项目作者: namsor

项目描述 :
NamSor Python command line tools, to append gender, origin, diaspora or us 'race'/ethnicity to a CSV file.
高级语言: Python
项目地址: git://github.com/namsor/namsor-python-tools-v2.git
创建时间: 2019-10-25T07:33:04Z
项目社区:https://github.com/namsor/namsor-python-tools-v2

开源协议:GNU Affero General Public License v3.0

下载


namsor-python-tools

NamSor Python command line tools, to append gender, origin, diaspora or us ‘race’/ethnicity to a CSV file. The CSV file should in UTF-8 encoding, pipe-| demimited. It can be very large. Check out also the Java CLT (https://github.com/namsor/namsor-tools-v2)

Please install https://github.com/namsor/namsor-python-sdk2 and synchronized-set dependencies,

  1. pip install git+https://github.com/namsor/namsor-python-sdk2.git
  2. pip install synchronized-set

NB: we use Unix conventions for file paths, ex. samples/some_fnln.txt but on MS Windows that would be samples\some_fnln.txt

Then clone this project and start from the base directory.

  1. git clone https://github.com/namsor/namsor-python-tools-v2/
  2. cd namsor-python-tools-v2

Running

  1. python namsor_tools.py
  2. python3 namsor_tools.py [-h] -apiKey APIKEY -i INPUTFILE [-countryIso2 COUNTRYISO2] [-o OUTPUTFILE] [-w]
  3. [-r] -f INPUTDATAFORMAT [-header] [-uid] [-digest] -service SERVICE [-e ENCODING]
  4. [-usraceethnicityoption USRACEETHNICITYOPTION]

Detailed usage

  1. python3
  2. usage: namsor_tools.py [-h] -apiKey APIKEY -i INPUTFILE [-countryIso2 COUNTRYISO2] [-o OUTPUTFILE] [-w]
  3. [-r] -f INPUTDATAFORMAT [-header] [-uid] [-digest] -service SERVICE [-e ENCODING]
  4. [-usraceethnicityoption USRACEETHNICITYOPTION]
  5. Main parser for namsor_commandline_tool
  6. options:
  7. -h, --help show this help message and exit
  8. -apiKey APIKEY, --apiKey APIKEY
  9. NamSor API Key
  10. -i INPUTFILE, --inputFile INPUTFILE
  11. input file name
  12. -countryIso2 COUNTRYISO2, --countryIso2 COUNTRYISO2
  13. countryIso2 default
  14. -o OUTPUTFILE, --outputFile OUTPUTFILE
  15. output file name
  16. -w, --overwrite overwrite existing output file
  17. -r, --recover continue from a job (requires uid)
  18. -f INPUTDATAFORMAT, --inputDataFormat INPUTDATAFORMAT
  19. input data format : first name, last name (fnln) / first name, last name, geo country iso2
  20. (fnlngeo) / / first name, last name, geo country iso2, subdivision (fnlngeosub) / full name
  21. (name) / full name, geo country iso2 (namegeo) / full name, geo country iso2, subdivision
  22. (namegeosub)
  23. -header, --header output header
  24. -uid, --uid input data has an ID prefix
  25. -digest, --digest SHA-256 digest names in output
  26. -service SERVICE, --endpoint SERVICE
  27. service : parse / gender / origin / country / diaspora / usraceethnicity / religion /
  28. castegroup
  29. -e ENCODING, --encoding ENCODING
  30. encoding : UTF-8 by default
  31. -usraceethnicityoption USRACEETHNICITYOPTION, --usraceethnicityoption USRACEETHNICITYOPTION
  32. extra usraceethnicity option USRACEETHNICITY-4CLASSES USRACEETHNICITY-4CLASSES-CLASSIC
  33. USRACEETHNICITY-6CLASSES

Examples

To append the likely name gender to a list of first and last names : John|Smith

  1. python namsor_tools.py -apiKey <yourAPIKey> -w -header -f fnln -i samples/some_fnln.txt -service gender

To append the likely name origin to a list of first and last names : John|Smith

  1. python namsor_tools.py -apiKey <yourAPIKey> -w -header -f fnln -i samples/some_fnln.txt -service origin

To append the likely country of residence to a list of full names : John Smith

  1. python namsor_tools.py -apiKey <yourAPIKey> -w -header -f name -i samples/some_name.txt -service country

To parse names into first and last name components (John Smith or Smith, John -> John|Smith)

  1. python namsor_tools.py -apiKey <yourAPIKey> -w -header -f name -i samples/some_name.txt -service parse

The recommended input format is to specify a unique ID and a geographic context (if known) as a countryIso2 code.

To append gender to a list of id, first and last names, geographic context : id12|John|Smith|US

  1. python namsor_tools.py -apiKey <yourAPIKey> -w -header -uid -f fnlngeo -i samples/some_idfnlngeo.txt -service gender

To append the ethnicity (in the sense of cultural heritage / country of origin of ascendents) from a list of id, first and last names, geographic context : id12|John|Smith|US

  1. python namsor_tools.py -apiKey <yourAPIKey> -w -header -uid -f fnlngeo -i samples/some_idfnlngeo.txt -service diaspora

To append US’race’/ethnicity to a list of id, first and last names, geographic context : id12|John|Smith|US

  1. python namsor_tools.py -apiKey <yourAPIKey> -w -header -uid -f fnlngeo -i samples/some_idfnlnUS.txt -service usraceethnicity

To append US’race’/ethnicity to a list of id, first and last names, geographic context : id12|John|Smith|US with option USRACEETHNICITY-6CLASSES

  1. python namsor_tools.py -apiKey <yourAPIKey> -w -header -uid -f fnlngeo -i samples/some_idfnlnUS.txt -service usraceethnicity --usraceethnicityoption USRACEETHNICITY-6CLASSES

To parse name into first and last name components, a geographic context is recommended (esp. for Latam names) : id12|John Smith|US

  1. python namsor_tools.py -apiKey <yourAPIKey> -w -header -uid -f namegeo -i samples/some_idnamegeo.txt -service parse

On large input files with a unique ID, it is possible to recover from where the process crashed and append to the existint output file, for example :

  1. python namsor_tools.py -apiKey <yourAPIKey> -r -header -uid -f fnlngeo -i samples/some_idfnlngeo.txt -service gender

For Indian names (for now), you can infer the likely india state or union territory (ie. a subdivision of the country as per ISO 3166-2:IN)

  1. python namsor_tools.py -apiKey <yourAPIKey> -r -header -uid -f fnlngeo -i samples/some_indian_idfnlngeo.txt -service subdivision

For Indian names (for now), you can infer the likely religion (provided the IN country code and state/union territory as per ISO 3166-2:IN)

  1. python namsor_tools.py -apiKey <yourAPIKey> -r -header -uid -f namegeosub -i samples/some_indian_idnamegeosub.txt -service subdivision

For Indian names (for now), you can infer the likely caste group (provided the IN country code and state/union territory as per ISO 3166-2:IN)

  1. python namsor_tools.py -apiKey <yourAPIKey> -r -header -uid -f namegeosub -i samples/some_indian_idnamegeosub.txt -service castegroup

Anonymizing output data

The -digest option will digest personal names in file outpus, using a non reversible MD-5 hash. For example, John Smith will become 6117323d2cabbc17d44c2b44587f682c.
Please note that this doesn’t apply to the PARSE output.

Understanding output

Please read and contribute to the WIKI
https://github.com/namsor/namsor-tools-v2/wiki/NamSor-Tools-V2

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.