Skip to content

The country converter (coco) - a Python package for converting country names between different classification schemes.

License

Notifications You must be signed in to change notification settings

HamedxRF/country_converter

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

country converter

The country converter (coco) is a Python package to convert and match country names between different classifications and between different naming versions. Internally it uses regular expressions to match country names. Coco can also be used to build aggregation concordance matrices between different classification schemes.

https://travis-ci.org/konstantinstadler/country_converter.svg?branch=master https://coveralls.io/repos/github/konstantinstadler/country_converter/badge.svg?branch=master

To date, there is no single standard of how to name or specify individual countries in a (meta) data description. While some data sources follow ISO 3166, this standard defines a two and a three letter code in addition to a numerical classification. To further complicate the matter, instead of using one of the existing standards, many databases use unstandardised country names to classify countries.

The country converter (coco) automates the conversion from different standards and version of country names. Internally, coco is based on a table specifying the different ISO and UN standards per country together with the official name and a regular expression which aim to match all English versions of a specific country name. In addition, coco includes classification based on UN-, EU-, OECD-membership, UN regions specifications, continents and various MRIO databases (see Classification schemes below).

Country_converter is registered at PyPI. From the command line:

pip install country_converter --upgrade

To install from the Anaconda Cloud use:

conda install -c konstantinstadler country_converter

Alternatively, the source code is available on GitHub.

The package depends on Pandas; for testing py.test is required. For further information on running the tests see CONTRIBUTING.rst.

Convert various country names to some standard names:

import country_converter as coco
some_names = ['United Rep. of Tanzania', 'DE', 'Cape Verde', '788', 'Burma', 'COG',
              'Iran (Islamic Republic of)', 'Korea, Republic of',
              "Dem. People's Rep. of Korea"]
standard_names = coco.convert(names=some_names, to='name_short')
print(standard_names)

Which results in ['Tanzania', 'Germany', 'Cabo Verde', 'Tunisia', 'Myanmar', 'Congo Republic', 'Iran', 'South Korea', 'North Korea']. The input format is determined automatically, based on ISO two letter, ISO three letter, ISO numeric or regular expression matching. In case of any ambiguity, the source format can be specified with the parameter 'src'.

In case of multiple conversion, better performance can be achieved by instantiating a single CountryConverter object for all conversions:

import country_converter as coco
cc = coco.CountryConverter()

some_names = ['United Rep. of Tanzania', 'Cape Verde', 'Burma',
              'Iran (Islamic Republic of)', 'Korea, Republic of',
              "Dem. People's Rep. of Korea"]

standard_names = cc.convert(names = some_names, to = 'name_short')
UNmembership = cc.convert(names = some_names, to = 'UNmember')
print(standard_names)
print(UNmembership)

Convert between classification schemes:

iso3_codes = ['USA', 'VUT', 'TKL', 'AUT', 'XXX' ]
iso2_codes = coco.convert(names=iso3_codes, to='ISO2')
print(iso2_codes)

Which results in ['US', 'VU', 'TK', 'AT', 'not found']

The not found indication can be specified (e.g. not_found = 'not there'), if None is passed for 'not_found', the original entry gets passed through:

iso2_codes = coco.convert(names=iso3_codes, to='ISO2', not_found=None)
print(iso2_codes)

results in ['US', 'VU', 'TK', 'AT', 'XXX']

Internally the data is stored in a Pandas DataFrame, which can be accessed directly. For example, this can be used to filter countries for membership organisations (per year). Note: for this, an instance of CountryConverter is required.

import country_converter as coco
cc = coco.CountryConverter()

some_countries = ['Australia', 'Belgium', 'Brazil', 'Bulgaria', 'Cyprus', 'Czech Republic',
                  'Denmark', 'Estonia', 'Finland', 'France', 'Germany', 'Greece', 'Hungary',
                  'India', 'Indonesia', 'Ireland', 'Italy', 'Japan', 'Latvia', 'Lithuania',
                  'Luxembourg', 'Malta', 'Romania', 'Russia', 'Turkey', 'United Kingdom',
                  'United States']

oecd_since_1995 = cc.data[(cc.data.OECD >= 1995) & cc.data.name_short.isin(some_countries)].name_short
eu_until_1980 = cc.data[(cc.data.EU <= 1980) & cc.data.name_short.isin(some_countries)].name_short
print(oecd_since_1995)
print(eu_until_1980)

Some properties provide direct access to affiliations:

cc.EU28
cc.OECD

cc.EU27as('ISO3')

and the classification schemes available:

cc.valid_class

The regular expressions can also be used to match any list of countries to any other. For example:

match_these = ['norway', 'united_states', 'china', 'taiwan']
master_list = ['USA', 'The Swedish Kingdom', 'Norway is a Kingdom too',
               'Peoples Republic of China', 'Republic of China' ]

matching_dict = coco.match(match_these, master_list)

See the IPython Notebook (country_converter_examples.ipynb) for more information.

The country converter package also provides a command line interface called coco.

Minimal example:

coco Cyprus DE Denmark Estonia 4 'United Kingdom' AUT

Converts the given names to ISO3 codes based on matching the input to ISO2, ISO3, ISOnumeric or regular expression matching. The list of names must be separated by spaces, country names consisting of multiple words must be put in quotes ('').

The input classification can be specified with '--src' or '-s' (or will be determined automatically), the target classification with '--to' or '-t'.

The default output is a space separated list, this can be changed by passing a separator by '--output_sep' or '-o' (e.g -o '|').

Thus, to convert from ISO3 to UN number codes and receive the output as comma separated list use:

coco AUT DEU VAT AUS -s ISO3 -t UNcode -o ', '

The command line tool also allows to specify the output for none found entries, including passing them through to the output by passing None:

coco CAN Peru US Mexico Venezuela UK Arendelle --not_found=None

and to specifiy an additional data file which will overwrite existing country matchings

coco Congo --additional_data path/to/datafile.csv

See https://github.com/konstantinstadler/country_converter/tree/master/tests/custom_data_example.txt for an example of an additional datafile.

The flags --UNmember_only (-u) and --include_obsolete (-i) restrict the search to UN memberstates only or extend it to also include currently obsolete countries. For example, the Netherlands Antilles were dissolved in 2010.

Thus:

coco "Netherlands Antilles"

results in "not found". The search, however, can be extended to recently dissolved countries by:

coco "Netherlands Antilles" -i

which results in 'ANT'.

In addition to the countries, the coco command line tool also accepts various country classifications (EXIO1, EXIO2, EXIO3, WIOD, Eora, MESSAGE, OECD, EU27, EU28, UN, obsolete, Cecilia2050, BRIC, APEC, BASIC, CIS, G7, G20). One of these can be passed by

coco G20

which lists all countries in that classification.

For the classifications covering almost all countries (MRIO and IAM classifications)

coco EXIO3

lists the unique classification names. When passing a --to parameter, a simplified correspondence of the chosen classification is printed:

coco EXIO3 --to ISO3

For further information call the help by

coco -h

Newer (tested in 2016a) versions of Matlab allow to directly call Python functions and libaries. This requires a Python version >= 3.4 installed in the sytem path (e.g. through Anaconda).

To test, try this in Matlab:

py.print(py.sys.version)

If this works, you can also use coco after installing it through pip (at the windows commandline - see the installing instruction above):

pip install country_converter --upgrade

And in matlab:

coco = py.country_converter.CountryConverter()
countries = {'The Swedish Kingdom', 'Norway is a Kingdom too', 'Peoples Republic of China', 'Republic of China'};
ISO2_pythontype = coco.convert(countries, pyargs('to', 'ISO2'));
ISO2_cellarray = cellfun(@char,cell(ISO2_pythontype),'UniformOutput',false);

Alternativley, as a long oneliner:

short_names = cellfun(@char, cell(py.country_converter.convert({56, 276}, pyargs('src', 'UNcode', 'to', 'name_short'))), 'UniformOutput',false);

All properties of coco as explained above are also available in Matlab:

coco = py.country_converter.CountryConverter();
coco.EU27
EU27ISO3 = coco.EU27as('ISO3');

These functions return a Pandas DataFrame. The underlying values can be access with .values (e.g.

EU27ISO3.values

I leave it to professional Matlab users to figure out how to further process them.

See also IPython Notebook (country_converter_examples.ipynb) for more information - all functions available in Python (for example passing additional data files, specifying the output in case of missing data) work also in Matlab by passing arguments through the pyargs function.

Coco provides a function for building concordance vectors, matrices and dictionaries between different classifications. This can be used in python as well as in matlab. For furter information see (country_converter_aggregation_helper.ipynb)

Currently the following classification schemes are available (see also Data sources below for further information):

  1. ISO2 (ISO 3166-1 alpha-2)
  2. ISO3 (ISO 3166-1 alpha-3)
  3. ISO - numeric (ISO 3166-1 numeric)
  4. UN numeric code (M.49 - follows to a large extend ISO-numeric)
  5. A standard or short name
  6. The "official" name
  7. Continent
  8. UN region
  9. EXIOBASE 1 classification
  10. EXIOBASE 2 classification
  11. EXIOBASE 3 classification
  12. WIOD classification
  13. Eora
  14. OECD membership (per year)
  15. MESSAGE 11-region classification
  16. UN membership (per year)
  17. EU membership (per year)
  18. Cecilia 2050 classification
  19. APEC
  20. BRIC
  21. BASIC
  22. CIS (as by 2019, excl. Turkmenistan)
  23. G7
  24. G20 (listing all EU member states as individual members)

Coco contains official recognised codes as well as non-standard codes for disputed or dissolved countries. To restrict the set to only the official recognized UN members or include obsolete countries, pass

import country_converter as coco
cc = coco.CountryConverter()
cc_UN = coco.CountryConverter(only_UNmember=True)
cc_all = coco.CountryConverter(include_obsolete=True)

cc.convert(['PSE', 'XKX', 'EAZ', 'FRA'], to='name_short')
cc_UN.convert(['PSE', 'XKX', 'EAZ', 'FRA'], to='name_short')
cc_all.convert(['PSE', 'XKX', 'EAZ', 'FRA'], to='name_short')

cc results in ['Palestine', 'Kosovo', 'not found', 'France'], whereas cc_UN converts to ['not found', 'not found', 'not found', 'France'] and cc_all converts to ['Palestine', 'Kosovo', 'Zanzibar', 'France'] Note that the underlying dataframe is available at the attribute .data (e.g. cc_all.data).

Most of the underlying data can be found in Wikipedia. https://en.wikipedia.org/wiki/ISO_3166-1 is a good starting point. UN regions/codes are given on the United Nation Statistical Division (unstats) webpage. For the differences between the ISO numeric and UN (M.49) codes see https://en.wikipedia.org/wiki/UN_M.49. EXIOBASE, WIOD and Eora classification were extracted from the respective databases. For Eora, the names are based on the 'Country names' csv file provided on the webpage, but updated for different names used in the Eora26 database. The MESSAGE classification follows the 11-region aggregation given in the MESSAGE model regions description. The membership of OECD, UN and EU can be found at the membership organisations' webpages, information about obsolete country codes on the Statoids webpage.

Please use the issue tracker for documenting bugs, proposing enhancements and all other communication related to coco.

You can follow me on twitter to get the latest news about all my open-source and research projects (and occasionally some random retweets).

Want to contribute? Great! Please check CONTRIBUTING.rst if you want to help to improve coco.

The package pycountry provides access to the official ISO databases for historic countries, country subdivisions, languages and currencies. In case you need to convert non-English country names, countrynames includes an extensive database of country names in different languages and functions to convert them to the different ISO 3166 standards. Python-iso3166 focuses on conversion between the two-letter, three-letter and three-digit codes defined in the ISO 3166 standard.

If you are using R, you should have a look at countrycode.

Version 0.5 of the country converter was published in the Journal of Open Source Software. To cite the country converter in publication please use:

Stadler, K. (2017). The country converter coco - a Python package for converting country names between different classification schemes. The Journal of Open Source Software. doi: http://dx.doi.org/10.21105/joss.00332

For the full bibtex key see CITATION

This package was inspired by (and the regular expression are mostly based on) the R-package countrycode by Vincent Arel-Bundock and his (defunct) port to Python (pycountrycode). Many thanks to Robert Gieseke for the review of the source code and paper for the publication in the Journal of Open Source Software.

About

The country converter (coco) - a Python package for converting country names between different classification schemes.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 90.6%
  • TeX 9.4%