Country Converter

The country converter (coco) is a Python package to convert country names into different classifications and between different naming versions. Internally it uses regular expressions to match country names.

Installation

The package is available as PyPI, use

pip install country_converter -upgrade

from the command line or use your preferred python package installer. The sourcecode is available on github: https://github.com/konstantinstadler/country_converter

Conversion

The country converter provides one main class which is used for the conversion:



In [1]:

    
import country_converter as coco



In [2]:

    
converter = coco.CountryConverter()

Given a list of countries is a certain classification:



In [3]:

    
iso3_codes = ['USA', 'VUT', 'TKL', 'AUT', 'AFG', 'ALB']

This can be converted to any classification provided by:



In [4]:

    
converter.convert(names = iso3_codes, src = 'ISO3', to = 'name_official')









    Out[4]:





['United States of America',
 'Republic of Vanuatu',
 'Tokelau',
 'Republic of Austria',
 'Islamic Republic of Afghanistan',
 'Republic of Albania']



In [5]:

    
converter.convert(names = iso3_codes, src = 'ISO3', to = 'continent')









    Out[5]:





['America', 'Oceania', 'Oceania', 'Europe', 'Asia', 'Europe']

The parameter "src" specifies the input-, "to" the output format. Possible values for both parameter can be found by:



In [6]:

    
converter.valid_class









    Out[6]:





['APEC',
 'BASIC',
 'BRIC',
 'CIS',
 'Cecilia2050',
 'EU',
 'EURO',
 'EXIO1',
 'EXIO2',
 'EXIO3',
 'Eora',
 'G20',
 'G7',
 'ISO2',
 'ISO3',
 'ISOnumeric',
 'MESSAGE',
 'OECD',
 'UNcode',
 'UNmember',
 'UNregion',
 'WIOD',
 'continent',
 'name_official',
 'name_short',
 'obsolete',
 'regex']

Internally, these names are the column header of the underlying pandas dataframe (see below).

The convert function can also be accessed without initiating the CountryConverter. This can be useful for one time usage. For multiple matchings, initiating the CountryCotnverter avoids that the file providing the matching data gets read in for each conversion.



In [7]:

    
converter.convert(names = iso3_codes, src = 'ISO3', to = 'ISO2')









    Out[7]:





['US', 'VU', 'TK', 'AT', 'AF', 'AL']

Some of the classifications can be accessed by some shortcuts. For example:



In [8]:

    
converter.EU27









    Out[8]:







  
    
      
      name_short
    
  
  
    
      14
      Austria
    
    
      21
      Belgium
    
    
      35
      Bulgaria
    
    
      58
      Cyprus
    
    
      59
      Czech Republic
    
    
      60
      Denmark
    
    
      70
      Estonia
    
    
      76
      Finland
    
    
      77
      France
    
    
      84
      Germany
    
    
      87
      Greece
    
    
      101
      Hungary
    
    
      107
      Ireland
    
    
      110
      Italy
    
    
      122
      Latvia
    
    
      128
      Lithuania
    
    
      129
      Luxembourg
    
    
      137
      Malta
    
    
      156
      Netherlands
    
    
      177
      Poland
    
    
      178
      Portugal
    
    
      182
      Romania
    
    
      196
      Slovakia
    
    
      197
      Slovenia
    
    
      204
      Spain
    
    
      215
      Sweden
    
    
      235
      United Kingdom



In [9]:

    
converter.OECDas('ISO2')

Handling missing data

The return value for non-found entries is be default set to 'not found':



In [10]:

    
iso3_codes_missing = ['ABC', 'AUT', 'XXX']
converter.convert(iso3_codes_missing, src='ISO3')









    



WARNING:root:ABC not found in ISO3
WARNING:root:XXX not found in ISO3






    Out[10]:





['not found', 'AUT', 'not found']

but can also be rest to something else:



In [11]:

    
converter.convert(iso3_codes_missing, src='ISO3', not_found='missing')









    



WARNING:root:ABC not found in ISO3
WARNING:root:XXX not found in ISO3






    Out[11]:





['missing', 'AUT', 'missing']

Alternativly, the non-found entries can be passed through by passing None to not_found:



In [12]:

    
converter.convert(iso3_codes_missing, src='ISO3', not_found=None)









    



WARNING:root:ABC not found in ISO3
WARNING:root:XXX not found in ISO3






    Out[12]:





['ABC', 'AUT', 'XXX']

To extend the underlying dataset, an additional dataframe (or file) can be passed.



In [13]:

    
import pandas as pd
add_data = pd.DataFrame.from_dict({
       'name_short' : ['xxx country', 'abc country'],
       'name_official' : ['The XXX country', 'The ABC country'],
       'regex' : ['xxx country', 'abc country'], 
       'ISO3': ['xxx', 'abc']}
)



In [14]:

    
add_data









    Out[14]:







  
    
      
      name_short
      name_official
      regex
      ISO3
    
  
  
    
      0
      xxx country
      The XXX country
      xxx country
      xxx
    
    
      1
      abc country
      The ABC country
      abc country
      abc



In [15]:

    
extended_converter = coco.CountryConverter(additional_data=add_data)
extended_converter.convert(iso3_codes_missing, src='ISO3', to='name_short')









    Out[15]:





['abc country', 'Austria', 'xxx country']

Alternatively to a ad hoc dataframe, additional datafiles can be passed. These must have the same format as basic data set. An example can be found here: https://github.com/konstantinstadler/country_converter/tree/master/tests/custom_data_example.txt

The custom data example contains the ISO3 code mapping for Romania before 2002 and switches the regex matching for congo between DR Congo and Congo Republic.

To use is pass the path to the additional country file:



In [16]:

    
# extended_converter = coco.CountryConverter(additional_data=path/to/datafile)

The passed data (file or dataframe) must at least contain the headers 'name_official', 'name_short' and 'regex'. Of course, if the additional data shall be used to a conversion to any other field, these must also be included.

Additionally passed data always overwrites the existing one. This can be used to adjust coco for datasets with wrong country names. For example, assuming a dataset erroneous switched the ISO2 codes for India (IN) and Indonesia (ID) (therefore assuming 'ID' for India and 'IN' for Indonesia), one can accomedate for that by:



In [17]:

    
switched_converter = coco.CountryConverter(additional_data=pd.DataFrame.from_dict({
       'name_short' : ['India', 'Indonesia'],
       'name_official' : ['India', 'Indonesia'],
       'regex' : ['india', 'indonesia'], 
       'ISO2': ['ID', 'IN']}))









    



WARNING:root:Duplicated values in column name_short of merged data - keep last one
WARNING:root:Duplicated values in column regex of merged data - keep last one



In [18]:

    
converter.convert('IN', src='ISO2', to='name_short')









    Out[18]:





'India'



In [19]:

    
switched_converter.convert('ID', src='ISO2', to='name_short')









    Out[19]:





'India'

Regular expression matching

The input parameter "src" can be set to "regex" to use regular expression matching for a given country list. For example:



In [20]:

    
some_names = ['United Rep. of Tanzania', 'Cape Verde', 'Burma', 'Iran (Islamic Republic of)', 'Korea, Republic of', "Dem. People's Rep. of Korea"]



In [21]:

    
coco.convert(names = some_names, src = "regex", to = "name_short")









    Out[21]:





['Tanzania', 'Cabo Verde', 'Myanmar', 'Iran', 'South Korea', 'North Korea']

The regular expressions can also be used to match any list of countries to any other. For example:



In [22]:

    
match_these = ['norway', 'united_states', 'china', 'taiwan']
master_list = ['USA', 'The Swedish Kingdom', 'Norway is a Kingdom too', 'Peoples Republic of China', 'Republic of China' ]

coco.match(match_these, master_list)









    Out[22]:





{'norway': 'Norway is a Kingdom too',
 'united_states': 'USA',
 'china': 'Peoples Republic of China',
 'taiwan': 'Republic of China'}

If the regular expression matches several times, all results are given as list and a warning is generated:



In [23]:

    
match_these = ['norway', 'united_states', 'china', 'taiwan']
master_list = ['USA', 'The Swedish Kingdom', 'Norway is a Kingdom too', 'Peoples Republic of China', 'Taiwan, province of china', 'Republic of China' ]

coco.match(match_these, master_list)









    



WARNING:root:Multiple matches for name taiwan in list_b






    Out[23]:





{'norway': 'Norway is a Kingdom too',
 'united_states': 'USA',
 'china': 'Peoples Republic of China',
 'taiwan': ['Taiwan, province of china', 'Republic of China']}

The parameter "enforce_sublist" can be set to ensure consistent output:



In [24]:

    
coco.match(match_these, master_list, enforce_sublist = True)









    



WARNING:root:Multiple matches for name taiwan in list_b






    Out[24]:





{'norway': ['Norway is a Kingdom too'],
 'united_states': ['USA'],
 'china': ['Peoples Republic of China'],
 'taiwan': ['Taiwan, province of china', 'Republic of China']}

A warning also ococours if one of the names couldn't be found:



In [25]:

    
match_these = ['norway', 'united_states', 'china', 'taiwan', 'some other country']
master_list = ['USA', 'The Swedish Kingdom', 'Norway is a Kingdom too', 'Peoples Republic of China',  'Republic of China' ]
coco.match(match_these, master_list)









    



WARNING:root:Could not identify some other country in list_a






    Out[25]:





{'norway': 'Norway is a Kingdom too',
 'united_states': 'USA',
 'china': 'Peoples Republic of China',
 'taiwan': 'Republic of China',
 'some other country': 'not_found'}

And the value for non found countries can be specified:



In [26]:

    
coco.match(match_these, master_list, not_found = 'its not there')









    



WARNING:root:Could not identify some other country in list_a






    Out[26]:





{'norway': 'Norway is a Kingdom too',
 'united_states': 'USA',
 'china': 'Peoples Republic of China',
 'taiwan': 'Republic of China',
 'some other country': 'its not there'}

This can also be used to pass the not found country to the new classification:



In [27]:

    
coco.match(match_these, master_list, not_found = None)









    



WARNING:root:Could not identify some other country in list_a






    Out[27]:





{'norway': 'Norway is a Kingdom too',
 'united_states': 'USA',
 'china': 'Peoples Republic of China',
 'taiwan': 'Republic of China',
 'some other country': 'some other country'}

Internals

Within the new instance, the raw data for the conversion is saved within a pandas dataframe. This dataframe can be acocoessed directly with:



In [28]:

    
converter.data.head()









    Out[28]:







  
    
      
      APEC
      BASIC
      BRIC
      CIS
      Cecilia2050
      EU
      EURO
      EXIO1
      EXIO2
      EXIO3
      ...
      OECD
      UNcode
      UNmember
      UNregion
      WIOD
      continent
      name_official
      name_short
      obsolete
      regex
    
  
  
    
      0
      NaN
      NaN
      NaN
      NaN
      RoW
      NaN
      NaN
      WW
      WA
      WA
      ...
      NaN
      4.0
      1946.0
      Southern Asia
      RoW
      Asia
      Islamic Republic of Afghanistan
      Afghanistan
      NaN
      afghan
    
    
      1
      NaN
      NaN
      NaN
      NaN
      RoW
      NaN
      NaN
      WW
      WE
      WE
      ...
      NaN
      248.0
      NaN
      Northern Europe
      RoW
      Europe
      Åland Islands
      Aland Islands
      NaN
      \b(a|å)land
    
    
      2
      NaN
      NaN
      NaN
      NaN
      RoW
      NaN
      NaN
      WW
      WE
      WE
      ...
      NaN
      8.0
      1955.0
      Southern Europe
      RoW
      Europe
      Republic of Albania
      Albania
      NaN
      albania
    
    
      3
      NaN
      NaN
      NaN
      NaN
      RoW
      NaN
      NaN
      WW
      WF
      WF
      ...
      NaN
      12.0
      1962.0
      Northern Africa
      RoW
      Africa
      People's Democratic Republic of Algeria
      Algeria
      NaN
      algeria
    
    
      4
      NaN
      NaN
      NaN
      NaN
      RoW
      NaN
      NaN
      WW
      WA
      WA
      ...
      NaN
      16.0
      NaN
      Polynesia
      RoW
      Oceania
      American Samoa
      American Samoa
      NaN
      ^(?=.*americ).*samoa
    
  

5 rows × 27 columns

This dataframe can be extended in both directions. The only requirement is to provide unique values for name_short, name_official and regex.

Internally, the data is saved in country_data.txt as tab-separated values (utf-8 encoded).

Of course, all pandas indexing and matching methods can be used. For example, to get new OECD members since 1995 present in a list:



In [29]:

    
some_countries = ['Australia', 'Belgium', 'Brazil', 'Bulgaria', 'Cyprus', 'Czech Republic', 'Denmark', 'Estonia', 'Finland', 'France', 'Germany', 'Greece', 'Hungary', 'India', 'Indonesia', 'Ireland', 'Italy', 'Japan', 'Latvia', 'Lithuania', 'Luxembourg', 'Malta', 'Romania', 'Russia',  'Turkey', 'United Kingdom', 'United States']
converter.data[(converter.data.OECD >= 1995) & converter.data.name_short.isin(some_countries)].name_short









    Out[29]:





59     Czech Republic
70            Estonia
101           Hungary
122            Latvia
Name: name_short, dtype: object

Further information can be found here: http://pandas.pydata.org/pandas-docs/stable/

Testing

All regular expressions of the country converter are tested for a unique match to name_short and name_official. Test sets for alternative names found in various databases are also available.

The test sets are stored in the /test subbolder. To tests require py.test. I recommend to rerun the test if a regular expression is changed.

To specify a new test set just add a tab-separated file with headers "name_short" and "name_test" and provide name (corresponding to the short name in the main classification file) and the alternative name which should be tested (one pair per row in the file). If the file name starts with "test_regex_ " it will be automatically recognised by the test functions.

Please see the file CONTRIBUTING.rst for further information.

Konstantin Stadler



In [ ]:

	ISO2
13	AU
14	AT
21	BE
41	CA
45	CL
59	CZ
60	DK
70	EE
76	FI
77	FR
84	DE
87	GR
101	HU
102	IS
107	IE
109	IL
110	IT
112	JP
122	LV
129	LU
143	MX
156	NL
158	NZ
166	NO
177	PL
178	PT
196	SK
197	SI
202	KR
204	ES
215	SE
216	CH
228	TR
235	GB
236	US

	name_short
14	Austria
21	Belgium
35	Bulgaria
58	Cyprus
59	Czech Republic
60	Denmark
70	Estonia
76	Finland
77	France
84	Germany
87	Greece
101	Hungary
107	Ireland
110	Italy
122	Latvia
128	Lithuania
129	Luxembourg
137	Malta
156	Netherlands
177	Poland
178	Portugal
182	Romania
196	Slovakia
197	Slovenia
204	Spain
215	Sweden
235	United Kingdom

	ISO2
13	AU
14	AT
21	BE
41	CA
45	CL
59	CZ
60	DK
70	EE
76	FI
77	FR
84	DE
87	GR
101	HU
102	IS
107	IE
109	IL
110	IT
112	JP
122	LV
129	LU
143	MX
156	NL
158	NZ
166	NO
177	PL
178	PT
196	SK
197	SI
202	KR
204	ES
215	SE
216	CH
228	TR
235	GB
236	US

	name_short	name_official	regex	ISO3
0	xxx country	The XXX country	xxx country	xxx
1	abc country	The ABC country	abc country	abc

	APEC	BASIC	BRIC	CIS	Cecilia2050	EU	EURO	EXIO1	EXIO2	EXIO3	...	OECD	UNcode	UNmember	UNregion	WIOD	continent	name_official	name_short	obsolete	regex
0	NaN	NaN	NaN	NaN	RoW	NaN	NaN	WW	WA	WA	...	NaN	4.0	1946.0	Southern Asia	RoW	Asia	Islamic Republic of Afghanistan	Afghanistan	NaN	afghan
1	NaN	NaN	NaN	NaN	RoW	NaN	NaN	WW	WE	WE	...	NaN	248.0	NaN	Northern Europe	RoW	Europe	Åland Islands	Aland Islands	NaN	\b(a\|å)land
2	NaN	NaN	NaN	NaN	RoW	NaN	NaN	WW	WE	WE	...	NaN	8.0	1955.0	Southern Europe	RoW	Europe	Republic of Albania	Albania	NaN	albania
3	NaN	NaN	NaN	NaN	RoW	NaN	NaN	WW	WF	WF	...	NaN	12.0	1962.0	Northern Africa	RoW	Africa	People's Democratic Republic of Algeria	Algeria	NaN	algeria
4	NaN	NaN	NaN	NaN	RoW	NaN	NaN	WW	WA	WA	...	NaN	16.0	NaN	Polynesia	RoW	Oceania	American Samoa	American Samoa	NaN	^(?=.americ).samoa

	ISO2
13	AU
14	AT
21	BE
41	CA
45	CL
59	CZ
60	DK
70	EE
76	FI
77	FR
84	DE
87	GR
101	HU
102	IS
107	IE
109	IL
110	IT
112	JP
122	LV
129	LU
143	MX
156	NL
158	NZ
166	NO
177	PL
178	PT
196	SK
197	SI
202	KR
204	ES
215	SE
216	CH
228	TR
235	GB
236	US

	ISO2
13	AU
14	AT
21	BE
41	CA
45	CL
59	CZ
60	DK
70	EE
76	FI
77	FR
84	DE
87	GR
101	HU
102	IS
107	IE
109	IL
110	IT
112	JP
122	LV
129	LU
143	MX
156	NL
158	NZ
166	NO
177	PL
178	PT
196	SK
197	SI
202	KR
204	ES
215	SE
216	CH
228	TR
235	GB
236	US