JSON examples and exercise


  • get familiar with packages for dealing with JSON
  • study examples with JSON strings and files
  • work on exercise to be completed and submitted



In [150]:
import pandas as pd
import numpy as np

imports for Python, Pandas


In [2]:
import json
from pandas.io.json import json_normalize

JSON example, with string


In [3]:
# define json string
data = [{'state': 'Florida', 
         'shortname': 'FL',
         'info': {'governor': 'Rick Scott'},
         'counties': [{'name': 'Dade', 'population': 12345},
                      {'name': 'Broward', 'population': 40000},
                      {'name': 'Palm Beach', 'population': 60000}]},
        {'state': 'Ohio',
         'shortname': 'OH',
         'info': {'governor': 'John Kasich'},
         'counties': [{'name': 'Summit', 'population': 1234},
                      {'name': 'Cuyahoga', 'population': 1337}]}]

In [4]:
# use normalization to create tables from nested element
json_normalize(data, 'counties')


Out[4]:
name population
0 Dade 12345
1 Broward 40000
2 Palm Beach 60000
3 Summit 1234
4 Cuyahoga 1337

In [5]:
# further populate tables created from nested element
json_normalize(data, 'counties', ['state', 'shortname', ['info', 'governor']])


Out[5]:
name population info.governor state shortname
0 Dade 12345 Rick Scott Florida FL
1 Broward 40000 Rick Scott Florida FL
2 Palm Beach 60000 Rick Scott Florida FL
3 Summit 1234 John Kasich Ohio OH
4 Cuyahoga 1337 John Kasich Ohio OH

JSON example, with file

  • demonstrates reading in a json file as a string and as a table
  • uses small sample file containing data about projects funded by the World Bank
  • data source: http://jsonstudio.com/resources/

In [6]:
# load json as string
json.load((open('data/world_bank_projects_less.json')))


Out[6]:
[{u'_id': {u'$oid': u'52b213b38594d8a2be17c780'},
  u'approvalfy': 1999,
  u'board_approval_month': u'November',
  u'boardapprovaldate': u'2013-11-12T00:00:00Z',
  u'borrower': u'FEDERAL DEMOCRATIC REPUBLIC OF ETHIOPIA',
  u'closingdate': u'2018-07-07T00:00:00Z',
  u'country_namecode': u'Federal Democratic Republic of Ethiopia!$!ET',
  u'countrycode': u'ET',
  u'countryname': u'Federal Democratic Republic of Ethiopia',
  u'countryshortname': u'Ethiopia',
  u'docty': u'Project Information Document,Indigenous Peoples Plan,Project Information Document',
  u'envassesmentcategorycode': u'C',
  u'grantamt': 0,
  u'ibrdcommamt': 0,
  u'id': u'P129828',
  u'idacommamt': 130000000,
  u'impagency': u'MINISTRY OF EDUCATION',
  u'lendinginstr': u'Investment Project Financing',
  u'lendinginstrtype': u'IN',
  u'lendprojectcost': 550000000,
  u'majorsector_percent': [{u'Name': u'Education', u'Percent': 46},
   {u'Name': u'Education', u'Percent': 26},
   {u'Name': u'Public Administration, Law, and Justice', u'Percent': 16},
   {u'Name': u'Education', u'Percent': 12}],
  u'mjsector_namecode': [{u'code': u'EX', u'name': u'Education'},
   {u'code': u'EX', u'name': u'Education'},
   {u'code': u'BX', u'name': u'Public Administration, Law, and Justice'},
   {u'code': u'EX', u'name': u'Education'}],
  u'mjtheme': [u'Human development'],
  u'mjtheme_namecode': [{u'code': u'8', u'name': u'Human development'},
   {u'code': u'11', u'name': u''}],
  u'mjthemecode': u'8,11',
  u'prodline': u'PE',
  u'prodlinetext': u'IBRD/IDA',
  u'productlinetype': u'L',
  u'project_abstract': {u'cdata': u'The development objective of the Second Phase of General Education Quality Improvement Project for Ethiopia is to improve learning conditions in primary and secondary schools and strengthen institutions at different levels of educational administration. The project has six components. The first component is curriculum, textbooks, assessment, examinations, and inspection. This component will support improvement of learning conditions in grades KG-12 by providing increased access to teaching and learning materials and through improvements to the curriculum by assessing the strengths and weaknesses of the current curriculum. This component has following four sub-components: (i) curriculum reform and implementation; (ii) teaching and learning materials; (iii) assessment and examinations; and (iv) inspection. The second component is teacher development program (TDP). This component will support improvements in learning conditions in both primary and secondary schools by advancing the quality of teaching in general education through: (a) enhancing the training of pre-service teachers in teacher education institutions; and (b) improving the quality of in-service teacher training. This component has following three sub-components: (i) pre-service teacher training; (ii) in-service teacher training; and (iii) licensing and relicensing of teachers and school leaders. The third component is school improvement plan. This component will support the strengthening of school planning in order to improve learning outcomes, and to partly fund the school improvement plans through school grants. It has following two sub-components: (i) school improvement plan; and (ii) school grants. The fourth component is management and capacity building, including education management information systems (EMIS). This component will support management and capacity building aspect of the project. This component has following three sub-components: (i) capacity building for education planning and management; (ii) capacity building for school planning and management; and (iii) EMIS. The fifth component is improving the quality of learning and teaching in secondary schools and universities through the use of information and communications technology (ICT). It has following five sub-components: (i) national policy and institution for ICT in general education; (ii) national ICT infrastructure improvement plan for general education; (iii) develop an integrated monitoring, evaluation, and learning system specifically for the ICT component; (iv) teacher professional development in the use of ICT; and (v) provision of limited number of e-Braille display readers with the possibility to scale up to all secondary education schools based on the successful implementation and usage of the readers. The sixth component is program coordination, monitoring and evaluation, and communication. It will support institutional strengthening by developing capacities in all aspects of program coordination, monitoring and evaluation; a new sub-component on communications will support information sharing for better management and accountability. It has following three sub-components: (i) program coordination; (ii) monitoring and evaluation (M and E); and (iii) communication.'},
  u'project_name': u'Ethiopia General Education Quality Improvement Project II',
  u'projectdocs': [{u'DocDate': u'28-AUG-2013',
    u'DocType': u'PID',
    u'DocTypeDesc': u'Project Information Document (PID),  Vol.',
    u'DocURL': u'http://www-wds.worldbank.org/servlet/WDSServlet?pcont=details&eid=090224b081e545fb_1_0',
    u'EntityID': u'090224b081e545fb_1_0'},
   {u'DocDate': u'01-JUL-2013',
    u'DocType': u'IP',
    u'DocTypeDesc': u'Indigenous Peoples Plan (IP),  Vol.1 of 1',
    u'DocURL': u'http://www-wds.worldbank.org/servlet/WDSServlet?pcont=details&eid=000442464_20130920111729',
    u'EntityID': u'000442464_20130920111729'},
   {u'DocDate': u'22-NOV-2012',
    u'DocType': u'PID',
    u'DocTypeDesc': u'Project Information Document (PID),  Vol.',
    u'DocURL': u'http://www-wds.worldbank.org/servlet/WDSServlet?pcont=details&eid=090224b0817b19e2_1_0',
    u'EntityID': u'090224b0817b19e2_1_0'}],
  u'projectfinancialtype': u'IDA',
  u'projectstatusdisplay': u'Active',
  u'regionname': u'Africa',
  u'sector': [{u'Name': u'Primary education'},
   {u'Name': u'Secondary education'},
   {u'Name': u'Public administration- Other social services'},
   {u'Name': u'Tertiary education'}],
  u'sector1': {u'Name': u'Primary education', u'Percent': 46},
  u'sector2': {u'Name': u'Secondary education', u'Percent': 26},
  u'sector3': {u'Name': u'Public administration- Other social services',
   u'Percent': 16},
  u'sector4': {u'Name': u'Tertiary education', u'Percent': 12},
  u'sector_namecode': [{u'code': u'EP', u'name': u'Primary education'},
   {u'code': u'ES', u'name': u'Secondary education'},
   {u'code': u'BS', u'name': u'Public administration- Other social services'},
   {u'code': u'ET', u'name': u'Tertiary education'}],
  u'sectorcode': u'ET,BS,ES,EP',
  u'source': u'IBRD',
  u'status': u'Active',
  u'supplementprojectflg': u'N',
  u'theme1': {u'Name': u'Education for all', u'Percent': 100},
  u'theme_namecode': [{u'code': u'65', u'name': u'Education for all'}],
  u'themecode': u'65',
  u'totalamt': 130000000,
  u'totalcommamt': 130000000,
  u'url': u'http://www.worldbank.org/projects/P129828/ethiopia-general-education-quality-improvement-project-ii?lang=en'},
 {u'_id': {u'$oid': u'52b213b38594d8a2be17c781'},
  u'approvalfy': 2015,
  u'board_approval_month': u'November',
  u'boardapprovaldate': u'2013-11-04T00:00:00Z',
  u'borrower': u'GOVERNMENT OF TUNISIA',
  u'country_namecode': u'Republic of Tunisia!$!TN',
  u'countrycode': u'TN',
  u'countryname': u'Republic of Tunisia',
  u'countryshortname': u'Tunisia',
  u'docty': u'Project Information Document,Integrated Safeguards Data Sheet,Integrated Safeguards Data Sheet,Project Information Document,Integrated Safeguards Data Sheet,Project Information Document',
  u'envassesmentcategorycode': u'C',
  u'grantamt': 4700000,
  u'ibrdcommamt': 0,
  u'id': u'P144674',
  u'idacommamt': 0,
  u'impagency': u'MINISTRY OF FINANCE',
  u'lendinginstr': u'Specific Investment Loan',
  u'lendinginstrtype': u'IN',
  u'lendprojectcost': 5700000,
  u'majorsector_percent': [{u'Name': u'Public Administration, Law, and Justice',
    u'Percent': 70},
   {u'Name': u'Public Administration, Law, and Justice', u'Percent': 30}],
  u'mjsector_namecode': [{u'code': u'BX',
    u'name': u'Public Administration, Law, and Justice'},
   {u'code': u'BX', u'name': u'Public Administration, Law, and Justice'}],
  u'mjtheme': [u'Economic management',
   u'Social protection and risk management'],
  u'mjtheme_namecode': [{u'code': u'1', u'name': u'Economic management'},
   {u'code': u'6', u'name': u'Social protection and risk management'}],
  u'mjthemecode': u'1,6',
  u'prodline': u'RE',
  u'prodlinetext': u'Recipient Executed Activities',
  u'productlinetype': u'L',
  u'project_name': u'TN: DTF Social Protection Reforms Support',
  u'projectdocs': [{u'DocDate': u'29-MAR-2013',
    u'DocType': u'PID',
    u'DocTypeDesc': u'Project Information Document (PID),  Vol.1 of 1',
    u'DocURL': u'http://www-wds.worldbank.org/servlet/WDSServlet?pcont=details&eid=000333037_20131024115616',
    u'EntityID': u'000333037_20131024115616'},
   {u'DocDate': u'29-MAR-2013',
    u'DocType': u'ISDS',
    u'DocTypeDesc': u'Integrated Safeguards Data Sheet (ISDS),  Vol.1 of 1',
    u'DocURL': u'http://www-wds.worldbank.org/servlet/WDSServlet?pcont=details&eid=000356161_20131024151611',
    u'EntityID': u'000356161_20131024151611'},
   {u'DocDate': u'29-MAR-2013',
    u'DocType': u'ISDS',
    u'DocTypeDesc': u'Integrated Safeguards Data Sheet (ISDS),  Vol.1 of 1',
    u'DocURL': u'http://www-wds.worldbank.org/servlet/WDSServlet?pcont=details&eid=000442464_20131031112136',
    u'EntityID': u'000442464_20131031112136'},
   {u'DocDate': u'29-MAR-2013',
    u'DocType': u'PID',
    u'DocTypeDesc': u'Project Information Document (PID),  Vol.1 of 1',
    u'DocURL': u'http://www-wds.worldbank.org/servlet/WDSServlet?pcont=details&eid=000333037_20131031105716',
    u'EntityID': u'000333037_20131031105716'},
   {u'DocDate': u'16-JAN-2013',
    u'DocType': u'ISDS',
    u'DocTypeDesc': u'Integrated Safeguards Data Sheet (ISDS),  Vol.1 of 1',
    u'DocURL': u'http://www-wds.worldbank.org/servlet/WDSServlet?pcont=details&eid=000356161_20130305113209',
    u'EntityID': u'000356161_20130305113209'},
   {u'DocDate': u'16-JAN-2013',
    u'DocType': u'PID',
    u'DocTypeDesc': u'Project Information Document (PID),  Vol.1 of 1',
    u'DocURL': u'http://www-wds.worldbank.org/servlet/WDSServlet?pcont=details&eid=000356161_20130305113716',
    u'EntityID': u'000356161_20130305113716'}],
  u'projectfinancialtype': u'OTHER',
  u'projectstatusdisplay': u'Active',
  u'regionname': u'Middle East and North Africa',
  u'sector': [{u'Name': u'Public administration- Other social services'},
   {u'Name': u'General public administration sector'}],
  u'sector1': {u'Name': u'Public administration- Other social services',
   u'Percent': 70},
  u'sector2': {u'Name': u'General public administration sector',
   u'Percent': 30},
  u'sector_namecode': [{u'code': u'BS',
    u'name': u'Public administration- Other social services'},
   {u'code': u'BZ', u'name': u'General public administration sector'}],
  u'sectorcode': u'BZ,BS',
  u'source': u'IBRD',
  u'status': u'Active',
  u'supplementprojectflg': u'N',
  u'theme1': {u'Name': u'Other economic management', u'Percent': 30},
  u'theme_namecode': [{u'code': u'24', u'name': u'Other economic management'},
   {u'code': u'54', u'name': u'Social safety nets'}],
  u'themecode': u'54,24',
  u'totalamt': 0,
  u'totalcommamt': 4700000,
  u'url': u'http://www.worldbank.org/projects/P144674?lang=en'}]

In [27]:
sample_json_df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 500 entries, 0 to 499
Data columns (total 50 columns):
_id                         500 non-null object
approvalfy                  500 non-null int64
board_approval_month        500 non-null object
boardapprovaldate           500 non-null object
borrower                    485 non-null object
closingdate                 370 non-null object
country_namecode            500 non-null object
countrycode                 500 non-null object
countryname                 500 non-null object
countryshortname            500 non-null object
docty                       446 non-null object
envassesmentcategorycode    430 non-null object
grantamt                    500 non-null int64
ibrdcommamt                 500 non-null int64
id                          500 non-null object
idacommamt                  500 non-null int64
impagency                   472 non-null object
lendinginstr                495 non-null object
lendinginstrtype            495 non-null object
lendprojectcost             500 non-null int64
majorsector_percent         500 non-null object
mjsector_namecode           500 non-null object
mjtheme                     491 non-null object
mjtheme_namecode            500 non-null object
mjthemecode                 500 non-null object
prodline                    500 non-null object
prodlinetext                500 non-null object
productlinetype             500 non-null object
project_abstract            362 non-null object
project_name                500 non-null object
projectdocs                 446 non-null object
projectfinancialtype        500 non-null object
projectstatusdisplay        500 non-null object
regionname                  500 non-null object
sector                      500 non-null object
sector1                     500 non-null object
sector2                     380 non-null object
sector3                     265 non-null object
sector4                     174 non-null object
sector_namecode             500 non-null object
sectorcode                  500 non-null object
source                      500 non-null object
status                      500 non-null object
supplementprojectflg        498 non-null object
theme1                      500 non-null object
theme_namecode              491 non-null object
themecode                   491 non-null object
totalamt                    500 non-null int64
totalcommamt                500 non-null int64
url                         500 non-null object
dtypes: int64(7), object(43)
memory usage: 199.2+ KB

In [252]:
wb_projects['country_namecode'].str.contains()


Out[252]:
0      True
1      True
2      True
3      True
4      True
5      True
6      True
7      True
8      True
9      True
10     True
11     True
12     True
13     True
14     True
15     True
16     True
17     True
18     True
19     True
20     True
21     True
22     True
23     True
24     True
25     True
26     True
27     True
28     True
29     True
       ... 
470    True
471    True
472    True
473    True
474    True
475    True
476    True
477    True
478    True
479    True
480    True
481    True
482    True
483    True
484    True
485    True
486    True
487    True
488    True
489    True
490    True
491    True
492    True
493    True
494    True
495    True
496    True
497    True
498    True
499    True
Name: country_namecode, dtype: bool

In [ ]:


JSON exercise

Using data in file 'data/world_bank_projects.json' and the techniques demonstrated above,

  1. Find the 10 countries with most projects
  2. Find the top 10 major project themes (using column 'mjtheme_namecode')
  3. In 2. above you will notice that some entries have only the code and the name is missing. Create a dataframe with the missing names filled in.

In [261]:
# load as Pandas dataframe
wb_projects = pd.read_json('data/world_bank_projects.json')

Part 1


In [265]:
topcountries=wb_projects['countrycode'].value_counts()[:10]

In [290]:
topcountries.name='count'

In [286]:
countrydict=wb_projects['country_namecode'].str.split(pat='!\$!',expand=True).set_index(1).groupby(level=0).last()
countrydict.columns=['name']

In [291]:
pd.concat([countrydict,topcountries],axis=1,join='inner').sort('count',ascending=False)


Out[291]:
name count
CN People's Republic of China 19
ID Republic of Indonesia 19
VN Socialist Republic of Vietnam 17
IN Republic of India 16
RY Republic of Yemen 13
BD People's Republic of Bangladesh 12
MA Kingdom of Morocco 12
NP Nepal 12
3A Africa 11
MZ Republic of Mozambique 11

Part 2


In [224]:
codecounts=(wb_projects['mjthemecode'].str.split(pat=',',expand=True)).apply(lambda x: x.value_counts()).sum(axis=1)
codecounts


Out[224]:
1      38
10    216
11    250
2     199
3      15
4     146
5      77
6     168
7     130
8     210
9      50
dtype: int64

In [210]:
c=wb_projects['mjtheme_namecode'].map(lambda x: pd.DataFrame(x).set_index('code').groupby(level=0).last())
#sample_json_df['mjtheme_namecode']=c
#codedict=c[0]
#for i in c[1:20]:
#    codedict=codedict.merge(i,how='outer')
#codedict

In [167]:
codedict=pd.concat(c.values, axis=1, join='outer').transpose().replace(to_replace='',value=np.nan).fillna(method='bfill').iloc[0]

In [231]:
a=pd.DataFrame([codedict,codecounts])
a.columns=a.columns.astype('int64')
a.index=pd.Index([u'name', u'count'], dtype='object')
a=a.transpose()
a.sort(columns='count',ascending=False).iloc[:10]


Out[231]:
name count
11 Environment and natural resources management 250
10 Rural development 216
8 Human development 210
2 Public sector governance 199
6 Social protection and risk management 168
4 Financial and private sector development 146
7 Social dev/gender/inclusion 130
5 Trade and integration 77
9 Urban development 50
1 Economic management 38

Part 3


In [221]:
wb_projects['mjtheme_namecode']=c.map(lambda x: pd.concat([x,codedict],axis=1,join='inner').iloc[:,1])

Scrap


In [191]:
codedict


Out[191]:
1                              Economic management
10                               Rural development
11    Environment and natural resources management
2                         Public sector governance
3                                      Rule of law
4         Financial and private sector development
5                            Trade and integration
6            Social protection and risk management
7                      Social dev/gender/inclusion
8                                Human development
9                                Urban development
Name: name, dtype: object

In [203]:
pd.concat([c[5],codedict],axis=1,join='inner').iloc[:,1]


Out[203]:
6    Social protection and risk management
Name: name, dtype: object

In [98]:
codes.apply(lambda x: x.value_counts()).sum(axis=1).order(ascending=False)


Out[98]:
11    250
10    216
8     210
2     199
6     168
4     146
7     130
5      77
9      50
1      38
3      15
dtype: int64

In [166]:
codes = sample_json_df['mjthemecode'].str.split(pat=',',expand=True)
tots=codes[0].value_counts()
codes.columns
for i in codes.columns[1:]:
    tots += codes[i].value_counts()
tots


Out[166]:
11    250
8     210
2     199
6     168
10    216
4     146
7     130
5      77
9      50
1      38
3      15
dtype: int64

In [46]:
codes[0].value_counts()


Out[46]:
11    88
8     72
2     66
10    57
6     57
4     54
7     44
5     25
9     24
1     11
3      2
dtype: int64

In [26]:
sample_json_df['mjtheme_namecode'].map(len).min()


Out[26]:
1

In [84]:
pd.DataFrame(sample_json_df['mjtheme_namecode'][4])


Out[84]:
code name
0 5 Trade and integration
1 4 Financial and private sector development

In [85]:
pd.DataFrame(sample_json_df['mjtheme_namecode'][2])


Out[85]:
code name
0 5 Trade and integration
1 2 Public sector governance
2 11 Environment and natural resources management
3 6 Social protection and risk management

In [83]:
pd.DataFrame(sample_json_df['mjtheme_namecode'][2]).merge(pd.DataFrame(sample_json_df['mjtheme_namecode'][4]),how='outer')


Out[83]:
code name
0 5 Trade and integration
1 2 Public sector governance
2 11 Environment and natural resources management
3 6 Social protection and risk management
4 4 Financial and private sector development

In [141]:
c.values[3].groupby(level=0).last()


Out[141]:
name
code
7 Social dev/gender/inclusion

In [162]:
codecounts.index


Out[162]:
Float64Index([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0], dtype='float64')

In [164]:
codedict.index=codedict.index.astype('float64')

In [182]:



Out[182]:
Index([u'1', u'10', u'11', u'2', u'3', u'4', u'5', u'6', u'7', u'8', u'9'], dtype='object')

In [112]:
pd.concat((sample_json_df['mjtheme_namecode'].map(lambda x: pd.DataFrame(x).set_index('code'))).values)[11]


---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-112-2ac8979869c4> in <module>()
----> 1 pd.concat((sample_json_df['mjtheme_namecode'].map(lambda x: pd.DataFrame(x).set_index('code'))).values)[11]

/usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in __getitem__(self, key)
   1795             return self._getitem_multilevel(key)
   1796         else:
-> 1797             return self._getitem_column(key)
   1798 
   1799     def _getitem_column(self, key):

/usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in _getitem_column(self, key)
   1802         # get column
   1803         if self.columns.is_unique:
-> 1804             return self._get_item_cache(key)
   1805 
   1806         # duplicate columns & possible reduce dimensionaility

/usr/local/lib/python2.7/dist-packages/pandas/core/generic.pyc in _get_item_cache(self, item)
   1082         res = cache.get(item)
   1083         if res is None:
-> 1084             values = self._data.get(item)
   1085             res = self._box_item_values(item, values)
   1086             cache[item] = res

/usr/local/lib/python2.7/dist-packages/pandas/core/internals.pyc in get(self, item, fastpath)
   2849 
   2850             if not isnull(item):
-> 2851                 loc = self.items.get_loc(item)
   2852             else:
   2853                 indexer = np.arange(len(self.items))[isnull(self.items)]

/usr/local/lib/python2.7/dist-packages/pandas/core/index.pyc in get_loc(self, key, method)
   1570         """
   1571         if method is None:
-> 1572             return self._engine.get_loc(_values_from_object(key))
   1573 
   1574         indexer = self.get_indexer([key], method=method)

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3824)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3704)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12280)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12231)()

KeyError: 11

In [ ]: