JSON examples and exercise


  • get familiar with packages for dealing with JSON
  • study examples with JSON strings and files
  • work on exercise to be completed and submitted



In [3]:

imports for Python, Pandas


In [ ]:

JSON example, with string


In [ ]:


In [ ]:


In [ ]:


JSON example, with file

  • demonstrates reading in a json file as a string and as a table
  • uses small sample file containing data about projects funded by the World Bank
  • data source: http://jsonstudio.com/resources/

In [ ]:


In [ ]:


JSON exercise

Using data in file 'data/world_bank_projects.json' and the techniques demonstrated above,

  1. Find the 10 countries with most projects
  2. Find the top 10 major project themes (using column 'mjtheme_namecode')
  3. In 2. above you will notice that some entries have only the code and the name is missing. Create a dataframe with the missing names filled in.

In [2]:
# Answer to Exercise 1 (Find the 10 countries with most projects.)

import pandas as pd

import json from pandas.io.json 
import json_normalize

# load as Pandas dataframe
exercise_json_df = pd.read_json('data/world_bank_projects.json')
exercise_json_df

#Get frequency that the country name value occurs, this will signify the amount of projects each country has
country_project_counts = exercise_json_df.countryname.value_counts()

#Display only the top 10 countries with most projects
country_project_counts[:10]


Out[2]:
People's Republic of China         19
Republic of Indonesia              19
Socialist Republic of Vietnam      17
Republic of India                  16
Republic of Yemen                  13
Nepal                              12
Kingdom of Morocco                 12
People's Republic of Bangladesh    12
Republic of Mozambique             11
Africa                             11
Name: countryname, dtype: int64

In [6]:
#Answer to Exercise 2 (Find the top 10 major project themes (using column 'mjtheme_namecode').)

import pandas as pd

import json
from pandas.io.json import json_normalize

# load json as string
wbprojects = json.load((open('data/world_bank_projects.json')))

# normalize 'mjtheme_namecode' JSON data into a flat table
# NOTE that there are blank names associated with the code.
mjtheme_normalized = json_normalize(wbprojects, 'mjtheme_namecode')

# Use both 'code' and 'name' as index but get a frequency count by the code,
# This way we won't run into the blank name issue.
mjtheme_normalized['Counts'] = mjtheme_normalized.groupby('code')['name'].transform('count')

# Now, sort by the frequency count field in a descending manner.
# Display only the first 10 most occuring project themes.
freq_sort = mjtheme_normalized.set_index(['Counts']).sort_index(axis=0,ascending=False)
freq_sort_no_dup = freq_sort.drop_duplicates('code',inplace = False)
freq_sort_no_dup[:10]


Out[6]:
code name
Counts
250 11 Environment and natural resources management
216 10 Rural development
210 8 Human development
199 2 Public sector governance
168 6 Social protection and risk management
146 4 Financial and private sector development
130 7 Social dev/gender/inclusion
77 5 Trade and integration
50 9 Urban development
38 1 Economic management

In [7]:
# Answer to Exercise 3 (In 2. above you will notice that some entries have only the code and the name is missing. Create a dataframe with the missing names filled in.)

import pandas as pd
import numpy as np

import json
from pandas.io.json import json_normalize

# load json as string
wbprojects = json.load((open('data/world_bank_projects.json')))

# normalize 'mjtheme_namecode' JSON data into a flat table
# NOTE that there are blank names associated with the code.
mjtheme_normalized = json_normalize(wbprojects, 'mjtheme_namecode')

# We want to create a reference table of all valid names and values in 'mjtheme_namecode'
# In order to replace the missing name values in the original dataframe.
# First replace the empty name(s) with a null qualifier so pandas can recognize it as such.
mjtheme_normalized['name'].replace('', np.nan, inplace = True)

# Drop the rows with null name values and then make a unique set of rows to make our proper reference table.
mjtheme_no_dup = mjtheme_normalized.dropna(subset = ['name']).drop_duplicates('name', inplace = False, keep = 'last')
mjtheme_no_dup

# Now, merge reference table with mjtheme_namecode dataframe to populate the missing blank names.
left = mjtheme_normalized
right = mjtheme_no_dup 
complete_mjtheme = pd.merge(left,right, on = ['code','name'], right_index= 'TRUE')

# Restore the sort
complete_mjtheme.sort_index()


Out[7]:
code name
0 8 Human development
2 1 Economic management
3 6 Social protection and risk management
4 5 Trade and integration
5 2 Public sector governance
6 11 Environment and natural resources management
7 6 Social protection and risk management
8 7 Social dev/gender/inclusion
9 7 Social dev/gender/inclusion
10 5 Trade and integration
11 4 Financial and private sector development
12 6 Social protection and risk management
14 2 Public sector governance
15 4 Financial and private sector development
16 11 Environment and natural resources management
18 10 Rural development
20 2 Public sector governance
21 2 Public sector governance
22 2 Public sector governance
23 10 Rural development
25 10 Rural development
26 6 Social protection and risk management
27 6 Social protection and risk management
28 11 Environment and natural resources management
30 5 Trade and integration
31 11 Environment and natural resources management
32 10 Rural development
33 6 Social protection and risk management
34 6 Social protection and risk management
35 6 Social protection and risk management
... ... ...
1465 11 Environment and natural resources management
1466 8 Human development
1467 2 Public sector governance
1468 2 Public sector governance
1469 8 Human development
1470 9 Urban development
1471 6 Social protection and risk management
1472 6 Social protection and risk management
1473 9 Urban development
1474 2 Public sector governance
1475 2 Public sector governance
1476 10 Rural development
1478 8 Human development
1479 7 Social dev/gender/inclusion
1480 11 Environment and natural resources management
1482 6 Social protection and risk management
1484 4 Financial and private sector development
1485 7 Social dev/gender/inclusion
1486 8 Human development
1487 5 Trade and integration
1488 2 Public sector governance
1489 8 Human development
1490 10 Rural development
1492 10 Rural development
1493 10 Rural development
1494 10 Rural development
1495 9 Urban development
1496 8 Human development
1497 5 Trade and integration
1498 4 Financial and private sector development

1377 rows × 2 columns


In [ ]: