JSON examples and exercise


  • get familiar with packages for dealing with JSON
  • study examples with JSON strings and files
  • work on exercise to be completed and submitted



In [1]:
import pandas as pd

In [2]:
import json
from pandas.io.json import json_normalize

In [3]:
from numpy import nan

JSON exercise

Using data in file 'data/world_bank_projects.json' and the techniques demonstrated above,

  1. Find the 10 countries with most projects
  2. Find the top 10 major project themes (using column 'mjtheme_namecode')
  3. In 2. above you will notice that some entries have only the code and the name is missing. Create a dataframe with the missing names filled in.

Part 1


In [4]:
pd.read_json('data/world_bank_projects.json')['countryname'].value_counts()[:10]


Out[4]:
People's Republic of China         19
Republic of Indonesia              19
Socialist Republic of Vietnam      17
Republic of India                  16
Republic of Yemen                  13
Kingdom of Morocco                 12
People's Republic of Bangladesh    12
Nepal                              12
Republic of Mozambique             11
Africa                             11
dtype: int64

Part 2


In [5]:
wb_projects_str=json.load((open('data/world_bank_projects.json')))

In [6]:
themes=json_normalize(wb_projects_str,'mjtheme_namecode')
codecounts=themes['code'].value_counts()
codecounts.name='counts'
codecounts


Out[6]:
11    250
10    216
8     210
2     199
6     168
4     146
7     130
5      77
9      50
1      38
3      15
Name: counts, dtype: int64

Make somthing to tell us what each code means


In [7]:
codedict=themes.replace('',nan).dropna().groupby('code').last()
codedict


Out[7]:
name
code
1 Economic management
10 Rural development
11 Environment and natural resources management
2 Public sector governance
3 Rule of law
4 Financial and private sector development
5 Trade and integration
6 Social protection and risk management
7 Social dev/gender/inclusion
8 Human development
9 Urban development

Answer:

Combine the two


In [8]:
pd.DataFrame([codedict['name'],codecounts]).transpose().sort('counts',ascending=False)


Out[8]:
name counts
11 Environment and natural resources management 250
10 Rural development 216
8 Human development 210
2 Public sector governance 199
6 Social protection and risk management 168
4 Financial and private sector development 146
7 Social dev/gender/inclusion 130
5 Trade and integration 77
9 Urban development 50
1 Economic management 38
3 Rule of law 15

Part 3

Basically we just copy the code column and then replace all the codes with the corresponding name. This disturbs my sensibilities as it's doing lots more replacements than it has to, but it's faster than anything else I came up with.


In [9]:
%%time
themes['name']=themes['code']


themes['name'].replace(to_replace=codedict.index.values,value=codedict.values,inplace=True)


CPU times: user 4.82 ms, sys: 32 µs, total: 4.85 ms
Wall time: 4.49 ms

In [10]:
themes


Out[10]:
code name
0 8 Human development
1 11 Environment and natural resources management
2 1 Economic management
3 6 Social protection and risk management
4 5 Trade and integration
5 2 Public sector governance
6 11 Environment and natural resources management
7 6 Social protection and risk management
8 7 Social dev/gender/inclusion
9 7 Social dev/gender/inclusion
10 5 Trade and integration
11 4 Financial and private sector development
12 6 Social protection and risk management
13 6 Social protection and risk management
14 2 Public sector governance
15 4 Financial and private sector development
16 11 Environment and natural resources management
17 8 Human development
18 10 Rural development
19 7 Social dev/gender/inclusion
20 2 Public sector governance
21 2 Public sector governance
22 2 Public sector governance
23 10 Rural development
24 2 Public sector governance
25 10 Rural development
26 6 Social protection and risk management
27 6 Social protection and risk management
28 11 Environment and natural resources management
29 4 Financial and private sector development
... ... ...
1469 8 Human development
1470 9 Urban development
1471 6 Social protection and risk management
1472 6 Social protection and risk management
1473 9 Urban development
1474 2 Public sector governance
1475 2 Public sector governance
1476 10 Rural development
1477 11 Environment and natural resources management
1478 8 Human development
1479 7 Social dev/gender/inclusion
1480 11 Environment and natural resources management
1481 5 Trade and integration
1482 6 Social protection and risk management
1483 8 Human development
1484 4 Financial and private sector development
1485 7 Social dev/gender/inclusion
1486 8 Human development
1487 5 Trade and integration
1488 2 Public sector governance
1489 8 Human development
1490 10 Rural development
1491 6 Social protection and risk management
1492 10 Rural development
1493 10 Rural development
1494 10 Rural development
1495 9 Urban development
1496 8 Human development
1497 5 Trade and integration
1498 4 Financial and private sector development

1499 rows × 2 columns