In [204]:
import pandas as pd
In [2]:
import json
from pandas.io.json import json_normalize
Using data in file 'data/world_bank_projects.json' and the techniques demonstrated above,
In [227]:
df = pd.read_json('data/world_bank_projects.json')
df.head(3)
Out[227]:
In [228]:
df.countryshortname.value_counts().sort_values(ascending=False).head(10)
Out[228]:
Thus we have the ten countries with the most projects.
In [229]:
# load json as string
json_string = json.load((open('data/world_bank_projects.json')))
In [230]:
unstacked_mjthemes = json_normalize(json_string, 'mjtheme_namecode','_id')
In [231]:
unstacked_mjthemes.head(15)
Out[231]:
In [232]:
unstacked_mjthemes.code.value_counts().head(10)
Out[232]:
Now we have the top 10 project themes (out of 11 total). However these give us only the code, so unless you have the corresponding string names memorized, it's not too useful. Using $value\_counts()$ on the name column of our new dataframe won't quite do, since there will be missing entries counted as blanks. So let's fill these in using the map method.
In [233]:
themes = unstacked_mjthemes[unstacked_mjthemes.name != '']
paired_themes = themes.name.groupby(themes.code).max()
paired_themes = paired_themes.to_dict()
paired_themes
Out[233]:
In [234]:
unstacked_mjthemes['name'] = unstacked_mjthemes.code.map(paired_themes)
unstacked_mjthemes.head(15)
Out[234]:
In [235]:
any(unstacked_mjthemes.name == '')
Out[235]:
So we see we have used the dict of paired codes and names to them map the blank entries correctly, and that there are no empty strings left in the name column. We can thus now use value counts to correctly see the ten top themes by name and not just code.
In [236]:
unstacked_mjthemes.name.value_counts().head(10)
Out[236]:
Reindex our new dataframe to use the project id as index. Could then use this to reinsert corrected major themes into large dataframe if required.
In [257]:
unstacked = unstacked_mjthemes[['_id','code','name']]
unstacked.set_index('_id')
Out[257]: