This Jupyter notebook shows the python library jsonstat.py in action.
It shows how to explore dataset downloaded from a data provider. This notebook uses some datasets from Eurostat. Eurostat provides a rest api to download its datasets. You can find details about the api here
It is possible to use a query builder for discovering the rest api parameters. The following image shows the query builder:
In [1]:
# all import here
from __future__ import print_function
import os
import pandas as pd
import jsonstat
import matplotlib as plt
%matplotlib inline
Following cell downloads a datataset from eurostat. If the file is already downloaded use the copy presents on the disk. Caching file is useful to avoid downloading dataset every time notebook runs. Caching can speed the development, and provides consistent results. You can see the raw data here
In [2]:
url_1 = 'http://ec.europa.eu/eurostat/wdds/rest/data/v1.1/json/en/nama_gdp_c?precision=1&geo=IT&unit=EUR_HAB&indic_na=B1GM'
file_name_1 = "eurostat-name_gpd_c-geo_IT.json"
file_path_1 = os.path.abspath(os.path.join("..", "tests", "fixtures", "www.ec.europa.eu_eurostat", file_name_1))
if os.path.exists(file_path_1):
print("using already donwloaded file {}".format(file_path_1))
else:
print("download file")
jsonstat.download(url_1, file_name_1)
file_path_1 = file_name_1
Initialize JsonStatCollection with eurostat data and print some info about the collection.
In [3]:
collection_1 = jsonstat.from_file(file_path_1)
collection_1
Out[3]:
Previous collection contains only a dataset named 'nama_gdp_c
'
In [4]:
nama_gdp_c_1 = collection_1.dataset('nama_gdp_c')
nama_gdp_c_1
Out[4]:
All dimensions of the dataset 'nama_gdp_c
' are of size 1 with exception of time
dimension. Let's explore the time dimension.
In [5]:
nama_gdp_c_1.dimension('time')
Out[5]:
Get value for year 2012.
In [6]:
nama_gdp_c_1.value(time='2012')
Out[6]:
Convert the jsonstat data into a pandas dataframe.
In [7]:
df_1 = nama_gdp_c_1.to_data_frame('time', content='id')
df_1.tail()
Out[7]:
Adding a simple plot
In [8]:
df_1 = df_1.dropna() # remove rows with NaN values
df_1.plot(grid=True, figsize=(20,5))
Out[8]:
Download or use the jsonstat file cached on disk. The cache is used to avoid internet download during the devolopment to make the things a bit faster. You can see the raw data here
In [9]:
url_2 = 'http://ec.europa.eu/eurostat/wdds/rest/data/v1.1/json/en/nama_gdp_c?precision=1&geo=IT&geo=FR&unit=EUR_HAB&indic_na=B1GM'
file_name_2 = "eurostat-name_gpd_c-geo_IT_FR.json"
file_path_2 = os.path.abspath(os.path.join("..", "tests", "fixtures", "www.ec.europa.eu_eurostat", file_name_2))
if os.path.exists(file_path_2):
print("using alredy donwloaded file {}".format(file_path_2))
else:
print("download file and storing on disk")
jsonstat.download(url, file_name_2)
file_path_2 = file_name_2
In [10]:
collection_2 = jsonstat.from_file(file_path_2)
nama_gdp_c_2 = collection_2.dataset('nama_gdp_c')
nama_gdp_c_2
Out[10]:
In [11]:
nama_gdp_c_2.dimension('geo')
Out[11]:
In [12]:
nama_gdp_c_2.value(time='2012',geo='IT')
Out[12]:
In [13]:
nama_gdp_c_2.value(time='2012',geo='FR')
Out[13]:
In [14]:
df_2 = nama_gdp_c_2.to_table(content='id',rtype=pd.DataFrame)
df_2.tail()
Out[14]:
In [15]:
df_FR_IT = df_2.dropna()[['time', 'geo', 'Value']]
df_FR_IT = df_FR_IT.pivot('time', 'geo', 'Value')
df_FR_IT.plot(grid=True, figsize=(20,5))
Out[15]:
In [16]:
df_3 = nama_gdp_c_2.to_data_frame('time', content='id', blocked_dims={'geo':'FR'})
df_3 = df_3.dropna()
df_3.plot(grid=True,figsize=(20,5))
Out[16]:
In [17]:
df_4 = nama_gdp_c_2.to_data_frame('time', content='id', blocked_dims={'geo':'IT'})
df_4 = df_4.dropna()
df_4.plot(grid=True,figsize=(20,5))
Out[17]: