datahub.io Linked Open Data Growth

This notebook plots the development of the number of datasets on datahub.io with a certain tag (e.g. "lod") or a certain resource format (e.g. "api/sparql").

Set up the notebook and import the necessary libraries


In [1]:
# Display images inline and as SVG
%matplotlib inline
%config InlineBackend.figure_format = 'svg'

# Import the libraries
import json  # For loading and converting JSON
import matplotlib.pyplot as plt  # For plotting
import pandas as pd  # For the statistical stuff
import seaborn as sns  # For prettier plotting
import urllib2  # For loading the data from the API

Get the JSON from the datahub.io API for all datasets tagged lod

Since there are more than 1000 results, split into two requests.


In [2]:
data_lod_0000 = json.load(urllib2.urlopen('http://datahub.io/api/3/action/package_search?fq=tags:lod&rows=1000&start=0'))
data_lod_1000 = json.load(urllib2.urlopen('http://datahub.io/api/3/action/package_search?fq=tags:lod&rows=1000&start=1000'))

Combine, convert, and plot the data


In [3]:
# Turn the loaded JSON files into Pandas dataframes
df_lod_0000 = pd.io.json.json_normalize(data_lod_0000['result']['results'])
df_lod_1000 = pd.io.json.json_normalize(data_lod_1000['result']['results'])

# Concatenate the two dataframes
df_lod = pd.concat([df_lod_0000, df_lod_1000])

# Extract and sort the 'metadata_created' column in order to use it as an index
dti_lod = pd.to_datetime(df_lod['metadata_created'])
dti_lod.sort()

# Create a TimeSeries with the prepared index and set the value of each entry to 1
ts_lod = pd.Series(1, index=dti_lod)

# Resample the TimeSeries
ts_lod = ts_lod.resample('M', how='sum', kind='period')  # yearly = A, monthly = M

# If there is no value for a given period, fill with 0
ts_lod = ts_lod.fillna(0)

# Create a cumulated TimeSeries, print and plot it
ts_lod_cumsum = ts_lod.cumsum()
ts_lod_cumsum.plot()
ts_lod_cumsum


Out[3]:
metadata_created
2007-04              1
2007-05              1
2007-06              1
2007-07              2
2007-08              4
2007-09              5
2007-10              5
2007-11              8
2007-12              8
2008-01              8
2008-02              8
2008-03              9
2008-04             10
2008-05             10
2008-06             12
...
2013-08              896
2013-09              898
2013-10              899
2013-11              901
2013-12              903
2014-01              905
2014-02              908
2014-03              915
2014-04              921
2014-05              933
2014-06              955
2014-07              982
2014-08             1007
2014-09             1019
2014-10             1029
Freq: M, Length: 91

Get the JSON from the datahub.io API for all datasets providing an api/sparql resource

Since there are less than 1000 results, a single request is sufficient.


In [4]:
data_sparql = json.load(urllib2.urlopen('http://datahub.io/api/3/action/package_search?fq=res_format:api%2Fsparql&rows=1000&start=0'))

Convert and plot the data


In [5]:
# Turn the loaded JSON file into a Pandas dataframe
df_sparql = pd.io.json.json_normalize(data_sparql['result']['results'])

# Extract and sort the 'metadata_created' column in order to use it as an index
dti_sparql = pd.to_datetime(df_sparql['metadata_created'])
dti_sparql.sort()

# Create a TimeSeries with the prepared index and set the value of each entry to 1
ts_sparql = pd.Series(1, index=dti_sparql)

# Resample the TimeSeries
ts_sparql = ts_sparql.resample('M', how='sum', kind='period')  # yearly = A, monthly = M

# If there is no value for a given period, fill with 0
ts_sparql = ts_sparql.fillna(0)  # If there is no value for a given period, fill with 0

# Create a cumulated TimeSeries, print and plot it
ts_sparql_cumsum = ts_sparql.cumsum()
ts_sparql_cumsum.plot()
ts_sparql_cumsum


Out[5]:
metadata_created
2007-07             2
2007-08             3
2007-09             3
2007-10             3
2007-11             3
2007-12             3
2008-01             3
2008-02             3
2008-03             5
2008-04             5
2008-05             5
2008-06             6
2008-07             6
2008-08             6
2008-09             6
...
2013-08             437
2013-09             439
2013-10             442
2013-11             446
2013-12             449
2014-01             454
2014-02             456
2014-03             461
2014-04             470
2014-05             474
2014-06             493
2014-07             519
2014-08             548
2014-09             550
2014-10             554
Freq: M, Length: 88

Combine both plots and save as PNG and PDF


In [6]:
# Combine the two cumulated TimeSeries and fill missing values with 0
ts_combined_cumsum = pd.concat([ts_lod_cumsum, ts_sparql_cumsum], axis=1).fillna(0)

# Rename the index (used as x-axis label)
ts_combined_cumsum.index.name = 'Time'

# Rename the two columns (used for the legend)
ts_combined_cumsum.columns = ['LOD', 'SPARQL']

# Plot the combined TimeSeries
ax = ts_combined_cumsum.plot()

# Set the y-axis label
ax.set_ylabel('Number of datasets')

# Move the legend to the left
ax.legend(loc='upper left')

# Format the x tick labels
labels = ax.get_xticklabels()
for label in labels:
    label.set_ha('left')

# Save the plot as PDF and PNG (funny file suffixes because of authorea.com figure workaround)
plt.savefig('../figures/datahubio_datasets.png.pdf')
plt.savefig('../figures/datahubio_datasets.pdf.png', dpi=200)