Linked Open Data Growth

This notebook plots the development of the number of datasets on with a certain tag (e.g. "lod") or a certain resource format (e.g. "api/sparql").

Set up the notebook and import the necessary libraries

In [1]:
# Display images inline and as SVG
%matplotlib inline
%config InlineBackend.figure_format = 'svg'

# Import the libraries
import json  # For loading and converting JSON
import matplotlib.pyplot as plt  # For plotting
import pandas as pd  # For the statistical stuff
import seaborn as sns  # For prettier plotting
import urllib2  # For loading the data from the API

Get the JSON from the API for all datasets tagged lod

Since there are more than 1000 results, split into two requests.

In [2]:
data_lod_0000 = json.load(urllib2.urlopen(''))
data_lod_1000 = json.load(urllib2.urlopen(''))

Combine, convert, and plot the data

In [3]:
# Turn the loaded JSON files into Pandas dataframes
df_lod_0000 =['result']['results'])
df_lod_1000 =['result']['results'])

# Concatenate the two dataframes
df_lod = pd.concat([df_lod_0000, df_lod_1000])

# Extract and sort the 'metadata_created' column in order to use it as an index
dti_lod = pd.to_datetime(df_lod['metadata_created'])

# Create a TimeSeries with the prepared index and set the value of each entry to 1
ts_lod = pd.Series(1, index=dti_lod)

# Resample the TimeSeries
ts_lod = ts_lod.resample('M', how='sum', kind='period')  # yearly = A, monthly = M

# If there is no value for a given period, fill with 0
ts_lod = ts_lod.fillna(0)

# Create a cumulated TimeSeries, print and plot it
ts_lod_cumsum = ts_lod.cumsum()

2007-04              1
2007-05              1
2007-06              1
2007-07              2
2007-08              4
2007-09              5
2007-10              5
2007-11              8
2007-12              8
2008-01              8
2008-02              8
2008-03              9
2008-04             10
2008-05             10
2008-06             12
2013-08              896
2013-09              898
2013-10              899
2013-11              901
2013-12              903
2014-01              905
2014-02              908
2014-03              915
2014-04              921
2014-05              933
2014-06              955
2014-07              982
2014-08             1007
2014-09             1019
2014-10             1029
Freq: M, Length: 91

Get the JSON from the API for all datasets providing an api/sparql resource

Since there are less than 1000 results, a single request is sufficient.

In [4]:
data_sparql = json.load(urllib2.urlopen(''))

Convert and plot the data

In [5]:
# Turn the loaded JSON file into a Pandas dataframe
df_sparql =['result']['results'])

# Extract and sort the 'metadata_created' column in order to use it as an index
dti_sparql = pd.to_datetime(df_sparql['metadata_created'])

# Create a TimeSeries with the prepared index and set the value of each entry to 1
ts_sparql = pd.Series(1, index=dti_sparql)

# Resample the TimeSeries
ts_sparql = ts_sparql.resample('M', how='sum', kind='period')  # yearly = A, monthly = M

# If there is no value for a given period, fill with 0
ts_sparql = ts_sparql.fillna(0)  # If there is no value for a given period, fill with 0

# Create a cumulated TimeSeries, print and plot it
ts_sparql_cumsum = ts_sparql.cumsum()

2007-07             2
2007-08             3
2007-09             3
2007-10             3
2007-11             3
2007-12             3
2008-01             3
2008-02             3
2008-03             5
2008-04             5
2008-05             5
2008-06             6
2008-07             6
2008-08             6
2008-09             6
2013-08             437
2013-09             439
2013-10             442
2013-11             446
2013-12             449
2014-01             454
2014-02             456
2014-03             461
2014-04             470
2014-05             474
2014-06             493
2014-07             519
2014-08             548
2014-09             550
2014-10             554
Freq: M, Length: 88

Combine both plots and save as PNG and PDF

In [6]:
# Combine the two cumulated TimeSeries and fill missing values with 0
ts_combined_cumsum = pd.concat([ts_lod_cumsum, ts_sparql_cumsum], axis=1).fillna(0)

# Rename the index (used as x-axis label) = 'Time'

# Rename the two columns (used for the legend)
ts_combined_cumsum.columns = ['LOD', 'SPARQL']

# Plot the combined TimeSeries
ax = ts_combined_cumsum.plot()

# Set the y-axis label
ax.set_ylabel('Number of datasets')

# Move the legend to the left
ax.legend(loc='upper left')

# Format the x tick labels
labels = ax.get_xticklabels()
for label in labels:

# Save the plot as PDF and PNG (funny file suffixes because of figure workaround)
plt.savefig('../figures/datahubio_datasets.pdf.png', dpi=200)