This post is a little different from my usual fare;
Basically, there was a tweet from MATRIX NI that caught my eye; the latest Office of National Statistics report on Internet Use in the UK.
https://twitter.com/Bolster/status/601689626496135168
Basically, NI "lost". So I thought it was a good opportunity to play around with the data a little bit instead of my usual stuff.
As such this post has two main thrusts;
1) Demonstrating a little sample of my normal workflow with data wrangling using Python, Pandas, Plot.ly and a few other tools. This is not clean code and it's not pretty. 2) NI Sucks at the internet and I believe this statistic is the more realistic, reliable, and impactful metric to target economic and cultural growth/stability/recovery/happiness/whatever.
When I'm working on something like this, I usually end up spending about 80% of my time just getting the data into a format that can be used reasonably; Excel documents are NOT accessible open data standards and they should all go die in a fire... But this is what we've got.
Lets kick things off with a few 'preambles'.
In [1]:
import pandas as pd
import plotly.plotly as py
import statsmodels as sm
from sklearn.linear_model import LinearRegression
import scipy, scipy.stats
import cufflinks as cf # Awesome Pandas/Plot.ly integration module https://github.com/santosjorge/cufflinks
py.sign_in('bolster', 'q54pd5bbmd')
In [2]:
from IPython.display import Image
Image(filename='/home/bolster/Pictures/Screenshot from 2015-05-23 15:57:54.png')
Out[2]:
Pandas has the relatively intelligent read_excel method that... well... does what it says on the tin.
Since every gov department appears to use Excel as a formatting and layout tool rather than a data management platform, there are lots of pointlessly empty rows and columns that we can drop, so we end up with the top of a data structure like this.
As you can see, Pandas has "guessed" that the top row is a set of column headers... this is incorrect...
In [3]:
df = pd.read_excel("/home/bolster/Downloads/rftiatables152015q1_tcm77-405031.xls", sheetname="4b")
df=df.dropna(axis=1,how="all")
df=df.dropna(axis=0,how="all")
df.head()
Out[3]:
Remove pointless rows (Note that the index on the right hand side isn't automatically updated)
In [4]:
df.drop(df.index[[0, -3,-2,-1]], inplace=True)
df.head()
Out[4]:
Fill in the 'headings' for the "Used in the last" section to wipe out those NaN's
In [5]:
df.iloc[0].fillna(method="ffill", inplace=True)
df.head()
Out[5]:
This ones a bit complicated so we'll split it up; first off transpose the frame (rows become columns, etc), and then set the new column headers to be the row currently containing the region names. Then drop that row since we don't need it any more, and finally update the columns to give the first two columns useful names.
In [6]:
df = df.T
df.columns = df.iloc[0]
df.head()
Out[6]:
In [7]:
df.drop(df.index[[0]], inplace=True)
df.columns = ["Internet Use","Date"] + [s.strip() for s in df.columns[2:].tolist()] # Some idiots put spaces at the end of their cells
df.head()
Out[7]:
In [8]:
q_map = {"Q1":"March","Q2":"June","Q3":"September","Q4":"December"}
def _yr_q_parse(yq): return yq[0], q_map[yq[1]]
def _yr_q_join(s): return " ".join(_yr_q_parse(s.split(" ")))
df['Date'] = pd.to_datetime(df['Date'].apply(_yr_q_join))
Finally, set the Date to be the primary index (i.e. the row identifier), convert all the values from boring "objects" to "floats", give the column range a name (for when we're manipulating the data later), and for play, lets just look at the average internet use statistics between 2013 and 2015
In [9]:
df.set_index(['Date','Internet Use'],inplace=True)
df.sortlevel(inplace=True)
df=df.astype(float)
df.columns.name="Region"
df.groupby(level='Internet Use').mean().T
Out[9]:
Now that everythings tidy, we can start to play.
First thing to have a look at is the regional breakdown, so we select the Regional records and make a fresh dataframe just with those values. We can also forget about the "Used in the last 3 months" records because they're not particularly interesting and make the graphs look weird...
As you can see, there are two "West Midlands"; this is a pain, as one is the 'Region' and the other is to NUTS level 2 Region, obviously.... So we drop the local authority region. This was not as easy as I expected to do programmatically, and if someone has a cleverer way to do this, lemme know.
In [10]:
def last_index_of(li, val):
return (len(li) - 1) - li[::-1].index(val)
def non_dup_indexes_of_columns(df, dupkey):
col_ids = range(len(df.columns.tolist()))
col_ids.remove(last_index_of(df.columns.tolist(),dupkey))
return col_ids
regions = ["UK",
"North East","North West",
"Yorkshire and the Humber",
"East Midlands", "West Midlands",
"East of England", "London",
"South East", "South West",
"Wales", "Scotland",
"Northern Ireland"
]
df_regional_use = df[regions]
df_regional_use = df_regional_use[non_dup_indexes_of_columns(df_regional_use,"West Midlands")]# Dammit West Midlands...
Now we can plot the regional internet uses using Plot.ly, meaning that these graphs both look good in my IPython Notebook where I'm editing this, and hopefully on the blog when this goes up....
In [11]:
df_regional_use_means = df_regional_use.groupby(level='Internet Use').mean().T
cols = ['Never used','Used over 3 months ago']
df_regional_use_means[cols].sort(columns=cols).iplot(kind="bar",
filename='Internet Region Use Means Bar',world_readable=True, theme="pearl",
title="Average Digital Illiteracy by Region (2013-2015)", xTitle="Region", yTitle="%")
Out[11]:
Now that doesn't look too good; Northern Ireland has the highest proportion of "Never Used", at just below 20%.
"But!" you might say, "What about all the great programs and processes put in place recently? We're rapidly recovering from a troubles, and we're on the up and up!"
Alright then, lets look at the standard deviation of these results (i.e. the volativility)
In [12]:
df_regional_use_stddevs = df_regional_use.groupby(level='Internet Use').std().T
df_regional_use_stddevs[cols].sort(columns=cols).iplot(kind="bar",
filename='Internet Region Use Std Bar',world_readable=True, theme="pearl",
title="Standard Deviation in UK Digital Illiteracy since 2013", xTitle="Period", yTitle="Std Dev (Variability)")
Out[12]:
Nope, we still suck. So we've got the highest amount of "Digital Illiteracy" and the lowest change in Digital Literacy of any region. Lets take a look at that; std dev is a terrible measure for time series data and doesn't give a reliable sense of 'trend'
In [13]:
df_regional_never = df_regional_use.xs("Never used", level="Internet Use")
df_regional_never_pct_change = ((df_regional_never/df_regional_never.iloc[0])-1)
#fig,ax = my_chart_factory()
#df_regional_never_pct_change.plot(ax=ax, legend=False, color=tableau20)
df_regional_never_pct_change.iplot(filename='Internet Region Never Pct Change',world_readable=True, theme="pearl",
title="Reduction in UK Digital Illiteracy since 2013",
xTitle="Period", yTitle="%")
Out[13]:
Remember this is the number of people who said that they never use the internet...
Ok that's a nice interactive graph that you can poke around with, but it's very noisy.... Lets see if we can clean that up a bit with a Hodrick-Prescott filter (we could have just done a linear regression but that's not really that interesting.
In [25]:
cycle,trend = sm.tsa.filters.hp_filter.hpfilter(df_regional_never_pct_change)
trend.iplot(filename='Internet Region Never Pct Change Reg',world_readable=True, theme="pearl",
title="Reduction in UK Digital Illiteracy since 2013",
xTitle="Period", yTitle="%")
Out[25]:
Note that because of the big gap at the beginning of the data (i.e. there's no information between Q1 2013 and Q1 2014), the regression has a kink in it. Looking at the results, and the "knee" in the data, it's highly likely that this is just a mistake in the accounting and that it's actually 2013 Q4 data.
Either way, Northern Ireland is not only bottom of the Digital Literacy Table, but it's also accomplished the least progress in this area of any region across the UK.