School Populations in Edinburgh/Scotland

The data set in this notebook can be found in http://www.opendatascotland.org/ and it details how many students can be found in each school in addition to the geographical location of the school. Let's mine this data and see if we can discover any interesting trends.

Import Statements

A bit like citations in textbooks that allow us to quote and use the works of another author. Import statements let us bring extra tools in to the notebook which we may need to use when analysing a data set.


In [2]:
# This is a comment. The notebook does not execute these. They are just used to provide information to the reader.
# We use pandas to manage tabular data such as spread sheets, coma separated values
# And tab separated values.
import pandas as pd  # This is an import statement.
# Bokeh is a very cool library which we will use for visualizing the data in intersting ways.
from bokeh.plotting import *
from bokeh.io import output_notebook
from bokeh.models import ColumnDataSource
# Numpy is a numerical package which we will use to deal with all the maths and metrics
# that we may want to compute from this data set.
import numpy as np

Reading and Visualizing Tabular Data

As mentioned in the comments of the previous code snippet pandas is a great library to deal with tabular data when programming in python.


In [3]:
# Load the data in to the notebook:
gov_csv = pd.read_csv("../../data/opendata/tutorial_1/schools.csv")
# Display the tabular data 
gov_csv


Out[3]:
school school_label latitude longitude pupils
0 http://data.opendatascotland.org/id/educationa... Linlithgow Academy 55.97160 -3.61259 1231
1 http://data.opendatascotland.org/id/educationa... St Kentigern's Academy 55.87101 -3.63367 1215
2 http://data.opendatascotland.org/id/educationa... James Young High,The 55.88093 -3.51523 1135
3 http://data.opendatascotland.org/id/educationa... St Margaret's Academy 55.88937 -3.52213 1094
4 http://data.opendatascotland.org/id/educationa... Inveralmond Community High 55.90146 -3.51932 1090
5 http://data.opendatascotland.org/id/educationa... West Calder High 55.86291 -3.54044 950
6 http://data.opendatascotland.org/id/educationa... Deans Community High 55.90581 -3.54977 941
7 http://data.opendatascotland.org/id/educationa... Broxburn Academy 55.93694 -3.48778 903
8 http://data.opendatascotland.org/id/educationa... Bathgate Academy 55.89838 -3.61313 899
9 http://data.opendatascotland.org/id/educationa... Whitburn Academy 55.86804 -3.67964 822
10 http://data.opendatascotland.org/id/educationa... Armadale Academy 55.89481 -3.71436 780
11 http://data.opendatascotland.org/id/educationa... Armadale 55.89717 -3.70321 440
12 http://data.opendatascotland.org/id/educationa... Balbardie 55.90518 -3.63735 423
13 http://data.opendatascotland.org/id/educationa... Linlithgow 55.97165 -3.60945 417
14 http://data.opendatascotland.org/id/educationa... Peel 55.89497 -3.53573 407
15 http://data.opendatascotland.org/id/educationa... Williamston 55.87527 -3.50414 404
16 http://data.opendatascotland.org/id/educationa... Carmondean 55.90662 -3.54190 402
17 http://data.opendatascotland.org/id/educationa... St Mary's, Bathgate 55.90016 -3.64731 401
18 http://data.opendatascotland.org/id/educationa... Harrysmuir 55.90146 -3.51932 401
19 http://data.opendatascotland.org/id/educationa... St John Ogilvie 55.90437 -3.55298 368
20 http://data.opendatascotland.org/id/educationa... St Nicholas 55.93284 -3.48396 364
21 http://data.opendatascotland.org/id/educationa... St Nicholas 55.93461 -3.47285 364
22 http://data.opendatascotland.org/id/educationa... Broxburn 55.93545 -3.47440 361
23 http://data.opendatascotland.org/id/educationa... Windyknowe 55.89849 -3.66481 359
24 http://data.opendatascotland.org/id/educationa... Parkhead 55.85232 -3.56651 354
25 http://data.opendatascotland.org/id/educationa... Whitdale Primary 55.86560 -3.67475 345
26 http://data.opendatascotland.org/id/educationa... Simpson Primary 55.89066 -3.62362 336
27 http://data.opendatascotland.org/id/educationa... Bankton 55.88157 -3.50703 331
28 http://data.opendatascotland.org/id/educationa... Eastertoun 55.89812 -3.71076 329
29 http://data.opendatascotland.org/id/educationa... Howden St Andrew's 55.89300 -3.50936 322
... ... ... ... ... ...
54 http://data.opendatascotland.org/id/educationa... St John The Baptist 55.83026 -3.70456 176
55 http://data.opendatascotland.org/id/educationa... Polkemmet 55.86204 -3.69082 169
56 http://data.opendatascotland.org/id/educationa... Falla Hill 55.82792 -3.71016 168
57 http://data.opendatascotland.org/id/educationa... Pumpherston and Uphall Station 55.90655 -3.49183 168
58 http://data.opendatascotland.org/id/educationa... Pumpherston and Uphall Station 55.90885 -3.49211 168
59 http://data.opendatascotland.org/id/educationa... Our Lady of Lourdes 55.87509 -3.62618 148
60 http://data.opendatascotland.org/id/educationa... Blackridge 55.88429 -3.77368 143
61 http://data.opendatascotland.org/id/educationa... St Columba's 55.90156 -3.60997 126
62 http://data.opendatascotland.org/id/educationa... St Joseph's, Linlithgow 55.97165 -3.60945 124
63 http://data.opendatascotland.org/id/educationa... St Paul's 55.89645 -3.45757 117
64 http://data.opendatascotland.org/id/educationa... St Mary's, Polbeth 55.85996 -3.55515 115
65 http://data.opendatascotland.org/id/educationa... Pinewood 55.87249 -3.61281 115
66 http://data.opendatascotland.org/id/educationa... Seafield 55.87864 -3.58856 115
67 http://data.opendatascotland.org/id/educationa... Winchburgh 55.95631 -3.46832 102
68 http://data.opendatascotland.org/id/educationa... Stoneyburn 55.84932 -3.62739 96
69 http://data.opendatascotland.org/id/educationa... Addiewell 55.84520 -3.61746 95
70 http://data.opendatascotland.org/id/educationa... Longridge 55.84493 -3.67811 92
71 http://data.opendatascotland.org/id/educationa... Bridgend 55.96256 -3.53472 78
72 http://data.opendatascotland.org/id/educationa... Cedarbank 55.89696 -3.51670 75
73 http://data.opendatascotland.org/id/educationa... Torphichen 55.93437 -3.65513 71
74 http://data.opendatascotland.org/id/educationa... Holy Family 55.95631 -3.46832 60
75 http://data.opendatascotland.org/id/educationa... Blackburn 55.87750 -3.63012 59
76 http://data.opendatascotland.org/id/educationa... Ogilvie School Campus 55.90653 -3.52634 56
77 http://data.opendatascotland.org/id/educationa... Our Lady's Primary 55.84601 -3.63337 53
78 http://data.opendatascotland.org/id/educationa... Westfield 55.92798 -3.70243 38
79 http://data.opendatascotland.org/id/educationa... St Thomas' 55.84520 -3.61746 36
80 http://data.opendatascotland.org/id/educationa... Beatlie School 55.89686 -3.49517 34
81 http://data.opendatascotland.org/id/educationa... Woodmuir 55.82852 -3.65678 27
82 http://data.opendatascotland.org/id/educationa... Burnhouse 55.86204 -3.69082 16
83 http://data.opendatascotland.org/id/educationa... Dechmont 55.91956 -3.54255 14

84 rows × 5 columns

Looking at the table we can see that not all schools have the same number of pupils. But how badly are students actuallly spread out within schools in scotland? can we display the table in a better way that allows us to make a conclusion about the spread of pupils in scottish schools ?


In [4]:
# Create a histogram to visualize the tabular data:
pupils_histogram = gov_csv["pupils"]
number_of_schools = len(pupils_histogram)
bins = range(number_of_schools + 1)
# plot a bokeh figure
hist_fig = figure(title="Pupils Population Within Scottish Schools")
hist_fig.quad(top=pupils_histogram, bottom=0, left=bins[:-1], right=bins[1:],
              fill_color="#036564", line_color="#033649");
hist_fig.xaxis.axis_label = 'schools'
hist_fig.yaxis.axis_label = 'number of pupils'
# has school names as labels, sadly it looks very ugly. Can overcome this with a hover tool.
# x = list(gov_csv["school_label"])
# b = Bar(pupils_histogram, title="Sorted by School Name Length",cat=x,);

In [5]:
output_notebook()


BokehJS successfully loaded.

In [6]:
show(hist_fig)


Much better! we can se that the number of pupils decreases almost linearly only that there is a sudden drop. Can we use use the histogram to find out why?

Lets say I have a hypothesis and that is that parentslike sending their kids to schools that have long names because it sounds pretentious in their social circles.

To motivate this (NOTE: motivate only and not prove in any formal way !) I will generate the exact same histogram only that the bins will be sorted by school name length as opposed to number of pupils


In [7]:
# obtain the length of each school name
int_index = gov_csv["school_label"].apply(lambda x: len(x))
# replace the index of the tabular index with int_index
gov_csv = gov_csv.set_index(int_index).sort()

In [8]:
# Create a histogram to visualize the sorted tabular data:
from bokeh.charts import Bar
# Group names with same length and up their populations
pupils_histogram = gov_csv.groupby(gov_csv.index).sum()["pupils"]
number_of_schools = len(pupils_histogram)
# plot a bokeh Bar since it allows us to label the x axis with relevant values
x = list(map(str,sorted(set(int_index))))
b = Bar(pupils_histogram, title="Sorted by School Name Length",cat=x,
        xlabel='school name lengths', ylabel='number of pupils');

In [9]:
show(b)


Its not a very strong motivation mayve a cumulative version of this might help in making a stronger conclusion. This could be partially set up and left as an exercise?


In [ ]:


In [ ]: