The data set in this notebook can be found in http://www.opendatascotland.org/ and it details how many students can be found in each school in addition to the geographical location of the school. Let's mine this data and see if we can discover any interesting trends.
In [2]:
# This is a comment. The notebook does not execute these. They are just used to provide information to the reader.
# We use pandas to manage tabular data such as spread sheets, coma separated values
# And tab separated values.
import pandas as pd # This is an import statement.
# Bokeh is a very cool library which we will use for visualizing the data in intersting ways.
from bokeh.plotting import *
from bokeh.io import output_notebook
from bokeh.models import ColumnDataSource
# Numpy is a numerical package which we will use to deal with all the maths and metrics
# that we may want to compute from this data set.
import numpy as np
In [3]:
# Load the data in to the notebook:
gov_csv = pd.read_csv("../../data/opendata/tutorial_1/schools.csv")
# Display the tabular data
gov_csv
Out[3]:
Looking at the table we can see that not all schools have the same number of pupils. But how badly are students actuallly spread out within schools in scotland? can we display the table in a better way that allows us to make a conclusion about the spread of pupils in scottish schools ?
In [4]:
# Create a histogram to visualize the tabular data:
pupils_histogram = gov_csv["pupils"]
number_of_schools = len(pupils_histogram)
bins = range(number_of_schools + 1)
# plot a bokeh figure
hist_fig = figure(title="Pupils Population Within Scottish Schools")
hist_fig.quad(top=pupils_histogram, bottom=0, left=bins[:-1], right=bins[1:],
fill_color="#036564", line_color="#033649");
hist_fig.xaxis.axis_label = 'schools'
hist_fig.yaxis.axis_label = 'number of pupils'
# has school names as labels, sadly it looks very ugly. Can overcome this with a hover tool.
# x = list(gov_csv["school_label"])
# b = Bar(pupils_histogram, title="Sorted by School Name Length",cat=x,);
In [5]:
output_notebook()
In [6]:
show(hist_fig)
Much better! we can se that the number of pupils decreases almost linearly only that there is a sudden drop. Can we use use the histogram to find out why?
Lets say I have a hypothesis and that is that parentslike sending their kids to schools that have long names because it sounds pretentious in their social circles.
To motivate this (NOTE: motivate only and not prove in any formal way !) I will generate the exact same histogram only that the bins will be sorted by school name length as opposed to number of pupils
In [7]:
# obtain the length of each school name
int_index = gov_csv["school_label"].apply(lambda x: len(x))
# replace the index of the tabular index with int_index
gov_csv = gov_csv.set_index(int_index).sort()
In [8]:
# Create a histogram to visualize the sorted tabular data:
from bokeh.charts import Bar
# Group names with same length and up their populations
pupils_histogram = gov_csv.groupby(gov_csv.index).sum()["pupils"]
number_of_schools = len(pupils_histogram)
# plot a bokeh Bar since it allows us to label the x axis with relevant values
x = list(map(str,sorted(set(int_index))))
b = Bar(pupils_histogram, title="Sorted by School Name Length",cat=x,
xlabel='school name lengths', ylabel='number of pupils');
In [9]:
show(b)
Its not a very strong motivation mayve a cumulative version of this might help in making a stronger conclusion. This could be partially set up and left as an exercise?
In [ ]:
In [ ]: