UK schools cluster analysis

This notebook explores some potential correlations between the features of our UK school datasets and then performs an agglomerative clustering saving the labeling results on disk for further visualisation.


In [61]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn import preprocessing
from sklearn.cluster  import KMeans, AgglomerativeClustering

from mpl_toolkits.mplot3d import Axes3D

%matplotlib inline

Load schools features

Load UK schools dataset cleaned with cleaning.ipynb and sample 5 data points.


In [62]:
schools = pd.read_csv('/project/uk-schools-clustering/data/derived/2016-2017_england.csv')

In [76]:
schools.head(5)
schools.columns.tolist()


Out[76]:
['urn',
 'name',
 'english first language',
 'girls on roll',
 'english not first language',
 'total income pp',
 'total pupils on roll',
 'on free meal',
 'idaci score',
 'teacher headcount',
 'boys on roll',
 'mean salary fte',
 'total expenditure pp',
 'income score',
 'empl score',
 'perc pupils meeting reading standard',
 'perc pupils meeting math standard',
 'perc pupils meeting grammar standard',
 'perc pupils meeting writing standard',
 'avg reading scaled score',
 'avg grammar scaled score',
 'avg math scaled score']

Describe dataset


In [64]:
schools.describe()


Out[64]:
urn english first language girls on roll english not first language total income pp total pupils on roll on free meal idaci score teacher headcount boys on roll ... total expenditure pp income score empl score perc pupils meeting reading standard perc pupils meeting math standard perc pupils meeting grammar standard perc pupils meeting writing standard avg reading scaled score avg grammar scaled score avg math scaled score
count 7940.000000 7940.000000 7940.000000 7940.000000 7940.000000 7940.000000 7940.000000 7940.000000 7940.000000 7940.000000 ... 7940.000000 7940.000000 7940.000000 7940.000000 7940.000000 7940.000000 7940.000000 7940.000000 7940.000000 7940.000000
mean 115127.058942 187.597733 143.216247 50.330227 4.956919 291.907179 19.874181 0.177044 15.919773 148.690932 ... 4.974790 0.141924 0.117558 0.740366 0.763241 0.786617 0.775000 104.479219 106.028086 104.255038
std 9517.978234 103.893017 80.647899 90.464302 0.994317 163.665806 20.551354 0.133490 8.621675 84.098239 ... 1.026957 0.102965 0.076813 0.139206 0.144220 0.127736 0.120463 3.113659 2.936253 2.942249
min 100000.000000 0.000000 0.000000 0.000000 3.305000 26.000000 0.000000 0.005000 4.000000 0.000000 ... 1.097000 0.005000 0.004000 0.170000 0.120000 0.170000 0.000000 90.000000 93.000000 91.000000
25% 107086.750000 111.000000 91.000000 3.000000 4.329000 190.000000 5.000000 0.071000 10.000000 94.000000 ... 4.336000 0.063000 0.061000 0.650000 0.680000 0.710000 0.710000 102.000000 104.000000 102.000000
50% 114555.500000 168.000000 121.000000 13.000000 4.774000 244.000000 13.000000 0.134000 14.000000 125.000000 ... 4.779500 0.108000 0.094000 0.750000 0.780000 0.800000 0.790000 104.000000 106.000000 104.000000
75% 121544.750000 249.000000 196.000000 55.000000 5.344000 405.000000 28.000000 0.259250 21.000000 204.000000 ... 5.380000 0.195250 0.154250 0.840000 0.870000 0.880000 0.860000 107.000000 108.000000 106.000000
max 142523.000000 1566.000000 966.000000 1488.000000 35.425000 1933.000000 345.000000 0.783000 147.000000 967.000000 ... 36.093000 0.636000 0.557000 1.000000 1.000000 1.000000 1.000000 115.000000 117.000000 115.000000

8 rows × 21 columns

Select only numerical features


In [65]:
X=np.array(schools[schools.columns[-19:]]).astype(float)
header = schools.columns

Show correlation matrix


In [66]:
fig = plt.figure(figsize=(12,8))
correlationMatrix = np.corrcoef(X, rowvar=0)

plt.pcolor(correlationMatrix, cmap = 'hot', vmin=-1, vmax=1)
plt.colorbar()

plt.yticks(np.arange(0.5, 19), range(0,19))
plt.xticks(np.arange(0.5, 19), range(0,19))

plt.show()


Preprocess data and standard scaling


In [67]:
scaler  = preprocessing.StandardScaler().fit(X)
X_scaled = scaler.transform(X)

Show names of numerical features


In [68]:
header = schools.columns[-19:]
header


Out[68]:
Index(['girls on roll', 'english not first language', 'total income pp',
       'total pupils on roll', 'on free meal', 'idaci score',
       'teacher headcount', 'boys on roll', 'mean salary fte',
       'total expenditure pp', 'income score', 'empl score',
       'perc pupils meeting reading standard',
       'perc pupils meeting math standard',
       'perc pupils meeting grammar standard',
       'perc pupils meeting writing standard', 'avg reading scaled score',
       'avg grammar scaled score', 'avg math scaled score'],
      dtype='object')

K-Means clustering

Perform an agglomerative clustering and visualise a 3D scatter plot for 3 features. Namely:

  • IDACI score
  • Number of pupils on free meal
  • Number of pupils which english is not the first language

In [69]:
features = ['idaci score', 'on free meal', 'english not first language']

In [70]:
estimator = AgglomerativeClustering(n_clusters=2, linkage='average', affinity='cosine')

x_index = header.tolist().index(features[0])
y_index = header.tolist().index(features[1])
z_index = header.tolist().index(features[2])

fig = plt.figure(1, figsize=(8, 7))
plt.clf()
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=28, azim=134)

plt.cla()
estimator.fit(X_scaled)
labels = estimator.labels_

ax.scatter(X_scaled[:, x_index], X_scaled[:, y_index], X_scaled[:, z_index], c=labels.astype(np.float))

ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])
ax.set_xlabel(features[0])
ax.set_ylabel(features[1])
ax.set_zlabel(features[2])


Out[70]:
Text(0.5,0,'english not first language')

Save features with cluster labels

Save result of clustering to disk for further visualisation in a Plotly dash.


In [71]:
X_with_labels = np.insert(X, 19, labels, axis=1)
column_names = header.tolist()
column_names.append('cluster')
clustered_schools = pd.DataFrame(X_with_labels, columns=column_names)
clustered_schools['cluster'] = clustered_schools.cluster.astype(int)
clustered_schools


Out[71]:
girls on roll english not first language total income pp total pupils on roll on free meal idaci score teacher headcount boys on roll mean salary fte total expenditure pp income score empl score perc pupils meeting reading standard perc pupils meeting math standard perc pupils meeting grammar standard perc pupils meeting writing standard avg reading scaled score avg grammar scaled score avg math scaled score cluster
0 136.0 105.0 8.176 276.0 16.0 0.051 20.0 140.0 46.053 8.319 0.014 0.011 0.88 0.92 0.92 0.92 107.0 109.0 107.0 0
1 226.0 236.0 6.444 446.0 58.0 0.128 27.0 220.0 45.328 6.293 0.088 0.058 0.64 0.75 0.76 0.75 101.0 105.0 104.0 0
2 177.0 213.0 7.191 390.0 55.0 0.309 26.0 213.0 39.253 6.644 0.180 0.118 0.78 0.93 0.87 0.82 106.0 112.0 110.0 0
3 207.0 70.0 5.630 437.0 36.0 0.284 28.0 230.0 42.305 5.700 0.225 0.175 0.82 0.78 0.69 0.67 106.0 106.0 105.0 0
4 160.0 229.0 7.526 359.0 62.0 0.507 22.0 199.0 46.462 7.911 0.367 0.247 0.85 0.91 0.88 0.79 106.0 108.0 107.0 0
5 136.0 92.0 6.581 228.0 29.0 0.192 17.0 92.0 44.398 6.856 0.121 0.083 0.81 0.81 0.78 0.70 107.0 105.0 107.0 0
6 189.0 301.0 8.492 439.0 70.0 0.281 26.0 250.0 46.255 8.408 0.168 0.100 0.68 0.77 0.77 0.77 103.0 107.0 104.0 0
7 197.0 220.0 5.539 438.0 25.0 0.090 24.0 241.0 46.384 5.649 0.120 0.072 0.80 0.78 0.85 0.80 107.0 109.0 106.0 1
8 205.0 187.0 7.180 448.0 55.0 0.412 26.0 243.0 41.158 7.090 0.191 0.112 0.93 0.93 0.92 0.86 111.0 112.0 109.0 0
9 238.0 170.0 6.360 468.0 58.0 0.193 23.0 230.0 46.060 6.276 0.113 0.070 0.75 0.85 0.93 0.88 106.0 110.0 106.0 0
10 119.0 53.0 6.603 236.0 15.0 0.240 16.0 117.0 44.770 6.399 0.129 0.083 0.83 0.87 0.90 0.77 111.0 109.0 107.0 1
11 99.0 82.0 6.310 187.0 3.0 0.015 11.0 88.0 38.976 6.103 0.017 0.016 1.00 0.96 0.96 1.00 111.0 113.0 110.0 1
12 113.0 108.0 6.793 223.0 28.0 0.308 14.0 110.0 40.513 6.434 0.187 0.119 0.82 0.86 0.93 0.86 107.0 110.0 106.0 0
13 120.0 76.0 6.578 251.0 9.0 0.171 14.0 131.0 40.841 6.870 0.073 0.054 0.86 0.93 0.86 0.71 109.0 110.0 109.0 0
14 114.0 64.0 6.182 208.0 16.0 0.392 14.0 94.0 40.415 6.116 0.266 0.167 0.80 0.90 0.90 0.87 107.0 109.0 106.0 0
15 135.0 112.0 7.240 219.0 46.0 0.423 12.0 84.0 46.540 7.023 0.325 0.232 0.52 0.86 0.90 0.79 101.0 107.0 106.0 0
16 195.0 176.0 8.420 398.0 67.0 0.310 14.0 203.0 43.263 8.271 0.248 0.181 0.51 0.71 0.64 0.89 100.0 103.0 103.0 0
17 125.0 78.0 6.725 278.0 51.0 0.251 17.0 153.0 41.645 6.572 0.170 0.107 0.85 0.91 0.88 0.91 106.0 107.0 106.0 0
18 110.0 108.0 6.753 217.0 23.0 0.193 16.0 107.0 40.731 6.782 0.146 0.087 0.90 0.97 0.97 0.77 107.0 110.0 107.0 0
19 103.0 75.0 6.432 220.0 24.0 0.235 14.0 117.0 40.330 6.379 0.181 0.136 0.59 0.56 0.52 0.78 101.0 100.0 99.0 0
20 110.0 118.0 7.186 228.0 29.0 0.344 14.0 118.0 43.232 7.142 0.171 0.149 0.56 0.81 0.78 0.78 100.0 104.0 104.0 0
21 118.0 110.0 7.047 235.0 38.0 0.310 12.0 117.0 41.657 7.170 0.248 0.181 0.66 0.76 0.79 0.62 103.0 107.0 104.0 0
22 101.0 112.0 6.966 176.0 45.0 0.347 8.0 75.0 47.626 7.983 0.198 0.139 0.69 0.96 0.96 0.77 104.0 110.0 107.0 0
23 103.0 82.0 6.031 208.0 15.0 0.058 13.0 105.0 45.769 6.052 0.053 0.041 0.70 0.73 0.70 0.63 106.0 105.0 105.0 1
24 108.0 98.0 6.961 210.0 36.0 0.344 16.0 102.0 48.308 7.036 0.171 0.149 0.87 0.97 0.93 0.80 107.0 112.0 109.0 0
25 113.0 101.0 7.595 222.0 50.0 0.347 11.0 109.0 44.691 7.639 0.198 0.139 0.63 0.70 0.80 0.73 101.0 106.0 104.0 0
26 118.0 114.0 6.821 237.0 30.0 0.193 13.0 119.0 44.118 7.026 0.146 0.087 0.87 0.83 0.83 0.87 107.0 108.0 107.0 0
27 115.0 47.0 8.263 225.0 31.0 0.445 20.0 110.0 42.168 8.042 0.204 0.113 0.65 0.74 0.71 0.74 102.0 105.0 103.0 0
28 129.0 119.0 7.027 252.0 21.0 0.243 15.0 123.0 38.813 7.078 0.192 0.142 0.68 0.77 0.74 0.77 101.0 106.0 103.0 0
29 230.0 229.0 5.707 426.0 46.0 0.113 21.0 196.0 41.147 5.825 0.094 0.083 0.76 0.83 0.81 0.80 105.0 107.0 105.0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7910 84.0 3.0 4.424 166.0 2.0 0.065 14.0 82.0 34.906 4.365 0.071 0.073 0.74 0.70 0.70 0.70 103.0 104.0 102.0 0
7911 138.0 9.0 4.233 265.0 15.0 0.170 15.0 127.0 35.481 4.236 0.155 0.108 0.58 0.46 0.62 0.69 100.0 103.0 96.0 0
7912 45.0 0.0 4.755 99.0 2.0 0.027 5.0 54.0 37.143 4.579 0.028 0.035 0.71 0.64 0.79 0.57 107.0 108.0 103.0 0
7913 157.0 43.0 4.265 339.0 7.0 0.144 21.0 182.0 36.151 4.160 0.124 0.092 0.61 0.52 0.65 0.71 101.0 102.0 100.0 0
7914 65.0 2.0 4.822 137.0 4.0 0.055 10.0 72.0 37.164 4.696 0.055 0.054 0.86 0.71 0.76 0.86 105.0 105.0 103.0 1
7915 212.0 7.0 4.512 421.0 15.0 0.091 20.0 209.0 35.228 4.064 0.079 0.085 0.76 0.60 0.68 0.76 102.0 104.0 102.0 0
7916 43.0 0.0 5.677 95.0 9.0 0.178 9.0 52.0 36.205 5.757 0.099 0.074 0.68 0.74 0.68 0.74 102.0 102.0 102.0 0
7917 168.0 18.0 4.226 370.0 17.0 0.111 20.0 202.0 32.027 4.192 0.070 0.063 0.68 0.76 0.81 0.73 103.0 106.0 104.0 0
7918 107.0 9.0 4.932 197.0 8.0 0.147 11.0 90.0 33.811 4.908 0.112 0.079 0.57 0.46 0.46 0.64 100.0 100.0 98.0 0
7919 62.0 2.0 4.620 109.0 5.0 0.111 7.0 47.0 34.682 4.799 0.072 0.075 0.41 0.82 0.71 0.65 101.0 104.0 104.0 0
7920 167.0 21.0 3.764 368.0 12.0 0.131 16.0 201.0 32.368 3.489 0.124 0.134 0.72 0.58 0.75 0.67 104.0 104.0 102.0 1
7921 325.0 149.0 4.858 652.0 34.0 0.101 35.0 327.0 35.225 4.958 0.081 0.073 0.70 0.82 0.86 0.81 103.0 108.0 104.0 1
7922 202.0 108.0 4.978 436.0 29.0 0.148 24.0 234.0 35.949 4.988 0.106 0.099 0.56 0.60 0.60 0.75 100.0 102.0 100.0 0
7923 95.0 4.0 4.059 206.0 3.0 0.065 10.0 111.0 36.665 4.166 0.057 0.031 0.83 0.87 0.83 0.93 109.0 108.0 106.0 1
7924 187.0 32.0 4.128 412.0 17.0 0.068 20.0 225.0 37.132 4.243 0.054 0.053 0.75 0.80 0.77 0.79 105.0 106.0 104.0 1
7925 150.0 13.0 4.137 349.0 19.0 0.167 21.0 199.0 35.613 4.005 0.124 0.091 0.89 0.79 0.87 0.77 107.0 108.0 106.0 1
7926 93.0 16.0 4.214 159.0 2.0 0.139 10.0 66.0 35.276 4.157 0.076 0.056 0.83 0.94 0.91 0.80 103.0 108.0 105.0 1
7927 118.0 9.0 5.027 271.0 9.0 0.178 13.0 153.0 34.952 4.971 0.167 0.137 0.94 0.91 0.89 0.87 108.0 109.0 107.0 1
7928 202.0 104.0 4.881 418.0 24.0 0.255 23.0 216.0 37.451 4.702 0.196 0.131 0.65 0.76 0.80 0.74 103.0 107.0 104.0 0
7929 213.0 42.0 3.946 417.0 12.0 0.073 20.0 204.0 34.956 3.902 0.056 0.065 0.86 0.91 0.83 0.86 107.0 106.0 107.0 1
7930 280.0 52.0 3.889 553.0 13.0 0.105 27.0 273.0 36.616 3.895 0.066 0.057 0.76 0.83 0.86 0.78 105.0 109.0 104.0 1
7931 103.0 29.0 5.170 203.0 20.0 0.327 10.0 100.0 35.618 5.461 0.209 0.155 0.63 0.67 0.63 0.77 102.0 103.0 102.0 0
7932 219.0 53.0 5.061 417.0 16.0 0.119 16.0 198.0 34.104 5.002 0.062 0.077 0.81 0.75 0.69 0.75 103.0 104.0 102.0 0
7933 246.0 63.0 6.042 543.0 15.0 0.063 37.0 297.0 37.614 6.015 0.046 0.043 0.75 0.63 0.75 0.78 105.0 107.0 102.0 1
7934 236.0 68.0 4.732 447.0 30.0 0.114 20.0 211.0 37.947 4.864 0.071 0.069 0.71 0.77 0.76 0.76 104.0 105.0 104.0 1
7935 212.0 38.0 3.916 418.0 6.0 0.078 23.0 206.0 35.807 3.887 0.047 0.031 0.78 0.83 0.83 0.78 107.0 107.0 105.0 1
7936 199.0 70.0 5.665 400.0 57.0 0.406 23.0 201.0 35.992 5.775 0.304 0.205 0.79 0.85 0.79 0.76 104.0 105.0 104.0 0
7937 158.0 22.0 5.289 344.0 33.0 0.217 18.0 186.0 35.330 5.054 0.212 0.199 0.51 0.43 0.54 0.63 100.0 101.0 99.0 0
7938 118.0 27.0 4.309 260.0 8.0 0.076 14.0 142.0 34.818 4.348 0.048 0.036 0.53 0.53 0.70 0.67 102.0 102.0 100.0 0
7939 217.0 23.0 3.816 467.0 8.0 0.138 27.0 250.0 34.602 3.885 0.088 0.053 0.73 0.72 0.78 0.72 105.0 105.0 102.0 0

7940 rows × 20 columns


In [72]:
clustered_schools.insert(loc=0, column='name', value=schools['name'])
clustered_schools.insert(loc=0, column='urn', value=schools['urn'])

In [73]:
clustered_schools.to_csv('/project/uk-schools-clustering/data/derived/2016-2017_england_clusters.csv', index=False)

In [ ]: