Clean UK schools data

This notebook cleans some UK schools data sets and joins them with other data sources as deprivation data.

Main datasets:


In [1]:
import pandas as pd
import numpy as np

Utility functions

Some utility function to cast data to proper data types.


In [2]:
def is_int(value):
        try:
            int(value)
            return True
        except ValueError:
            return False
        
def is_float(value):
    try:
        float(value)
        return True
    except ValueError:
        return False

to_float = lambda x: float(x if is_float(x) else np.nan)
to_int = lambda x: int(x) if is_int(x) else np.nan

Read and clean SPINE data

Load and clean SPINE data. For each school the following fields are loaded:

  • school URN
  • la
  • school name
  • locality
  • town
  • postcode

In [3]:
def clean_spine(file_path):
    
    def la_codes(file_path):
        la_codes = pd.read_csv(file_path, usecols=['LEA', 'LA Name'])
        la_codes.columns = ['la', 'la name']
        return la_codes
    
    la_codes = la_codes('/project/uk-schools-clustering/data/meta/la_and_region_codes_meta.csv')
    spine = pd.read_csv(
        file_path,
        usecols=['URN', 'LA', 'SCHNAME', 'LOCALITY', 'TOWN', 'POSTCODE'],
        dtype={
            'URN': 'object'
        }
    )
    spine['POSTCODE'] = spine['POSTCODE'].str.replace(' ', '')
    spine.columns = ['urn', 'la', 'name', 'locality', 'town', 'postcode']
    return pd.merge(spine, la_codes, on=['la']).drop_duplicates('urn')

In [4]:
clean_spine('/project/uk-schools-clustering/data/raw/2016-2017_england_spine.csv').sample(5)


Out[4]:
urn la name locality town postcode la name
7646 106428 359 Lowton West Primary School Lowton Warrington WA32ED Wigan
8721 107873 383 Ninelands Primary School Garforth Leeds LS251NT Leeds
9093 139501 384 Carleton Community High School A Specialist Sc... Carleton Pontefract WF83NW Wakefield
6018 104481 340 St Margaret Mary's Catholic Infant School NaN Liverpool L140JG Knowsley
23687 123366 893 Gobowen Primary School Gobowen Oswestry SY113LD Shropshire

Read and clean census data

Load and clean some schools census data, including:

  • total number of pupils on roll
  • number of girls on roll
  • number of boys on roll
  • number of pupils which english is not first language
  • number of pupils which english is first language
  • number of pupils on free meals

In [5]:
def clean_census(file_path):
    census = pd.read_csv(
        file_path, usecols=['URN', 'NOR', 'NORG', 'NORB', 'NUMEAL', 'NUMENGFL', 'PNUMFSM'],
        converters={
            'NOR': to_int,
            'NORG': to_int,
            'NORB': to_int,
            'NUMEAL': to_int,
            'NUMENGFL': to_int,
            'PNUMFSM': to_float
        }
    )
    census['on free meal'] = (census['NORG']*census['PNUMFSM']) / 100
    census['on free meal'] = round(census['on free meal'])
    census.drop(inplace=True, columns=['PNUMFSM'])
    census.columns = ['urn', 'total pupils on roll', 
                      'girls on roll', 'boys on roll', 
                      'english not first language', 'english first language', 'on free meal']
    return census.drop_duplicates('urn')

In [6]:
clean_census('/project/uk-schools-clustering/data/raw/2016-2017_england_census.csv')


Out[6]:
urn total pupils on roll girls on roll boys on roll english not first language english first language on free meal
0 100000 276.0 136.0 140.0 105.0 72.0 16.0
1 100001 727.0 727.0 0.0 0.0 0.0 0.0
2 100002 248.0 103.0 145.0 0.0 0.0 0.0
3 100003 928.0 0.0 928.0 0.0 0.0 0.0
4 100008 396.0 198.0 198.0 275.0 22.0 56.0
5 100009 446.0 226.0 220.0 236.0 117.0 58.0
6 100010 390.0 177.0 213.0 213.0 117.0 55.0
7 100011 437.0 207.0 230.0 70.0 274.0 36.0
8 100012 359.0 160.0 199.0 229.0 67.0 62.0
9 100013 217.0 112.0 105.0 151.0 13.0 39.0
10 100014 228.0 136.0 92.0 92.0 82.0 29.0
11 100015 146.0 67.0 79.0 56.0 34.0 13.0
12 100018 439.0 189.0 250.0 301.0 41.0 70.0
13 100019 438.0 197.0 241.0 220.0 124.0 25.0
14 100020 455.0 201.0 254.0 233.0 123.0 45.0
15 100021 431.0 204.0 227.0 300.0 42.0 81.0
16 100022 446.0 222.0 224.0 323.0 16.0 67.0
17 100023 448.0 205.0 243.0 187.0 168.0 55.0
18 100025 468.0 238.0 230.0 170.0 187.0 58.0
19 100026 232.0 124.0 108.0 106.0 71.0 9.0
20 100027 236.0 119.0 117.0 53.0 127.0 15.0
21 100028 187.0 99.0 88.0 82.0 78.0 3.0
22 100029 223.0 113.0 110.0 108.0 64.0 28.0
23 100030 251.0 120.0 131.0 76.0 117.0 9.0
24 100031 233.0 115.0 118.0 91.0 113.0 11.0
25 100032 180.0 109.0 71.0 107.0 45.0 32.0
26 100033 208.0 114.0 94.0 64.0 115.0 16.0
27 100034 235.0 101.0 134.0 73.0 108.0 23.0
28 100035 334.0 167.0 167.0 117.0 134.0 33.0
29 100036 219.0 135.0 84.0 112.0 54.0 46.0
... ... ... ... ... ... ... ...
23479 143881 473.0 244.0 229.0 268.0 82.0 82.0
23480 143882 485.0 227.0 258.0 295.0 58.0 41.0
23481 143883 416.0 217.0 199.0 135.0 172.0 49.0
23482 143886 215.0 119.0 96.0 2.0 213.0 24.0
23483 143888 117.0 66.0 51.0 5.0 81.0 2.0
23484 143889 103.0 53.0 50.0 1.0 87.0 2.0
23485 143890 781.0 429.0 352.0 45.0 736.0 39.0
23486 143891 194.0 94.0 100.0 16.0 123.0 25.0
23487 143892 210.0 107.0 103.0 17.0 144.0 27.0
23488 143896 171.0 87.0 84.0 1.0 116.0 5.0
23489 143898 383.0 177.0 206.0 5.0 282.0 28.0
23490 143899 1323.0 661.0 662.0 10.0 1313.0 49.0
23491 143900 168.0 69.0 99.0 0.0 148.0 11.0
23492 143901 1421.0 687.0 734.0 70.0 1351.0 25.0
23493 143902 766.0 329.0 437.0 57.0 709.0 30.0
23494 143903 814.0 382.0 432.0 99.0 711.0 53.0
23495 143904 406.0 215.0 191.0 257.0 41.0 52.0
23496 143909 259.0 127.0 132.0 11.0 151.0 39.0
23497 143911 13.0 0.0 13.0 0.0 0.0 0.0
23498 143927 310.0 56.0 254.0 138.0 172.0 5.0
23499 143933 6.0 0.0 6.0 0.0 0.0 0.0
23500 143934 1148.0 526.0 622.0 49.0 1099.0 67.0
23501 143938 1247.0 615.0 632.0 197.0 1050.0 114.0
23502 143939 456.0 250.0 206.0 18.0 330.0 85.0
23503 143940 233.0 121.0 112.0 29.0 130.0 23.0
23504 143945 400.0 203.0 197.0 32.0 254.0 65.0
23505 143946 806.0 354.0 452.0 105.0 701.0 48.0
23506 143947 70.0 33.0 37.0 0.0 0.0 0.0
23507 144033 5.0 2.0 3.0 0.0 0.0 0.0
23508 NAT 4998768.0 2432086.0 2566682.0 835726.0 3184411.0 367245.0

23509 rows × 7 columns

Read and clean workforce data

Load and clean school workforce data, such as:

  • Total number of teachers (headcount)
  • Mean gross fulltime teacher salary

In [7]:
def clean_workforce(file_path):
    clean_salary = lambda x : x.replace('£', '').replace(',','.')
    workforce = pd.read_csv(
        file_path,
        usecols=['URN', 'Total Number of Teachers (Headcount)', 'Mean Gross FTE Salary of All Teachers'],
        dtype={'URN': object},
        converters={
            'Total Number of Teachers (Headcount)': to_int,
            'Mean Gross FTE Salary of All Teachers': lambda x: to_float(clean_salary(x))
        }
    )
    workforce.columns = ['urn', 'teacher headcount', 'mean salary fte']
    return workforce

In [8]:
clean_workforce('/project/uk-schools-clustering/data/raw/2016-2017_england_swf.csv')


Out[8]:
urn teacher headcount mean salary fte
0 100000 20.0 46.053
1 136807 6.0 45.589
2 139837 6.0 49.225
3 140686 0.0 NaN
4 100008 22.0 44.640
5 100009 27.0 45.328
6 100010 26.0 39.253
7 100011 28.0 42.305
8 100012 22.0 46.462
9 100013 13.0 43.912
10 100014 17.0 44.398
11 100015 8.0 46.046
12 100018 26.0 46.255
13 100019 24.0 46.384
14 100020 26.0 45.094
15 100021 28.0 44.441
16 100022 29.0 46.367
17 100023 26.0 41.158
18 100025 23.0 46.060
19 100026 12.0 45.149
20 100027 16.0 44.770
21 130342 13.0 44.118
22 132245 34.0 43.382
23 100028 11.0 38.976
24 100029 14.0 40.513
25 100030 14.0 40.841
26 100031 11.0 37.007
27 100032 14.0 38.779
28 100033 14.0 40.415
29 100034 17.0 45.310
... ... ... ...
20169 126068 107.0 37.553
20170 126069 51.0 38.646
20171 137263 135.0 37.649
20172 126071 71.0 41.690
20173 138620 73.0 41.441
20174 126080 83.0 41.156
20175 126081 107.0 39.264
20176 140713 101.0 38.020
20177 126085 74.0 40.778
20178 137416 93.0 40.144
20179 126087 67.0 40.738
20180 126088 113.0 40.572
20181 126089 102.0 40.496
20182 139109 56.0 41.552
20183 126092 138.0 39.045
20184 126093 76.0 40.281
20185 126094 55.0 39.291
20186 126095 69.0 39.747
20187 126096 44.0 41.362
20188 140472 91.0 39.779
20189 126098 92.0 40.022
20190 126101 82.0 36.836
20191 126102 66.0 39.184
20192 138492 21.0 36.747
20193 135744 61.0 38.246
20194 135745 89.0 38.596
20195 135760 58.0 38.414
20196 135962 110.0 38.180
20197 NaN NaN 37.119
20198 NaN NaN 39.483

20199 rows × 3 columns

Read and clean spending data

Load school spending data:

  • Total school income per pupil
  • Total school expenditure per pupil

In [9]:
def clean_spending(file_path):
    clean_value = lambda x : x.replace(',','.')
    to_float = lambda x: float(clean_value(x) if is_float(clean_value(x)) else np.nan)
    spending = pd.read_csv(
        file_path,
        usecols=['URN', 'TOTALINCOME', 'TOTALEXPENDITURE'],
        dtype={
            'URN': 'object'
        },
        converters={
            'TOTALINCOME': lambda x : to_float(x),
            'TOTALEXPENDITURE': lambda x : to_float(x)
        }
    )
    spending.columns = ['urn', 'total income pp', 'total expenditure pp']
    return spending

In [10]:
clean_spending('/project/uk-schools-clustering/data/raw/2016-2017_england_cfr.csv')


Out[10]:
urn total income pp total expenditure pp
0 100000 8.176 8.319
1 100005 12.646 12.280
2 100006 40.968 40.968
3 100007 84.399 81.137
4 100008 6.869 6.660
5 100009 6.444 6.293
6 100010 7.191 6.644
7 100011 5.630 5.700
8 100012 7.526 7.911
9 100013 7.162 7.108
10 100014 6.581 6.856
11 100015 9.759 9.682
12 100018 8.492 8.408
13 100019 5.539 5.649
14 100020 6.790 6.579
15 100021 7.313 7.253
16 100022 6.562 6.781
17 100023 7.180 7.090
18 100025 6.360 6.276
19 100026 6.564 6.487
20 100027 6.603 6.399
21 100028 6.310 6.103
22 100029 6.793 6.434
23 100030 6.578 6.870
24 100031 5.853 5.904
25 100032 7.238 6.923
26 100033 6.182 6.116
27 100034 8.618 8.593
28 100044 6.966 7.983
29 100045 7.043 7.059
... ... ... ...
15243 126095 5.202 5.198
15244 126096 5.355 5.476
15245 126098 5.209 5.304
15246 126101 5.850 5.831
15247 126102 5.231 5.296
15248 126155 15.879 15.638
15249 126156 25.110 26.799
15250 126157 18.677 18.839
15251 126159 20.438 19.615
15252 126160 25.698 26.710
15253 126161 19.551 19.838
15254 126162 18.662 18.383
15255 126163 22.392 22.902
15256 126169 19.998 19.498
15257 126170 23.280 22.443
15258 133971 4.378 4.425
15259 133967 4.589 4.431
15260 131130 4.144 4.098
15261 131602 4.855 4.834
15262 131603 3.897 4.016
15263 131518 4.032 3.974
15264 133321 3.967 4.072
15265 134042 5.448 5.438
15266 131001 3.939 3.920
15267 134776 3.983 4.031
15268 134896 4.443 4.514
15269 134791 4.421 4.369
15270 135607 4.037 4.021
15271 135813 3.824 3.812
15272 136114 19.976 19.021

15273 rows × 3 columns

Read and clean deprivation data

Load UK deprivation data:

  • Income score
  • Employment score
  • IDACI score

In [11]:
def clean_deprivation(file_path):
    deprivation = pd.read_csv(
        file_path,
        usecols=['Postcode', 'Income Score', 'Employment Score', 'IDACI Score'],
        converters={
            'Postcode' : lambda s : s.replace(' ', ''),
            'Income Score': lambda x : to_float(x),
            'Employment Score': lambda x : to_float(x),
            'IDACI Score': lambda x : to_float(x)
        }
    )
    deprivation.columns = ['postcode', 'income score', 'empl score', 'idaci score']
    return deprivation

In [12]:
clean_deprivation('/project/uk-schools-clustering/data/raw/deprivation-by-postcode-2015.csv')


Out[12]:
postcode income score empl score idaci score
0 EC3A5DE 0.014 0.011 0.051
1 EC2Y8BB 0.012 0.017 0.006
2 EC4M9AD 0.014 0.011 0.051
3 EC4V3AL 0.014 0.011 0.051
4 EC1M6EA 0.029 0.031 0.018
5 WC1H9EG 0.267 0.216 0.311
6 NW61QL 0.088 0.058 0.128
7 NW19AL 0.180 0.118 0.309
8 N195DH 0.225 0.175 0.284
9 NW54AX 0.367 0.247 0.507
10 NW11DN 0.345 0.204 0.523
11 NW32QT 0.121 0.083 0.192
12 NW18RP 0.266 0.167 0.392
13 NW13EX 0.168 0.100 0.281
14 NW31HU 0.120 0.072 0.090
15 NW18JL 0.065 0.052 0.085
16 NW53HB 0.222 0.149 0.372
17 NW10LL 0.286 0.193 0.424
18 NW52SJ 0.191 0.112 0.412
19 NW32JB 0.113 0.070 0.193
20 NW36NP 0.028 0.030 0.019
21 NW52JA 0.129 0.083 0.240
22 NW31JH 0.017 0.016 0.015
23 NW14BD 0.187 0.119 0.308
24 NW61TF 0.073 0.054 0.171
25 NW36TX 0.044 0.038 0.039
26 NW35SQ 0.067 0.031 0.087
27 NW18DE 0.266 0.167 0.392
28 NW52TU 0.129 0.083 0.240
29 NW32AE 0.133 0.080 0.170
... ... ... ... ...
22907 SN56HN 0.037 0.043 0.076
22908 SN27SH 0.037 0.044 0.040
22909 SN67AP 0.084 0.088 0.130
22910 SN31ER 0.078 0.089 0.037
22911 SN267AP 0.075 0.061 0.085
22912 SN35DA 0.170 0.122 0.272
22913 SN25DE 0.201 0.143 0.345
22914 SN56JU 0.048 0.054 0.062
22915 SN253BN 0.146 0.129 0.189
22916 SN33RD 0.063 0.065 0.081
22917 SN33EZ 0.354 0.265 0.442
22918 SN13AH 0.091 0.067 0.092
22919 SN31RA 0.089 0.065 0.165
22920 SN13LB 0.142 0.111 0.223
22921 SN27LL 0.099 0.092 0.128
22922 SN32PT 0.307 0.248 0.377
22923 SN58DR 0.273 0.204 0.349
22924 SN22JG 0.220 0.182 0.281
22925 SN58NU 0.243 0.179 0.288
22926 SN55DP 0.021 0.034 0.031
22927 SN55PU 0.066 0.075 0.065
22928 SN57BT 0.106 0.079 0.166
22929 SN252QS 0.075 0.061 0.085
22930 SN22NQ 0.220 0.182 0.281
22931 SN15ET 0.196 0.131 0.255
22932 SN251HT 0.131 0.138 0.141
22933 SN27NG 0.295 0.233 0.377
22934 SN21NX 0.083 0.076 0.079
22935 SN28BZ 0.152 0.126 0.141
22936 TR210NA 0.028 0.027 0.013

22937 rows × 4 columns

Read and clean key stage 2 final data

Load final key stage 2 data. Including:

  • math, writing, reading and grammar average scaled scores
  • percentage of pupils meeting math, writing, reading and grammar standards

In [13]:
def clean_k2final(file_path):
    def clean_percent(percent_str):
        percent_candidate = percent_str.replace('%', '')
        return to_float(percent_candidate) / 100

    k2final = pd.read_csv(
        file_path,
        usecols=['URN', 'PTREAD_EXP', 
                 'PTMAT_EXP', 'PTGPS_EXP', 'PTWRITTA_EXP', 
                 'READ_AVERAGE', 'GPS_AVERAGE', 'MAT_AVERAGE'
        ],
        converters={
            'PTREAD_EXP' : clean_percent,
            'PTMAT_EXP' : clean_percent,
            'PTGPS_EXP' : clean_percent,
            'PTWRITTA_EXP' : clean_percent,
            'READ_AVERAGE' : to_int,
            'GPS_AVERAGE' : to_int,
            'MAT_AVERAGE' : to_int
        }
    )
    k2final.rename(columns={
        'URN':'urn', 
        'PTREAD_EXP': 'perc pupils meeting reading standard',
        'PTMAT_EXP': 'perc pupils meeting math standard',
        'PTGPS_EXP': 'perc pupils meeting grammar standard',
        'PTWRITTA_EXP': 'perc pupils meeting writing standard',
        'READ_AVERAGE': 'avg reading scaled score',
        'GPS_AVERAGE': 'avg grammar scaled score',
        'MAT_AVERAGE': 'avg math scaled score'
    }, inplace=True)
    return k2final

In [14]:
clean_k2final('/project/uk-schools-clustering/data/raw/2016-2017_england_ks2final.csv')


/opt/anaconda/envs/Python3/lib/python3.6/site-packages/IPython/core/interactiveshell.py:2856: DtypeWarning: Columns (4) have mixed types. Specify dtype option on import or set low_memory=False.
  if self.run_code(code, result):
Out[14]:
urn perc pupils meeting reading standard avg reading scaled score perc pupils meeting grammar standard avg grammar scaled score perc pupils meeting math standard avg math scaled score perc pupils meeting writing standard
0 110705 0.54 99.0 0.63 101.0 0.67 100.0 0.75
1 117736 0.67 104.0 0.78 106.0 0.67 101.0 0.56
2 122638 0.86 106.0 0.90 110.0 0.71 104.0 0.86
3 141550 0.88 111.0 0.75 105.0 0.88 103.0 0.88
4 142999 0.73 106.0 0.73 106.0 0.78 106.0 0.80
5 106190 0.63 102.0 0.87 107.0 0.67 103.0 0.87
6 120819 0.75 104.0 0.56 99.0 0.75 102.0 0.81
7 120602 0.73 102.0 0.55 100.0 0.73 102.0 0.55
8 135197 0.87 105.0 0.87 105.0 0.91 105.0 0.91
9 105667 0.75 106.0 0.85 108.0 0.85 105.0 0.85
10 118160 0.77 105.0 0.79 105.0 0.74 103.0 0.79
11 136365 0.94 107.0 0.91 105.0 0.97 106.0 0.94
12 109024 0.86 107.0 0.96 110.0 0.96 109.0 0.93
13 134306 0.57 100.0 0.71 104.0 0.56 101.0 0.59
14 124313 0.50 100.0 0.47 101.0 0.47 99.0 0.48
15 125636 0.91 108.0 0.91 109.0 0.85 105.0 0.88
16 133255 0.95 110.0 0.95 114.0 0.95 109.0 0.88
17 102381 0.63 103.0 0.78 105.0 0.80 105.0 0.73
18 139905 0.74 104.0 0.88 107.0 0.90 105.0 0.88
19 143198 0.59 101.0 0.73 105.0 0.76 103.0 0.76
20 144648 NaN NaN NaN NaN NaN NaN NaN
21 120816 0.88 105.0 0.83 106.0 0.88 106.0 0.71
22 130968 0.90 107.0 0.86 107.0 0.90 106.0 0.86
23 139116 0.78 104.0 0.70 105.0 0.83 103.0 0.83
24 140135 0.64 102.0 0.72 104.0 0.72 102.0 0.72
25 109788 0.83 107.0 0.85 108.0 0.90 106.0 0.83
26 106441 0.71 102.0 0.77 106.0 0.87 103.0 0.81
27 118361 0.89 104.0 0.67 106.0 0.78 102.0 0.89
28 101286 0.69 104.0 0.70 104.0 0.66 104.0 0.61
29 137731 0.94 108.0 0.98 112.0 0.94 110.0 0.92
... ... ... ... ... ... ... ... ...
16437 121614 0.67 101.0 0.67 105.0 0.50 103.0 0.67
16438 141613 0.94 108.0 0.94 111.0 0.78 105.0 0.78
16439 100326 1.00 109.0 1.00 113.0 1.00 108.0 1.00
16440 107639 0.58 102.0 0.74 105.0 0.63 102.0 0.58
16441 104432 0.71 102.0 0.79 105.0 0.76 103.0 0.71
16442 122109 0.80 106.0 0.81 106.0 0.69 103.0 0.83
16443 114760 0.73 103.0 0.73 103.0 0.62 99.0 0.81
16444 101790 0.69 103.0 0.74 105.0 0.77 104.0 0.83
16445 113335 0.89 108.0 0.84 109.0 0.89 107.0 0.84
16446 108637 0.70 103.0 0.71 104.0 0.67 103.0 0.74
16447 130340 0.91 110.0 0.91 110.0 0.91 108.0 0.91
16448 134659 0.42 98.0 0.55 101.0 0.45 97.0 0.58
16449 110657 0.87 109.0 0.81 108.0 0.85 106.0 0.79
16450 119336 0.85 105.0 0.82 106.0 0.79 106.0 0.97
16451 123407 0.95 108.0 0.76 106.0 0.81 104.0 0.71
16452 143182 NaN NaN NaN NaN NaN NaN NaN
16453 140363 0.80 105.0 0.80 107.0 0.82 106.0 0.84
16454 105919 0.43 98.0 0.38 99.0 0.38 97.0 0.43
16455 112509 0.89 110.0 0.86 112.0 0.89 112.0 0.86
16456 126407 0.77 105.0 0.74 106.0 0.71 104.0 0.84
16457 109949 0.71 105.0 1.00 107.0 0.86 104.0 0.57
16458 140843 0.36 99.0 0.68 104.0 0.55 102.0 0.55
16459 102812 0.79 105.0 0.85 109.0 0.79 106.0 0.69
16460 118200 0.52 100.0 0.60 102.0 0.63 101.0 0.71
16461 108164 0.55 101.0 0.66 104.0 0.69 103.0 0.76
16462 124979 0.80 107.0 0.78 106.0 0.74 103.0 0.86
16463 117972 0.90 107.0 0.90 105.0 0.40 100.0 0.90
16464 106602 0.90 109.0 1.00 111.0 1.00 108.0 0.86
16465 123354 0.76 104.0 0.76 104.0 0.90 104.0 0.67
16466 112232 0.81 107.0 0.84 108.0 0.84 106.0 0.84

16467 rows × 8 columns

Merge datasets

This function, loads all different datasets, merges them together and eventually saves them on disk.


In [86]:
def get_data(save_to = None, columns = None):
    spine = clean_spine('/project/uk-schools-clustering/data/raw/2016-2017_england_spine.csv')
    census = clean_census('/project/uk-schools-clustering/data/raw/2016-2017_england_census.csv')
    workforce = clean_workforce('/project/uk-schools-clustering/data/raw/2016-2017_england_swf.csv')
    spending = clean_spending('/project/uk-schools-clustering/data/raw/2016-2017_england_cfr.csv')
    deprivation = clean_deprivation('/project/uk-schools-clustering/data/raw/deprivation-by-postcode-2015.csv')
    k2final = clean_k2final('/project/uk-schools-clustering/data/raw/2016-2017_england_ks2final.csv')
    result = pd.merge(spine, census, on=['urn'])
    result = pd.merge(result, deprivation, on=['postcode'])
    result = pd.merge(result, workforce, on=['urn'])
    result = pd.merge(result, spending, on=['urn'])
    result = pd.merge(result, k2final, on=['urn'])
    result.dropna(axis=0, subset=[
        'total income pp',
        'idaci score',
        'mean salary fte',
        'perc pupils meeting reading standard',
        'perc pupils meeting grammar standard',
        'perc pupils meeting math standard',
        'avg reading scaled score',
        'avg grammar scaled score',
        'avg math scaled score'
    ], how='any', inplace=True)
#     result.dropna(axis=0, how='all', inplace=True)
    if columns is None:
        columns_to_select = result.columns
    else:
        columns_to_select = columns
    if save_to is not None:
        result[columns_to_select].to_csv(save_to, index=False)
    return result[columns_to_select]

In [87]:
get_data(
    '/project/uk-schools-clustering/data/derived/2016-2017_england.csv',
    columns=['urn', 'name', 'english first language', 'girls on roll',
             'english not first language','total income pp', 'total pupils on roll', 'on free meal',
             'idaci score', 'teacher headcount','boys on roll', 'mean salary fte', 'total expenditure pp',
             'income score', 'empl score', 'perc pupils meeting reading standard',
             'perc pupils meeting math standard', 'perc pupils meeting grammar standard', 'perc pupils meeting writing standard',
             'avg reading scaled score','avg grammar scaled score','avg math scaled score']
)


/opt/anaconda/envs/Python3/lib/python3.6/site-packages/IPython/core/interactiveshell.py:2910: DtypeWarning: Columns (4) have mixed types. Specify dtype option on import or set low_memory=False.
  exec(code_obj, self.user_global_ns, self.user_ns)
Out[87]:
urn name english first language girls on roll english not first language total income pp total pupils on roll on free meal idaci score teacher headcount ... total expenditure pp income score empl score perc pupils meeting reading standard perc pupils meeting math standard perc pupils meeting grammar standard perc pupils meeting writing standard avg reading scaled score avg grammar scaled score avg math scaled score
0 100000 Sir John Cass's Foundation Primary School 72.0 136.0 105.0 8.176 276.0 16.0 0.051 20.0 ... 8.319 0.014 0.011 0.88 0.92 0.92 0.92 107.0 109.0 107.0
1 100009 Beckford Primary School 117.0 226.0 236.0 6.444 446.0 58.0 0.128 27.0 ... 6.293 0.088 0.058 0.64 0.75 0.76 0.75 101.0 105.0 104.0
2 100010 Brecknock Primary School 117.0 177.0 213.0 7.191 390.0 55.0 0.309 26.0 ... 6.644 0.180 0.118 0.78 0.93 0.87 0.82 106.0 112.0 110.0
3 100011 Brookfield Primary School 274.0 207.0 70.0 5.630 437.0 36.0 0.284 28.0 ... 5.700 0.225 0.175 0.82 0.78 0.69 0.67 106.0 106.0 105.0
4 100012 Carlton Primary School 67.0 160.0 229.0 7.526 359.0 62.0 0.507 22.0 ... 7.911 0.367 0.247 0.85 0.91 0.88 0.79 106.0 108.0 107.0
5 100014 Fleet Primary School 82.0 136.0 92.0 6.581 228.0 29.0 0.192 17.0 ... 6.856 0.121 0.083 0.81 0.81 0.78 0.70 107.0 105.0 107.0
6 100018 Netley Primary School & Centre for Autism 41.0 189.0 301.0 8.492 439.0 70.0 0.281 26.0 ... 8.408 0.168 0.100 0.68 0.77 0.77 0.77 103.0 107.0 104.0
7 100019 New End Primary School 124.0 197.0 220.0 5.539 438.0 25.0 0.090 24.0 ... 5.649 0.120 0.072 0.80 0.78 0.85 0.80 107.0 109.0 106.0
8 100023 Torriano Primary School 168.0 205.0 187.0 7.180 448.0 55.0 0.412 26.0 ... 7.090 0.191 0.112 0.93 0.93 0.92 0.86 111.0 112.0 109.0
9 100025 Gospel Oak Primary School 187.0 238.0 170.0 6.360 468.0 58.0 0.193 23.0 ... 6.276 0.113 0.070 0.75 0.85 0.93 0.88 106.0 110.0 106.0
10 100027 Eleanor Palmer Primary School 127.0 119.0 53.0 6.603 236.0 15.0 0.240 16.0 ... 6.399 0.129 0.083 0.83 0.87 0.90 0.77 111.0 109.0 107.0
11 100028 Christ Church Primary School, Hampstead 78.0 99.0 82.0 6.310 187.0 3.0 0.015 11.0 ... 6.103 0.017 0.016 1.00 0.96 0.96 1.00 111.0 113.0 110.0
12 100029 Christ Church School 64.0 113.0 108.0 6.793 223.0 28.0 0.308 14.0 ... 6.434 0.187 0.119 0.82 0.86 0.93 0.86 107.0 110.0 106.0
13 100030 Emmanuel Church of England Primary School 117.0 120.0 76.0 6.578 251.0 9.0 0.171 14.0 ... 6.870 0.073 0.054 0.86 0.93 0.86 0.71 109.0 110.0 109.0
14 100033 Holy Trinity and Saint Silas CofE Primary Scho... 115.0 114.0 64.0 6.182 208.0 16.0 0.392 14.0 ... 6.116 0.266 0.167 0.80 0.90 0.90 0.87 107.0 109.0 106.0
15 100036 St Alban's Church of England Primary School 54.0 135.0 112.0 7.240 219.0 46.0 0.423 12.0 ... 7.023 0.325 0.232 0.52 0.86 0.90 0.79 101.0 107.0 106.0
16 100037 St Aloysius Catholic Primary School 150.0 195.0 176.0 8.420 398.0 67.0 0.310 14.0 ... 8.271 0.248 0.181 0.51 0.71 0.64 0.89 100.0 103.0 103.0
17 100039 St Dominic's Catholic Primary School 130.0 125.0 78.0 6.725 278.0 51.0 0.251 17.0 ... 6.572 0.170 0.107 0.85 0.91 0.88 0.91 106.0 107.0 106.0
18 100040 St George the Martyr Church of England Primary... 67.0 110.0 108.0 6.753 217.0 23.0 0.193 16.0 ... 6.782 0.146 0.087 0.90 0.97 0.97 0.77 107.0 110.0 107.0
19 100041 St Josephs Catholic Primary School 91.0 103.0 75.0 6.432 220.0 24.0 0.235 14.0 ... 6.379 0.181 0.136 0.59 0.56 0.52 0.78 101.0 100.0 99.0
20 100042 St Mary's Kilburn Church of England Primary Sc... 55.0 110.0 118.0 7.186 228.0 29.0 0.344 14.0 ... 7.142 0.171 0.149 0.56 0.81 0.78 0.78 100.0 104.0 104.0
21 100043 St Mary and St Pancras Church of England Prima... 69.0 118.0 110.0 7.047 235.0 38.0 0.310 12.0 ... 7.170 0.248 0.181 0.66 0.76 0.79 0.62 103.0 107.0 104.0
22 100044 St Michael's Church of England Primary School 45.0 101.0 112.0 6.966 176.0 45.0 0.347 8.0 ... 7.983 0.198 0.139 0.69 0.96 0.96 0.77 104.0 110.0 107.0
23 100046 St Paul's Church of England Primary School 96.0 103.0 82.0 6.031 208.0 15.0 0.058 13.0 ... 6.052 0.053 0.041 0.70 0.73 0.70 0.63 106.0 105.0 105.0
24 100047 St Eugene de Mazenod Roman Catholic Primary Sc... 82.0 108.0 98.0 6.961 210.0 36.0 0.344 16.0 ... 7.036 0.171 0.149 0.87 0.97 0.93 0.80 107.0 112.0 109.0
25 100048 Our Lady Roman Catholic Primary School 71.0 113.0 101.0 7.595 222.0 50.0 0.347 11.0 ... 7.639 0.198 0.139 0.63 0.70 0.80 0.73 101.0 106.0 104.0
26 130342 Christopher Hatton Primary School 66.0 118.0 114.0 6.821 237.0 30.0 0.193 13.0 ... 7.026 0.146 0.087 0.87 0.83 0.83 0.87 107.0 108.0 107.0
27 100112 Alderwood Primary School 121.0 115.0 47.0 8.263 225.0 31.0 0.445 20.0 ... 8.042 0.204 0.113 0.65 0.74 0.71 0.74 102.0 105.0 103.0
28 100115 Cherry Orchard Primary School 64.0 129.0 119.0 7.027 252.0 21.0 0.243 15.0 ... 7.078 0.192 0.142 0.68 0.77 0.74 0.77 101.0 106.0 103.0
29 100125 Fossdene Primary School 109.0 230.0 229.0 5.707 426.0 46.0 0.113 21.0 ... 5.825 0.094 0.083 0.76 0.83 0.81 0.80 105.0 107.0 105.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
8413 126486 Sutton Benger Church of England Aided Primary ... 136.0 84.0 3.0 4.424 166.0 2.0 0.065 14.0 ... 4.365 0.071 0.073 0.74 0.70 0.70 0.70 103.0 104.0 102.0
8414 126489 Ludgershall Castle Primary School 209.0 138.0 9.0 4.233 265.0 15.0 0.170 15.0 ... 4.236 0.155 0.108 0.58 0.46 0.62 0.69 100.0 103.0 96.0
8415 126490 Pitton Church of England Voluntary Aided Prima... 83.0 45.0 0.0 4.755 99.0 2.0 0.027 5.0 ... 4.579 0.028 0.035 0.71 0.64 0.79 0.57 107.0 108.0 103.0
8416 126492 Clarendon Junior School 296.0 157.0 43.0 4.265 339.0 7.0 0.144 21.0 ... 4.160 0.124 0.092 0.61 0.52 0.65 0.71 101.0 102.0 100.0
8417 130388 Churchfields, the Village School 123.0 65.0 2.0 4.822 137.0 4.0 0.055 10.0 ... 4.696 0.055 0.054 0.86 0.71 0.76 0.86 105.0 105.0 103.0
8418 131215 Amesbury Church of England Voluntary Controlle... 328.0 212.0 7.0 4.512 421.0 15.0 0.091 20.0 ... 4.064 0.079 0.085 0.76 0.60 0.68 0.76 102.0 104.0 102.0
8419 133233 Five Lanes Primary School 84.0 43.0 0.0 5.677 95.0 9.0 0.178 9.0 ... 5.757 0.099 0.074 0.68 0.74 0.68 0.74 102.0 102.0 102.0
8420 133775 Amesbury Archer Primary School 297.0 168.0 18.0 4.226 370.0 17.0 0.111 20.0 ... 4.192 0.070 0.063 0.68 0.76 0.81 0.73 103.0 106.0 104.0
8421 134788 Old Sarum Primary School 154.0 107.0 9.0 4.932 197.0 8.0 0.147 11.0 ... 4.908 0.112 0.079 0.57 0.46 0.46 0.64 100.0 100.0 98.0
8422 134977 Wylye Valley Church of England Voluntary Aided... 98.0 62.0 2.0 4.620 109.0 5.0 0.111 7.0 ... 4.799 0.072 0.075 0.41 0.82 0.71 0.65 101.0 104.0 104.0
8423 126207 Grange Junior School 347.0 167.0 21.0 3.764 368.0 12.0 0.131 16.0 ... 3.489 0.124 0.134 0.72 0.58 0.75 0.67 104.0 104.0 102.0
8424 126213 Even Swindon Primary School 324.0 325.0 149.0 4.858 652.0 34.0 0.101 35.0 ... 4.958 0.081 0.073 0.70 0.82 0.86 0.81 103.0 108.0 104.0
8425 126228 Lainesmead Primary School 226.0 202.0 108.0 4.978 436.0 29.0 0.148 24.0 ... 4.988 0.106 0.099 0.56 0.60 0.60 0.75 100.0 102.0 100.0
8426 126231 Wanborough Primary School 178.0 95.0 4.0 4.059 206.0 3.0 0.065 10.0 ... 4.166 0.057 0.031 0.83 0.87 0.83 0.93 109.0 108.0 106.0
8427 126240 Lawn Primary 320.0 187.0 32.0 4.128 412.0 17.0 0.068 20.0 ... 4.243 0.054 0.053 0.75 0.80 0.77 0.79 105.0 106.0 104.0
8428 126245 Wroughton Junior School 336.0 150.0 13.0 4.137 349.0 19.0 0.167 21.0 ... 4.005 0.124 0.091 0.89 0.79 0.87 0.77 107.0 108.0 106.0
8429 126252 Colebrook Junior School 143.0 93.0 16.0 4.214 159.0 2.0 0.139 10.0 ... 4.157 0.076 0.056 0.83 0.94 0.91 0.80 103.0 108.0 105.0
8430 126253 Westrop Primary School 214.0 118.0 9.0 5.027 271.0 9.0 0.178 13.0 ... 4.971 0.167 0.137 0.94 0.91 0.89 0.87 108.0 109.0 107.0
8431 126281 Robert Le Kyng Primary School 254.0 202.0 104.0 4.881 418.0 24.0 0.255 23.0 ... 4.702 0.196 0.131 0.65 0.76 0.80 0.74 103.0 107.0 104.0
8432 126283 Brook Field Primary School 315.0 213.0 42.0 3.946 417.0 12.0 0.073 20.0 ... 3.902 0.056 0.065 0.86 0.91 0.83 0.86 107.0 106.0 107.0
8433 126295 Haydonleigh Primary School 412.0 280.0 52.0 3.889 553.0 13.0 0.105 27.0 ... 3.895 0.066 0.057 0.76 0.83 0.86 0.78 105.0 109.0 104.0
8435 126440 Oliver Tomkins Church of England Junior School 173.0 103.0 29.0 5.170 203.0 20.0 0.327 10.0 ... 5.461 0.209 0.155 0.63 0.67 0.63 0.77 102.0 103.0 102.0
8436 131901 Liden Primary and Nursery School 224.0 219.0 53.0 5.061 417.0 16.0 0.119 16.0 ... 5.002 0.062 0.077 0.81 0.75 0.69 0.75 103.0 104.0 102.0
8437 131377 Red Oaks Primary School 369.0 246.0 63.0 6.042 543.0 15.0 0.063 37.0 ... 6.015 0.046 0.043 0.75 0.63 0.75 0.78 105.0 107.0 102.0
8438 131573 Abbey Meads Community Primary School 292.0 236.0 68.0 4.732 447.0 30.0 0.114 20.0 ... 4.864 0.071 0.069 0.71 0.77 0.76 0.76 104.0 105.0 104.0
8439 132023 Catherine Wayte Primary School 320.0 212.0 38.0 3.916 418.0 6.0 0.078 23.0 ... 3.887 0.047 0.031 0.78 0.83 0.83 0.78 107.0 107.0 105.0
8440 132064 Oak Tree Nursery and Primary School 217.0 199.0 70.0 5.665 400.0 57.0 0.406 23.0 ... 5.775 0.304 0.205 0.79 0.85 0.79 0.76 104.0 105.0 104.0
8441 132229 Eldene Nursery and Primary School 237.0 158.0 22.0 5.289 344.0 33.0 0.217 18.0 ... 5.054 0.212 0.199 0.51 0.43 0.54 0.63 100.0 101.0 99.0
8442 134134 Bridlewood Primary School 203.0 118.0 27.0 4.309 260.0 8.0 0.076 14.0 ... 4.348 0.048 0.036 0.53 0.53 0.70 0.67 102.0 102.0 100.0
8443 134151 St Francis CofE Primary School 336.0 217.0 23.0 3.816 467.0 8.0 0.138 27.0 ... 3.885 0.088 0.053 0.73 0.72 0.78 0.72 105.0 105.0 102.0

7940 rows × 22 columns


In [ ]: