Predicting Numerical Graduation Rate

Based on feedback from our TF on 12/7, we attempted several forms of regression to see how well we could predict numerical graduation rate with school district characteristics alone, then with previous years graduation rates, then with school district characteristics and previous years graduation rates.

One opportunity in this space is that the U.S. Department of Education typically delays making graduation rate data available. For instance, it is the 2015-2016 school year, and the most current graduation rate data available is for the year 2009-2010. If we could build a model to predict graduation rate for the years of missing data, organizations that rely on graduation rate data to provide schools services could use this graduation rate approximation until newer graduation rate data becomes available.

For this notebook, we pulled 3 previous years of graduation rate data (2006-2007, 2007-2008, and 2008-2009). First we built regression models using school district data alone and predicting numerical graduation rate, then we built regression models using historic graduation rate data alone and predicting numerical graduation rate, then we built regression models using school district data and historic graduation rate data and predicting numerical graduation rate, and lastly we built a regression model using 2006-2007 school district data and fed it new 2009-2010 school district data to see how well it would predict 2009-2010 numerical graduation rate.

We compared the models using mean squared error, with the lower the mean squared error, the better.

1. Predicting Numerical Graduation Rate
2. Table of Contents



In [24]:

    
%matplotlib inline 
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
from matplotlib import rcParams
import sklearn
import statsmodels.api as sm

pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)

import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")

Acquiring graduation rate from previous years and adding it to schools districts dataset



In [3]:

    
grad0607 = pd.read_csv("data/rawdata/districts/prevyears/dr06p1b.txt", dtype=np.str, delim_whitespace=True)
grad0708 = pd.read_csv("data/rawdata/districts/prevyears/dr07p1b.txt", dtype=np.str, delim_whitespace=True)
grad0809 = pd.read_csv("data/rawdata/districts/prevyears/dr08p1a.txt", dtype=np.str, delim_whitespace=True)



In [4]:

    
grad0607 = grad0607.replace('-1', '')
grad0607 = grad0607.replace('-2', '')
grad0607 = grad0607.replace('-3', '')
grad0607 = grad0607.replace('-4', '')
grad0607 = grad0607.replace('-9', '')
grad0607 = grad0607.replace('-1.0', '')
grad0607 = grad0607.replace('-2.0', '')
grad0607 = grad0607.replace('-3.0', '')
grad0607 = grad0607.replace('-4.0', '')
grad0607 = grad0607.replace('-9.0', '')

grad0607.head()



In [5]:

    
grad0708 = grad0708.replace('-1', '')
grad0708 = grad0708.replace('-2', '')
grad0708 = grad0708.replace('-3', '')
grad0708 = grad0708.replace('-4', '')
grad0708 = grad0708.replace('-9', '')
grad0708 = grad0708.replace('-1.0', '')
grad0708 = grad0708.replace('-2.0', '')
grad0708 = grad0708.replace('-3.0', '')
grad0708 = grad0708.replace('-4.0', '')
grad0708 = grad0708.replace('-9.0', '')

grad0708.head()



In [6]:

    
grad0809 = grad0809.replace('-1', '')
grad0809 = grad0809.replace('-2', '')
grad0809 = grad0809.replace('-3', '')
grad0809 = grad0809.replace('-4', '')
grad0809 = grad0809.replace('-9', '')
grad0809 = grad0809.replace('-1.0', '')
grad0809 = grad0809.replace('-2.0', '')
grad0809 = grad0809.replace('-3.0', '')
grad0809 = grad0809.replace('-4.0', '')
grad0809 = grad0809.replace('-9.0', '')

grad0809.head()



In [7]:

    
grad0809.dtypes









    Out[7]:





SURVYEAR    object
FIPST       object
LEAID       object
TOTD912     object
EBS912      object
DRP912      object
TOTDPL      object
AFGEB       object
AFGR        object
TOTOHC      object
dtype: object



In [9]:

    
grad0607=grad0607.replace([np.inf, -np.inf], np.nan)
grad0708=grad0708.replace([np.inf, -np.inf], np.nan)
grad0809=grad0809.replace([np.inf, -np.inf], np.nan)

grad0607['AFGR'] = grad0607['AFGR'].replace('', np.nan)
grad0607['AFGR'] = grad0607['AFGR'].astype(float)
grad0607['LEAID'] = grad0607['LEAID'].astype(int)
grad0607 = grad0607.rename(columns={'LEAID': 'agency_id_nces'})
grad0607 = grad0607.rename(columns={'AFGR': 'gradrate0607'})
grad0607 = grad0607.drop(grad0607.columns[[0, 1, 3, 4, 5, 6, 7, 9]], 1)
grad0708['afgr'] = grad0708['afgr'].replace('', np.nan)
grad0708['afgr'] = grad0708['afgr'].astype(float)
grad0708['leaid'] = grad0708['leaid'].astype(int)
grad0708 = grad0708.rename(columns={'leaid': 'agency_id_nces'})
grad0708 = grad0708.rename(columns={'afgr': 'gradrate0708'})
grad0708 = grad0708.drop(grad0708.columns[[0, 1, 3, 4, 5, 6, 7, 9]], 1)
grad0809['AFGR'] = grad0809['AFGR'].replace('', np.nan)
grad0809['AFGR'] = grad0809['AFGR'].astype(float)
grad0809['LEAID'] = grad0809['LEAID'].astype(int)
grad0809 = grad0809.rename(columns={'LEAID': 'agency_id_nces'})
grad0809 = grad0809.rename(columns={'AFGR': 'gradrate0809'})
grad0809 = grad0809.drop(grad0809.columns[[0, 1, 3, 4, 5, 6, 7, 9]], 1)

grad0809.head()









    Out[9]:






  
    
      
      agency_id_nces
      gradrate0809
    
  
  
    
      0
      100002
      NaN
    
    
      1
      100005
      64.0
    
    
      2
      100006
      67.0
    
    
      3
      100007
      89.6
    
    
      4
      100008
      100.0



In [12]:

    
filtered = pd.read_csv("data/finaldata/filtered.csv")
print filtered.shape

filtered = filtered.merge(grad0607, 'left', 'agency_id_nces', suffixes=('', '_DEL'))
filtered = filtered.merge(grad0708, 'left', 'agency_id_nces', suffixes=('', '_DEL'))
filtered = filtered.merge(grad0809, 'left', 'agency_id_nces', suffixes=('', '_DEL'))
print filtered.shape

filtered.head()









    



(10785, 412)
(10785, 415)






    Out[12]:






  
    
      
      agency
      state
      state_abbr
      agency_id_nces
      county
      county_number
      num_schools
      num_charter_schools
      num_pub_schools
      report_years
      no_report_years
      address
      city
      add_state
      zipcode
      latitude
      longitude
      agency_id_state
      congressional_code
      census_id
      offered_g_lowest
      num_students
      pupil_teacher_ratio_dist
      pupil_teacher_ratio_ps
      totalrev_pp
      tlocrev_pp
      tsrev_pp
      tfedrev_pp
      tcurinst_pp
      tcurssv_pp
      tcuroth_pp
      tcursalary_pp
      tcurbenefits_pp
      totalexp_pp
      tcapout_pp
      tnonelse_pp
      tcurelsc_pp
      instexp_pp
      tcurelsc_percent
      tcurinst_percent
      tcuroth_percent
      tcuresal_percent
      tcurssvc_percent
      tfedrev_percent
      tlocrev_percent
      tsrev_percent
      fipst
      totd912
      ebs912
      drp912
      ...
      r_lrev_it
      r_lrev_aot
      r_lrev_pgc
      r_lrev_cc
      r_lrev_oss
      r_lrev_tui
      r_lrev_trans
      r_lrev_slr
      r_lrev_ts
      r_lrev_sar
      r_lrev_osalserv
      r_lrev_sfns
      r_lrev_ie
      r_lrev_molr
      r_lrev_sp
      r_lrev_rr
      r_lrev_sale
      r_lrev_ff
      r_lrev_pc
      r_srev_gfa
      r_srev_sep
      r_srev_trans
      r_srev_sip
      r_srev_cbsp
      r_srev_vep
      r_srev_codsp
      r_srev_bep
      r_srev_gt
      r_srev_slp
      r_srev_aor
      r_srev_splea
      r_srev_osp
      r_srev_ns
      r_frev_title1
      r_frev_dis
      r_frev_cna
      r_frev_ems
      r_frev_dfs
      r_frev_voc
      r_frev_ao
      r_frev_ns
      r_frev_ia
      r_frev_be
      r_frev_na
      r_frev_aofed
      RESP_High_Graduation
      RESP_Low_Graduation
      gradrate0607
      gradrate0708
      gradrate0809
    
  
  
    
      0
      21ST CENTURY CHARTER SCH OF GARY
      Indiana
      IN
      1800046
      MARION COUNTY
      18097
      1
      1
      1
      2004-2013
      1986-2003
      556 WASHINGTON ST
      GARY
      IN
      46402
      39.771949
      -86.155184
      9545
      1807
      NaN
      Kindergarten
      360
      15.65
      17.14
      11111
      356
      7925
      2831
      3986
      4717
      436
      4133
      1028
      11197
      1986
      72
      9139
      3986
      11.2
      43.6
      4.8
      45.2
      51.6
      25.5
      3.2
      71.3
      18
      203.793066
      100
      9.676598
      ...
      0.011257
      0.004972
      0.769211
      0.070312
      0.390625
      0.000000
      0
      0.062500
      0.000000
      0.000000
      0.000000
      0
      0.000000
      0.078125
      0.007812
      0.000000
      0.390625
      0
      0.00000
      0.964599
      0.000000
      0.000000
      0.000000
      0.003155
      0.000000
      0
      0
      0.003155
      0.000000
      0.029092
      0.000000
      0.000000
      0
      0.606477
      0.000000
      0.156035
      0.000000
      0.000000
      0.000000
      0.176644
      0.060844
      0.000000
      0.000000
      0.000000
      0.000000
      0
      1
      NaN
      NaN
      NaN
    
    
      1
      21ST CENTURY CYBER CS
      Pennsylvania
      PA
      4200091
      CHESTER COUNTY
      42029
      1
      1
      1
      2001-2013
      1986-2000
      805 SPRINGDALE DR
      EXTON
      PA
      19341
      40.005030
      -75.678564
      124150002
      4206
      NaN
      6th Grade
      594
      28.49
      28.49
      10557
      10285
      273
      0
      5104
      3003
      0
      4199
      1456
      8732
      554
      0
      8108
      5104
      18.0
      63.0
      0.0
      51.8
      37.0
      0.0
      97.4
      2.6
      42
      33.000000
      483
      6.800000
      ...
      0.011257
      0.004972
      0.769211
      0.000000
      0.983958
      0.005893
      0
      0.000000
      0.000000
      0.000000
      0.000000
      0
      0.000000
      0.010149
      0.000000
      0.000000
      0.000000
      0
      0.00000
      0.000000
      0.000000
      0.000000
      0.950617
      0.000000
      0.000000
      0
      0
      0.000000
      0.000000
      0.049383
      0.000000
      0.000000
      0
      0.208316
      0.166727
      0.177190
      0.024388
      0.003395
      0.005148
      0.309150
      0.048223
      0.015101
      0.003794
      0.002556
      0.036014
      1
      0
      100.0
      100.0
      100.0
    
    
      2
      A+ ACADEMY
      Texas
      TX
      4800203
      DALLAS COUNTY
      48113
      1
      1
      1
      2000-2013
      1986-1999
      8225 BRUTON RD
      DALLAS
      TX
      75217
      32.767535
      -96.660866
      057829
      4830
      NaN
      Prekindergarten
      1033
      16.61
      16.61
      11015
      68
      9164
      1784
      4890
      3500
      441
      5788
      477
      8864
      0
      1
      8832
      4890
      5.4
      55.4
      5.0
      65.5
      39.6
      16.2
      0.6
      83.2
      48
      203.793066
      228
      9.676598
      ...
      0.011257
      0.004972
      0.769211
      0.000000
      0.000000
      0.000000
      0
      0.800000
      0.000000
      0.000000
      0.000000
      0
      0.142857
      0.000000
      0.000000
      0.057143
      0.000000
      0
      0.00000
      0.952145
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0
      0
      0.000000
      0.000211
      0.006233
      0.040038
      0.001373
      0
      0.351058
      0.056430
      0.238741
      0.033641
      0.000000
      0.000000
      0.320130
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0
      1
      75.6
      41.4
      NaN
    
    
      3
      A-C CENTRAL CUSD 262
      Illinois
      IL
      1700105
      CASS COUNTY
      17017
      3
      0
      3
      1989-2013
      1986-1988
      501 EAST BUCHANAN ST
      ASHLAND
      IL
      62612
      39.892187
      -90.016057
      46-009-2620-26
      1718
      1.450092e+13
      Kindergarten
      432
      11.86
      12.08
      11367
      4374
      5089
      1904
      4835
      4633
      498
      5741
      2112
      13200
      2204
      0
      9966
      4835
      21.2
      48.5
      5.0
      57.6
      46.5
      16.7
      38.5
      44.8
      17
      203.793066
      130
      9.676598
      ...
      0.000000
      0.000000
      0.769211
      0.000000
      0.001049
      0.000000
      0
      0.060829
      0.009963
      0.014683
      0.000000
      0
      0.005244
      0.052438
      0.027792
      0.004195
      0.000000
      0
      0.02517
      0.642181
      0.049572
      0.068049
      0.000000
      0.004056
      0.000901
      0
      0
      0.000000
      0.000901
      0.015773
      0.218567
      0.000000
      0
      0.162651
      0.306024
      0.087952
      0.038554
      0.002410
      0.000000
      0.402410
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0
      1
      97.1
      81.6
      83.3
    
    
      4
      A-H-S-T COMM SCHOOL DISTRICT
      Iowa
      IA
      1904080
      POTTAWATTAMIE COUNTY
      19155
      2
      0
      2
      1986-2013
      NaN
      768 SOUTH MAPLE ST
      AVOCA
      IA
      51521
      41.471017
      -95.341001
      780441 000
      1905
      1.650780e+13
      Prekindergarten
      595
      15.64
      13.37
      10718
      5634
      4020
      1065
      4944
      2103
      336
      4703
      1438
      11586
      2769
      0
      7384
      4944
      19.5
      67.0
      4.6
      63.7
      28.5
      9.9
      52.6
      37.5
      19
      203.793066
      168
      9.676598
      ...
      0.061464
      0.000000
      0.769211
      0.000000
      0.043101
      0.000000
      0
      0.034175
      0.008926
      0.004336
      0.002805
      0
      0.004081
      0.066820
      0.000000
      0.000000
      0.004081
      0
      0.00000
      0.859543
      0.000000
      0.000000
      0.118656
      0.000000
      0.002144
      0
      0
      0.000000
      0.001072
      0.018585
      0.000000
      0.000000
      0
      0.095816
      0.107962
      0.148448
      0.043185
      0.008097
      0.005398
      0.554656
      0.000000
      0.000000
      0.000000
      0.000000
      0.036437
      1
      0
      75.0
      72.0
      69.4
    
  

5 rows × 415 columns



In [13]:

    
print filtered['gradrate0607'].isnull().sum()
print filtered['gradrate0708'].isnull().sum()
print filtered['gradrate0809'].isnull().sum()



In [14]:

    
filtered.to_csv("data/finaldata/filtered_withprevyears.csv", index=False)



In [15]:

    
dftouse = filtered.copy(deep=True)



In [16]:

    
#Drop identification data
dftouse.drop('agency', axis=1, inplace=True)
dftouse.drop('state', axis=1, inplace=True)
dftouse.drop('state_abbr', axis=1, inplace=True)
dftouse.drop('agency_id_nces', axis=1, inplace=True)
dftouse.drop('county', axis=1, inplace=True)
dftouse.drop('county_number', axis=1, inplace=True)
dftouse.drop('report_years', axis=1, inplace=True)
dftouse.drop('no_report_years', axis=1, inplace=True)
dftouse.drop('address', axis=1, inplace=True)
dftouse.drop('city', axis=1, inplace=True)
dftouse.drop('add_state', axis=1, inplace=True)
dftouse.drop('zipcode', axis=1, inplace=True)
dftouse.drop('latitude', axis=1, inplace=True)
dftouse.drop('longitude', axis=1, inplace=True)
dftouse.drop('agency_id_state', axis=1, inplace=True)
dftouse.drop('congressional_code', axis=1, inplace=True)
dftouse.drop('census_id', axis=1, inplace=True)
dftouse.drop('offered_g_lowest', axis=1, inplace=True)

#Drop indicator columns that are all 1 or 0 - found through KDE
dftouse.drop('i_agency_type_sup_union_admin', axis=1, inplace=True)
dftouse.drop('i_agency_type_state_operated_institution', axis=1, inplace=True)
dftouse.drop('i_agency_type_other_education_agency', axis=1, inplace=True)
dftouse.drop('i_fin_sdlc_elem', axis=1, inplace=True)
dftouse.drop('i_fin_sdlc_nonop', axis=1, inplace=True)
dftouse.drop('i_fin_sdlc_ed_serv', axis=1, inplace=True)
dftouse.drop('i_lgo_10', axis=1, inplace=True)
dftouse.drop('i_lgo_11', axis=1, inplace=True)
dftouse.drop('i_lgo_12', axis=1, inplace=True)
dftouse.drop('i_lgo_1', axis=1, inplace=True)
dftouse.drop('i_lgo_2', axis=1, inplace=True)
dftouse.drop('i_lgo_3', axis=1, inplace=True)
dftouse.drop('i_lgo_4', axis=1, inplace=True)
dftouse.drop('i_lgo_5', axis=1, inplace=True)
dftouse.drop('i_lgo_6', axis=1, inplace=True)
dftouse.drop('i_lgo_7', axis=1, inplace=True)
dftouse.drop('i_lgo_8', axis=1, inplace=True)
dftouse.drop('i_lgo_9', axis=1, inplace=True)
dftouse.drop('i_lgo_U', axis=1, inplace=True)

#Drop columns related to graduation rate
dftouse.drop('fipst', axis=1, inplace=True)
dftouse.drop('totd912', axis=1, inplace=True)
dftouse.drop('ebs912', axis=1, inplace=True)
dftouse.drop('drp912', axis=1, inplace=True)
dftouse.drop('totdpl', axis=1, inplace=True)
dftouse.drop('afgeb', axis=1, inplace=True)
dftouse.drop('totohc', axis=1, inplace=True)



In [17]:

    
#Drop gender/race information for non 12th grade for now
dftouse.drop(['r_stud_reg_PK_AIAN_M','r_stud_reg_PK_AIAN_F','r_stud_reg_PK_AAP_M','r_stud_reg_PK_AAP_F','r_stud_reg_PK_H_M','r_stud_reg_PK_H_F','r_stud_reg_PK_B_M','r_stud_reg_PK_B_F','r_stud_reg_PK_W_M','r_stud_reg_PK_W_F','r_stud_reg_PK_HNPI_M','r_stud_reg_PK_HNPI_F','r_stud_reg_PK_Two_M','r_stud_reg_PK_Two_F','r_stud_reg_K_AIAN_M','r_stud_reg_K_AIAN_F','r_stud_reg_K_AAP_M','r_stud_reg_K_AAP_F','r_stud_reg_K_H_M','r_stud_reg_K_H_F','r_stud_reg_K_B_M','r_stud_reg_K_B_F','r_stud_reg_K_W_M','r_stud_reg_K_W_F','r_stud_reg_K_HNPI_M','r_stud_reg_K_HNPI_F','r_stud_reg_K_Two_M','r_stud_reg_K_Two_F','r_stud_reg_1_AIAN_M','r_stud_reg_1_AIAN_F','r_stud_reg_1_AAP_M','r_stud_reg_1_AAP_F','r_stud_reg_1_H_M','r_stud_reg_1_H_F','r_stud_reg_1_B_M','r_stud_reg_1_B_F','r_stud_reg_1_W_M','r_stud_reg_1_W_F','r_stud_reg_1_HNPI_M','r_stud_reg_1_HNPI_F','r_stud_reg_1_Two_M','r_stud_reg_1_Two_F','r_stud_reg_2_AIAN_M','r_stud_reg_2_AIAN_F','r_stud_reg_2_AAP_M','r_stud_reg_2_AAP_F','r_stud_reg_2_H_M','r_stud_reg_2_H_F','r_stud_reg_2_B_M','r_stud_reg_2_B_F','r_stud_reg_2_W_M','r_stud_reg_2_W_F','r_stud_reg_2_HNPI_M','r_stud_reg_2_HNPI_F','r_stud_reg_2_Two_M','r_stud_reg_2_Two_F','r_stud_reg_3_AIAN_M','r_stud_reg_3_AIAN_F','r_stud_reg_3_AAP_M','r_stud_reg_3_AAP_F','r_stud_reg_3_H_M','r_stud_reg_3_H_F','r_stud_reg_3_B_M','r_stud_reg_3_B_F','r_stud_reg_3_W_M','r_stud_reg_3_W_F','r_stud_reg_3_HNPI_M','r_stud_reg_3_HNPI_F','r_stud_reg_3_Two_M','r_stud_reg_3_Two_F','r_stud_reg_4_AIAN_M','r_stud_reg_4_AIAN_F','r_stud_reg_4_AAP_M','r_stud_reg_4_AAP_F','r_stud_reg_4_H_M','r_stud_reg_4_H_F','r_stud_reg_4_B_M','r_stud_reg_4_B_F','r_stud_reg_4_W_M','r_stud_reg_4_W_F','r_stud_reg_4_HNPI_M','r_stud_reg_4_HNPI_F','r_stud_reg_4_Two_M','r_stud_reg_4_Two_F','r_stud_reg_5_AIAN_M','r_stud_reg_5_AIAN_F','r_stud_reg_5_AAP_M','r_stud_reg_5_AAP_F','r_stud_reg_5_H_M','r_stud_reg_5_H_F','r_stud_reg_5_B_M','r_stud_reg_5_B_F','r_stud_reg_5_W_M','r_stud_reg_5_W_F','r_stud_reg_5_HNPI_M','r_stud_reg_5_HNPI_F','r_stud_reg_5_Two_M','r_stud_reg_5_Two_F','r_stud_reg_6_AIAN_M','r_stud_reg_6_AIAN_F','r_stud_reg_6_AAP_M','r_stud_reg_6_AAP_F','r_stud_reg_6_H_M','r_stud_reg_6_H_F','r_stud_reg_6_B_M','r_stud_reg_6_B_F','r_stud_reg_6_W_M','r_stud_reg_6_W_F','r_stud_reg_6_HNPI_M','r_stud_reg_6_HNPI_F','r_stud_reg_6_Two_M','r_stud_reg_6_Two_F','r_stud_reg_7_AIAN_M','r_stud_reg_7_AIAN_F','r_stud_reg_7_AAP_M','r_stud_reg_7_AAP_F','r_stud_reg_7_H_M','r_stud_reg_7_H_F','r_stud_reg_7_B_M','r_stud_reg_7_B_F','r_stud_reg_7_W_M','r_stud_reg_7_W_F','r_stud_reg_7_HNPI_M','r_stud_reg_7_HNPI_F','r_stud_reg_7_Two_M','r_stud_reg_7_Two_F','r_stud_reg_8_AIAN_M','r_stud_reg_8_AIAN_F','r_stud_reg_8_AAP_M','r_stud_reg_8_AAP_F','r_stud_reg_8_H_M','r_stud_reg_8_H_F','r_stud_reg_8_B_M','r_stud_reg_8_B_F','r_stud_reg_8_W_M','r_stud_reg_8_W_F','r_stud_reg_8_HNPI_M','r_stud_reg_8_HNPI_F','r_stud_reg_8_Two_M','r_stud_reg_8_Two_F','r_stud_reg_9_AIAN_M','r_stud_reg_9_AIAN_F','r_stud_reg_9_AAP_M','r_stud_reg_9_AAP_F','r_stud_reg_9_H_M','r_stud_reg_9_H_F','r_stud_reg_9_B_M','r_stud_reg_9_B_F','r_stud_reg_9_W_M','r_stud_reg_9_W_F','r_stud_reg_9_HNPI_M','r_stud_reg_9_HNPI_F','r_stud_reg_9_Two_M','r_stud_reg_9_Two_F','r_stud_reg_10_AIAN_M','r_stud_reg_10_AIAN_F','r_stud_reg_10_AAP_M','r_stud_reg_10_AAP_F','r_stud_reg_10_H_M','r_stud_reg_10_H_F','r_stud_reg_10_B_M','r_stud_reg_10_B_F','r_stud_reg_10_W_M','r_stud_reg_10_W_F','r_stud_reg_10_HNPI_M','r_stud_reg_10_HNPI_F','r_stud_reg_10_Two_M','r_stud_reg_10_Two_F','r_stud_reg_11_AIAN_M','r_stud_reg_11_AIAN_F','r_stud_reg_11_AAP_M','r_stud_reg_11_AAP_F','r_stud_reg_11_H_M','r_stud_reg_11_H_F','r_stud_reg_11_B_M','r_stud_reg_11_B_F','r_stud_reg_11_W_M','r_stud_reg_11_W_F','r_stud_reg_11_HNPI_M','r_stud_reg_11_HNPI_F','r_stud_reg_11_Two_M','r_stud_reg_11_Two_F','r_stud_reg_U_AIAN_M','r_stud_reg_U_AIAN_F','r_stud_reg_U_AAP_M','r_stud_reg_U_AAP_F','r_stud_reg_U_H_M','r_stud_reg_U_H_F','r_stud_reg_U_B_M','r_stud_reg_U_B_F','r_stud_reg_U_W_M','r_stud_reg_U_W_F','r_stud_reg_U_HNPI_M','r_stud_reg_U_HNPI_F','r_stud_reg_U_Two_M','r_stud_reg_U_Two_F'], axis=1, inplace=True)

#Drop rates of students in PK-12th grade
dftouse.drop(['r_stud_PK', 'r_stud_K', 'r_stud_1', 'r_stud_2', 'r_stud_3', 'r_stud_4', 'r_stud_5', 'r_stud_6', 'r_stud_7', 'r_stud_8', 'r_stud_9', 'r_stud_10', 'r_stud_11', 'r_stud_12', 'r_stud_U'], axis=1, inplace=True)



In [18]:

    
#This is the fix for replacing NaN with Mean to fix a replace with 0 issue.  There is the potential that variability could be reduced, yet we already checked for invalid columns in a prior step due to NaN, so we do not find this to be of great concern.
#This is much better than the previous mistake of replacing with 0.
#https://www.quora.com/Are-we-doing-justice-with-our-data-set-when-we-replace-NaN-values-with-mean-median-or-mode-most-frequent-value
for col in dftouse.columns:
    if(dftouse[col].dtype == np.float64):
        dftouse[col].fillna(value=np.mean(dftouse[col]), inplace=True)



In [19]:

    
print dftouse.shape
dftouse.head()









    



(10785, 160)






    Out[19]:






  
    
      
      num_schools
      num_charter_schools
      num_pub_schools
      num_students
      pupil_teacher_ratio_dist
      pupil_teacher_ratio_ps
      totalrev_pp
      tlocrev_pp
      tsrev_pp
      tfedrev_pp
      tcurinst_pp
      tcurssv_pp
      tcuroth_pp
      tcursalary_pp
      tcurbenefits_pp
      totalexp_pp
      tcapout_pp
      tnonelse_pp
      tcurelsc_pp
      instexp_pp
      tcurelsc_percent
      tcurinst_percent
      tcuroth_percent
      tcuresal_percent
      tcurssvc_percent
      tfedrev_percent
      tlocrev_percent
      tsrev_percent
      afgr
      i_agency_type_local_school_district
      i_agency_type_local_school_district_sup_union
      i_agency_type_regional_education_services
      i_agency_type_charter_school_agency
      i_fin_sdlc_sec
      i_fin_sdlc_elem_sec
      i_fin_sdlc_voc
      i_ucl_city_large
      i_ucl_city_mid
      i_ucl_city_small
      i_ucl_suburb_large
      i_ucl_suburb_mid
      i_ucl_suburb_small
      i_ucl_town_fringe
      i_ucl_town_distant
      i_ucl_town_remote
      i_ucl_rural_fringe
      i_ucl_rural_distant
      i_ucl_rural_remote
      i_cs_all_charter
      i_cs_charter_noncharter
      ...
      r_lrev_it
      r_lrev_aot
      r_lrev_pgc
      r_lrev_cc
      r_lrev_oss
      r_lrev_tui
      r_lrev_trans
      r_lrev_slr
      r_lrev_ts
      r_lrev_sar
      r_lrev_osalserv
      r_lrev_sfns
      r_lrev_ie
      r_lrev_molr
      r_lrev_sp
      r_lrev_rr
      r_lrev_sale
      r_lrev_ff
      r_lrev_pc
      r_srev_gfa
      r_srev_sep
      r_srev_trans
      r_srev_sip
      r_srev_cbsp
      r_srev_vep
      r_srev_codsp
      r_srev_bep
      r_srev_gt
      r_srev_slp
      r_srev_aor
      r_srev_splea
      r_srev_osp
      r_srev_ns
      r_frev_title1
      r_frev_dis
      r_frev_cna
      r_frev_ems
      r_frev_dfs
      r_frev_voc
      r_frev_ao
      r_frev_ns
      r_frev_ia
      r_frev_be
      r_frev_na
      r_frev_aofed
      RESP_High_Graduation
      RESP_Low_Graduation
      gradrate0607
      gradrate0708
      gradrate0809
    
  
  
    
      0
      1
      1
      1
      360
      15.65
      17.14
      11111
      356
      7925
      2831
      3986
      4717
      436
      4133
      1028
      11197
      1986
      72
      9139
      3986
      11.2
      43.6
      4.8
      45.2
      51.6
      25.5
      3.2
      71.3
      30.2
      0
      0
      0
      1
      0
      1
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      ...
      0.011257
      0.004972
      0.769211
      0.070312
      0.390625
      0.000000
      0
      0.062500
      0.000000
      0.000000
      0.000000
      0
      0.000000
      0.078125
      0.007812
      0.000000
      0.390625
      0
      0.00000
      0.964599
      0.000000
      0.000000
      0.000000
      0.003155
      0.000000
      0
      0
      0.003155
      0.000000
      0.029092
      0.000000
      0.000000
      0
      0.606477
      0.000000
      0.156035
      0.000000
      0.000000
      0.000000
      0.176644
      0.060844
      0.000000
      0.000000
      0.000000
      0.000000
      0
      1
      80.772552
      81.268948
      81.805709
    
    
      1
      1
      1
      1
      594
      28.49
      28.49
      10557
      10285
      273
      0
      5104
      3003
      0
      4199
      1456
      8732
      554
      0
      8108
      5104
      18.0
      63.0
      0.0
      51.8
      37.0
      0.0
      97.4
      2.6
      100.0
      0
      0
      0
      1
      0
      1
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      ...
      0.011257
      0.004972
      0.769211
      0.000000
      0.983958
      0.005893
      0
      0.000000
      0.000000
      0.000000
      0.000000
      0
      0.000000
      0.010149
      0.000000
      0.000000
      0.000000
      0
      0.00000
      0.000000
      0.000000
      0.000000
      0.950617
      0.000000
      0.000000
      0
      0
      0.000000
      0.000000
      0.049383
      0.000000
      0.000000
      0
      0.208316
      0.166727
      0.177190
      0.024388
      0.003395
      0.005148
      0.309150
      0.048223
      0.015101
      0.003794
      0.002556
      0.036014
      1
      0
      100.000000
      100.000000
      100.000000
    
    
      2
      1
      1
      1
      1033
      16.61
      16.61
      11015
      68
      9164
      1784
      4890
      3500
      441
      5788
      477
      8864
      0
      1
      8832
      4890
      5.4
      55.4
      5.0
      65.5
      39.6
      16.2
      0.6
      83.2
      55.7
      0
      0
      0
      1
      0
      1
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      ...
      0.011257
      0.004972
      0.769211
      0.000000
      0.000000
      0.000000
      0
      0.800000
      0.000000
      0.000000
      0.000000
      0
      0.142857
      0.000000
      0.000000
      0.057143
      0.000000
      0
      0.00000
      0.952145
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0
      0
      0.000000
      0.000211
      0.006233
      0.040038
      0.001373
      0
      0.351058
      0.056430
      0.238741
      0.033641
      0.000000
      0.000000
      0.320130
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0
      1
      75.600000
      41.400000
      81.805709
    
    
      3
      3
      0
      3
      432
      11.86
      12.08
      11367
      4374
      5089
      1904
      4835
      4633
      498
      5741
      2112
      13200
      2204
      0
      9966
      4835
      21.2
      48.5
      5.0
      57.6
      46.5
      16.7
      38.5
      44.8
      70.7
      1
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      ...
      0.000000
      0.000000
      0.769211
      0.000000
      0.001049
      0.000000
      0
      0.060829
      0.009963
      0.014683
      0.000000
      0
      0.005244
      0.052438
      0.027792
      0.004195
      0.000000
      0
      0.02517
      0.642181
      0.049572
      0.068049
      0.000000
      0.004056
      0.000901
      0
      0
      0.000000
      0.000901
      0.015773
      0.218567
      0.000000
      0
      0.162651
      0.306024
      0.087952
      0.038554
      0.002410
      0.000000
      0.402410
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0
      1
      97.100000
      81.600000
      83.300000
    
    
      4
      2
      0
      2
      595
      15.64
      13.37
      10718
      5634
      4020
      1065
      4944
      2103
      336
      4703
      1438
      11586
      2769
      0
      7384
      4944
      19.5
      67.0
      4.6
      63.7
      28.5
      9.9
      52.6
      37.5
      95.7
      1
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      ...
      0.061464
      0.000000
      0.769211
      0.000000
      0.043101
      0.000000
      0
      0.034175
      0.008926
      0.004336
      0.002805
      0
      0.004081
      0.066820
      0.000000
      0.000000
      0.004081
      0
      0.00000
      0.859543
      0.000000
      0.000000
      0.118656
      0.000000
      0.002144
      0
      0
      0.000000
      0.001072
      0.018585
      0.000000
      0.000000
      0
      0.095816
      0.107962
      0.148448
      0.043185
      0.008097
      0.005398
      0.554656
      0.000000
      0.000000
      0.000000
      0.000000
      0.036437
      1
      0
      75.000000
      72.000000
      69.400000
    
  

5 rows × 160 columns



In [20]:

    
dftouse.to_csv("data/finaldata/dftouse_withprevyears.csv", index=False)

These are the steps to take the 06-07 info and append the 09-10 grad info.



In [21]:

    
filtered0607 = pd.read_csv("data/finaldata/filtered_0607.csv")
filtered0607.head()









    Out[21]:






  
    
      
      agency
      state
      state_abbr
      agency_id_nces
      county
      county_number
      num_schools
      num_charter_schools
      num_pub_schools
      report_years
      no_report_years
      address
      city
      add_state
      zipcode
      Agency Name_DEL
      latitude
      longitude
      agency_id_state
      Metro Status Code [District] 2009-10
      congressional_code
      census_id
      offered_g_lowest
      num_students
      Total Students - Calculated Sum of Reported Grade Totals [Public School] 2009-10
      Migrant Students in Summer Programs (**) [District] 2009-10
      Migrant Students (**) [Public School] 2009-10
      Gender Unknown Students [Public School] 2009-10
      pupil_teacher_ratio_dist
      pupil_teacher_ratio_ps
      totalrev_pp
      tlocrev_pp
      tsrev_pp
      tfedrev_pp
      tcurinst_pp
      tcurssv_pp
      tcuroth_pp
      tcursalary_pp
      tcurbenefits_pp
      totalexp_pp
      tcapout_pp
      tnonelse_pp
      tcurelsc_pp
      instexp_pp
      tcurelsc_percent
      tcurinst_percent
      tcuroth_percent
      tcuresal_percent
      tcurssvc_percent
      tfedrev_percent
      ...
      r_lrev_pt
      r_lrev_gst
      r_lrev_put
      r_lrev_it
      r_lrev_aot
      r_lrev_pgc
      r_lrev_cc
      r_lrev_oss
      r_lrev_tui
      r_lrev_trans
      r_lrev_slr
      r_lrev_ts
      r_lrev_sar
      r_lrev_osalserv
      r_lrev_sfns
      r_lrev_ie
      r_lrev_molr
      r_lrev_sp
      r_lrev_rr
      r_lrev_sale
      r_lrev_ff
      r_lrev_pc
      r_srev_gfa
      r_srev_sep
      r_srev_trans
      r_srev_sip
      r_srev_cbsp
      r_srev_vep
      r_srev_codsp
      r_srev_bep
      r_srev_gt
      r_srev_slp
      r_srev_aor
      r_srev_splea
      r_srev_osp
      r_srev_ns
      r_frev_title1
      r_frev_dis
      r_frev_cna
      r_frev_ems
      r_frev_dfs
      r_frev_voc
      r_frev_ao
      r_frev_ns
      r_frev_ia
      r_frev_be
      r_frev_na
      r_frev_aofed
      RESP_High_Graduation
      RESP_Low_Graduation
    
  
  
    
      0
      21ST CENTURY CYBER CS
      Pennsylvania
      PA
      4200091
      CHESTER
      42029
      1
      1
      1
      2001-2013
      1986-2000
      805 SPRINGDALE DR
      EXTON
      PA
      19341
      21ST CENTURY CYBER CS
      40.002611
      -75.689137
      124150002
      2-CBSA/MSA- not primarily its principal/centra...
      4206
      NaN
      6th Grade
      467
      467
      NaN
      NaN
      0
      29.3
      28.7
      9011
      8676
      245
      90
      4426
      2898
      0
      3394
      1196
      7785
      461
      0
      7324
      4426
      16.3
      60.4
      0.0
      46.3
      39.6
      1.0
      ...
      0.000000
      0.000000
      0
      0.000000
      0
      0
      0.00000
      0.999017
      0.000000
      0
      0.000000
      0.000000
      0.000000
      0.000737
      0
      0.000000
      0.000246
      0.000000
      0.000000
      0.000000
      0
      0.000000
      0.000000
      0.000000
      0.000000
      0.939130
      0.000000
      0.000000
      0.000000
      0
      0
      0.000000
      0.060870
      0.000000
      0.000000
      0
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0
      0
      0
      0
      1.000000
      1
      0
    
    
      1
      A+ ACADEMY
      Texas
      TX
      4800203
      DALLAS
      48113
      1
      1
      1
      2000-2013
      1986-1999
      8225 BRUTON RD
      DALLAS
      TX
      75217
      A+ ACADEMY
      32.690593
      -96.643398
      057829
      1-Principal/central city of CBSA/MSA
      4830
      NaN
      Prekindergarten
      994
      994
      NaN
      NaN
      0
      13.5
      13.5
      9040
      462
      7798
      781
      4751
      2960
      362
      5179
      523
      8072
      0
      0
      8072
      4751
      6.5
      58.8
      4.5
      64.2
      36.7
      8.6
      ...
      0.000000
      0.000000
      0
      0.000000
      0
      0
      0.00000
      0.000000
      0.000000
      0
      0.172113
      0.000000
      0.000000
      0.000000
      0
      0.000000
      0.827887
      0.000000
      0.000000
      0.000000
      0
      0.000000
      0.916398
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0
      0
      0.000516
      0.035221
      0.041414
      0.006451
      0
      0.000000
      0.199742
      0.355670
      0.068299
      0.001289
      0.000000
      0.375000
      0
      0
      0
      0
      0.000000
      0
      0
    
    
      2
      A-C CENTRAL CUSD 262
      Illinois
      IL
      1700105
      CASS
      17017
      3
      0
      3
      1989-2013
      1986-1988
      501 EAST BUCHANAN ST
      ASHLAND
      IL
      62612
      A-C CENTRAL CUSD 262
      39.891262
      -90.010711
      46-009-2620-26
      3-Does not serve a CBSA/MSA
      1718
      1.450092e+13
      Prekindergarten
      483
      483
      NaN
      0
      0
      10.3
      10.3
      9327
      4292
      4532
      503
      3950
      3280
      383
      4826
      1133
      7857
      27
      0
      7613
      3950
      14.9
      51.9
      5.0
      63.4
      43.1
      5.4
      ...
      0.865412
      0.000000
      0
      0.000000
      0
      0
      0.00000
      0.017366
      0.000000
      0
      0.045345
      0.010613
      0.011577
      0.000000
      0
      0.007718
      0.010130
      0.031838
      0.000000
      0.000000
      0
      0.000000
      0.704888
      0.031521
      0.160347
      0.000000
      0.000000
      0.005025
      0.000000
      0
      0
      0.000914
      0.010964
      0.081316
      0.005025
      0
      0.263374
      0.000000
      0.238683
      0.000000
      0.004115
      0.000000
      0.288066
      0
      0
      0
      0
      0.205761
      1
      0
    
    
      3
      A-H-S-T COMM SCHOOL DISTRICT
      Iowa
      IA
      1904080
      POTTAWATTAMIE
      19155
      2
      0
      2
      1986-2013
      NaN
      768 SOUTH MAPLE ST
      AVOCA
      IA
      51521
      A-H-S-T COMM SCHOOL DISTRICT
      41.471423
      -95.341045
      780441 000
      2-CBSA/MSA- not primarily its principal/centra...
      1905
      1.650780e+13
      Prekindergarten
      605
      605
      NaN
      0
      0
      13.5
      13.5
      10681
      5056
      5246
      379
      5028
      2192
      309
      4864
      1258
      8712
      1106
      31
      7529
      5028
      16.7
      66.8
      4.1
      64.6
      29.1
      3.5
      ...
      0.626022
      0.154299
      0
      0.073227
      0
      0
      0.00000
      0.059170
      0.002288
      0
      0.039555
      0.004904
      0.001308
      0.000000
      0
      0.016672
      0.022229
      0.000000
      0.000000
      0.000327
      0
      0.000000
      0.904852
      0.000000
      0.000000
      0.076875
      0.000000
      0.001890
      0.000000
      0
      0
      0.000945
      0.015438
      0.000000
      0.000000
      0
      0.218341
      0.152838
      0.353712
      0.000000
      0.008734
      0.004367
      0.174672
      0
      0
      0
      0
      0.087336
      0
      0
    
    
      4
      A.C.G.C. PUBLIC SCHOOL DISTRICT
      Minnesota
      MN
      2700106
      MEEKER
      27093
      5
      0
      5
      1995-2013
      1986-1994
      27250 MN HWY 4
      GROVE CITY
      MN
      56243
      A.C.G.C. PUBLIC SCHOOL DISTRICT
      45.142843
      -94.685310
      012396
      3-Does not serve a CBSA/MSA
      2707
      2.450473e+13
      Prekindergarten
      849
      849
      NaN
      NaN
      0
      13.5
      13.9
      11204
      2880
      7687
      637
      5508
      3046
      478
      5550
      1329
      10167
      203
      399
      9032
      5508
      14.7
      61.0
      5.3
      61.5
      33.7
      5.7
      ...
      0.628630
      0.000000
      0
      0.000000
      0
      0
      0.01227
      0.005726
      0.063804
      0
      0.082618
      0.000000
      0.016769
      0.017587
      0
      0.078119
      0.058487
      0.000000
      0.000409
      0.000000
      0
      0.035583
      0.813975
      0.083972
      0.001379
      0.000000
      0.049341
      0.000613
      0.022832
      0
      0
      0.004903
      0.022985
      0.000000
      0.000000
      0
      0.312384
      0.000000
      0.316081
      0.101664
      0.001848
      0.000000
      0.266174
      0
      0
      0
      0
      0.000000
      1
      0
    
  

5 rows × 416 columns



In [22]:

    
filtered = pd.read_csv("data/finaldata/filtered.csv")
print filtered.shape

gradonly = filtered[['agency_id_nces', 'afgr']]
filtered0607 = filtered0607.merge(gradonly, 'left', 'agency_id_nces', suffixes=('', '_DEL'))
filtered0607.head()









    



(10785, 412)






    Out[22]:






  
    
      
      agency
      state
      state_abbr
      agency_id_nces
      county
      county_number
      num_schools
      num_charter_schools
      num_pub_schools
      report_years
      no_report_years
      address
      city
      add_state
      zipcode
      Agency Name_DEL
      latitude
      longitude
      agency_id_state
      Metro Status Code [District] 2009-10
      congressional_code
      census_id
      offered_g_lowest
      num_students
      Total Students - Calculated Sum of Reported Grade Totals [Public School] 2009-10
      Migrant Students in Summer Programs (**) [District] 2009-10
      Migrant Students (**) [Public School] 2009-10
      Gender Unknown Students [Public School] 2009-10
      pupil_teacher_ratio_dist
      pupil_teacher_ratio_ps
      totalrev_pp
      tlocrev_pp
      tsrev_pp
      tfedrev_pp
      tcurinst_pp
      tcurssv_pp
      tcuroth_pp
      tcursalary_pp
      tcurbenefits_pp
      totalexp_pp
      tcapout_pp
      tnonelse_pp
      tcurelsc_pp
      instexp_pp
      tcurelsc_percent
      tcurinst_percent
      tcuroth_percent
      tcuresal_percent
      tcurssvc_percent
      tfedrev_percent
      ...
      r_lrev_gst
      r_lrev_put
      r_lrev_it
      r_lrev_aot
      r_lrev_pgc
      r_lrev_cc
      r_lrev_oss
      r_lrev_tui
      r_lrev_trans
      r_lrev_slr
      r_lrev_ts
      r_lrev_sar
      r_lrev_osalserv
      r_lrev_sfns
      r_lrev_ie
      r_lrev_molr
      r_lrev_sp
      r_lrev_rr
      r_lrev_sale
      r_lrev_ff
      r_lrev_pc
      r_srev_gfa
      r_srev_sep
      r_srev_trans
      r_srev_sip
      r_srev_cbsp
      r_srev_vep
      r_srev_codsp
      r_srev_bep
      r_srev_gt
      r_srev_slp
      r_srev_aor
      r_srev_splea
      r_srev_osp
      r_srev_ns
      r_frev_title1
      r_frev_dis
      r_frev_cna
      r_frev_ems
      r_frev_dfs
      r_frev_voc
      r_frev_ao
      r_frev_ns
      r_frev_ia
      r_frev_be
      r_frev_na
      r_frev_aofed
      RESP_High_Graduation
      RESP_Low_Graduation
      afgr_DEL
    
  
  
    
      0
      21ST CENTURY CYBER CS
      Pennsylvania
      PA
      4200091
      CHESTER
      42029
      1
      1
      1
      2001-2013
      1986-2000
      805 SPRINGDALE DR
      EXTON
      PA
      19341
      21ST CENTURY CYBER CS
      40.002611
      -75.689137
      124150002
      2-CBSA/MSA- not primarily its principal/centra...
      4206
      NaN
      6th Grade
      467
      467
      NaN
      NaN
      0
      29.3
      28.7
      9011
      8676
      245
      90
      4426
      2898
      0
      3394
      1196
      7785
      461
      0
      7324
      4426
      16.3
      60.4
      0.0
      46.3
      39.6
      1.0
      ...
      0.000000
      0
      0.000000
      0
      0
      0.00000
      0.999017
      0.000000
      0
      0.000000
      0.000000
      0.000000
      0.000737
      0
      0.000000
      0.000246
      0.000000
      0.000000
      0.000000
      0
      0.000000
      0.000000
      0.000000
      0.000000
      0.939130
      0.000000
      0.000000
      0.000000
      0
      0
      0.000000
      0.060870
      0.000000
      0.000000
      0
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0
      0
      0
      0
      1.000000
      1
      0
      100.0
    
    
      1
      A+ ACADEMY
      Texas
      TX
      4800203
      DALLAS
      48113
      1
      1
      1
      2000-2013
      1986-1999
      8225 BRUTON RD
      DALLAS
      TX
      75217
      A+ ACADEMY
      32.690593
      -96.643398
      057829
      1-Principal/central city of CBSA/MSA
      4830
      NaN
      Prekindergarten
      994
      994
      NaN
      NaN
      0
      13.5
      13.5
      9040
      462
      7798
      781
      4751
      2960
      362
      5179
      523
      8072
      0
      0
      8072
      4751
      6.5
      58.8
      4.5
      64.2
      36.7
      8.6
      ...
      0.000000
      0
      0.000000
      0
      0
      0.00000
      0.000000
      0.000000
      0
      0.172113
      0.000000
      0.000000
      0.000000
      0
      0.000000
      0.827887
      0.000000
      0.000000
      0.000000
      0
      0.000000
      0.916398
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0
      0
      0.000516
      0.035221
      0.041414
      0.006451
      0
      0.000000
      0.199742
      0.355670
      0.068299
      0.001289
      0.000000
      0.375000
      0
      0
      0
      0
      0.000000
      0
      0
      55.7
    
    
      2
      A-C CENTRAL CUSD 262
      Illinois
      IL
      1700105
      CASS
      17017
      3
      0
      3
      1989-2013
      1986-1988
      501 EAST BUCHANAN ST
      ASHLAND
      IL
      62612
      A-C CENTRAL CUSD 262
      39.891262
      -90.010711
      46-009-2620-26
      3-Does not serve a CBSA/MSA
      1718
      1.450092e+13
      Prekindergarten
      483
      483
      NaN
      0
      0
      10.3
      10.3
      9327
      4292
      4532
      503
      3950
      3280
      383
      4826
      1133
      7857
      27
      0
      7613
      3950
      14.9
      51.9
      5.0
      63.4
      43.1
      5.4
      ...
      0.000000
      0
      0.000000
      0
      0
      0.00000
      0.017366
      0.000000
      0
      0.045345
      0.010613
      0.011577
      0.000000
      0
      0.007718
      0.010130
      0.031838
      0.000000
      0.000000
      0
      0.000000
      0.704888
      0.031521
      0.160347
      0.000000
      0.000000
      0.005025
      0.000000
      0
      0
      0.000914
      0.010964
      0.081316
      0.005025
      0
      0.263374
      0.000000
      0.238683
      0.000000
      0.004115
      0.000000
      0.288066
      0
      0
      0
      0
      0.205761
      1
      0
      70.7
    
    
      3
      A-H-S-T COMM SCHOOL DISTRICT
      Iowa
      IA
      1904080
      POTTAWATTAMIE
      19155
      2
      0
      2
      1986-2013
      NaN
      768 SOUTH MAPLE ST
      AVOCA
      IA
      51521
      A-H-S-T COMM SCHOOL DISTRICT
      41.471423
      -95.341045
      780441 000
      2-CBSA/MSA- not primarily its principal/centra...
      1905
      1.650780e+13
      Prekindergarten
      605
      605
      NaN
      0
      0
      13.5
      13.5
      10681
      5056
      5246
      379
      5028
      2192
      309
      4864
      1258
      8712
      1106
      31
      7529
      5028
      16.7
      66.8
      4.1
      64.6
      29.1
      3.5
      ...
      0.154299
      0
      0.073227
      0
      0
      0.00000
      0.059170
      0.002288
      0
      0.039555
      0.004904
      0.001308
      0.000000
      0
      0.016672
      0.022229
      0.000000
      0.000000
      0.000327
      0
      0.000000
      0.904852
      0.000000
      0.000000
      0.076875
      0.000000
      0.001890
      0.000000
      0
      0
      0.000945
      0.015438
      0.000000
      0.000000
      0
      0.218341
      0.152838
      0.353712
      0.000000
      0.008734
      0.004367
      0.174672
      0
      0
      0
      0
      0.087336
      0
      0
      95.7
    
    
      4
      A.C.G.C. PUBLIC SCHOOL DISTRICT
      Minnesota
      MN
      2700106
      MEEKER
      27093
      5
      0
      5
      1995-2013
      1986-1994
      27250 MN HWY 4
      GROVE CITY
      MN
      56243
      A.C.G.C. PUBLIC SCHOOL DISTRICT
      45.142843
      -94.685310
      012396
      3-Does not serve a CBSA/MSA
      2707
      2.450473e+13
      Prekindergarten
      849
      849
      NaN
      NaN
      0
      13.5
      13.9
      11204
      2880
      7687
      637
      5508
      3046
      478
      5550
      1329
      10167
      203
      399
      9032
      5508
      14.7
      61.0
      5.3
      61.5
      33.7
      5.7
      ...
      0.000000
      0
      0.000000
      0
      0
      0.01227
      0.005726
      0.063804
      0
      0.082618
      0.000000
      0.016769
      0.017587
      0
      0.078119
      0.058487
      0.000000
      0.000409
      0.000000
      0
      0.035583
      0.813975
      0.083972
      0.001379
      0.000000
      0.049341
      0.000613
      0.022832
      0
      0
      0.004903
      0.022985
      0.000000
      0.000000
      0
      0.312384
      0.000000
      0.316081
      0.101664
      0.001848
      0.000000
      0.266174
      0
      0
      0
      0
      0.000000
      1
      0
      89.6
    
  

5 rows × 417 columns



In [23]:

    
filtered0607 = filtered0607.rename(columns={'afgr_DEL': 'afgr0910'})
filtered0607.to_csv("data/finaldata/dftouse_0607_with0910grad.csv", index=False)

Data Analysis



In [25]:

    
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - #
# Generic classification and optimization functions from cs019 lab7
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - #

from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import train_test_split
from sklearn.metrics import confusion_matrix

# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - #
# clf - original classifier
# parameters - grid to search over
# X - usually your training X matrix
# y - usually your training y 
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - #
def cv_optimize(clf, parameters, X, y, n_jobs=1, n_folds=5, score_func=None):
    if score_func:
        gs = GridSearchCV(clf, param_grid=parameters, cv=n_folds, n_jobs=n_jobs, scoring=score_func)
    else:
        gs = GridSearchCV(clf, param_grid=parameters, n_jobs=n_jobs, cv=n_folds)
    gs.fit(X, y)
    print "BEST", gs.best_params_, gs.best_score_, gs.grid_scores_
    best = gs.best_estimator_
    return best

# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - #
# Important parameters
# indf - Input dataframe
# featurenames - vector of names of predictors
# targetname - name of column you want to predict (e.g. 0 or 1, 'M' or 'F', 
#              'yes' or 'no')
# target1val - particular value you want to have as a 1 in the target
# mask - boolean vector indicating test set (~mask is training set)
# reuse_split - dictionary that contains traning and testing dataframes 
#              (we'll use this to test different classifiers on the same 
#              test-train splits)
# score_func - we've used the accuracy as a way of scoring algorithms but 
#              this can be more general later on
# n_folds - Number of folds for cross validation ()
# n_jobs - used for parallelization
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - #

def do_classify(clf, parameters, indf, featurenames, targetname, target1val, mask=None, reuse_split=None, score_func=None, n_folds=5, n_jobs=1):
    subdf=indf[featurenames]
    X=subdf.values
    y=indf[targetname].values
    if mask !=None:
        print "using mask"
        Xtrain, Xtest, ytrain, ytest = X[mask], X[~mask], y[mask], y[~mask]
    if reuse_split !=None:
        print "using reuse split"
        Xtrain, Xtest, ytrain, ytest = reuse_split['Xtrain'], reuse_split['Xtest'], reuse_split['ytrain'], reuse_split['ytest']
    if parameters:
        clf = cv_optimize(clf, parameters, Xtrain, ytrain, n_jobs=n_jobs, n_folds=n_folds, score_func=score_func)
    clf=clf.fit(Xtrain, ytrain)
    #training_accuracy = clf.score(Xtrain, ytrain)
    #test_accuracy = clf.score(Xtest, ytest)
    #print "############# based on standard predict ################"
    #print "Accuracy on training data: %0.2f" % (training_accuracy)
    #print "Accuracy on test data:     %0.2f" % (test_accuracy)
    #print confusion_matrix(ytest, clf.predict(Xtest))
    #print "########################################################"
    return clf, Xtrain, ytrain, Xtest, ytest

# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - #
# Make ROC Plots

from sklearn.metrics import roc_curve, auc
def make_roc(name, clf, ytest, xtest, ax=None, labe=5, proba=True, skip=0):
    initial=False
    if not ax:
        ax=plt.gca()
        initial=True
    if proba:
        fpr, tpr, thresholds=roc_curve(ytest, clf.predict_proba(xtest)[:,1])
    else:
        fpr, tpr, thresholds=roc_curve(ytest, clf.predict(xtest))
    roc_auc = auc(fpr, tpr)
    if skip:
        l=fpr.shape[0]
        ax.plot(fpr[0:l:skip], tpr[0:l:skip], '.-', alpha=0.3, label='ROC curve for %s (area = %0.2f)' % (name, roc_auc))
    else:
        ax.plot(fpr, tpr, '.-', alpha=0.3, label='ROC curve for %s (area = %0.2f)' % (name, roc_auc))
    label_kwargs = {}
    label_kwargs['bbox'] = dict(
        boxstyle='round,pad=0.3', alpha=0.2,
    )
    if labe!=None:
        for k in xrange(0, fpr.shape[0],labe):
            #from https://gist.github.com/podshumok/c1d1c9394335d86255b8
            threshold = str(np.round(thresholds[k], 2))
            ax.annotate(threshold, (fpr[k], tpr[k]), **label_kwargs)
    if initial:
        ax.plot([0, 1], [0, 1], 'k--')
        ax.set_xlim([0.0, 1.0])
        ax.set_ylim([0.0, 1.05])
        ax.set_xlabel('False Positive Rate')
        ax.set_ylabel('True Positive Rate')
        ax.set_title('ROC')
    ax.legend(loc="lower right")
    return ax

# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - #

Baseline Comparison :

We first compare current years graduation rate with last years graduation rate and use that as our baseline



In [26]:

    
df_yr=pd.read_csv("data/finaldata/dftouse_withprevyears.csv")
print df_yr.shape
df_yr.head()









    



(10785, 160)






    Out[26]:






  
    
      
      num_schools
      num_charter_schools
      num_pub_schools
      num_students
      pupil_teacher_ratio_dist
      pupil_teacher_ratio_ps
      totalrev_pp
      tlocrev_pp
      tsrev_pp
      tfedrev_pp
      tcurinst_pp
      tcurssv_pp
      tcuroth_pp
      tcursalary_pp
      tcurbenefits_pp
      totalexp_pp
      tcapout_pp
      tnonelse_pp
      tcurelsc_pp
      instexp_pp
      tcurelsc_percent
      tcurinst_percent
      tcuroth_percent
      tcuresal_percent
      tcurssvc_percent
      tfedrev_percent
      tlocrev_percent
      tsrev_percent
      afgr
      i_agency_type_local_school_district
      i_agency_type_local_school_district_sup_union
      i_agency_type_regional_education_services
      i_agency_type_charter_school_agency
      i_fin_sdlc_sec
      i_fin_sdlc_elem_sec
      i_fin_sdlc_voc
      i_ucl_city_large
      i_ucl_city_mid
      i_ucl_city_small
      i_ucl_suburb_large
      i_ucl_suburb_mid
      i_ucl_suburb_small
      i_ucl_town_fringe
      i_ucl_town_distant
      i_ucl_town_remote
      i_ucl_rural_fringe
      i_ucl_rural_distant
      i_ucl_rural_remote
      i_cs_all_charter
      i_cs_charter_noncharter
      ...
      r_lrev_it
      r_lrev_aot
      r_lrev_pgc
      r_lrev_cc
      r_lrev_oss
      r_lrev_tui
      r_lrev_trans
      r_lrev_slr
      r_lrev_ts
      r_lrev_sar
      r_lrev_osalserv
      r_lrev_sfns
      r_lrev_ie
      r_lrev_molr
      r_lrev_sp
      r_lrev_rr
      r_lrev_sale
      r_lrev_ff
      r_lrev_pc
      r_srev_gfa
      r_srev_sep
      r_srev_trans
      r_srev_sip
      r_srev_cbsp
      r_srev_vep
      r_srev_codsp
      r_srev_bep
      r_srev_gt
      r_srev_slp
      r_srev_aor
      r_srev_splea
      r_srev_osp
      r_srev_ns
      r_frev_title1
      r_frev_dis
      r_frev_cna
      r_frev_ems
      r_frev_dfs
      r_frev_voc
      r_frev_ao
      r_frev_ns
      r_frev_ia
      r_frev_be
      r_frev_na
      r_frev_aofed
      RESP_High_Graduation
      RESP_Low_Graduation
      gradrate0607
      gradrate0708
      gradrate0809
    
  
  
    
      0
      1
      1
      1
      360
      15.65
      17.14
      11111
      356
      7925
      2831
      3986
      4717
      436
      4133
      1028
      11197
      1986
      72
      9139
      3986
      11.2
      43.6
      4.8
      45.2
      51.6
      25.5
      3.2
      71.3
      30.2
      0
      0
      0
      1
      0
      1
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      ...
      0.011257
      0.004972
      0.769211
      0.070312
      0.390625
      0.000000
      0
      0.062500
      0.000000
      0.000000
      0.000000
      0
      0.000000
      0.078125
      0.007812
      0.000000
      0.390625
      0
      0.00000
      0.964599
      0.000000
      0.000000
      0.000000
      0.003155
      0.000000
      0
      0
      0.003155
      0.000000
      0.029092
      0.000000
      0.000000
      0
      0.606477
      0.000000
      0.156035
      0.000000
      0.000000
      0.000000
      0.176644
      0.060844
      0.000000
      0.000000
      0.000000
      0.000000
      0
      1
      80.772552
      81.268948
      81.805709
    
    
      1
      1
      1
      1
      594
      28.49
      28.49
      10557
      10285
      273
      0
      5104
      3003
      0
      4199
      1456
      8732
      554
      0
      8108
      5104
      18.0
      63.0
      0.0
      51.8
      37.0
      0.0
      97.4
      2.6
      100.0
      0
      0
      0
      1
      0
      1
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      ...
      0.011257
      0.004972
      0.769211
      0.000000
      0.983958
      0.005893
      0
      0.000000
      0.000000
      0.000000
      0.000000
      0
      0.000000
      0.010149
      0.000000
      0.000000
      0.000000
      0
      0.00000
      0.000000
      0.000000
      0.000000
      0.950617
      0.000000
      0.000000
      0
      0
      0.000000
      0.000000
      0.049383
      0.000000
      0.000000
      0
      0.208316
      0.166727
      0.177190
      0.024388
      0.003395
      0.005148
      0.309150
      0.048223
      0.015101
      0.003794
      0.002556
      0.036014
      1
      0
      100.000000
      100.000000
      100.000000
    
    
      2
      1
      1
      1
      1033
      16.61
      16.61
      11015
      68
      9164
      1784
      4890
      3500
      441
      5788
      477
      8864
      0
      1
      8832
      4890
      5.4
      55.4
      5.0
      65.5
      39.6
      16.2
      0.6
      83.2
      55.7
      0
      0
      0
      1
      0
      1
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      ...
      0.011257
      0.004972
      0.769211
      0.000000
      0.000000
      0.000000
      0
      0.800000
      0.000000
      0.000000
      0.000000
      0
      0.142857
      0.000000
      0.000000
      0.057143
      0.000000
      0
      0.00000
      0.952145
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0
      0
      0.000000
      0.000211
      0.006233
      0.040038
      0.001373
      0
      0.351058
      0.056430
      0.238741
      0.033641
      0.000000
      0.000000
      0.320130
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0
      1
      75.600000
      41.400000
      81.805709
    
    
      3
      3
      0
      3
      432
      11.86
      12.08
      11367
      4374
      5089
      1904
      4835
      4633
      498
      5741
      2112
      13200
      2204
      0
      9966
      4835
      21.2
      48.5
      5.0
      57.6
      46.5
      16.7
      38.5
      44.8
      70.7
      1
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      ...
      0.000000
      0.000000
      0.769211
      0.000000
      0.001049
      0.000000
      0
      0.060829
      0.009963
      0.014683
      0.000000
      0
      0.005244
      0.052438
      0.027792
      0.004195
      0.000000
      0
      0.02517
      0.642181
      0.049572
      0.068049
      0.000000
      0.004056
      0.000901
      0
      0
      0.000000
      0.000901
      0.015773
      0.218567
      0.000000
      0
      0.162651
      0.306024
      0.087952
      0.038554
      0.002410
      0.000000
      0.402410
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0
      1
      97.100000
      81.600000
      83.300000
    
    
      4
      2
      0
      2
      595
      15.64
      13.37
      10718
      5634
      4020
      1065
      4944
      2103
      336
      4703
      1438
      11586
      2769
      0
      7384
      4944
      19.5
      67.0
      4.6
      63.7
      28.5
      9.9
      52.6
      37.5
      95.7
      1
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      ...
      0.061464
      0.000000
      0.769211
      0.000000
      0.043101
      0.000000
      0
      0.034175
      0.008926
      0.004336
      0.002805
      0
      0.004081
      0.066820
      0.000000
      0.000000
      0.004081
      0
      0.00000
      0.859543
      0.000000
      0.000000
      0.118656
      0.000000
      0.002144
      0
      0
      0.000000
      0.001072
      0.018585
      0.000000
      0.000000
      0
      0.095816
      0.107962
      0.148448
      0.043185
      0.008097
      0.005398
      0.554656
      0.000000
      0.000000
      0.000000
      0.000000
      0.036437
      1
      0
      75.000000
      72.000000
      69.400000
    
  

5 rows × 160 columns



In [27]:

    
# Removing rows containing NaN for the model that will make use of previous years grad rate.
df_yr = df_yr[np.isfinite(df_yr['gradrate0607'])]
df_yr = df_yr[np.isfinite(df_yr['gradrate0708'])]
df_yr = df_yr[np.isfinite(df_yr['gradrate0809'])]
df_yr = df_yr[np.isfinite(df_yr['afgr'])]
df_yr.fillna(value=np.nan, inplace=True)

print df_yr.shape









    



(10785, 160)



In [28]:

    
# Baseline Comparisons
from sklearn.metrics import mean_squared_error

y_0607 = df_yr['gradrate0607'].astype(int)
y_0708 = df_yr['gradrate0708'].astype(int)
y_0809 = df_yr['gradrate0809'].astype(int)
y_0910 = df_yr['afgr'].astype(int)

print "Mean Squared Error : "
rscore =  mean_squared_error(y_0910, y_0809)
print "Prediction for 2009-10 using 2009-09 : ", rscore

rscore =  mean_squared_error(y_0809, y_0708)
print "Prediction for 2008-09 using 2007-08 : ", rscore

rscore =  mean_squared_error(y_0708, y_0607)
print "Prediction for 2007-08 using 2006-07 : ", rscore

colors=sns.color_palette()
p1 = plt.scatter(y_0910, y_0809, alpha=0.2, color=colors[0])
p2 = plt.scatter(y_0809, y_0708, alpha=0.2, color=colors[1])
p3 = plt.scatter(y_0708, y_0607, alpha=0.2, color=colors[2])
    
plt.xlabel("Grad Rate: Current Year")
plt.ylabel("Grad Rate: Previous Year")
plt.title("Grad Rate: Current Year vs Previous Year")
plt.xlim(0,100)
plt.ylim(0,100)
plt.legend((p1, p2, p3),('2009-10', '2009-08', '2008-07'),
           scatterpoints=100,
           loc='lower left',
           ncol=3,
           fontsize=20)
plt.show()









    



Mean Squared Error : 
Prediction for 2009-10 using 2009-09 :  143.477515067
Prediction for 2008-09 using 2007-08 :  138.296152063
Prediction for 2007-08 using 2006-07 :  132.4472879






    



/Users/ashwindeo/Dev/anaconda/lib/python2.7/site-packages/matplotlib/collections.py:590: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  if self._edgecolors == str('face'):

Predictive Modelling using the current school districts datasets.

Here we try to build a model using only the school districts. The goal is that this model can help school asses what they could do to improve their perfomance. So focus is only on the indiciators over which schools like some control.



In [205]:

    
fullprevyears = pd.read_csv("data/finaldata/dftouse_withprevyears.csv")



In [206]:

    
dftouse = fullprevyears.copy(deep=True)



In [207]:

    
dftouse.shape









    Out[207]:





(10785, 160)



In [208]:

    
dftouse.head()









    Out[208]:






  
    
      
      num_schools
      num_charter_schools
      num_pub_schools
      num_students
      pupil_teacher_ratio_dist
      pupil_teacher_ratio_ps
      totalrev_pp
      tlocrev_pp
      tsrev_pp
      tfedrev_pp
      tcurinst_pp
      tcurssv_pp
      tcuroth_pp
      tcursalary_pp
      tcurbenefits_pp
      totalexp_pp
      tcapout_pp
      tnonelse_pp
      tcurelsc_pp
      instexp_pp
      tcurelsc_percent
      tcurinst_percent
      tcuroth_percent
      tcuresal_percent
      tcurssvc_percent
      tfedrev_percent
      tlocrev_percent
      tsrev_percent
      afgr
      i_agency_type_local_school_district
      i_agency_type_local_school_district_sup_union
      i_agency_type_regional_education_services
      i_agency_type_charter_school_agency
      i_fin_sdlc_sec
      i_fin_sdlc_elem_sec
      i_fin_sdlc_voc
      i_ucl_city_large
      i_ucl_city_mid
      i_ucl_city_small
      i_ucl_suburb_large
      i_ucl_suburb_mid
      i_ucl_suburb_small
      i_ucl_town_fringe
      i_ucl_town_distant
      i_ucl_town_remote
      i_ucl_rural_fringe
      i_ucl_rural_distant
      i_ucl_rural_remote
      i_cs_all_charter
      i_cs_charter_noncharter
      ...
      r_lrev_it
      r_lrev_aot
      r_lrev_pgc
      r_lrev_cc
      r_lrev_oss
      r_lrev_tui
      r_lrev_trans
      r_lrev_slr
      r_lrev_ts
      r_lrev_sar
      r_lrev_osalserv
      r_lrev_sfns
      r_lrev_ie
      r_lrev_molr
      r_lrev_sp
      r_lrev_rr
      r_lrev_sale
      r_lrev_ff
      r_lrev_pc
      r_srev_gfa
      r_srev_sep
      r_srev_trans
      r_srev_sip
      r_srev_cbsp
      r_srev_vep
      r_srev_codsp
      r_srev_bep
      r_srev_gt
      r_srev_slp
      r_srev_aor
      r_srev_splea
      r_srev_osp
      r_srev_ns
      r_frev_title1
      r_frev_dis
      r_frev_cna
      r_frev_ems
      r_frev_dfs
      r_frev_voc
      r_frev_ao
      r_frev_ns
      r_frev_ia
      r_frev_be
      r_frev_na
      r_frev_aofed
      RESP_High_Graduation
      RESP_Low_Graduation
      gradrate0607
      gradrate0708
      gradrate0809
    
  
  
    
      0
      1
      1
      1
      360
      15.65
      17.14
      11111
      356
      7925
      2831
      3986
      4717
      436
      4133
      1028
      11197
      1986
      72
      9139
      3986
      11.2
      43.6
      4.8
      45.2
      51.6
      25.5
      3.2
      71.3
      30.2
      0
      0
      0
      1
      0
      1
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      ...
      0.011257
      0.004972
      0.769211
      0.070312
      0.390625
      0.000000
      0
      0.062500
      0.000000
      0.000000
      0.000000
      0
      0.000000
      0.078125
      0.007812
      0.000000
      0.390625
      0
      0.00000
      0.964599
      0.000000
      0.000000
      0.000000
      0.003155
      0.000000
      0
      0
      0.003155
      0.000000
      0.029092
      0.000000
      0.000000
      0
      0.606477
      0.000000
      0.156035
      0.000000
      0.000000
      0.000000
      0.176644
      0.060844
      0.000000
      0.000000
      0.000000
      0.000000
      0
      1
      80.772552
      81.268948
      81.805709
    
    
      1
      1
      1
      1
      594
      28.49
      28.49
      10557
      10285
      273
      0
      5104
      3003
      0
      4199
      1456
      8732
      554
      0
      8108
      5104
      18.0
      63.0
      0.0
      51.8
      37.0
      0.0
      97.4
      2.6
      100.0
      0
      0
      0
      1
      0
      1
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      ...
      0.011257
      0.004972
      0.769211
      0.000000
      0.983958
      0.005893
      0
      0.000000
      0.000000
      0.000000
      0.000000
      0
      0.000000
      0.010149
      0.000000
      0.000000
      0.000000
      0
      0.00000
      0.000000
      0.000000
      0.000000
      0.950617
      0.000000
      0.000000
      0
      0
      0.000000
      0.000000
      0.049383
      0.000000
      0.000000
      0
      0.208316
      0.166727
      0.177190
      0.024388
      0.003395
      0.005148
      0.309150
      0.048223
      0.015101
      0.003794
      0.002556
      0.036014
      1
      0
      100.000000
      100.000000
      100.000000
    
    
      2
      1
      1
      1
      1033
      16.61
      16.61
      11015
      68
      9164
      1784
      4890
      3500
      441
      5788
      477
      8864
      0
      1
      8832
      4890
      5.4
      55.4
      5.0
      65.5
      39.6
      16.2
      0.6
      83.2
      55.7
      0
      0
      0
      1
      0
      1
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      ...
      0.011257
      0.004972
      0.769211
      0.000000
      0.000000
      0.000000
      0
      0.800000
      0.000000
      0.000000
      0.000000
      0
      0.142857
      0.000000
      0.000000
      0.057143
      0.000000
      0
      0.00000
      0.952145
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0
      0
      0.000000
      0.000211
      0.006233
      0.040038
      0.001373
      0
      0.351058
      0.056430
      0.238741
      0.033641
      0.000000
      0.000000
      0.320130
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0
      1
      75.600000
      41.400000
      81.805709
    
    
      3
      3
      0
      3
      432
      11.86
      12.08
      11367
      4374
      5089
      1904
      4835
      4633
      498
      5741
      2112
      13200
      2204
      0
      9966
      4835
      21.2
      48.5
      5.0
      57.6
      46.5
      16.7
      38.5
      44.8
      70.7
      1
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      ...
      0.000000
      0.000000
      0.769211
      0.000000
      0.001049
      0.000000
      0
      0.060829
      0.009963
      0.014683
      0.000000
      0
      0.005244
      0.052438
      0.027792
      0.004195
      0.000000
      0
      0.02517
      0.642181
      0.049572
      0.068049
      0.000000
      0.004056
      0.000901
      0
      0
      0.000000
      0.000901
      0.015773
      0.218567
      0.000000
      0
      0.162651
      0.306024
      0.087952
      0.038554
      0.002410
      0.000000
      0.402410
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0
      1
      97.100000
      81.600000
      83.300000
    
    
      4
      2
      0
      2
      595
      15.64
      13.37
      10718
      5634
      4020
      1065
      4944
      2103
      336
      4703
      1438
      11586
      2769
      0
      7384
      4944
      19.5
      67.0
      4.6
      63.7
      28.5
      9.9
      52.6
      37.5
      95.7
      1
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      ...
      0.061464
      0.000000
      0.769211
      0.000000
      0.043101
      0.000000
      0
      0.034175
      0.008926
      0.004336
      0.002805
      0
      0.004081
      0.066820
      0.000000
      0.000000
      0.004081
      0
      0.00000
      0.859543
      0.000000
      0.000000
      0.118656
      0.000000
      0.002144
      0
      0
      0.000000
      0.001072
      0.018585
      0.000000
      0.000000
      0
      0.095816
      0.107962
      0.148448
      0.043185
      0.008097
      0.005398
      0.554656
      0.000000
      0.000000
      0.000000
      0.000000
      0.036437
      1
      0
      75.000000
      72.000000
      69.400000
    
  

5 rows × 160 columns



In [209]:

    
STANDARDIZABLE = ['num_students', 'num_schools','num_charter_schools','num_pub_schools','tcuresal_percent','pupil_teacher_ratio_dist', 'pupil_teacher_ratio_ps', 'totalrev_pp','tlocrev_pp','tsrev_pp','tfedrev_pp','tcurinst_pp','tcurssv_pp','tcuroth_pp','tcursalary_pp','tcurbenefits_pp','totalexp_pp','tcapout_pp','tnonelse_pp','tcurelsc_pp','instexp_pp','tcurinst_percent','tcuroth_percent','tcurelsc_percent','tcurssvc_percent','tfedrev_percent','tlocrev_percent','tsrev_percent','r_ELL','r_IEP','r_lunch_free','r_lunch_reduced','r_stud_PKK','r_stud_18','r_stud_912','r_stud_re_M','r_stud_re_F','r_stud_re_AIAN','r_stud_re_AAP','r_stud_re_H','r_stud_re_B','r_stud_re_W','r_stud_re_HNPI','r_stud_re_Two','r_stud_re_Total','r_stud_reg_12_AIAN_M','r_stud_reg_12_AIAN_F','r_stud_reg_12_AAP_M','r_stud_reg_12_AAP_F','r_stud_reg_12_H_M','r_stud_reg_12_H_F','r_stud_reg_12_B_M','r_stud_reg_12_B_F','r_stud_reg_12_W_M','r_stud_reg_12_W_F','r_stud_reg_12_HNPI_M','r_stud_reg_12_HNPI_F','r_stud_reg_12_Two_M','r_stud_reg_12_Two_F','r_st_PKT','r_st_KT','r_st_ET','r_st_ST','r_st_UT','r_st_TS','r_st_IA','r_st_IC','r_st_EGC','r_st_SGC','r_st_OGC','r_st_TGC','r_st_LMS','r_st_LMSS','r_st_LEA','r_st_LEASS','r_st_SA','r_st_SASS','r_st_SSSS','r_st_OSSS','r_lrev_pt','r_lrev_gst','r_lrev_put','r_lrev_it','r_lrev_aot','r_lrev_pgc','r_lrev_cc','r_lrev_oss','r_lrev_tui','r_lrev_trans','r_lrev_slr','r_lrev_ts','r_lrev_sar','r_lrev_osalserv','r_lrev_sfns','r_lrev_ie','r_lrev_molr','r_lrev_sp','r_lrev_rr','r_lrev_sale','r_lrev_ff','r_lrev_pc','r_srev_gfa','r_srev_sep','r_srev_trans','r_srev_sip','r_srev_cbsp','r_srev_vep','r_srev_codsp','r_srev_bep','r_srev_gt','r_srev_slp','r_srev_aor','r_srev_splea','r_srev_osp','r_srev_ns','r_frev_title1','r_frev_dis','r_frev_cna','r_frev_ems','r_frev_dfs','r_frev_voc','r_frev_ao','r_frev_ns','r_frev_ia','r_frev_be','r_frev_na','r_frev_aofed']
print STANDARDIZABLE









    



['num_students', 'num_schools', 'num_charter_schools', 'num_pub_schools', 'tcuresal_percent', 'pupil_teacher_ratio_dist', 'pupil_teacher_ratio_ps', 'totalrev_pp', 'tlocrev_pp', 'tsrev_pp', 'tfedrev_pp', 'tcurinst_pp', 'tcurssv_pp', 'tcuroth_pp', 'tcursalary_pp', 'tcurbenefits_pp', 'totalexp_pp', 'tcapout_pp', 'tnonelse_pp', 'tcurelsc_pp', 'instexp_pp', 'tcurinst_percent', 'tcuroth_percent', 'tcurelsc_percent', 'tcurssvc_percent', 'tfedrev_percent', 'tlocrev_percent', 'tsrev_percent', 'r_ELL', 'r_IEP', 'r_lunch_free', 'r_lunch_reduced', 'r_stud_PKK', 'r_stud_18', 'r_stud_912', 'r_stud_re_M', 'r_stud_re_F', 'r_stud_re_AIAN', 'r_stud_re_AAP', 'r_stud_re_H', 'r_stud_re_B', 'r_stud_re_W', 'r_stud_re_HNPI', 'r_stud_re_Two', 'r_stud_re_Total', 'r_stud_reg_12_AIAN_M', 'r_stud_reg_12_AIAN_F', 'r_stud_reg_12_AAP_M', 'r_stud_reg_12_AAP_F', 'r_stud_reg_12_H_M', 'r_stud_reg_12_H_F', 'r_stud_reg_12_B_M', 'r_stud_reg_12_B_F', 'r_stud_reg_12_W_M', 'r_stud_reg_12_W_F', 'r_stud_reg_12_HNPI_M', 'r_stud_reg_12_HNPI_F', 'r_stud_reg_12_Two_M', 'r_stud_reg_12_Two_F', 'r_st_PKT', 'r_st_KT', 'r_st_ET', 'r_st_ST', 'r_st_UT', 'r_st_TS', 'r_st_IA', 'r_st_IC', 'r_st_EGC', 'r_st_SGC', 'r_st_OGC', 'r_st_TGC', 'r_st_LMS', 'r_st_LMSS', 'r_st_LEA', 'r_st_LEASS', 'r_st_SA', 'r_st_SASS', 'r_st_SSSS', 'r_st_OSSS', 'r_lrev_pt', 'r_lrev_gst', 'r_lrev_put', 'r_lrev_it', 'r_lrev_aot', 'r_lrev_pgc', 'r_lrev_cc', 'r_lrev_oss', 'r_lrev_tui', 'r_lrev_trans', 'r_lrev_slr', 'r_lrev_ts', 'r_lrev_sar', 'r_lrev_osalserv', 'r_lrev_sfns', 'r_lrev_ie', 'r_lrev_molr', 'r_lrev_sp', 'r_lrev_rr', 'r_lrev_sale', 'r_lrev_ff', 'r_lrev_pc', 'r_srev_gfa', 'r_srev_sep', 'r_srev_trans', 'r_srev_sip', 'r_srev_cbsp', 'r_srev_vep', 'r_srev_codsp', 'r_srev_bep', 'r_srev_gt', 'r_srev_slp', 'r_srev_aor', 'r_srev_splea', 'r_srev_osp', 'r_srev_ns', 'r_frev_title1', 'r_frev_dis', 'r_frev_cna', 'r_frev_ems', 'r_frev_dfs', 'r_frev_voc', 'r_frev_ao', 'r_frev_ns', 'r_frev_ia', 'r_frev_be', 'r_frev_na', 'r_frev_aofed']



In [210]:

    
INDICATORS = []
for v in dftouse.columns:
    l=np.unique(dftouse[v])
    if len(l) <= 10:
        INDICATORS.append(v)
        
INDICATORS.remove('RESP_High_Graduation')        
INDICATORS.remove('RESP_Low_Graduation')  
print INDICATORS









    



['i_agency_type_local_school_district', 'i_agency_type_local_school_district_sup_union', 'i_agency_type_regional_education_services', 'i_agency_type_charter_school_agency', 'i_fin_sdlc_sec', 'i_fin_sdlc_elem_sec', 'i_fin_sdlc_voc', 'i_ucl_city_large', 'i_ucl_city_mid', 'i_ucl_city_small', 'i_ucl_suburb_large', 'i_ucl_suburb_mid', 'i_ucl_suburb_small', 'i_ucl_town_fringe', 'i_ucl_town_distant', 'i_ucl_town_remote', 'i_ucl_rural_fringe', 'i_ucl_rural_distant', 'i_ucl_rural_remote', 'i_cs_all_charter', 'i_cs_charter_noncharter', 'i_cs_all_noncharter', 'i_ma_ne_nr', 'i_ma_metropolitan', 'i_ma_micropolitan', 'i_lgo_K', 'i_lgo_PK']



In [211]:

    
#CITATION: From HW3
from sklearn.cross_validation import train_test_split
itrain, itest = train_test_split(xrange(dftouse.shape[0]), train_size=0.7)



In [212]:

    
#CITATION: From HW3
mask=np.ones(dftouse.shape[0], dtype='int')
mask[itrain]=1
mask[itest]=0
mask = (mask==1)



In [213]:

    
# make sure we didn't get unlucky in our mask selection
print "% Graduation in Training:", np.mean(dftouse['afgr'][mask])
print "% Graduation in Testing:", np.mean(dftouse['afgr'][~mask])









    



% Graduation in Training: 82.5862895748
% Graduation in Testing: 82.9390605686



In [214]:

    
#CITATION: From HW3
mask.shape, mask.sum()









    Out[214]:





((10785,), 7549)



In [215]:

    
#CITATION: From HW3
from sklearn.preprocessing import StandardScaler

for col in STANDARDIZABLE:
    #print col
    valstrain=dftouse[col].values[mask]
    valstest=dftouse[col].values[~mask]
    scaler=StandardScaler().fit(valstrain)
    outtrain=scaler.transform(valstrain)
    outtest=scaler.fit_transform(valstest)
    out=np.empty(mask.shape[0])
    out[mask]=outtrain
    out[~mask]=outtest
    dftouse[col]=out



In [216]:

    
#CITATION: From HW3
lcols=list(dftouse.columns)
lcols.remove('RESP_High_Graduation')
lcols.remove('RESP_Low_Graduation')
lcols.remove('afgr')
lcols.remove('gradrate0607')
lcols.remove('gradrate0708')
lcols.remove('gradrate0809')
###Check for Index Column if exixts in data and remove.
indexcol='Unnamed: 0'
if indexcol in list(dftouse.columns): lcols.remove(indexcol)
print len(lcols)



In [217]:

    
#CITATION: From HW3
ccols=[]
for c in lcols:
    if c not in INDICATORS:
        ccols.append(c)
print len(ccols), len(INDICATORS)
ccols









    



127 27






    Out[217]:





['num_schools',
 'num_charter_schools',
 'num_pub_schools',
 'num_students',
 'pupil_teacher_ratio_dist',
 'pupil_teacher_ratio_ps',
 'totalrev_pp',
 'tlocrev_pp',
 'tsrev_pp',
 'tfedrev_pp',
 'tcurinst_pp',
 'tcurssv_pp',
 'tcuroth_pp',
 'tcursalary_pp',
 'tcurbenefits_pp',
 'totalexp_pp',
 'tcapout_pp',
 'tnonelse_pp',
 'tcurelsc_pp',
 'instexp_pp',
 'tcurelsc_percent',
 'tcurinst_percent',
 'tcuroth_percent',
 'tcuresal_percent',
 'tcurssvc_percent',
 'tfedrev_percent',
 'tlocrev_percent',
 'tsrev_percent',
 'r_ELL',
 'r_IEP',
 'r_lunch_free',
 'r_lunch_reduced',
 'r_stud_PKK',
 'r_stud_18',
 'r_stud_912',
 'r_stud_re_M',
 'r_stud_re_F',
 'r_stud_re_AIAN',
 'r_stud_re_AAP',
 'r_stud_re_H',
 'r_stud_re_B',
 'r_stud_re_W',
 'r_stud_re_HNPI',
 'r_stud_re_Two',
 'r_stud_re_Total',
 'r_stud_reg_12_AIAN_M',
 'r_stud_reg_12_AIAN_F',
 'r_stud_reg_12_AAP_M',
 'r_stud_reg_12_AAP_F',
 'r_stud_reg_12_H_M',
 'r_stud_reg_12_H_F',
 'r_stud_reg_12_B_M',
 'r_stud_reg_12_B_F',
 'r_stud_reg_12_W_M',
 'r_stud_reg_12_W_F',
 'r_stud_reg_12_HNPI_M',
 'r_stud_reg_12_HNPI_F',
 'r_stud_reg_12_Two_M',
 'r_stud_reg_12_Two_F',
 'r_st_PKT',
 'r_st_KT',
 'r_st_ET',
 'r_st_ST',
 'r_st_UT',
 'r_st_TS',
 'r_st_IA',
 'r_st_IC',
 'r_st_EGC',
 'r_st_SGC',
 'r_st_OGC',
 'r_st_TGC',
 'r_st_LMS',
 'r_st_LMSS',
 'r_st_LEA',
 'r_st_LEASS',
 'r_st_SA',
 'r_st_SASS',
 'r_st_SSSS',
 'r_st_OSSS',
 'r_lrev_pt',
 'r_lrev_gst',
 'r_lrev_put',
 'r_lrev_it',
 'r_lrev_aot',
 'r_lrev_pgc',
 'r_lrev_cc',
 'r_lrev_oss',
 'r_lrev_tui',
 'r_lrev_trans',
 'r_lrev_slr',
 'r_lrev_ts',
 'r_lrev_sar',
 'r_lrev_osalserv',
 'r_lrev_sfns',
 'r_lrev_ie',
 'r_lrev_molr',
 'r_lrev_sp',
 'r_lrev_rr',
 'r_lrev_sale',
 'r_lrev_ff',
 'r_lrev_pc',
 'r_srev_gfa',
 'r_srev_sep',
 'r_srev_trans',
 'r_srev_sip',
 'r_srev_cbsp',
 'r_srev_vep',
 'r_srev_codsp',
 'r_srev_bep',
 'r_srev_gt',
 'r_srev_slp',
 'r_srev_aor',
 'r_srev_splea',
 'r_srev_osp',
 'r_srev_ns',
 'r_frev_title1',
 'r_frev_dis',
 'r_frev_cna',
 'r_frev_ems',
 'r_frev_dfs',
 'r_frev_voc',
 'r_frev_ao',
 'r_frev_ns',
 'r_frev_ia',
 'r_frev_be',
 'r_frev_na',
 'r_frev_aofed']



In [218]:

    
Xmatrix=dftouse[lcols].values
Yresp=dftouse['afgr'].values



In [219]:

    
Xtrain=Xmatrix[mask]
Xtest=Xmatrix[~mask]
ytrain=Yresp[mask]
ytest=Yresp[~mask]

Linear Regression

With so many features, we expect the performance of linear regression to be terrible. We include this to justify why we need to use more sophisticated techniques such as lasso and elastic net to select the key features.



In [220]:

    
from sklearn import linear_model
lm = linear_model.LinearRegression()
lm.fit(Xtrain, ytrain)









    Out[220]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)



In [221]:

    
print lm.intercept_









    



163836.034619

Top Factors - Positive



In [222]:

    
pd.DataFrame(zip(dftouse[lcols].columns, lm.coef_), columns = ['features', 'estimatedCoefficients']).sort(['estimatedCoefficients'], ascending=False)[:25]









    Out[222]:






  
    
      
      features
      estimatedCoefficients
    
  
  
    
      128
      r_srev_gfa
      6.846079e+10
    
    
      138
      r_srev_aor
      4.107836e+10
    
    
      139
      r_srev_splea
      2.919768e+10
    
    
      129
      r_srev_sep
      2.685793e+10
    
    
      131
      r_srev_sip
      1.867126e+10
    
    
      134
      r_srev_codsp
      1.629000e+10
    
    
      141
      r_srev_ns
      1.311669e+10
    
    
      130
      r_srev_trans
      1.234336e+10
    
    
      132
      r_srev_cbsp
      8.097579e+09
    
    
      133
      r_srev_vep
      3.646229e+09
    
    
      140
      r_srev_osp
      3.129522e+09
    
    
      135
      r_srev_bep
      2.664747e+09
    
    
      137
      r_srev_slp
      1.643682e+09
    
    
      136
      r_srev_gt
      9.053896e+08
    
    
      19
      instexp_pp
      2.884532e+05
    
    
      50
      i_ma_ne_nr
      1.241564e+05
    
    
      51
      i_ma_metropolitan
      1.241558e+05
    
    
      52
      i_ma_micropolitan
      1.241554e+05
    
    
      41
      i_ucl_town_fringe
      3.808005e+04
    
    
      46
      i_ucl_rural_remote
      3.807999e+04
    
    
      45
      i_ucl_rural_distant
      3.807883e+04
    
    
      40
      i_ucl_suburb_small
      3.807849e+04
    
    
      38
      i_ucl_suburb_large
      3.807848e+04
    
    
      43
      i_ucl_town_remote
      3.807846e+04
    
    
      42
      i_ucl_town_distant
      3.807844e+04

Top Factors - Negative



In [223]:

    
pd.DataFrame(zip(dftouse[lcols].columns, lm.coef_), columns = ['features', 'estimatedCoefficients']).sort(['estimatedCoefficients'], ascending=True)[:25]









    Out[223]:






  
    
      
      features
      estimatedCoefficients
    
  
  
    
      148
      r_frev_ao
      -2.851683e+08
    
    
      143
      r_frev_dis
      -2.145319e+08
    
    
      142
      r_frev_title1
      -1.651628e+08
    
    
      149
      r_frev_ns
      -1.566859e+08
    
    
      144
      r_frev_cna
      -1.112821e+08
    
    
      153
      r_frev_aofed
      -1.086761e+08
    
    
      150
      r_frev_ia
      -9.347220e+07
    
    
      145
      r_frev_ems
      -3.841922e+07
    
    
      151
      r_frev_be
      -3.007251e+07
    
    
      147
      r_frev_voc
      -2.416929e+07
    
    
      152
      r_frev_na
      -1.904637e+07
    
    
      146
      r_frev_dfs
      -1.766999e+07
    
    
      48
      i_cs_charter_noncharter
      -3.260136e+05
    
    
      49
      i_cs_all_noncharter
      -3.260118e+05
    
    
      10
      tcurinst_pp
      -2.878319e+05
    
    
      47
      i_cs_all_charter
      -2.877958e+05
    
    
      31
      i_agency_type_charter_school_agency
      -3.822082e+04
    
    
      6
      totalrev_pp
      -1.454550e+03
    
    
      18
      tcurelsc_pp
      -1.137150e+03
    
    
      91
      r_st_TS
      -2.583916e+01
    
    
      24
      tcurssvc_percent
      -1.375403e+01
    
    
      21
      tcurinst_percent
      -1.342402e+01
    
    
      2
      num_pub_schools
      -1.252104e+01
    
    
      30
      i_agency_type_regional_education_services
      -4.815589e+00
    
    
      22
      tcuroth_percent
      -4.177093e+00



In [224]:

    
from sklearn import cross_validation
cross_validation.cross_val_score(lm, Xtrain, ytrain, scoring='r2')









    Out[224]:





array([ 0.36469313,  0.35761986,  0.38660675])



In [225]:

    
from sklearn.metrics import mean_squared_error

ypredict = lm.predict(Xtest)
rscore =  mean_squared_error(ytest, ypredict)
print "Linear Regression : Mean Squared Error : ", rscore

plt.scatter(ytest, ypredict, alpha=0.5)
plt.xlabel("Grad Rate: Actual")
plt.ylabel("Grad Rate: Predicted")
plt.title("Linear Regression: School District Data")
plt.xlim(0,105)
#plt.ylim(0,105)
plt.show()









    



Linear Regression : Mean Squared Error :  2.11897145556e+19

Lasso Regression

Here we have a huge dataset, and lasso goes after the signficant fields so that we don't self select.



In [226]:

    
lasso = linear_model.Lasso()



In [227]:

    
lasso.fit(Xtrain, ytrain)









    Out[227]:





Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)



In [228]:

    
pd.DataFrame(zip(dftouse[lcols].columns, lasso.coef_), columns = ['features', 'estimatedCoefficients']).sort(['estimatedCoefficients'], ascending=False)[:10]









    Out[228]:






  
    
      
      features
      estimatedCoefficients
    
  
  
    
      81
      r_stud_reg_12_W_F
      1.541761
    
    
      80
      r_stud_reg_12_W_M
      1.165075
    
    
      26
      tlocrev_percent
      0.467191
    
    
      0
      num_schools
      -0.000000
    
    
      107
      r_lrev_gst
      -0.000000
    
    
      101
      r_st_LEASS
      -0.000000
    
    
      102
      r_st_SA
      -0.000000
    
    
      103
      r_st_SASS
      -0.000000
    
    
      104
      r_st_SSSS
      -0.000000
    
    
      105
      r_st_OSSS
      0.000000



In [229]:

    
pd.DataFrame(zip(dftouse[lcols].columns, lasso.coef_), columns = ['features', 'estimatedCoefficients']).sort(['estimatedCoefficients'], ascending=True)[:10]









    Out[229]:






  
    
      
      features
      estimatedCoefficients
    
  
  
    
      57
      r_lunch_free
      -3.312818
    
    
      67
      r_stud_re_B
      -1.548774
    
    
      8
      tsrev_pp
      -0.561664
    
    
      25
      tfedrev_percent
      -0.414278
    
    
      9
      tfedrev_pp
      -0.068704
    
    
      100
      r_st_LEA
      -0.000000
    
    
      101
      r_st_LEASS
      -0.000000
    
    
      102
      r_st_SA
      -0.000000
    
    
      103
      r_st_SASS
      -0.000000
    
    
      104
      r_st_SSSS
      -0.000000



In [230]:

    
from sklearn import cross_validation
cross_validation.cross_val_score(lasso, Xtrain, ytrain, scoring='r2')









    Out[230]:





array([ 0.31777824,  0.30340675,  0.28960471])



In [231]:

    
ypredict = lasso.predict(Xtest)
rscore =  mean_squared_error(ytest, ypredict)
print "Lasso Regression : Mean Squared Error : ", rscore

plt.scatter(ytest, ypredict, alpha=0.5)
plt.xlabel("Grad Rate: Actual")
plt.ylabel("Grad Rate: Predicted")
plt.title("Lasso Regression: School District Data")
plt.xlim(0,105)
plt.ylim(0,105)
plt.show()









    



Lasso Regression : Mean Squared Error :  128.682382592

We hit a wall at actual graduation rate of 100%, with some of our predictions varying from 70% to 100%.

Elastic Net Regression



In [232]:

    
elasticnet = linear_model.ElasticNet()
elasticnet.fit(Xtrain, ytrain)









    Out[232]:





ElasticNet(alpha=1.0, copy_X=True, fit_intercept=True, l1_ratio=0.5,
      max_iter=1000, normalize=False, positive=False, precompute=False,
      random_state=None, selection='cyclic', tol=0.0001, warm_start=False)



In [233]:

    
pd.DataFrame(zip(dftouse[lcols].columns, elasticnet.coef_), columns = ['features', 'estimatedCoefficients']).sort(['estimatedCoefficients'], ascending=False)[:20]









    Out[233]:






  
    
      
      features
      estimatedCoefficients
    
  
  
    
      81
      r_stud_reg_12_W_F
      1.239553
    
    
      80
      r_stud_reg_12_W_M
      1.098755
    
    
      68
      r_stud_re_W
      0.795457
    
    
      26
      tlocrev_percent
      0.693129
    
    
      77
      r_stud_reg_12_H_F
      0.351569
    
    
      23
      tcuresal_percent
      0.262755
    
    
      75
      r_stud_reg_12_AAP_F
      0.227409
    
    
      74
      r_stud_reg_12_AAP_M
      0.158365
    
    
      63
      r_stud_re_F
      0.113068
    
    
      59
      r_stud_PKK
      0.105005
    
    
      73
      r_stud_reg_12_AIAN_F
      0.056997
    
    
      109
      r_lrev_it
      0.039117
    
    
      116
      r_lrev_slr
      0.030626
    
    
      99
      r_st_LMSS
      0.028520
    
    
      76
      r_stud_reg_12_H_M
      0.022879
    
    
      96
      r_st_OGC
      0.013479
    
    
      138
      r_srev_aor
      0.000000
    
    
      88
      r_st_ET
      -0.000000
    
    
      101
      r_st_LEASS
      -0.000000
    
    
      100
      r_st_LEA
      -0.000000



In [234]:

    
pd.DataFrame(zip(dftouse[lcols].columns, elasticnet.coef_), columns = ['features', 'estimatedCoefficients']).sort(['estimatedCoefficients'], ascending=True)[:20]









    Out[234]:






  
    
      
      features
      estimatedCoefficients
    
  
  
    
      57
      r_lunch_free
      -1.950453
    
    
      67
      r_stud_re_B
      -1.240224
    
    
      25
      tfedrev_percent
      -0.800888
    
    
      8
      tsrev_pp
      -0.500096
    
    
      60
      r_stud_18
      -0.444123
    
    
      9
      tfedrev_pp
      -0.343794
    
    
      142
      r_frev_title1
      -0.343191
    
    
      90
      r_st_UT
      -0.334007
    
    
      107
      r_lrev_gst
      -0.321746
    
    
      24
      tcurssvc_percent
      -0.263275
    
    
      0
      num_schools
      -0.197457
    
    
      2
      num_pub_schools
      -0.193220
    
    
      27
      tsrev_percent
      -0.173343
    
    
      122
      r_lrev_molr
      -0.104637
    
    
      125
      r_lrev_sale
      -0.100227
    
    
      56
      r_IEP
      -0.095247
    
    
      55
      r_ELL
      -0.065675
    
    
      136
      r_srev_gt
      -0.058401
    
    
      132
      r_srev_cbsp
      -0.055461
    
    
      11
      tcurssv_pp
      -0.014056



In [235]:

    
cross_validation.cross_val_score(elasticnet, Xtrain, ytrain, scoring='r2')









    Out[235]:





array([ 0.32812438,  0.3043909 ,  0.31874612])



In [236]:

    
ypredict = elasticnet.predict(Xtest)
rscore =  mean_squared_error(ytest, ypredict)
print "Elastic Net : Mean Squared Error : ", rscore

plt.scatter(ytest, ypredict, alpha=0.5)
plt.xlabel("Grad Rate: Actual")
plt.ylabel("Grad Rate: Predicted")
plt.title("Elastic Net Regression: School District Data")
plt.xlim(0,105)
plt.ylim(0,105)
plt.show()









    



Elastic Net : Mean Squared Error :  125.889829557

We hit a wall at actual graduation rate of 100% again, with predictions between 70% and 100%. Of our three models of school district data only, elastic net provides the best results.

Decision Tree & Random Forests



In [368]:

    
dfdr=pd.read_csv("data/finaldata/dftouse.csv")
print dfdr.shape

# Create test/train mask
itrain_dr1, itest_dr1 = train_test_split(xrange(dfdr.shape[0]), train_size=0.7)
mask_dr1=np.ones(dfdr.shape[0], dtype='int')
mask_dr1[itrain_dr1]=1
mask_dr1[itest_dr1]=0
mask_dr1 = (mask_dr1==1)

# make sure we didn't get unlucky in our mask selection
print "% Graduation in Training:", np.mean(dfdr['afgr'][mask_dr1])
print "% Graduation in Testing:", np.mean(dfdr['afgr'][~mask_dr1])

# Indicators used : Funding/Expenditure/Location/School Types (no Race)
Xnames2 = [
            'pupil_teacher_ratio_dist',
            'totalrev_pp',
            'tcurinst_pp',
            'tcurssv_pp',
            'tcursalary_pp',
            'tcurbenefits_pp',
            'totalexp_pp',
            'tcapout_pp',
            'tnonelse_pp',
            'tcurelsc_pp',
            'instexp_pp',
            'i_agency_type_local_school_district',
            'i_agency_type_local_school_district_sup_union',
            'i_agency_type_regional_education_services',
            'i_agency_type_charter_school_agency',
            'i_fin_sdlc_sec',
            'i_fin_sdlc_elem_sec',
            'i_fin_sdlc_voc',
            'i_ucl_city_large',
            'i_ucl_city_mid',
            'i_ucl_city_small',
            'i_ucl_suburb_large',
            'i_ucl_suburb_mid',
            'i_ucl_suburb_small',
            'i_ucl_town_fringe',
            'i_ucl_town_distant',
            'i_ucl_town_remote',
            'i_ucl_rural_fringe',
            'i_ucl_rural_distant',
            'i_ucl_rural_remote',
            'i_cs_all_charter',
            'i_cs_charter_noncharter',
            'i_cs_all_noncharter',
            'i_ma_ne_nr',
            'i_ma_metropolitan',
            'i_ma_micropolitan',            
            'r_ELL',
            'r_IEP',
            'r_lunch_free',
            'r_lunch_reduced'
]

# Target :
# Adding Target column :
dfdr['target'] = dfdr['afgr'].astype(int)
target2 = 'target'









    



(10785, 157)
% Graduation in Training: 82.5769770831
% Graduation in Testing: 82.9607849197



In [370]:

    
# Decision Trees
from sklearn import tree
clfTree2 = tree.DecisionTreeClassifier()

parameters = {"max_depth": [10, 11, 12, 13, 14, 15], 'min_samples_leaf': [10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]}
clfTree2, Xtrain, ytrain, Xtest, ytest = do_classify(clfTree2, parameters, dfdr, 
                                                     Xnames2, target2, 1, 
                                                     mask=mask_dr1, n_jobs = 4, score_func = 'r2')

importance_list = clfTree2.feature_importances_
name_list = dfdr[Xnames2].columns
importance_list, name_list = zip(*sorted(zip(importance_list, name_list))[-15:])
plt.barh(range(len(name_list)),importance_list,align='center')
plt.yticks(range(len(name_list)),name_list)
plt.xlabel('Relative Importance in the Descision Trees 2')
plt.ylabel('Features')
plt.title('Relative importance of Each Feature')
plt.show()









    



using mask






    



/Users/ashwindeo/Dev/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:46: FutureWarning: comparison to `None` will result in an elementwise object comparison in the future.






    



BEST {'max_depth': 13, 'min_samples_leaf': 20} -0.222224145502 [mean: -0.31357, std: 0.08894, params: {'max_depth': 10, 'min_samples_leaf': 10}, mean: -0.31278, std: 0.04026, params: {'max_depth': 10, 'min_samples_leaf': 11}, mean: -0.29852, std: 0.05052, params: {'max_depth': 10, 'min_samples_leaf': 12}, mean: -0.29272, std: 0.05331, params: {'max_depth': 10, 'min_samples_leaf': 13}, mean: -0.30353, std: 0.06196, params: {'max_depth': 10, 'min_samples_leaf': 14}, mean: -0.29367, std: 0.06694, params: {'max_depth': 10, 'min_samples_leaf': 15}, mean: -0.28253, std: 0.03369, params: {'max_depth': 10, 'min_samples_leaf': 16}, mean: -0.29391, std: 0.02013, params: {'max_depth': 10, 'min_samples_leaf': 17}, mean: -0.33733, std: 0.06939, params: {'max_depth': 10, 'min_samples_leaf': 18}, mean: -0.29145, std: 0.03723, params: {'max_depth': 10, 'min_samples_leaf': 19}, mean: -0.29651, std: 0.03978, params: {'max_depth': 10, 'min_samples_leaf': 20}, mean: -0.29572, std: 0.05237, params: {'max_depth': 11, 'min_samples_leaf': 10}, mean: -0.27075, std: 0.04740, params: {'max_depth': 11, 'min_samples_leaf': 11}, mean: -0.25353, std: 0.05613, params: {'max_depth': 11, 'min_samples_leaf': 12}, mean: -0.25997, std: 0.07588, params: {'max_depth': 11, 'min_samples_leaf': 13}, mean: -0.26119, std: 0.07428, params: {'max_depth': 11, 'min_samples_leaf': 14}, mean: -0.24715, std: 0.07758, params: {'max_depth': 11, 'min_samples_leaf': 15}, mean: -0.24481, std: 0.04529, params: {'max_depth': 11, 'min_samples_leaf': 16}, mean: -0.24848, std: 0.02854, params: {'max_depth': 11, 'min_samples_leaf': 17}, mean: -0.27187, std: 0.04976, params: {'max_depth': 11, 'min_samples_leaf': 18}, mean: -0.24906, std: 0.04204, params: {'max_depth': 11, 'min_samples_leaf': 19}, mean: -0.24833, std: 0.03129, params: {'max_depth': 11, 'min_samples_leaf': 20}, mean: -0.26851, std: 0.06931, params: {'max_depth': 12, 'min_samples_leaf': 10}, mean: -0.25200, std: 0.05054, params: {'max_depth': 12, 'min_samples_leaf': 11}, mean: -0.24486, std: 0.05990, params: {'max_depth': 12, 'min_samples_leaf': 12}, mean: -0.24695, std: 0.06849, params: {'max_depth': 12, 'min_samples_leaf': 13}, mean: -0.24654, std: 0.06775, params: {'max_depth': 12, 'min_samples_leaf': 14}, mean: -0.24049, std: 0.07025, params: {'max_depth': 12, 'min_samples_leaf': 15}, mean: -0.23912, std: 0.06094, params: {'max_depth': 12, 'min_samples_leaf': 16}, mean: -0.24294, std: 0.05050, params: {'max_depth': 12, 'min_samples_leaf': 17}, mean: -0.25461, std: 0.04269, params: {'max_depth': 12, 'min_samples_leaf': 18}, mean: -0.24539, std: 0.04351, params: {'max_depth': 12, 'min_samples_leaf': 19}, mean: -0.23765, std: 0.03562, params: {'max_depth': 12, 'min_samples_leaf': 20}, mean: -0.27330, std: 0.10805, params: {'max_depth': 13, 'min_samples_leaf': 10}, mean: -0.25870, std: 0.07133, params: {'max_depth': 13, 'min_samples_leaf': 11}, mean: -0.23674, std: 0.09325, params: {'max_depth': 13, 'min_samples_leaf': 12}, mean: -0.23011, std: 0.06606, params: {'max_depth': 13, 'min_samples_leaf': 13}, mean: -0.23544, std: 0.07363, params: {'max_depth': 13, 'min_samples_leaf': 14}, mean: -0.22365, std: 0.07199, params: {'max_depth': 13, 'min_samples_leaf': 15}, mean: -0.22449, std: 0.03957, params: {'max_depth': 13, 'min_samples_leaf': 16}, mean: -0.22625, std: 0.03366, params: {'max_depth': 13, 'min_samples_leaf': 17}, mean: -0.23698, std: 0.04312, params: {'max_depth': 13, 'min_samples_leaf': 18}, mean: -0.23305, std: 0.03976, params: {'max_depth': 13, 'min_samples_leaf': 19}, mean: -0.22222, std: 0.03142, params: {'max_depth': 13, 'min_samples_leaf': 20}, mean: -0.28608, std: 0.07610, params: {'max_depth': 14, 'min_samples_leaf': 10}, mean: -0.24254, std: 0.06110, params: {'max_depth': 14, 'min_samples_leaf': 11}, mean: -0.23615, std: 0.08666, params: {'max_depth': 14, 'min_samples_leaf': 12}, mean: -0.22384, std: 0.06266, params: {'max_depth': 14, 'min_samples_leaf': 13}, mean: -0.23393, std: 0.07595, params: {'max_depth': 14, 'min_samples_leaf': 14}, mean: -0.22243, std: 0.07934, params: {'max_depth': 14, 'min_samples_leaf': 15}, mean: -0.23548, std: 0.04813, params: {'max_depth': 14, 'min_samples_leaf': 16}, mean: -0.23616, std: 0.04333, params: {'max_depth': 14, 'min_samples_leaf': 17}, mean: -0.26082, std: 0.06617, params: {'max_depth': 14, 'min_samples_leaf': 18}, mean: -0.25501, std: 0.04981, params: {'max_depth': 14, 'min_samples_leaf': 19}, mean: -0.24177, std: 0.03891, params: {'max_depth': 14, 'min_samples_leaf': 20}, mean: -0.30148, std: 0.08120, params: {'max_depth': 15, 'min_samples_leaf': 10}, mean: -0.25023, std: 0.05434, params: {'max_depth': 15, 'min_samples_leaf': 11}, mean: -0.23151, std: 0.09105, params: {'max_depth': 15, 'min_samples_leaf': 12}, mean: -0.22998, std: 0.06898, params: {'max_depth': 15, 'min_samples_leaf': 13}, mean: -0.23717, std: 0.06946, params: {'max_depth': 15, 'min_samples_leaf': 14}, mean: -0.23157, std: 0.07177, params: {'max_depth': 15, 'min_samples_leaf': 15}, mean: -0.23620, std: 0.04815, params: {'max_depth': 15, 'min_samples_leaf': 16}, mean: -0.23564, std: 0.03771, params: {'max_depth': 15, 'min_samples_leaf': 17}, mean: -0.27290, std: 0.05155, params: {'max_depth': 15, 'min_samples_leaf': 18}, mean: -0.26485, std: 0.04995, params: {'max_depth': 15, 'min_samples_leaf': 19}, mean: -0.25470, std: 0.03460, params: {'max_depth': 15, 'min_samples_leaf': 20}]



In [371]:

    
ypredict = clfTree2.predict(Xtest)
rscore =  mean_squared_error(ytest, ypredict)
print "Mean Squared Error : ", rscore

plt.scatter(ytest, ypredict, alpha=0.5)
plt.xlabel("Grad Rate: Actual")
plt.ylabel("Grad Rate: Predicted")
plt.title("Decision Tree : School District")
plt.xlim(0,105)
plt.ylim(0,105)
plt.show()









    



Mean Squared Error :  218.606613103



In [372]:

    
# Random Forests
from sklearn.ensemble import RandomForestClassifier
clfForest2 = RandomForestClassifier()

parameters = {"n_estimators": range(1, 20)}
clfForest2, Xtrain, ytrain, Xtest, ytest = do_classify(clfForest2, parameters, 
                                                       dfdr, Xnames2, target2, 1, mask=mask_dr1, 
                                                       n_jobs = 4, score_func='r2')

importance_list = clfForest2.feature_importances_
name_list = dfdr[Xnames2].columns
importance_list, name_list = zip(*sorted(zip(importance_list, name_list))[-15:])
plt.barh(range(len(name_list)),importance_list,align='center')
plt.yticks(range(len(name_list)),name_list)
plt.xlabel('Relative Importance in the Random Forests')
plt.ylabel('Features')
plt.title('Relative importance of Each Feature')
plt.show()









    



using mask






    



/Users/ashwindeo/Dev/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:46: FutureWarning: comparison to `None` will result in an elementwise object comparison in the future.






    



BEST {'n_estimators': 18} -0.310739515573 [mean: -0.53765, std: 0.15353, params: {'n_estimators': 1}, mean: -0.64786, std: 0.29067, params: {'n_estimators': 2}, mean: -0.91929, std: 0.22069, params: {'n_estimators': 3}, mean: -1.00916, std: 0.27087, params: {'n_estimators': 4}, mean: -1.08944, std: 0.29275, params: {'n_estimators': 5}, mean: -1.13645, std: 0.24979, params: {'n_estimators': 6}, mean: -0.96025, std: 0.22346, params: {'n_estimators': 7}, mean: -0.95468, std: 0.25764, params: {'n_estimators': 8}, mean: -0.76561, std: 0.17615, params: {'n_estimators': 9}, mean: -0.63919, std: 0.12012, params: {'n_estimators': 10}, mean: -0.58850, std: 0.11964, params: {'n_estimators': 11}, mean: -0.46967, std: 0.11754, params: {'n_estimators': 12}, mean: -0.42887, std: 0.08869, params: {'n_estimators': 13}, mean: -0.40435, std: 0.06041, params: {'n_estimators': 14}, mean: -0.37523, std: 0.07093, params: {'n_estimators': 15}, mean: -0.38568, std: 0.09091, params: {'n_estimators': 16}, mean: -0.33643, std: 0.07983, params: {'n_estimators': 17}, mean: -0.31074, std: 0.05411, params: {'n_estimators': 18}, mean: -0.34656, std: 0.03337, params: {'n_estimators': 19}]



In [373]:

    
ypredict = clfForest2.predict(Xtest)
rscore =  mean_squared_error(ytest, ypredict)
print "Mean Squared Error : ", rscore

plt.scatter(ytest, ypredict, alpha=0.5)
plt.xlabel("Grad Rate: Actual")
plt.ylabel("Grad Rate: Predicted")
plt.title("Random Forest : School District ")
plt.xlim(0,105)
plt.ylim(0,105)
plt.show()









    



Mean Squared Error :  241.701792336

Predicting graduation based on previous year graduation

We now add previous years of graduation data into our models to see if they outperform the model of school district data alone. We expect that adding in previous years of graduation data will outperform the models of school district data alone.



In [238]:

    
dftouse.fillna(value=0,inplace=True)

As we add in previous years of data, some of those previous years do not have data for some of the school districts. Therefore, the row counts will drop.



In [239]:

    
dftouse = dftouse[dftouse['gradrate0607']>0]
print 'Has 0607 Grad Data: ' + str(len(dftouse))
dftouse = dftouse[dftouse['gradrate0708']>0]
print 'Has 0708 Grad Data: ' + str(len(dftouse))
dftouse = dftouse[dftouse['gradrate0809']>0]
print 'Has 0809 Grad Data: ' + str(len(dftouse))









    



Has 0607 Grad Data: 10785
Has 0708 Grad Data: 10784
Has 0809 Grad Data: 10783



In [240]:

    
dftouse.shape









    Out[240]:





(10783, 160)



In [241]:

    
gradpred = dftouse.copy(deep=True)



In [242]:

    
itrain, itest = train_test_split(xrange(dftouse.shape[0]), train_size=0.7)



In [243]:

    
#CITATION: From HW3
mask=np.ones(dftouse.shape[0], dtype='int')
mask[itrain]=1
mask[itest]=0
mask = (mask==1)



In [244]:

    
# make sure we didn't get unlucky in our mask selection
print "% Graduation in Training:", np.mean(dftouse['afgr'][mask])
print "% Graduation in Testing:", np.mean(dftouse['afgr'][~mask])









    



% Graduation in Training: 82.6549947006
% Graduation in Testing: 82.7872642968



In [245]:

    
Xmatrix=dftouse[['gradrate0607', 'gradrate0708', 'gradrate0809']].values
Yresp=dftouse['afgr'].values



In [246]:

    
Xtrain=Xmatrix[mask]
Xtest=Xmatrix[~mask]
ytrain=Yresp[mask]
ytest=Yresp[~mask]

Linear Regression - 0607, 0708, and 0809

You have multiple previous years of school district graduation data available to you, and you want to add all of them into the model to predict graduation outcome. This would be like saying, I want to predict school year 2014-2015 graduation rates, and I have 2013-2014, 2012-2013, and 2011-2012 school years of graduation rates available. This is not a realistic scenario for the general public - the latest school year of graduation rate data publically available is 2009-2010. That means there is a 5 year lapse in data availability, which would not make this regression possible to the general public if it we were the year 2009-2010.



In [247]:

    
lm = linear_model.LinearRegression()
lm.fit(Xtrain, ytrain)









    Out[247]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)



In [248]:

    
pd.DataFrame(zip(dftouse[['gradrate0607', 'gradrate0708', 'gradrate0809']].columns, lm.coef_), columns = ['features', 'estimatedCoefficients'])









    Out[248]:






  
    
      
      features
      estimatedCoefficients
    
  
  
    
      0
      gradrate0607
      0.188040
    
    
      1
      gradrate0708
      0.254905
    
    
      2
      gradrate0809
      0.354630



In [249]:

    
cross_validation.cross_val_score(lm, Xtrain, ytrain, scoring='r2')









    Out[249]:





array([ 0.45364231,  0.4480428 ,  0.48609109])



In [250]:

    
yp = lm.predict(Xtest)
mean_squared_error(yp, ytest)









    Out[250]:





96.259827122680122



In [251]:

    
plt.scatter(ytest, yp, alpha=.3)
plt.xlabel("Grad Rate: Actual")
plt.ylabel("Grad Rate: Predicted")
plt.title("Linear Regression: 06-07, 07-08, 08-09 Grad Data")
plt.xlim(0,105)
plt.ylim(0,105)
plt.show()

This model outperforms all other models created thus far. It hits the same wall at 100% graduation as the other models. As noted earlier, this is not a realistic scenario because the most recent data that would have been available during school year 2009-2010 would have been 2004-2005.

Linear Regression - 0708 and 0809

This model assumes you have the two most current previous years of data available to you.



In [252]:

    
Xmatrix=dftouse[['gradrate0708', 'gradrate0809']].values
Yresp=dftouse['afgr'].values



In [253]:

    
Xtrain=Xmatrix[mask]
Xtest=Xmatrix[~mask]
ytrain=Yresp[mask]
ytest=Yresp[~mask]



In [254]:

    
lm = linear_model.LinearRegression()
lm.fit(Xtrain, ytrain)









    Out[254]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)



In [255]:

    
print lm.intercept_
print lm.coef_









    



20.444005339
[ 0.34651434  0.41593984]



In [256]:

    
pd.DataFrame(zip(dftouse[['gradrate0708', 'gradrate0809']].columns, lm.coef_), columns = ['features', 'estimatedCoefficients'])









    Out[256]:






  
    
      
      features
      estimatedCoefficients
    
  
  
    
      0
      gradrate0708
      0.346514
    
    
      1
      gradrate0809
      0.415940



In [257]:

    
cross_validation.cross_val_score(lm, Xtrain, ytrain, scoring='r2')









    Out[257]:





array([ 0.42819447,  0.43726416,  0.4676011 ])



In [258]:

    
yp = lm.predict(Xtest)
mean_squared_error(yp, ytest)









    Out[258]:





99.637853083123559



In [259]:

    
plt.scatter(ytest, yp, alpha=.3)
plt.xlabel("Grad Rate: Actual")
plt.ylabel("Grad Rate: Predicted")
plt.title("Linear Regression: 07-08, 08-09 Grad Data")
plt.xlim(0,105)
plt.ylim(0,105)
plt.show()

This model is less accurate than having 3 years of graduation data available, yet still more accurate than all of our previous models.

Linear Regression - 0809 Only

This model assumes you have the most recent previous year available.



In [260]:

    
Xmatrix=dftouse[['gradrate0809']].values
Yresp=dftouse['afgr'].values



In [261]:

    
Xtrain=Xmatrix[mask]
Xtest=Xmatrix[~mask]
ytrain=Yresp[mask]
ytest=Yresp[~mask]



In [262]:

    
lm = linear_model.LinearRegression()
lm.fit(Xtrain, ytrain)









    Out[262]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)



In [263]:

    
print lm.intercept_
print lm.coef_









    



29.687441265
[ 0.64703357]



In [264]:

    
pd.DataFrame(zip(dftouse[['gradrate0809']].columns, lm.coef_), columns = ['features', 'estimatedCoefficients'])









    Out[264]:






  
    
      
      features
      estimatedCoefficients
    
  
  
    
      0
      gradrate0809
      0.647034



In [265]:

    
cross_validation.cross_val_score(lm, Xtrain, ytrain, scoring='r2')









    Out[265]:





array([ 0.36060434,  0.36953741,  0.37925459])



In [266]:

    
yp = lm.predict(Xtest)
mean_squared_error(yp, ytest)









    Out[266]:





114.70927641743916



In [267]:

    
plt.scatter(ytest, yp, alpha=.3)
plt.xlabel("Grad Rate: Actual")
plt.ylabel("Grad Rate: Predicted")
plt.title("Linear Regression: 0809 Grad Data")
plt.xlim(0,105)
plt.ylim(0,105)
plt.show()

This model performs worse than the two previous. It hits a wall at both actual grad rate and predicted grad rate.

Linear Regression - 0708 Only

This model assumes two years prior data available only.



In [268]:

    
Xmatrix=dftouse[['gradrate0708']].values
Yresp=dftouse['afgr'].values



In [269]:

    
Xtrain=Xmatrix[mask]
Xtest=Xmatrix[~mask]
ytrain=Yresp[mask]
ytest=Yresp[~mask]



In [270]:

    
lm = linear_model.LinearRegression()
lm.fit(Xtrain, ytrain)









    Out[270]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)



In [271]:

    
print lm.intercept_
print lm.coef_









    



34.7738087429
[ 0.58916098]



In [272]:

    
pd.DataFrame(zip(dftouse[['gradrate0708']].columns, lm.coef_), columns = ['features', 'estimatedCoefficients'])









    Out[272]:






  
    
      
      features
      estimatedCoefficients
    
  
  
    
      0
      gradrate0708
      0.589161



In [273]:

    
cross_validation.cross_val_score(lm, Xtrain, ytrain, scoring='r2')









    Out[273]:





array([ 0.32280937,  0.34751562,  0.38107179])



In [274]:

    
yp = lm.predict(Xtest)
mean_squared_error(yp, ytest)









    Out[274]:





115.27305941837383



In [275]:

    
plt.scatter(ytest, yp, alpha=.3)
plt.xlabel("Grad Rate: Actual")
plt.ylabel("Grad Rate: Predicted")
plt.title("Linear Regression: 0708 Grad Data")
plt.xlim(0,105)
plt.ylim(0,105)
plt.show()

Linear Regression - 0607 Only

This model assumes three years prior data available only.



In [276]:

    
Xmatrix=dftouse[['gradrate0607']].values
Yresp=dftouse['afgr'].values



In [277]:

    
Xtrain=Xmatrix[mask]
Xtest=Xmatrix[~mask]
ytrain=Yresp[mask]
ytest=Yresp[~mask]



In [278]:

    
lm = linear_model.LinearRegression()
lm.fit(Xtrain, ytrain)









    Out[278]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)



In [279]:

    
print lm.intercept_
print lm.coef_









    



38.0255534965
[ 0.55324569]



In [280]:

    
pd.DataFrame(zip(dftouse[['gradrate0607']].columns, lm.coef_), columns = ['features', 'estimatedCoefficients'])









    Out[280]:






  
    
      
      features
      estimatedCoefficients
    
  
  
    
      0
      gradrate0607
      0.553246



In [281]:

    
cross_validation.cross_val_score(lm, Xtrain, ytrain, scoring='r2')









    Out[281]:





array([ 0.34062551,  0.29943651,  0.31813816])



In [282]:

    
yp = lm.predict(Xtest)
mean_squared_error(yp, ytest)









    Out[282]:





124.37565458009647



In [283]:

    
plt.scatter(ytest, yp, alpha=.3)
plt.xlabel("Grad Rate: Actual")
plt.ylabel("Grad Rate: Predicted")
plt.title("Linear Regression: 0607 Grad Data")
plt.xlim(0,105)
plt.ylim(0,105)
plt.show()

If you only have graduation data available for 3 years prior, the performance is the worst. Again, walls at actual and predicted 100% graduation rate. This model overpredicts, telling a school district that it is doing better than it is, which is not a good scenario.

Predicting graduation based on previous year graduation and all other factors

If we married the previous years of graduation data and the school district data, could we create the best model? Is it better than just graduation rate alone?



In [284]:

    
lcols=list(dftouse.columns)
lcols.remove('RESP_High_Graduation')
lcols.remove('RESP_Low_Graduation')
lcols.remove('afgr')



In [285]:

    
Xmatrix=dftouse[lcols].values
Yresp=dftouse['afgr'].values



In [286]:

    
Xtrain=Xmatrix[mask]
Xtest=Xmatrix[~mask]
ytrain=Yresp[mask]
ytest=Yresp[~mask]

Linear Regression



In [287]:

    
lm = linear_model.LinearRegression()
lm.fit(Xtrain, ytrain)









    Out[287]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)



In [288]:

    
pd.DataFrame(zip(dftouse[lcols].columns, lm.coef_), columns = ['features', 'estimatedCoefficients']).sort(['estimatedCoefficients'], ascending=False)[:20]









    Out[288]:






  
    
      
      features
      estimatedCoefficients
    
  
  
    
      34
      i_fin_sdlc_voc
      10.833789
    
    
      18
      tcurelsc_pp
      9.617015
    
    
      33
      i_fin_sdlc_elem_sec
      9.372606
    
    
      54
      i_lgo_PK
      7.170665
    
    
      53
      i_lgo_K
      6.893866
    
    
      0
      num_schools
      6.641498
    
    
      101
      r_st_LEASS
      5.879462
    
    
      148
      r_frev_ao
      5.565641
    
    
      32
      i_fin_sdlc_sec
      5.094310
    
    
      128
      r_srev_gfa
      4.702548
    
    
      143
      r_frev_dis
      4.335033
    
    
      149
      r_frev_ns
      3.546220
    
    
      142
      r_frev_title1
      3.440538
    
    
      92
      r_st_IA
      3.325286
    
    
      7
      tlocrev_pp
      3.096447
    
    
      138
      r_srev_aor
      2.824535
    
    
      81
      r_stud_reg_12_W_F
      2.263718
    
    
      8
      tsrev_pp
      2.077341
    
    
      153
      r_frev_aofed
      2.004041
    
    
      77
      r_stud_reg_12_H_F
      1.996720



In [289]:

    
pd.DataFrame(zip(dftouse[lcols].columns, lm.coef_), columns = ['features', 'estimatedCoefficients']).sort(['estimatedCoefficients'], ascending=True)[:20]









    Out[289]:






  
    
      
      features
      estimatedCoefficients
    
  
  
    
      26
      tlocrev_percent
      -38.035606
    
    
      27
      tsrev_percent
      -34.018567
    
    
      25
      tfedrev_percent
      -14.698286
    
    
      91
      r_st_TS
      -10.003194
    
    
      68
      r_stud_re_W
      -9.298882
    
    
      21
      tcurinst_percent
      -7.825261
    
    
      2
      num_pub_schools
      -7.714817
    
    
      24
      tcurssvc_percent
      -7.695972
    
    
      30
      i_agency_type_regional_education_services
      -7.089685
    
    
      67
      r_stud_re_B
      -6.779637
    
    
      66
      r_stud_re_H
      -6.759184
    
    
      64
      r_stud_re_AIAN
      -4.411344
    
    
      6
      totalrev_pp
      -3.691741
    
    
      11
      tcurssv_pp
      -3.441055
    
    
      29
      i_agency_type_local_school_district_sup_union
      -3.146080
    
    
      13
      tcursalary_pp
      -2.965253
    
    
      100
      r_st_LEA
      -2.824958
    
    
      35
      i_ucl_city_large
      -2.683151
    
    
      22
      tcuroth_percent
      -2.312685
    
    
      15
      totalexp_pp
      -1.717676



In [290]:

    
cross_validation.cross_val_score(lm, Xtrain, ytrain, scoring='r2')









    Out[290]:





array([ 0.48996459,  0.43592357,  0.51999977])



In [291]:

    
yp = lm.predict(Xtest)
mean_squared_error(yp, ytest)









    Out[291]:





89.679583680189253



In [292]:

    
plt.scatter(ytest, yp, alpha=.3)
plt.xlabel("Grad Rate: Actual")
plt.ylabel("Grad Rate: Predicted")
plt.title("Linear Regression: School District Data + 0607, 0708, 0809 Grad Data")
plt.xlim(0,105)
plt.ylim(0,105)
plt.show()

Lasso Regression



In [293]:

    
lasso = linear_model.Lasso()



In [294]:

    
lasso.fit(Xtrain, ytrain)









    Out[294]:





Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)



In [295]:

    
pd.DataFrame(zip(dftouse[lcols].columns, lasso.coef_), columns = ['features', 'estimatedCoefficients']).sort(['estimatedCoefficients'], ascending=False)[:10]









    Out[295]:






  
    
      
      features
      estimatedCoefficients
    
  
  
    
      81
      r_stud_reg_12_W_F
      0.880563
    
    
      156
      gradrate0809
      0.336472
    
    
      155
      gradrate0708
      0.240823
    
    
      154
      gradrate0607
      0.171633
    
    
      108
      r_lrev_put
      0.000000
    
    
      101
      r_st_LEASS
      -0.000000
    
    
      102
      r_st_SA
      -0.000000
    
    
      103
      r_st_SASS
      -0.000000
    
    
      104
      r_st_SSSS
      -0.000000
    
    
      105
      r_st_OSSS
      0.000000



In [296]:

    
pd.DataFrame(zip(dftouse[lcols].columns, lasso.coef_), columns = ['features', 'estimatedCoefficients']).sort(['estimatedCoefficients'], ascending=True)[:10]









    Out[296]:






  
    
      
      features
      estimatedCoefficients
    
  
  
    
      57
      r_lunch_free
      -0.185954
    
    
      100
      r_st_LEA
      -0.000000
    
    
      101
      r_st_LEASS
      -0.000000
    
    
      102
      r_st_SA
      -0.000000
    
    
      103
      r_st_SASS
      -0.000000
    
    
      104
      r_st_SSSS
      -0.000000
    
    
      105
      r_st_OSSS
      0.000000
    
    
      106
      r_lrev_pt
      0.000000
    
    
      107
      r_lrev_gst
      -0.000000
    
    
      108
      r_lrev_put
      0.000000



In [297]:

    
cross_validation.cross_val_score(lasso, Xtrain, ytrain, scoring='r2')









    Out[297]:





array([ 0.46741886,  0.46351771,  0.4990405 ])



In [298]:

    
yp = lasso.predict(Xtest)
mean_squared_error(yp, ytest)









    Out[298]:





93.677188958100984



In [299]:

    
plt.scatter(ytest, yp, alpha=.3)
plt.xlabel("Grad Rate: Actual")
plt.ylabel("Grad Rate: Predicted")
plt.title("Lasso Regression: School District Data + 0607, 0708, 0809 Grad")
plt.xlim(0,105)
plt.ylim(0,105)
plt.show()

Elastic Net Regression



In [300]:

    
elasticnet = linear_model.ElasticNet()
elasticnet.fit(Xtrain, ytrain)









    Out[300]:





ElasticNet(alpha=1.0, copy_X=True, fit_intercept=True, l1_ratio=0.5,
      max_iter=1000, normalize=False, positive=False, precompute=False,
      random_state=None, selection='cyclic', tol=0.0001, warm_start=False)



In [301]:

    
pd.DataFrame(zip(dftouse[lcols].columns, elasticnet.coef_), columns = ['features', 'estimatedCoefficients']).sort(['estimatedCoefficients'], ascending=False)[:10]









    Out[301]:






  
    
      
      features
      estimatedCoefficients
    
  
  
    
      81
      r_stud_reg_12_W_F
      0.724249
    
    
      80
      r_stud_reg_12_W_M
      0.390046
    
    
      156
      gradrate0809
      0.321462
    
    
      77
      r_stud_reg_12_H_F
      0.263953
    
    
      155
      gradrate0708
      0.231649
    
    
      23
      tcuresal_percent
      0.206815
    
    
      154
      gradrate0607
      0.159676
    
    
      26
      tlocrev_percent
      0.113949
    
    
      74
      r_stud_reg_12_AAP_M
      0.079391
    
    
      106
      r_lrev_pt
      0.008552



In [302]:

    
pd.DataFrame(zip(dftouse[lcols].columns, elasticnet.coef_), columns = ['features', 'estimatedCoefficients']).sort(['estimatedCoefficients'], ascending=True)[:10]









    Out[302]:






  
    
      
      features
      estimatedCoefficients
    
  
  
    
      57
      r_lunch_free
      -0.386189
    
    
      67
      r_stud_re_B
      -0.253313
    
    
      8
      tsrev_pp
      -0.224147
    
    
      132
      r_srev_cbsp
      -0.136606
    
    
      119
      r_lrev_osalserv
      -0.099034
    
    
      9
      tfedrev_pp
      -0.080264
    
    
      25
      tfedrev_percent
      -0.064001
    
    
      60
      r_stud_18
      -0.056252
    
    
      17
      tnonelse_pp
      -0.053719
    
    
      122
      r_lrev_molr
      -0.009340



In [303]:

    
cross_validation.cross_val_score(elasticnet, Xtrain, ytrain, scoring='r2')









    Out[303]:





array([ 0.48120317,  0.47189244,  0.50903098])



In [304]:

    
yp = elasticnet.predict(Xtest)
mean_squared_error(yp, ytest)









    Out[304]:





91.59125369683909



In [305]:

    
plt.scatter(ytest, yp, alpha=.3)
plt.xlabel("Grad Rate: Actual")
plt.ylabel("Grad Rate: Predicted")
plt.title("Elastic Net Regression: School District Data + 0607, 0708, 0809 Grad")
plt.xlim(0,105)
plt.ylim(0,105)
plt.show()

Decision Trees and Random Forests



In [306]:

    
# Create test/train mask
itrain_dr, itest_dr = train_test_split(xrange(df_yr.shape[0]), train_size=0.7)
mask_dr=np.ones(df_yr.shape[0], dtype='int')
mask_dr[itrain_dr]=1
mask_dr[itest_dr]=0
mask_dr = (mask_dr==1)

# Indicators used : Funding/Expenditure/Location/School Types and Race/Sex
Xnames1 = [
            'pupil_teacher_ratio_dist',
            'totalrev_pp',
            'tcurinst_pp',
            'tcurssv_pp',
            'tcursalary_pp',
            'tcurbenefits_pp',
            'totalexp_pp',
            'tcapout_pp',
            'tnonelse_pp',
            'tcurelsc_pp',
            'instexp_pp',
            'i_agency_type_local_school_district',
            'i_agency_type_local_school_district_sup_union',
            'i_agency_type_regional_education_services',
            'i_agency_type_charter_school_agency',
            'i_fin_sdlc_sec',
            'i_fin_sdlc_elem_sec',
            'i_fin_sdlc_voc',
            'i_ucl_city_large',
            'i_ucl_city_mid',
            'i_ucl_city_small',
            'i_ucl_suburb_large',
            'i_ucl_suburb_mid',
            'i_ucl_suburb_small',
            'i_ucl_town_fringe',
            'i_ucl_town_distant',
            'i_ucl_town_remote',
            'i_ucl_rural_fringe',
            'i_ucl_rural_distant',
            'i_ucl_rural_remote',
            'i_cs_all_charter',
            'i_cs_charter_noncharter',
            'i_cs_all_noncharter',
            'i_ma_ne_nr',
            'i_ma_metropolitan',
            'i_ma_micropolitan',            
            'r_ELL',
            'r_IEP',
            'r_stud_re_M',
            'r_stud_re_F',
            'r_stud_re_AIAN',
            'r_stud_re_AAP',
            'r_stud_re_H',
            'r_stud_re_B',
            'r_stud_re_W',
            'r_stud_re_HNPI',
            'r_stud_re_Two',
            'gradrate0607',
            'gradrate0708',
            'gradrate0809',
            'r_lunch_free',
            'r_lunch_reduced'

]
# Adding Target column :
df_yr['target'] = df_yr['afgr'].astype(int)
target1 = 'target'



In [307]:

    
# Descision Trees
from sklearn import tree

clfTree1 = tree.DecisionTreeClassifier()

parameters = {"max_depth": [5, 6, 7, 9, 10, 11, 12], 'min_samples_leaf': [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]}
clfTree1, Xtrain, ytrain, Xtest, ytest = do_classify(clfTree1, parameters, df_yr, 
                                                     Xnames1, target1, 1, 
                                                     mask=mask_dr, n_jobs = 4, score_func = 'r2')

importance_list = clfTree1.feature_importances_
name_list = df_yr[Xnames1].columns
importance_list, name_list = zip(*sorted(zip(importance_list, name_list))[-15:])
plt.barh(range(len(name_list)),importance_list,align='center')
plt.yticks(range(len(name_list)),name_list)
plt.xlabel('Relative Importance in the Decision Trees')
plt.ylabel('Features')
plt.title('Relative importance of Each Feature')
plt.show()









    



using mask






    



/Users/ashwindeo/Dev/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:46: FutureWarning: comparison to `None` will result in an elementwise object comparison in the future.






    



BEST {'max_depth': 10, 'min_samples_leaf': 13} 0.201065709555 [mean: -0.00919, std: 0.08034, params: {'max_depth': 5, 'min_samples_leaf': 5}, mean: -0.01106, std: 0.07881, params: {'max_depth': 5, 'min_samples_leaf': 6}, mean: -0.01130, std: 0.07901, params: {'max_depth': 5, 'min_samples_leaf': 7}, mean: -0.01072, std: 0.07880, params: {'max_depth': 5, 'min_samples_leaf': 8}, mean: -0.01072, std: 0.07880, params: {'max_depth': 5, 'min_samples_leaf': 9}, mean: -0.01006, std: 0.07990, params: {'max_depth': 5, 'min_samples_leaf': 10}, mean: -0.00869, std: 0.07787, params: {'max_depth': 5, 'min_samples_leaf': 11}, mean: -0.00865, std: 0.07789, params: {'max_depth': 5, 'min_samples_leaf': 12}, mean: -0.00950, std: 0.07703, params: {'max_depth': 5, 'min_samples_leaf': 13}, mean: -0.00529, std: 0.07135, params: {'max_depth': 5, 'min_samples_leaf': 14}, mean: -0.00624, std: 0.07267, params: {'max_depth': 5, 'min_samples_leaf': 15}, mean: 0.05577, std: 0.13439, params: {'max_depth': 6, 'min_samples_leaf': 5}, mean: 0.06259, std: 0.14490, params: {'max_depth': 6, 'min_samples_leaf': 6}, mean: 0.06110, std: 0.14703, params: {'max_depth': 6, 'min_samples_leaf': 7}, mean: 0.06098, std: 0.13721, params: {'max_depth': 6, 'min_samples_leaf': 8}, mean: 0.06128, std: 0.13685, params: {'max_depth': 6, 'min_samples_leaf': 9}, mean: 0.06084, std: 0.13319, params: {'max_depth': 6, 'min_samples_leaf': 10}, mean: 0.06147, std: 0.13228, params: {'max_depth': 6, 'min_samples_leaf': 11}, mean: 0.06258, std: 0.12102, params: {'max_depth': 6, 'min_samples_leaf': 12}, mean: 0.06105, std: 0.12173, params: {'max_depth': 6, 'min_samples_leaf': 13}, mean: 0.05931, std: 0.12224, params: {'max_depth': 6, 'min_samples_leaf': 14}, mean: 0.05817, std: 0.12338, params: {'max_depth': 6, 'min_samples_leaf': 15}, mean: 0.07573, std: 0.15990, params: {'max_depth': 7, 'min_samples_leaf': 5}, mean: 0.08116, std: 0.17058, params: {'max_depth': 7, 'min_samples_leaf': 6}, mean: 0.07528, std: 0.16521, params: {'max_depth': 7, 'min_samples_leaf': 7}, mean: 0.08111, std: 0.16431, params: {'max_depth': 7, 'min_samples_leaf': 8}, mean: 0.07806, std: 0.16624, params: {'max_depth': 7, 'min_samples_leaf': 9}, mean: 0.08500, std: 0.14659, params: {'max_depth': 7, 'min_samples_leaf': 10}, mean: 0.08310, std: 0.14718, params: {'max_depth': 7, 'min_samples_leaf': 11}, mean: 0.08899, std: 0.14026, params: {'max_depth': 7, 'min_samples_leaf': 12}, mean: 0.10079, std: 0.12469, params: {'max_depth': 7, 'min_samples_leaf': 13}, mean: 0.09863, std: 0.12859, params: {'max_depth': 7, 'min_samples_leaf': 14}, mean: 0.10080, std: 0.13185, params: {'max_depth': 7, 'min_samples_leaf': 15}, mean: 0.14356, std: 0.05292, params: {'max_depth': 9, 'min_samples_leaf': 5}, mean: 0.16126, std: 0.03750, params: {'max_depth': 9, 'min_samples_leaf': 6}, mean: 0.15416, std: 0.04785, params: {'max_depth': 9, 'min_samples_leaf': 7}, mean: 0.15219, std: 0.04550, params: {'max_depth': 9, 'min_samples_leaf': 8}, mean: 0.13500, std: 0.04792, params: {'max_depth': 9, 'min_samples_leaf': 9}, mean: 0.16175, std: 0.04744, params: {'max_depth': 9, 'min_samples_leaf': 10}, mean: 0.18032, std: 0.04446, params: {'max_depth': 9, 'min_samples_leaf': 11}, mean: 0.17917, std: 0.03361, params: {'max_depth': 9, 'min_samples_leaf': 12}, mean: 0.19689, std: 0.02277, params: {'max_depth': 9, 'min_samples_leaf': 13}, mean: 0.19111, std: 0.02068, params: {'max_depth': 9, 'min_samples_leaf': 14}, mean: 0.17967, std: 0.01756, params: {'max_depth': 9, 'min_samples_leaf': 15}, mean: 0.08979, std: 0.01998, params: {'max_depth': 10, 'min_samples_leaf': 5}, mean: 0.10100, std: 0.06944, params: {'max_depth': 10, 'min_samples_leaf': 6}, mean: 0.12372, std: 0.06998, params: {'max_depth': 10, 'min_samples_leaf': 7}, mean: 0.12294, std: 0.05174, params: {'max_depth': 10, 'min_samples_leaf': 8}, mean: 0.12183, std: 0.05767, params: {'max_depth': 10, 'min_samples_leaf': 9}, mean: 0.13545, std: 0.05025, params: {'max_depth': 10, 'min_samples_leaf': 10}, mean: 0.15340, std: 0.04715, params: {'max_depth': 10, 'min_samples_leaf': 11}, mean: 0.16182, std: 0.03063, params: {'max_depth': 10, 'min_samples_leaf': 12}, mean: 0.20107, std: 0.03033, params: {'max_depth': 10, 'min_samples_leaf': 13}, mean: 0.19139, std: 0.03080, params: {'max_depth': 10, 'min_samples_leaf': 14}, mean: 0.18628, std: 0.02217, params: {'max_depth': 10, 'min_samples_leaf': 15}, mean: 0.08795, std: 0.05861, params: {'max_depth': 11, 'min_samples_leaf': 5}, mean: 0.08070, std: 0.07213, params: {'max_depth': 11, 'min_samples_leaf': 6}, mean: 0.10042, std: 0.08105, params: {'max_depth': 11, 'min_samples_leaf': 7}, mean: 0.10388, std: 0.05543, params: {'max_depth': 11, 'min_samples_leaf': 8}, mean: 0.09836, std: 0.05118, params: {'max_depth': 11, 'min_samples_leaf': 9}, mean: 0.11819, std: 0.05698, params: {'max_depth': 11, 'min_samples_leaf': 10}, mean: 0.15358, std: 0.04036, params: {'max_depth': 11, 'min_samples_leaf': 11}, mean: 0.15387, std: 0.03301, params: {'max_depth': 11, 'min_samples_leaf': 12}, mean: 0.18467, std: 0.03263, params: {'max_depth': 11, 'min_samples_leaf': 13}, mean: 0.17695, std: 0.02800, params: {'max_depth': 11, 'min_samples_leaf': 14}, mean: 0.16990, std: 0.02649, params: {'max_depth': 11, 'min_samples_leaf': 15}, mean: 0.05326, std: 0.07996, params: {'max_depth': 12, 'min_samples_leaf': 5}, mean: 0.03280, std: 0.10257, params: {'max_depth': 12, 'min_samples_leaf': 6}, mean: 0.05396, std: 0.09998, params: {'max_depth': 12, 'min_samples_leaf': 7}, mean: 0.06648, std: 0.08743, params: {'max_depth': 12, 'min_samples_leaf': 8}, mean: 0.06643, std: 0.08030, params: {'max_depth': 12, 'min_samples_leaf': 9}, mean: 0.08815, std: 0.07159, params: {'max_depth': 12, 'min_samples_leaf': 10}, mean: 0.10995, std: 0.07582, params: {'max_depth': 12, 'min_samples_leaf': 11}, mean: 0.12846, std: 0.05597, params: {'max_depth': 12, 'min_samples_leaf': 12}, mean: 0.15468, std: 0.04369, params: {'max_depth': 12, 'min_samples_leaf': 13}, mean: 0.15824, std: 0.04126, params: {'max_depth': 12, 'min_samples_leaf': 14}, mean: 0.14935, std: 0.03267, params: {'max_depth': 12, 'min_samples_leaf': 15}]



In [308]:

    
ypredict = clfTree1.predict(Xtest)
rscore =  mean_squared_error(ytest, ypredict)
print "Mean Squared Error : ", rscore

plt.scatter(ytest, ypredict, alpha=0.5)
plt.xlabel("Grad Rate: Actual")
plt.ylabel("Grad Rate: Predicted")
plt.title("Decision Tree : School District + Previous Years")
plt.xlim(0,105)
plt.ylim(0,105)
plt.show()









    



Mean Squared Error :  153.386279357



In [309]:

    
# Random Forests
from sklearn.ensemble import RandomForestClassifier

clfForest1 = RandomForestClassifier()

parameters = {"n_estimators": range(25, 40)}
clfForest1, Xtrain, ytrain, Xtest, ytest = do_classify(clfForest1, parameters, 
                                                       df_yr, Xnames1, target1, 1, mask=mask_dr, 
                                                       n_jobs = 4, score_func='r2')

importance_list = clfForest1.feature_importances_
name_list = df_yr[Xnames1].columns
importance_list, name_list = zip(*sorted(zip(importance_list, name_list))[-15:])
plt.barh(range(len(name_list)),importance_list,align='center')
plt.yticks(range(len(name_list)),name_list)
plt.xlabel('Relative Importance in the Random Forests 1')
plt.ylabel('Features')
plt.title('Relative importance of Each Feature')
plt.show()









    



using mask






    



/Users/ashwindeo/Dev/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:46: FutureWarning: comparison to `None` will result in an elementwise object comparison in the future.






    



BEST {'n_estimators': 30} 0.136262380312 [mean: 0.11188, std: 0.02741, params: {'n_estimators': 25}, mean: 0.10349, std: 0.03705, params: {'n_estimators': 26}, mean: 0.08091, std: 0.03741, params: {'n_estimators': 27}, mean: 0.06603, std: 0.03665, params: {'n_estimators': 28}, mean: 0.07589, std: 0.01211, params: {'n_estimators': 29}, mean: 0.13626, std: 0.03215, params: {'n_estimators': 30}, mean: 0.07856, std: 0.02747, params: {'n_estimators': 31}, mean: 0.09177, std: 0.04194, params: {'n_estimators': 32}, mean: 0.06913, std: 0.02878, params: {'n_estimators': 33}, mean: 0.10540, std: 0.01953, params: {'n_estimators': 34}, mean: 0.08838, std: 0.02288, params: {'n_estimators': 35}, mean: 0.07140, std: 0.02246, params: {'n_estimators': 36}, mean: 0.08011, std: 0.02063, params: {'n_estimators': 37}, mean: 0.08605, std: 0.02341, params: {'n_estimators': 38}, mean: 0.09952, std: 0.03631, params: {'n_estimators': 39}]



In [310]:

    
ypredict = clfForest1.predict(Xtest)
rscore =  mean_squared_error(ytest, ypredict)
print "Mean Squared Error : ", rscore

plt.scatter(ytest, ypredict, alpha=0.5)
plt.xlabel("Grad Rate: Actual")
plt.ylabel("Grad Rate: Predicted")
plt.title("Random Forest : School District + Previous Years")
plt.xlim(0,105)
plt.ylim(0,105)
plt.show()









    



Mean Squared Error :  167.071075402

Creating a best model

We use the columns selected by our best model (elastic net) to inform what features we should select for a linear regression and see if we can beat any of the models with graduation rate in them.



In [311]:

    
lcols=['r_stud_reg_12_W_F', 'r_stud_reg_12_W_M', 'tlocrev_percent', 'r_stud_re_W', 'r_lrev_pt', 'r_lunch_free', 'r_stud_re_B', 'tfedrev_percent', 'r_stud_18', 'tsrev_pp', 'tcuresal_percent', 'r_frev_title1', 'tcurinst_percent', 'r_lrev_molr']



In [312]:

    
Xmatrix=dftouse[lcols].values
Yresp=dftouse['afgr'].values



In [313]:

    
Xtrain=Xmatrix[mask]
Xtest=Xmatrix[~mask]
ytrain=Yresp[mask]
ytest=Yresp[~mask]



In [314]:

    
lm = linear_model.LinearRegression()
lm.fit(Xtrain, ytrain)









    Out[314]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)



In [315]:

    
pd.DataFrame(zip(dftouse[lcols].columns, lm.coef_), columns = ['features', 'estimatedCoefficients'])









    Out[315]:






  
    
      
      features
      estimatedCoefficients
    
  
  
    
      0
      r_stud_reg_12_W_F
      1.944409
    
    
      1
      r_stud_reg_12_W_M
      1.214658
    
    
      2
      tlocrev_percent
      0.257012
    
    
      3
      r_stud_re_W
      -0.173361
    
    
      4
      r_lrev_pt
      0.146182
    
    
      5
      r_lunch_free
      -2.896304
    
    
      6
      r_stud_re_B
      -2.116497
    
    
      7
      tfedrev_percent
      -1.158260
    
    
      8
      r_stud_18
      -1.422072
    
    
      9
      tsrev_pp
      -1.629074
    
    
      10
      tcuresal_percent
      0.973579
    
    
      11
      r_frev_title1
      -0.543912
    
    
      12
      tcurinst_percent
      0.025455
    
    
      13
      r_lrev_molr
      -0.731097



In [316]:

    
cross_validation.cross_val_score(lm, Xtrain, ytrain, scoring='r2')









    Out[316]:





array([ 0.3038443 ,  0.31224806,  0.35683494])



In [317]:

    
yp = lm.predict(Xtest)
mean_squared_error(yp, ytest)









    Out[317]:





126.19517764740569



In [318]:

    
plt.scatter(ytest, yp, alpha=.3)
plt.xlabel("Grad Rate: Actual")
plt.ylabel("Grad Rate: Predicted")
plt.title("Linear Regression: Filtered School District Data")
plt.xlim(0,105)
plt.ylim(0,105)
plt.show()

The mean squared error for the 0607 graduation rate only model and the filtered school district model are nearly identical. Therefore, if we only have school district information, its predictive ability is akin to having a graduation rate 3 years back. We could posit that in the scenario of not receiving public graduation data until 5 years later, we could use school district characteristics to predict the graduation rates 3 years in advance until such time that better data is available.

Findings

Linear Regression - 0607 Only and Best Model have very similar MSE. The 0607 Only model overpredicts. The Best Model underpredicts. Underprediction is better than overprediction in this scenario - it is disadvantageous if the school district thinks it is performing better than it is.

This brings to mind an interesting scenario. We started this project with 09-10 graduation data because it was the newest data available. Given the similarity in the 0607 Only and Best model, if all we have is school district information, it may be worthwhile checking if we can predict with the same quality as the 0607 Only model. If quality is similar, we might be able to predict 12-13 graduation rates several years before they are released to the public.

We go back to ELSI to download 0607 school district information to test this out.

Trying 2009-2010 predictions using 2006-2007 data



In [319]:

    
dftouse0607 = pd.read_csv("data/finaldata/dftouse_0607_with0910grad.csv")



In [320]:

    
dftouse0607.shape









    Out[320]:





(9582, 417)



In [321]:

    
STANDARDIZABLE = ['num_students', 'num_schools','num_charter_schools','num_pub_schools','tcuresal_percent','pupil_teacher_ratio_dist', 'pupil_teacher_ratio_ps', 'totalrev_pp','tlocrev_pp','tsrev_pp','tfedrev_pp','tcurinst_pp','tcurssv_pp','tcuroth_pp','tcursalary_pp','tcurbenefits_pp','totalexp_pp','tcapout_pp','tnonelse_pp','tcurelsc_pp','instexp_pp','tcurinst_percent','tcuroth_percent','tcurelsc_percent','tcurssvc_percent','tfedrev_percent','tlocrev_percent','tsrev_percent','r_ELL','r_IEP','r_lunch_free','r_lunch_reduced','r_stud_PKK','r_stud_18','r_stud_912','r_stud_re_M','r_stud_re_F','r_stud_re_AIAN','r_stud_re_AAP','r_stud_re_H','r_stud_re_B','r_stud_re_W','r_stud_re_Total','r_stud_reg_12_AIAN_M','r_stud_reg_12_AIAN_F','r_stud_reg_12_AAP_M','r_stud_reg_12_AAP_F','r_stud_reg_12_H_M','r_stud_reg_12_H_F','r_stud_reg_12_B_M','r_stud_reg_12_B_F','r_stud_reg_12_W_M','r_stud_reg_12_W_F','r_stud_reg_12_HNPI_M','r_stud_reg_12_HNPI_F','r_stud_reg_12_Two_M','r_stud_reg_12_Two_F','r_st_PKT','r_st_KT','r_st_ET','r_st_ST','r_st_UT','r_st_TS','r_st_IA','r_st_IC','r_st_EGC','r_st_SGC','r_st_TGC','r_st_LMS','r_st_LMSS','r_st_LEA','r_st_LEASS','r_st_SA','r_st_SASS','r_st_SSSS','r_st_OSSS','r_lrev_pt','r_lrev_gst','r_lrev_put','r_lrev_it','r_lrev_aot','r_lrev_pgc','r_lrev_cc','r_lrev_oss','r_lrev_tui','r_lrev_trans','r_lrev_slr','r_lrev_ts','r_lrev_sar','r_lrev_osalserv','r_lrev_sfns','r_lrev_ie','r_lrev_molr','r_lrev_sp','r_lrev_rr','r_lrev_sale','r_lrev_ff','r_lrev_pc','r_srev_gfa','r_srev_sep','r_srev_trans','r_srev_sip','r_srev_cbsp','r_srev_vep','r_srev_codsp','r_srev_bep','r_srev_gt','r_srev_slp','r_srev_aor','r_srev_splea','r_srev_osp','r_srev_ns','r_frev_title1','r_frev_dis','r_frev_cna','r_frev_ems','r_frev_dfs','r_frev_voc','r_frev_ao','r_frev_ns','r_frev_ia','r_frev_be','r_frev_na','r_frev_aofed']
print STANDARDIZABLE









    



['num_students', 'num_schools', 'num_charter_schools', 'num_pub_schools', 'tcuresal_percent', 'pupil_teacher_ratio_dist', 'pupil_teacher_ratio_ps', 'totalrev_pp', 'tlocrev_pp', 'tsrev_pp', 'tfedrev_pp', 'tcurinst_pp', 'tcurssv_pp', 'tcuroth_pp', 'tcursalary_pp', 'tcurbenefits_pp', 'totalexp_pp', 'tcapout_pp', 'tnonelse_pp', 'tcurelsc_pp', 'instexp_pp', 'tcurinst_percent', 'tcuroth_percent', 'tcurelsc_percent', 'tcurssvc_percent', 'tfedrev_percent', 'tlocrev_percent', 'tsrev_percent', 'r_ELL', 'r_IEP', 'r_lunch_free', 'r_lunch_reduced', 'r_stud_PKK', 'r_stud_18', 'r_stud_912', 'r_stud_re_M', 'r_stud_re_F', 'r_stud_re_AIAN', 'r_stud_re_AAP', 'r_stud_re_H', 'r_stud_re_B', 'r_stud_re_W', 'r_stud_re_Total', 'r_stud_reg_12_AIAN_M', 'r_stud_reg_12_AIAN_F', 'r_stud_reg_12_AAP_M', 'r_stud_reg_12_AAP_F', 'r_stud_reg_12_H_M', 'r_stud_reg_12_H_F', 'r_stud_reg_12_B_M', 'r_stud_reg_12_B_F', 'r_stud_reg_12_W_M', 'r_stud_reg_12_W_F', 'r_stud_reg_12_HNPI_M', 'r_stud_reg_12_HNPI_F', 'r_stud_reg_12_Two_M', 'r_stud_reg_12_Two_F', 'r_st_PKT', 'r_st_KT', 'r_st_ET', 'r_st_ST', 'r_st_UT', 'r_st_TS', 'r_st_IA', 'r_st_IC', 'r_st_EGC', 'r_st_SGC', 'r_st_TGC', 'r_st_LMS', 'r_st_LMSS', 'r_st_LEA', 'r_st_LEASS', 'r_st_SA', 'r_st_SASS', 'r_st_SSSS', 'r_st_OSSS', 'r_lrev_pt', 'r_lrev_gst', 'r_lrev_put', 'r_lrev_it', 'r_lrev_aot', 'r_lrev_pgc', 'r_lrev_cc', 'r_lrev_oss', 'r_lrev_tui', 'r_lrev_trans', 'r_lrev_slr', 'r_lrev_ts', 'r_lrev_sar', 'r_lrev_osalserv', 'r_lrev_sfns', 'r_lrev_ie', 'r_lrev_molr', 'r_lrev_sp', 'r_lrev_rr', 'r_lrev_sale', 'r_lrev_ff', 'r_lrev_pc', 'r_srev_gfa', 'r_srev_sep', 'r_srev_trans', 'r_srev_sip', 'r_srev_cbsp', 'r_srev_vep', 'r_srev_codsp', 'r_srev_bep', 'r_srev_gt', 'r_srev_slp', 'r_srev_aor', 'r_srev_splea', 'r_srev_osp', 'r_srev_ns', 'r_frev_title1', 'r_frev_dis', 'r_frev_cna', 'r_frev_ems', 'r_frev_dfs', 'r_frev_voc', 'r_frev_ao', 'r_frev_ns', 'r_frev_ia', 'r_frev_be', 'r_frev_na', 'r_frev_aofed']



In [322]:

    
INDICATORS = []
for v in dftouse0607.columns:
    l=np.unique(dftouse0607[v])
    if len(l) <= 10:
        INDICATORS.append(v)
        
INDICATORS.remove('RESP_High_Graduation')        
INDICATORS.remove('RESP_Low_Graduation')  
INDICATORS.remove('Metro Status Code [District] 2009-10')
INDICATORS.remove('Gender Unknown Students [Public School] 2009-10')
print INDICATORS









    



['i_agency_type_local_school_district', 'i_agency_type_local_school_district_sup_union', 'i_agency_type_sup_union_admin', 'i_agency_type_regional_education_services', 'i_agency_type_state_operated_institution', 'i_agency_type_charter_school_agency', 'i_agency_type_other_education_agency', 'i_fin_sdlc_elem', 'i_fin_sdlc_sec', 'i_fin_sdlc_elem_sec', 'i_fin_sdlc_voc', 'i_fin_sdlc_nonop', 'i_fin_sdlc_ed_serv', 'i_ucl_city_large', 'i_ucl_city_mid', 'i_ucl_city_small', 'i_ucl_suburb_large', 'i_ucl_suburb_mid', 'i_ucl_suburb_small', 'i_ucl_town_fringe', 'i_ucl_town_distant', 'i_ucl_town_remote', 'i_ucl_rural_fringe', 'i_ucl_rural_distant', 'i_ucl_rural_remote', 'i_cs_all_charter', 'i_cs_charter_noncharter', 'i_cs_all_noncharter', 'i_ma_ne_nr', 'i_ma_metropolitan', 'i_ma_micropolitan', 'i_lgo_10', 'i_lgo_11', 'i_lgo_12', 'i_lgo_1', 'i_lgo_2', 'i_lgo_3', 'i_lgo_4', 'i_lgo_5', 'i_lgo_6', 'i_lgo_7', 'i_lgo_8', 'i_lgo_9', 'i_lgo_K', 'i_lgo_PK', 'i_lgo_U']



In [323]:

    
itrain, itest = train_test_split(xrange(dftouse0607.shape[0]), train_size=0.7)



In [324]:

    
#CITATION: From HW3
mask=np.ones(dftouse0607.shape[0], dtype='int')
mask[itrain]=1
mask[itest]=0
mask = (mask==1)



In [325]:

    
# make sure we didn't get unlucky in our mask selection
print "% Graduation in Training:", np.mean(dftouse0607['afgr'][mask])
print "% Graduation in Testing:", np.mean(dftouse0607['afgr'][~mask])









    



% Graduation in Training: 81.510436857
% Graduation in Testing: 81.2183304348



In [326]:

    
#CITATION: From HW3
mask









    Out[326]:





array([ True,  True, False, ...,  True,  True, False], dtype=bool)



In [327]:

    
#CITATION: From HW3
mask.shape, mask.sum()









    Out[327]:





((9582,), 6707)



In [328]:

    
dftouse0607.head()









    Out[328]:






  
    
      
      agency
      state
      state_abbr
      agency_id_nces
      county
      county_number
      num_schools
      num_charter_schools
      num_pub_schools
      report_years
      no_report_years
      address
      city
      add_state
      zipcode
      Agency Name_DEL
      latitude
      longitude
      agency_id_state
      Metro Status Code [District] 2009-10
      congressional_code
      census_id
      offered_g_lowest
      num_students
      Total Students - Calculated Sum of Reported Grade Totals [Public School] 2009-10
      Migrant Students in Summer Programs (**) [District] 2009-10
      Migrant Students (**) [Public School] 2009-10
      Gender Unknown Students [Public School] 2009-10
      pupil_teacher_ratio_dist
      pupil_teacher_ratio_ps
      totalrev_pp
      tlocrev_pp
      tsrev_pp
      tfedrev_pp
      tcurinst_pp
      tcurssv_pp
      tcuroth_pp
      tcursalary_pp
      tcurbenefits_pp
      totalexp_pp
      tcapout_pp
      tnonelse_pp
      tcurelsc_pp
      instexp_pp
      tcurelsc_percent
      tcurinst_percent
      tcuroth_percent
      tcuresal_percent
      tcurssvc_percent
      tfedrev_percent
      ...
      r_lrev_gst
      r_lrev_put
      r_lrev_it
      r_lrev_aot
      r_lrev_pgc
      r_lrev_cc
      r_lrev_oss
      r_lrev_tui
      r_lrev_trans
      r_lrev_slr
      r_lrev_ts
      r_lrev_sar
      r_lrev_osalserv
      r_lrev_sfns
      r_lrev_ie
      r_lrev_molr
      r_lrev_sp
      r_lrev_rr
      r_lrev_sale
      r_lrev_ff
      r_lrev_pc
      r_srev_gfa
      r_srev_sep
      r_srev_trans
      r_srev_sip
      r_srev_cbsp
      r_srev_vep
      r_srev_codsp
      r_srev_bep
      r_srev_gt
      r_srev_slp
      r_srev_aor
      r_srev_splea
      r_srev_osp
      r_srev_ns
      r_frev_title1
      r_frev_dis
      r_frev_cna
      r_frev_ems
      r_frev_dfs
      r_frev_voc
      r_frev_ao
      r_frev_ns
      r_frev_ia
      r_frev_be
      r_frev_na
      r_frev_aofed
      RESP_High_Graduation
      RESP_Low_Graduation
      afgr0910
    
  
  
    
      0
      21ST CENTURY CYBER CS
      Pennsylvania
      PA
      4200091
      CHESTER
      42029
      1
      1
      1
      2001-2013
      1986-2000
      805 SPRINGDALE DR
      EXTON
      PA
      19341
      21ST CENTURY CYBER CS
      40.002611
      -75.689137
      124150002
      2-CBSA/MSA- not primarily its principal/centra...
      4206
      NaN
      6th Grade
      467
      467
      NaN
      NaN
      0
      29.3
      28.7
      9011
      8676
      245
      90
      4426
      2898
      0
      3394
      1196
      7785
      461
      0
      7324
      4426
      16.3
      60.4
      0.0
      46.3
      39.6
      1.0
      ...
      0.000000
      0
      0.000000
      0
      0
      0.00000
      0.999017
      0.000000
      0
      0.000000
      0.000000
      0.000000
      0.000737
      0
      0.000000
      0.000246
      0.000000
      0.000000
      0.000000
      0
      0.000000
      0.000000
      0.000000
      0.000000
      0.939130
      0.000000
      0.000000
      0.000000
      0
      0
      0.000000
      0.060870
      0.000000
      0.000000
      0
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0
      0
      0
      0
      1.000000
      1
      0
      100.0
    
    
      1
      A+ ACADEMY
      Texas
      TX
      4800203
      DALLAS
      48113
      1
      1
      1
      2000-2013
      1986-1999
      8225 BRUTON RD
      DALLAS
      TX
      75217
      A+ ACADEMY
      32.690593
      -96.643398
      057829
      1-Principal/central city of CBSA/MSA
      4830
      NaN
      Prekindergarten
      994
      994
      NaN
      NaN
      0
      13.5
      13.5
      9040
      462
      7798
      781
      4751
      2960
      362
      5179
      523
      8072
      0
      0
      8072
      4751
      6.5
      58.8
      4.5
      64.2
      36.7
      8.6
      ...
      0.000000
      0
      0.000000
      0
      0
      0.00000
      0.000000
      0.000000
      0
      0.172113
      0.000000
      0.000000
      0.000000
      0
      0.000000
      0.827887
      0.000000
      0.000000
      0.000000
      0
      0.000000
      0.916398
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0
      0
      0.000516
      0.035221
      0.041414
      0.006451
      0
      0.000000
      0.199742
      0.355670
      0.068299
      0.001289
      0.000000
      0.375000
      0
      0
      0
      0
      0.000000
      0
      0
      55.7
    
    
      2
      A-C CENTRAL CUSD 262
      Illinois
      IL
      1700105
      CASS
      17017
      3
      0
      3
      1989-2013
      1986-1988
      501 EAST BUCHANAN ST
      ASHLAND
      IL
      62612
      A-C CENTRAL CUSD 262
      39.891262
      -90.010711
      46-009-2620-26
      3-Does not serve a CBSA/MSA
      1718
      1.450092e+13
      Prekindergarten
      483
      483
      NaN
      0
      0
      10.3
      10.3
      9327
      4292
      4532
      503
      3950
      3280
      383
      4826
      1133
      7857
      27
      0
      7613
      3950
      14.9
      51.9
      5.0
      63.4
      43.1
      5.4
      ...
      0.000000
      0
      0.000000
      0
      0
      0.00000
      0.017366
      0.000000
      0
      0.045345
      0.010613
      0.011577
      0.000000
      0
      0.007718
      0.010130
      0.031838
      0.000000
      0.000000
      0
      0.000000
      0.704888
      0.031521
      0.160347
      0.000000
      0.000000
      0.005025
      0.000000
      0
      0
      0.000914
      0.010964
      0.081316
      0.005025
      0
      0.263374
      0.000000
      0.238683
      0.000000
      0.004115
      0.000000
      0.288066
      0
      0
      0
      0
      0.205761
      1
      0
      70.7
    
    
      3
      A-H-S-T COMM SCHOOL DISTRICT
      Iowa
      IA
      1904080
      POTTAWATTAMIE
      19155
      2
      0
      2
      1986-2013
      NaN
      768 SOUTH MAPLE ST
      AVOCA
      IA
      51521
      A-H-S-T COMM SCHOOL DISTRICT
      41.471423
      -95.341045
      780441 000
      2-CBSA/MSA- not primarily its principal/centra...
      1905
      1.650780e+13
      Prekindergarten
      605
      605
      NaN
      0
      0
      13.5
      13.5
      10681
      5056
      5246
      379
      5028
      2192
      309
      4864
      1258
      8712
      1106
      31
      7529
      5028
      16.7
      66.8
      4.1
      64.6
      29.1
      3.5
      ...
      0.154299
      0
      0.073227
      0
      0
      0.00000
      0.059170
      0.002288
      0
      0.039555
      0.004904
      0.001308
      0.000000
      0
      0.016672
      0.022229
      0.000000
      0.000000
      0.000327
      0
      0.000000
      0.904852
      0.000000
      0.000000
      0.076875
      0.000000
      0.001890
      0.000000
      0
      0
      0.000945
      0.015438
      0.000000
      0.000000
      0
      0.218341
      0.152838
      0.353712
      0.000000
      0.008734
      0.004367
      0.174672
      0
      0
      0
      0
      0.087336
      0
      0
      95.7
    
    
      4
      A.C.G.C. PUBLIC SCHOOL DISTRICT
      Minnesota
      MN
      2700106
      MEEKER
      27093
      5
      0
      5
      1995-2013
      1986-1994
      27250 MN HWY 4
      GROVE CITY
      MN
      56243
      A.C.G.C. PUBLIC SCHOOL DISTRICT
      45.142843
      -94.685310
      012396
      3-Does not serve a CBSA/MSA
      2707
      2.450473e+13
      Prekindergarten
      849
      849
      NaN
      NaN
      0
      13.5
      13.9
      11204
      2880
      7687
      637
      5508
      3046
      478
      5550
      1329
      10167
      203
      399
      9032
      5508
      14.7
      61.0
      5.3
      61.5
      33.7
      5.7
      ...
      0.000000
      0
      0.000000
      0
      0
      0.01227
      0.005726
      0.063804
      0
      0.082618
      0.000000
      0.016769
      0.017587
      0
      0.078119
      0.058487
      0.000000
      0.000409
      0.000000
      0
      0.035583
      0.813975
      0.083972
      0.001379
      0.000000
      0.049341
      0.000613
      0.022832
      0
      0
      0.004903
      0.022985
      0.000000
      0.000000
      0
      0.312384
      0.000000
      0.316081
      0.101664
      0.001848
      0.000000
      0.266174
      0
      0
      0
      0
      0.000000
      1
      0
      89.6
    
  

5 rows × 417 columns



In [329]:

    
#CITATION: From HW3
from sklearn.preprocessing import StandardScaler

for col in STANDARDIZABLE:
    #print col
    valstrain=dftouse0607[col].values[mask]
    valstest=dftouse0607[col].values[~mask]
    scaler=StandardScaler().fit(valstrain)
    outtrain=scaler.transform(valstrain)
    outtest=scaler.fit_transform(valstest)
    out=np.empty(mask.shape[0])
    out[mask]=outtrain
    out[~mask]=outtest
    dftouse0607[col]=out



In [330]:

    
dftouse0607.head()









    Out[330]:






  
    
      
      agency
      state
      state_abbr
      agency_id_nces
      county
      county_number
      num_schools
      num_charter_schools
      num_pub_schools
      report_years
      no_report_years
      address
      city
      add_state
      zipcode
      Agency Name_DEL
      latitude
      longitude
      agency_id_state
      Metro Status Code [District] 2009-10
      congressional_code
      census_id
      offered_g_lowest
      num_students
      Total Students - Calculated Sum of Reported Grade Totals [Public School] 2009-10
      Migrant Students in Summer Programs (**) [District] 2009-10
      Migrant Students (**) [Public School] 2009-10
      Gender Unknown Students [Public School] 2009-10
      pupil_teacher_ratio_dist
      pupil_teacher_ratio_ps
      totalrev_pp
      tlocrev_pp
      tsrev_pp
      tfedrev_pp
      tcurinst_pp
      tcurssv_pp
      tcuroth_pp
      tcursalary_pp
      tcurbenefits_pp
      totalexp_pp
      tcapout_pp
      tnonelse_pp
      tcurelsc_pp
      instexp_pp
      tcurelsc_percent
      tcurinst_percent
      tcuroth_percent
      tcuresal_percent
      tcurssvc_percent
      tfedrev_percent
      ...
      r_lrev_gst
      r_lrev_put
      r_lrev_it
      r_lrev_aot
      r_lrev_pgc
      r_lrev_cc
      r_lrev_oss
      r_lrev_tui
      r_lrev_trans
      r_lrev_slr
      r_lrev_ts
      r_lrev_sar
      r_lrev_osalserv
      r_lrev_sfns
      r_lrev_ie
      r_lrev_molr
      r_lrev_sp
      r_lrev_rr
      r_lrev_sale
      r_lrev_ff
      r_lrev_pc
      r_srev_gfa
      r_srev_sep
      r_srev_trans
      r_srev_sip
      r_srev_cbsp
      r_srev_vep
      r_srev_codsp
      r_srev_bep
      r_srev_gt
      r_srev_slp
      r_srev_aor
      r_srev_splea
      r_srev_osp
      r_srev_ns
      r_frev_title1
      r_frev_dis
      r_frev_cna
      r_frev_ems
      r_frev_dfs
      r_frev_voc
      r_frev_ao
      r_frev_ns
      r_frev_ia
      r_frev_be
      r_frev_na
      r_frev_aofed
      RESP_High_Graduation
      RESP_Low_Graduation
      afgr0910
    
  
  
    
      0
      21ST CENTURY CYBER CS
      Pennsylvania
      PA
      4200091
      CHESTER
      42029
      -0.368162
      0.570198
      -0.367500
      2001-2013
      1986-2000
      805 SPRINGDALE DR
      EXTON
      PA
      19341
      21ST CENTURY CYBER CS
      40.002611
      -75.689137
      124150002
      2-CBSA/MSA- not primarily its principal/centra...
      4206
      NaN
      6th Grade
      -0.275248
      467
      NaN
      NaN
      0
      1.717670
      2.322175
      -0.529282
      0.834055
      -1.901441
      -0.842605
      -0.680230
      -0.348066
      -1.476067
      -1.170481
      -0.747834
      -0.729336
      -0.279010
      -0.417063
      -0.689399
      -0.682458
      -0.519068
      0.024708
      -2.305626
      -1.518833
      0.908616
      -1.109752
      ...
      -0.217190
      -0.164834
      -0.270559
      -0.284281
      -0.260499
      -0.295710
      9.035203
      -0.205884
      -0.138538
      -1.167917
      -0.309708
      -0.572910
      -0.215268
      -0.056536
      -0.771206
      -0.499542
      -0.269646
      -0.191627
      -0.120023
      -0.290242
      -0.232545
      -3.283950
      -0.575660
      -0.516646
      16.809805
      -0.279095
      -0.282149
      -0.307282
      -0.162959
      -0.194673
      -0.520860
      -0.111070
      -0.462796
      -0.362187
      -0.195845
      -1.482877
      -0.960988
      -1.940120
      -0.674718
      -0.343426
      -0.295042
      -0.751240
      -0.414830
      -0.192561
      -0.150392
      -0.184885
      8.672353
      1
      0
      100.0
    
    
      1
      A+ ACADEMY
      Texas
      TX
      4800203
      DALLAS
      48113
      -0.368162
      0.570198
      -0.367500
      2000-2013
      1986-1999
      8225 BRUTON RD
      DALLAS
      TX
      75217
      A+ ACADEMY
      32.690593
      -96.643398
      057829
      1-Principal/central city of CBSA/MSA
      4830
      NaN
      Prekindergarten
      -0.231156
      994
      NaN
      NaN
      0
      -0.110098
      0.033154
      -0.522985
      -1.094637
      0.838477
      -0.064328
      -0.523902
      -0.302160
      -0.214486
      -0.283463
      -1.427182
      -0.666339
      -0.523042
      -0.417063
      -0.465952
      -0.530922
      -2.148434
      -0.189517
      -0.011538
      0.592377
      0.403585
      0.156193
      ...
      -0.217190
      -0.164834
      -0.270559
      -0.284281
      -0.260499
      -0.295710
      -0.357599
      -0.205884
      -0.138538
      3.467785
      -0.309708
      -0.572910
      -0.244358
      -0.056536
      -0.771206
      9.604452
      -0.269646
      -0.191627
      -0.120023
      -0.290242
      -0.232545
      0.796006
      -0.575660
      -0.516646
      -0.329062
      -0.279095
      -0.282149
      -0.307282
      -0.162959
      -0.194673
      -0.364413
      -0.323327
      -0.002865
      0.613502
      -0.195845
      -1.482877
      0.039311
      0.825576
      0.526829
      -0.273565
      -0.295042
      1.627902
      -0.414830
      -0.192561
      -0.150392
      -0.184885
      -0.475592
      0
      0
      55.7
    
    
      2
      A-C CENTRAL CUSD 262
      Illinois
      IL
      1700105
      CASS
      17017
      -0.292271
      -0.124187
      -0.293305
      1989-2013
      1986-1988
      501 EAST BUCHANAN ST
      ASHLAND
      IL
      62612
      A-C CENTRAL CUSD 262
      39.891262
      -90.010711
      46-009-2620-26
      3-Does not serve a CBSA/MSA
      1718
      1.450092e+13
      Prekindergarten
      -0.329984
      483
      NaN
      0
      0
      -0.589401
      -0.518477
      -0.444107
      -0.177789
      -0.367629
      -0.379624
      -0.866648
      -0.060703
      -0.180856
      -0.442913
      -0.742925
      -0.692694
      -0.549522
      -0.364310
      -0.574035
      -0.861341
      -0.689564
      -1.006805
      0.268683
      0.473690
      1.464749
      -0.370299
      ...
      -0.203752
      -0.152773
      -0.259900
      -0.280982
      -0.277667
      -0.291643
      -0.183283
      -0.303265
      -0.135550
      0.023945
      1.923160
      -0.295996
      -0.316217
      -0.064721
      -0.618526
      -0.373178
      0.601105
      -0.205212
      -0.101924
      -0.329525
      -0.207304
      -0.151469
      -0.217553
      3.536441
      -0.304002
      -0.296213
      0.154117
      -0.319822
      -0.155163
      -0.174539
      -0.060359
      -0.498004
      0.458276
      0.155612
      -0.205340
      0.211055
      -0.946307
      -0.097024
      -0.707140
      -0.122872
      -0.316114
      1.093938
      -0.405704
      -0.178486
      -0.134252
      -0.190202
      1.394174
      1
      0
      70.7
    
    
      3
      A-H-S-T COMM SCHOOL DISTRICT
      Iowa
      IA
      1904080
      POTTAWATTAMIE
      19155
      -0.311599
      -0.097106
      -0.311607
      1986-2013
      NaN
      768 SOUTH MAPLE ST
      AVOCA
      IA
      51521
      A-H-S-T COMM SCHOOL DISTRICT
      41.471423
      -95.341045
      780441 000
      2-CBSA/MSA- not primarily its principal/centra...
      1905
      1.650780e+13
      Prekindergarten
      -0.263702
      605
      NaN
      0
      0
      -0.110098
      0.033154
      -0.166672
      -0.015940
      -0.087284
      -0.517103
      -0.390662
      -0.870806
      -0.399193
      -0.439996
      -0.685250
      -0.525856
      0.062422
      -0.235886
      -0.628160
      -0.401767
      -0.452563
      0.881608
      -0.215457
      0.639555
      -0.919945
      -0.693323
      ...
      2.439304
      -0.164834
      1.645480
      -0.284281
      -0.260499
      -0.295710
      0.198717
      -0.099406
      -0.138538
      -0.102531
      0.839990
      -0.548196
      -0.244358
      -0.056536
      -0.482323
      -0.231160
      -0.269646
      -0.191627
      -0.103081
      -0.290242
      -0.232545
      0.744602
      -0.575660
      -0.516646
      1.073878
      -0.279095
      -0.141618
      -0.307282
      -0.162959
      -0.194673
      -0.234324
      -0.487047
      -0.462796
      -0.362187
      -0.195845
      -0.067969
      -0.195581
      0.810348
      -0.674718
      0.130045
      -0.142662
      0.356948
      -0.414830
      -0.192561
      -0.150392
      -0.184885
      0.323355
      0
      0
      95.7
    
    
      4
      A.C.G.C. PUBLIC SCHOOL DISTRICT
      Minnesota
      MN
      2700106
      MEEKER
      27093
      -0.141909
      -0.097106
      -0.143928
      1995-2013
      1986-1994
      27250 MN HWY 4
      GROVE CITY
      MN
      56243
      A.C.G.C. PUBLIC SCHOOL DISTRICT
      45.142843
      -94.685310
      012396
      3-Does not serve a CBSA/MSA
      2707
      2.450473e+13
      Prekindergarten
      -0.243288
      849
      NaN
      NaN
      0
      -0.110098
      0.093392
      -0.053112
      -0.526877
      0.798210
      -0.226516
      -0.159777
      -0.238483
      0.189778
      -0.099102
      -0.613580
      -0.206479
      -0.415583
      1.914858
      -0.179175
      -0.177960
      -0.785087
      0.105042
      0.396300
      0.273926
      -0.118861
      -0.326865
      ...
      -0.217190
      -0.164834
      -0.270559
      -0.284281
      -0.260499
      -0.169805
      -0.303763
      2.762962
      -0.138538
      1.057307
      -0.309708
      -0.255979
      0.449549
      -0.056536
      0.582380
      0.211475
      -0.269646
      -0.166551
      -0.120023
      -0.290242
      1.115569
      0.340002
      0.324563
      -0.480757
      -0.329062
      1.166302
      -0.236583
      0.064732
      -0.162959
      -0.194673
      0.965647
      -0.424590
      -0.462796
      -0.362187
      -0.195845
      0.541461
      -0.960988
      0.517733
      1.113795
      -0.243219
      -0.295042
      0.937467
      -0.414830
      -0.192561
      -0.150392
      -0.184885
      -0.475592
      1
      0
      89.6
    
  

5 rows × 417 columns



In [331]:

    
#CITATION: From HW3
lcols=STANDARDIZABLE+INDICATORS
print len(lcols)



In [332]:

    
#CITATION: From HW3
ccols=[]
for c in lcols:
    if c not in INDICATORS:
        ccols.append(c)
print len(ccols), len(INDICATORS)
ccols









    



124 46






    Out[332]:





['num_students',
 'num_schools',
 'num_charter_schools',
 'num_pub_schools',
 'tcuresal_percent',
 'pupil_teacher_ratio_dist',
 'pupil_teacher_ratio_ps',
 'totalrev_pp',
 'tlocrev_pp',
 'tsrev_pp',
 'tfedrev_pp',
 'tcurinst_pp',
 'tcurssv_pp',
 'tcuroth_pp',
 'tcursalary_pp',
 'tcurbenefits_pp',
 'totalexp_pp',
 'tcapout_pp',
 'tnonelse_pp',
 'tcurelsc_pp',
 'instexp_pp',
 'tcurinst_percent',
 'tcuroth_percent',
 'tcurelsc_percent',
 'tcurssvc_percent',
 'tfedrev_percent',
 'tlocrev_percent',
 'tsrev_percent',
 'r_ELL',
 'r_IEP',
 'r_lunch_free',
 'r_lunch_reduced',
 'r_stud_PKK',
 'r_stud_18',
 'r_stud_912',
 'r_stud_re_M',
 'r_stud_re_F',
 'r_stud_re_AIAN',
 'r_stud_re_AAP',
 'r_stud_re_H',
 'r_stud_re_B',
 'r_stud_re_W',
 'r_stud_re_Total',
 'r_stud_reg_12_AIAN_M',
 'r_stud_reg_12_AIAN_F',
 'r_stud_reg_12_AAP_M',
 'r_stud_reg_12_AAP_F',
 'r_stud_reg_12_H_M',
 'r_stud_reg_12_H_F',
 'r_stud_reg_12_B_M',
 'r_stud_reg_12_B_F',
 'r_stud_reg_12_W_M',
 'r_stud_reg_12_W_F',
 'r_stud_reg_12_HNPI_M',
 'r_stud_reg_12_HNPI_F',
 'r_stud_reg_12_Two_M',
 'r_stud_reg_12_Two_F',
 'r_st_PKT',
 'r_st_KT',
 'r_st_ET',
 'r_st_ST',
 'r_st_UT',
 'r_st_TS',
 'r_st_IA',
 'r_st_IC',
 'r_st_EGC',
 'r_st_SGC',
 'r_st_TGC',
 'r_st_LMS',
 'r_st_LMSS',
 'r_st_LEA',
 'r_st_LEASS',
 'r_st_SA',
 'r_st_SASS',
 'r_st_SSSS',
 'r_st_OSSS',
 'r_lrev_pt',
 'r_lrev_gst',
 'r_lrev_put',
 'r_lrev_it',
 'r_lrev_aot',
 'r_lrev_pgc',
 'r_lrev_cc',
 'r_lrev_oss',
 'r_lrev_tui',
 'r_lrev_trans',
 'r_lrev_slr',
 'r_lrev_ts',
 'r_lrev_sar',
 'r_lrev_osalserv',
 'r_lrev_sfns',
 'r_lrev_ie',
 'r_lrev_molr',
 'r_lrev_sp',
 'r_lrev_rr',
 'r_lrev_sale',
 'r_lrev_ff',
 'r_lrev_pc',
 'r_srev_gfa',
 'r_srev_sep',
 'r_srev_trans',
 'r_srev_sip',
 'r_srev_cbsp',
 'r_srev_vep',
 'r_srev_codsp',
 'r_srev_bep',
 'r_srev_gt',
 'r_srev_slp',
 'r_srev_aor',
 'r_srev_splea',
 'r_srev_osp',
 'r_srev_ns',
 'r_frev_title1',
 'r_frev_dis',
 'r_frev_cna',
 'r_frev_ems',
 'r_frev_dfs',
 'r_frev_voc',
 'r_frev_ao',
 'r_frev_ns',
 'r_frev_ia',
 'r_frev_be',
 'r_frev_na',
 'r_frev_aofed']



In [333]:

    
Xmatrix=dftouse0607[lcols].values
Yresp=dftouse0607['afgr'].values



In [334]:

    
Xtrain=Xmatrix[mask]
Xtest=Xmatrix[~mask]
ytrain=Yresp[mask]
ytest=Yresp[~mask]

Linear Regression

Performance expected to be poor - use as a justification for lasso and elastic net.



In [335]:

    
import sklearn.linear_model
lm = linear_model.LinearRegression()
lm.fit(Xtrain, ytrain)









    Out[335]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)



In [336]:

    
pd.DataFrame(zip(dftouse0607[lcols].columns, lm.coef_), columns = ['features', 'estimatedCoefficients']).sort(['estimatedCoefficients'], ascending=False)[:25]









    Out[336]:






  
    
      
      features
      estimatedCoefficients
    
  
  
    
      62
      r_st_TS
      3.864551e+11
    
    
      41
      r_stud_re_W
      7.257401e+10
    
    
      40
      r_stud_re_B
      4.937619e+10
    
    
      39
      r_stud_re_H
      4.656368e+10
    
    
      37
      r_stud_re_AIAN
      2.732111e+10
    
    
      38
      r_stud_re_AAP
      8.635449e+09
    
    
      148
      i_ucl_rural_remote
      1.371368e+07
    
    
      137
      i_ucl_city_large
      1.371368e+07
    
    
      147
      i_ucl_rural_distant
      1.371368e+07
    
    
      144
      i_ucl_town_distant
      1.371368e+07
    
    
      146
      i_ucl_rural_fringe
      1.371368e+07
    
    
      145
      i_ucl_town_remote
      1.371368e+07
    
    
      141
      i_ucl_suburb_mid
      1.371368e+07
    
    
      140
      i_ucl_suburb_large
      1.371368e+07
    
    
      143
      i_ucl_town_fringe
      1.371368e+07
    
    
      138
      i_ucl_city_mid
      1.371368e+07
    
    
      142
      i_ucl_suburb_small
      1.371368e+07
    
    
      139
      i_ucl_city_small
      1.371368e+07
    
    
      125
      i_agency_type_local_school_district_sup_union
      1.135006e+06
    
    
      124
      i_agency_type_local_school_district
      1.135005e+06
    
    
      130
      i_agency_type_other_education_agency
      1.134998e+06
    
    
      126
      i_agency_type_sup_union_admin
      1.134994e+06
    
    
      127
      i_agency_type_regional_education_services
      1.134994e+06
    
    
      128
      i_agency_type_state_operated_institution
      1.134972e+06
    
    
      7
      totalrev_pp
      1.580506e+03



In [337]:

    
pd.DataFrame(zip(dftouse0607[lcols].columns, lm.coef_), columns = ['features', 'estimatedCoefficients']).sort(['estimatedCoefficients'], ascending=True)[:25]









    Out[337]:






  
    
      
      features
      estimatedCoefficients
    
  
  
    
      61
      r_st_UT
      -7.935545e+10
    
    
      60
      r_st_ST
      -7.853948e+10
    
    
      75
      r_st_OSSS
      -7.125960e+10
    
    
      73
      r_st_SASS
      -5.356324e+10
    
    
      74
      r_st_SSSS
      -4.925571e+10
    
    
      64
      r_st_IC
      -4.432454e+10
    
    
      63
      r_st_IA
      -4.252249e+10
    
    
      72
      r_st_SA
      -1.783860e+10
    
    
      71
      r_st_LEASS
      -1.046188e+10
    
    
      59
      r_st_ET
      -1.019717e+10
    
    
      69
      r_st_LMSS
      -9.968358e+09
    
    
      42
      r_stud_re_Total
      -9.824563e+09
    
    
      70
      r_st_LEA
      -8.634636e+09
    
    
      35
      r_stud_re_M
      -4.814197e+09
    
    
      36
      r_stud_re_F
      -4.672918e+09
    
    
      67
      r_st_TGC
      -4.488305e+09
    
    
      58
      r_st_KT
      -2.207510e+09
    
    
      57
      r_st_PKT
      -1.370151e+09
    
    
      68
      r_st_LMS
      -8.418198e+08
    
    
      150
      i_cs_charter_noncharter
      -1.888661e+07
    
    
      151
      i_cs_all_noncharter
      -1.888660e+07
    
    
      154
      i_ma_micropolitan
      -1.561338e+07
    
    
      153
      i_ma_metropolitan
      -1.561338e+07
    
    
      152
      i_ma_ne_nr
      -1.561338e+07
    
    
      149
      i_cs_all_charter
      -1.253634e+07



In [338]:

    
from sklearn import cross_validation
cross_validation.cross_val_score(lm, Xtrain, ytrain, scoring='r2')









    Out[338]:





array([ -4.32504211e+06,  -1.66601642e-01,  -2.16785453e+00])



In [339]:

    
ypredict = lm.predict(Xtest)
rscore =  mean_squared_error(ytest, ypredict)
print "Mean Squared Error : ", rscore

plt.scatter(ytest, ypredict, alpha=0.5)
plt.xlabel("Grad Rate: Actual")
plt.ylabel("Grad Rate: Predicted")
plt.title("Grad Rate vs Predicted Grad Rate: $Y_i$ vs $\hat{Y}_i$")
plt.xlim(0,105)
#plt.ylim(0,105)
plt.show()









    



Mean Squared Error :  3.47809671345e+22

Lasso Regression



In [340]:

    
lasso = linear_model.Lasso()



In [341]:

    
lasso.fit(Xtrain, ytrain)









    Out[341]:





Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)



In [342]:

    
pd.DataFrame(zip(dftouse0607[lcols].columns, lasso.coef_), columns = ['features', 'estimatedCoefficients']).sort(['estimatedCoefficients'], ascending=False)[:10]









    Out[342]:






  
    
      
      features
      estimatedCoefficients
    
  
  
    
      41
      r_stud_re_W
      1.496449
    
    
      34
      r_stud_912
      1.148622
    
    
      26
      tlocrev_percent
      0.842331
    
    
      42
      r_stud_re_Total
      0.834020
    
    
      51
      r_stud_reg_12_W_M
      0.542936
    
    
      36
      r_stud_re_F
      0.033100
    
    
      0
      num_students
      -0.000000
    
    
      120
      r_frev_ia
      0.000000
    
    
      113
      r_frev_dis
      0.000000
    
    
      114
      r_frev_cna
      0.000000



In [343]:

    
pd.DataFrame(zip(dftouse0607[lcols].columns, lasso.coef_), columns = ['features', 'estimatedCoefficients']).sort(['estimatedCoefficients'], ascending=True)[:10]









    Out[343]:






  
    
      
      features
      estimatedCoefficients
    
  
  
    
      30
      r_lunch_free
      -3.016324
    
    
      40
      r_stud_re_B
      -1.922189
    
    
      25
      tfedrev_percent
      -0.671879
    
    
      3
      num_pub_schools
      -0.114687
    
    
      106
      r_srev_gt
      -0.036813
    
    
      118
      r_frev_ao
      -0.034284
    
    
      112
      r_frev_title1
      -0.000000
    
    
      113
      r_frev_dis
      0.000000
    
    
      114
      r_frev_cna
      0.000000
    
    
      115
      r_frev_ems
      0.000000



In [344]:

    
from sklearn import cross_validation
cross_validation.cross_val_score(lasso, Xtrain, ytrain, scoring='r2')









    Out[344]:





array([ 0.3156014 ,  0.31176424,  0.31801381])



In [345]:

    
yp = lasso.predict(Xtest)
mean_squared_error(yp, ytest)









    Out[345]:





132.41102601659034



In [346]:

    
plt.scatter(ytest, yp, alpha=.3)
plt.xlabel("Grad Rate: Actual")
plt.ylabel("Grad Rate: Predicted")
plt.title("Lasso Regression: 2006-2007 School District Data")
plt.xlim(0,105)
plt.ylim(0,105)
plt.show()

Elastic Net



In [347]:

    
elasticnet = linear_model.ElasticNet()
elasticnet.fit(Xtrain, ytrain)









    Out[347]:





ElasticNet(alpha=1.0, copy_X=True, fit_intercept=True, l1_ratio=0.5,
      max_iter=1000, normalize=False, positive=False, precompute=False,
      random_state=None, selection='cyclic', tol=0.0001, warm_start=False)



In [348]:

    
pd.DataFrame(zip(dftouse0607[lcols].columns, elasticnet.coef_), columns = ['features', 'estimatedCoefficients']).sort(['estimatedCoefficients'], ascending=False)[:20]









    Out[348]:






  
    
      
      features
      estimatedCoefficients
    
  
  
    
      41
      r_stud_re_W
      1.292664
    
    
      34
      r_stud_912
      1.060088
    
    
      26
      tlocrev_percent
      0.816008
    
    
      51
      r_stud_reg_12_W_M
      0.775423
    
    
      42
      r_stud_re_Total
      0.608449
    
    
      36
      r_stud_re_F
      0.346064
    
    
      52
      r_stud_reg_12_W_F
      0.336655
    
    
      76
      r_lrev_pt
      0.316851
    
    
      38
      r_stud_re_AAP
      0.291935
    
    
      4
      tcuresal_percent
      0.210040
    
    
      21
      tcurinst_percent
      0.205878
    
    
      57
      r_st_PKT
      0.180802
    
    
      113
      r_frev_dis
      0.165260
    
    
      79
      r_lrev_it
      0.154029
    
    
      86
      r_lrev_slr
      0.151288
    
    
      46
      r_stud_reg_12_AAP_F
      0.144122
    
    
      8
      tlocrev_pp
      0.136161
    
    
      88
      r_lrev_sar
      0.133059
    
    
      80
      r_lrev_aot
      0.127004
    
    
      37
      r_stud_re_AIAN
      0.092383



In [349]:

    
pd.DataFrame(zip(dftouse0607[lcols].columns, elasticnet.coef_), columns = ['features', 'estimatedCoefficients']).sort(['estimatedCoefficients'], ascending=True)[:20]









    Out[349]:






  
    
      
      features
      estimatedCoefficients
    
  
  
    
      30
      r_lunch_free
      -1.886635
    
    
      40
      r_stud_re_B
      -1.520982
    
    
      25
      tfedrev_percent
      -0.921166
    
    
      29
      r_IEP
      -0.331453
    
    
      10
      tfedrev_pp
      -0.304889
    
    
      106
      r_srev_gt
      -0.304483
    
    
      118
      r_frev_ao
      -0.296672
    
    
      3
      num_pub_schools
      -0.265479
    
    
      28
      r_ELL
      -0.218137
    
    
      1
      num_schools
      -0.210946
    
    
      39
      r_stud_re_H
      -0.205766
    
    
      119
      r_frev_ns
      -0.205559
    
    
      50
      r_stud_reg_12_B_F
      -0.182480
    
    
      77
      r_lrev_gst
      -0.159556
    
    
      27
      tsrev_percent
      -0.118256
    
    
      101
      r_srev_sip
      -0.104726
    
    
      103
      r_srev_vep
      -0.077543
    
    
      9
      tsrev_pp
      -0.045933
    
    
      91
      r_lrev_ie
      -0.019656
    
    
      92
      r_lrev_molr
      -0.017349



In [350]:

    
cross_validation.cross_val_score(elasticnet, Xtrain, ytrain, scoring='r2')









    Out[350]:





array([ 0.34497655,  0.32429677,  0.33163301])



In [351]:

    
yp = elasticnet.predict(Xtest)
mean_squared_error(yp, ytest)









    Out[351]:





127.56221558575209



In [352]:

    
plt.scatter(ytest, yp, alpha=.3)
plt.xlabel("Grad Rate: Actual")
plt.ylabel("Grad Rate: Predicted")
plt.title("Elastic Net Regression: 2006-2007 School District Data")
plt.xlim(0,105)
plt.ylim(0,105)
plt.show()

Creating a best model

Again, creating a best model based off of features that we learned from elastic net.



In [353]:

    
lcols=['r_stud_re_W','r_stud_912','r_stud_reg_12_W_M','tlocrev_percent','r_stud_re_Total','r_lunch_free','r_stud_re_B','tfedrev_percent','tfedrev_pp','r_frev_ao','r_stud_re_F','r_stud_reg_12_W_F','r_srev_gt','num_pub_schools']



In [354]:

    
Xmatrix=dftouse0607[lcols].values
Yresp=dftouse0607['afgr'].values



In [355]:

    
Xtrain=Xmatrix[mask]
Xtest=Xmatrix[~mask]
ytrain=Yresp[mask]
ytest=Yresp[~mask]



In [356]:

    
lm = linear_model.LinearRegression()
lm.fit(Xtrain, ytrain)









    Out[356]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)



In [357]:

    
print lm.intercept_
print lm.coef_









    



81.510436857
[ 1.11285622  1.70684375  1.15819179  1.70798608  1.15038508 -3.219141
 -2.52898564 -0.61038643 -0.38387926 -0.84810477  0.88431232 -0.34292184
 -0.84591516 -0.73519575]



In [358]:

    
pd.DataFrame(zip(dftouse0607[lcols].columns, lm.coef_), columns = ['features', 'estimatedCoefficients'])









    Out[358]:






  
    
      
      features
      estimatedCoefficients
    
  
  
    
      0
      r_stud_re_W
      1.112856
    
    
      1
      r_stud_912
      1.706844
    
    
      2
      r_stud_reg_12_W_M
      1.158192
    
    
      3
      tlocrev_percent
      1.707986
    
    
      4
      r_stud_re_Total
      1.150385
    
    
      5
      r_lunch_free
      -3.219141
    
    
      6
      r_stud_re_B
      -2.528986
    
    
      7
      tfedrev_percent
      -0.610386
    
    
      8
      tfedrev_pp
      -0.383879
    
    
      9
      r_frev_ao
      -0.848105
    
    
      10
      r_stud_re_F
      0.884312
    
    
      11
      r_stud_reg_12_W_F
      -0.342922
    
    
      12
      r_srev_gt
      -0.845915
    
    
      13
      num_pub_schools
      -0.735196



In [359]:

    
cross_validation.cross_val_score(lm, Xtrain, ytrain, scoring='r2')









    Out[359]:





array([ 0.37248643,  0.31631476,  0.32508989])



In [360]:

    
yp = lm.predict(Xtest)
mean_squared_error(yp, ytest)









    Out[360]:





127.02979473602149



In [361]:

    
plt.scatter(ytest, yp, alpha=.3)
plt.xlabel("Grad Rate: Actual")
plt.ylabel("Grad Rate: Predicted")
plt.title("Linear Regression: 2006-2007 Filtered School District Data")
plt.xlim(0,105)
plt.ylim(0,105)
plt.show()

Predicting 2009-2010 graduation rate from 2006-2007 data.

We have our model of 2006-2007 graduation, and we try plugging in 2009-2010 data. This is the scenario of having school district data available, but no graduation rates.



In [362]:

    
ActualX=dftouse[lcols].values
Actualy=dftouse['afgr'].values



In [363]:

    
yp = lm.predict(ActualX)
mean_squared_error(yp, Actualy)









    Out[363]:





133.91167388132399



In [364]:

    
plt.scatter(Actualy, yp, alpha=.3)
plt.xlabel("Grad Rate: Actual")
plt.ylabel("Grad Rate: Predicted")
plt.title("Linear Regression: Predicting 2009-2010 Graduation from School District Data")
plt.xlim(0,105)
plt.ylim(0,105)
plt.show()

The mean squared error is comparable to our earlier models. Absent any available graduation rate data for 5 years, one could fill in the missing data using a predictive model until such graduation rate data becomes publically available.

	SURVYEAR	FIPST	LEAID	TOTD912	EBS912	DRP912	TOTDPL	AFGEB	AFGR	TOTOHC
0	2006-07	01	0100002	0	0
1	2006-07	01	0100005	30	749	4.0	192	263	73.0	14
2	2006-07	01	0100006	46	1209	3.8	216	456	47.4	28
3	2006-07	01	0100007	45	2726	1.7	779	878	88.7	10
4	2006-07	01	0100008	8	1968	0.4	562	576	97.6	10

	survyear	fipst	leaid	totd912	ebs912	drp912	totdpl	afgeb	afgr	totohc
0	2007-08	01	0100002	0	47	0.0		14
1	2007-08	01	0100005	29	939	3.1	172	247	69.6	5
2	2007-08	01	0100006	41	1612	2.5	276	450	61.3	13
3	2007-08	01	0100007	38	3817	1.0	899	978	91.9	5
4	2007-08	01	0100008	27	2715	1.0	624	584	100.0	9

	SURVYEAR	FIPST	LEAID	TOTD912	EBS912	DRP912	TOTDPL	AFGEB	AFGR	TOTOHC
0	2008-09	01	0100002					4
1	2008-09	01	0100005	21	976	2.1	171	267	64.0	10
2	2008-09	01	0100006	21	1671	1.3	325	485	67.0	12
3	2008-09	01	0100007	30	3823	0.8	775	865	89.6	21
4	2008-09	01	0100008	19	2820	0.7	663	654	100.0	5

	agency	state	state_abbr	agency_id_nces	county	county_number	num_schools	num_charter_schools	num_pub_schools	report_years	no_report_years	address	city	add_state	zipcode	latitude	longitude	agency_id_state	congressional_code	census_id	offered_g_lowest	num_students	pupil_teacher_ratio_dist	pupil_teacher_ratio_ps	totalrev_pp	tlocrev_pp	tsrev_pp	tfedrev_pp	tcurinst_pp	tcurssv_pp	tcuroth_pp	tcursalary_pp	tcurbenefits_pp	totalexp_pp	tcapout_pp	tnonelse_pp	tcurelsc_pp	instexp_pp	tcurelsc_percent	tcurinst_percent	tcuroth_percent	tcuresal_percent	tcurssvc_percent	tfedrev_percent	tlocrev_percent	tsrev_percent	fipst	totd912	ebs912	drp912	...	r_lrev_it	r_lrev_aot	r_lrev_pgc	r_lrev_cc	r_lrev_oss	r_lrev_tui	r_lrev_slr	r_lrev_ts	r_lrev_sar	r_lrev_osalserv	r_lrev_ie	r_lrev_molr	r_lrev_sp	r_lrev_rr	r_lrev_sale	r_lrev_pc	r_srev_gfa	r_srev_sep	r_srev_trans	r_srev_sip	r_srev_cbsp	r_srev_vep	r_srev_gt	r_srev_slp	r_srev_aor	r_srev_splea	r_srev_osp	r_frev_title1	r_frev_dis	r_frev_cna	r_frev_ems	r_frev_dfs	r_frev_voc	r_frev_ao	r_frev_ns	r_frev_ia	r_frev_be	r_frev_na	r_frev_aofed	RESP_High_Graduation	RESP_Low_Graduation	gradrate0607	gradrate0708	gradrate0809
0	21ST CENTURY CHARTER SCH OF GARY	Indiana	IN	1800046	MARION COUNTY	18097	1	1	1	2004-2013	1986-2003	556 WASHINGTON ST	GARY	IN	46402	39.771949	-86.155184	9545	1807	NaN	Kindergarten	360	15.65	17.14	11111	356	7925	2831	3986	4717	436	4133	1028	11197	1986	72	9139	3986	11.2	43.6	4.8	45.2	51.6	25.5	3.2	71.3	18	203.793066	100	9.676598	...	0.011257	0.004972	0.769211	0.070312	0.390625	0.000000	0.062500	0.000000	0.000000	0.000000	0.000000	0.078125	0.007812	0.000000	0.390625	0.00000	0.964599	0.000000	0.000000	0.000000	0.003155	0.000000	0.003155	0.000000	0.029092	0.000000	0.000000	0.606477	0.000000	0.156035	0.000000	0.000000	0.000000	0.176644	0.060844	0.000000	0.000000	0.000000	0.000000	0	1	NaN	NaN	NaN
1	21ST CENTURY CYBER CS	Pennsylvania	PA	4200091	CHESTER COUNTY	42029	1	1	1	2001-2013	1986-2000	805 SPRINGDALE DR	EXTON	PA	19341	40.005030	-75.678564	124150002	4206	NaN	6th Grade	594	28.49	28.49	10557	10285	273	0	5104	3003	0	4199	1456	8732	554	0	8108	5104	18.0	63.0	0.0	51.8	37.0	0.0	97.4	2.6	42	33.000000	483	6.800000	...	0.011257	0.004972	0.769211	0.000000	0.983958	0.005893	0.000000	0.000000	0.000000	0.000000	0.000000	0.010149	0.000000	0.000000	0.000000	0.00000	0.000000	0.000000	0.000000	0.950617	0.000000	0.000000	0.000000	0.000000	0.049383	0.000000	0.000000	0.208316	0.166727	0.177190	0.024388	0.003395	0.005148	0.309150	0.048223	0.015101	0.003794	0.002556	0.036014	1	0	100.0	100.0	100.0
2	A+ ACADEMY	Texas	TX	4800203	DALLAS COUNTY	48113	1	1	1	2000-2013	1986-1999	8225 BRUTON RD	DALLAS	TX	75217	32.767535	-96.660866	057829	4830	NaN	Prekindergarten	1033	16.61	16.61	11015	68	9164	1784	4890	3500	441	5788	477	8864	0	1	8832	4890	5.4	55.4	5.0	65.5	39.6	16.2	0.6	83.2	48	203.793066	228	9.676598	...	0.011257	0.004972	0.769211	0.000000	0.000000	0.000000	0.800000	0.000000	0.000000	0.000000	0.142857	0.000000	0.000000	0.057143	0.000000	0.00000	0.952145	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000211	0.006233	0.040038	0.001373	0.351058	0.056430	0.238741	0.033641	0.000000	0.000000	0.320130	0.000000	0.000000	0.000000	0.000000	0.000000	0	1	75.6	41.4	NaN
3	A-C CENTRAL CUSD 262	Illinois	IL	1700105	CASS COUNTY	17017	3	0	3	1989-2013	1986-1988	501 EAST BUCHANAN ST	ASHLAND	IL	62612	39.892187	-90.016057	46-009-2620-26	1718	1.450092e+13	Kindergarten	432	11.86	12.08	11367	4374	5089	1904	4835	4633	498	5741	2112	13200	2204	0	9966	4835	21.2	48.5	5.0	57.6	46.5	16.7	38.5	44.8	17	203.793066	130	9.676598	...	0.000000	0.000000	0.769211	0.000000	0.001049	0.000000	0.060829	0.009963	0.014683	0.000000	0.005244	0.052438	0.027792	0.004195	0.000000	0.02517	0.642181	0.049572	0.068049	0.000000	0.004056	0.000901	0.000000	0.000901	0.015773	0.218567	0.000000	0.162651	0.306024	0.087952	0.038554	0.002410	0.000000	0.402410	0.000000	0.000000	0.000000	0.000000	0.000000	0	1	97.1	81.6	83.3
4	A-H-S-T COMM SCHOOL DISTRICT	Iowa	IA	1904080	POTTAWATTAMIE COUNTY	19155	2	0	2	1986-2013	NaN	768 SOUTH MAPLE ST	AVOCA	IA	51521	41.471017	-95.341001	780441 000	1905	1.650780e+13	Prekindergarten	595	15.64	13.37	10718	5634	4020	1065	4944	2103	336	4703	1438	11586	2769	0	7384	4944	19.5	67.0	4.6	63.7	28.5	9.9	52.6	37.5	19	203.793066	168	9.676598	...	0.061464	0.000000	0.769211	0.000000	0.043101	0.000000	0.034175	0.008926	0.004336	0.002805	0.004081	0.066820	0.000000	0.000000	0.004081	0.00000	0.859543	0.000000	0.000000	0.118656	0.000000	0.002144	0.000000	0.001072	0.018585	0.000000	0.000000	0.095816	0.107962	0.148448	0.043185	0.008097	0.005398	0.554656	0.000000	0.000000	0.000000	0.000000	0.036437	1	0	75.0	72.0	69.4

	features	estimatedCoefficients
128	r_srev_gfa	6.846079e+10
138	r_srev_aor	4.107836e+10
139	r_srev_splea	2.919768e+10
129	r_srev_sep	2.685793e+10
131	r_srev_sip	1.867126e+10
134	r_srev_codsp	1.629000e+10
141	r_srev_ns	1.311669e+10
130	r_srev_trans	1.234336e+10
132	r_srev_cbsp	8.097579e+09
133	r_srev_vep	3.646229e+09
140	r_srev_osp	3.129522e+09
135	r_srev_bep	2.664747e+09
137	r_srev_slp	1.643682e+09
136	r_srev_gt	9.053896e+08
19	instexp_pp	2.884532e+05
50	i_ma_ne_nr	1.241564e+05
51	i_ma_metropolitan	1.241558e+05
52	i_ma_micropolitan	1.241554e+05
41	i_ucl_town_fringe	3.808005e+04
46	i_ucl_rural_remote	3.807999e+04
45	i_ucl_rural_distant	3.807883e+04
40	i_ucl_suburb_small	3.807849e+04
38	i_ucl_suburb_large	3.807848e+04
43	i_ucl_town_remote	3.807846e+04
42	i_ucl_town_distant	3.807844e+04

	features	estimatedCoefficients
148	r_frev_ao	-2.851683e+08
143	r_frev_dis	-2.145319e+08
142	r_frev_title1	-1.651628e+08
149	r_frev_ns	-1.566859e+08
144	r_frev_cna	-1.112821e+08
153	r_frev_aofed	-1.086761e+08
150	r_frev_ia	-9.347220e+07
145	r_frev_ems	-3.841922e+07
151	r_frev_be	-3.007251e+07
147	r_frev_voc	-2.416929e+07
152	r_frev_na	-1.904637e+07
146	r_frev_dfs	-1.766999e+07
48	i_cs_charter_noncharter	-3.260136e+05
49	i_cs_all_noncharter	-3.260118e+05
10	tcurinst_pp	-2.878319e+05
47	i_cs_all_charter	-2.877958e+05
31	i_agency_type_charter_school_agency	-3.822082e+04
6	totalrev_pp	-1.454550e+03
18	tcurelsc_pp	-1.137150e+03
91	r_st_TS	-2.583916e+01
24	tcurssvc_percent	-1.375403e+01
21	tcurinst_percent	-1.342402e+01
2	num_pub_schools	-1.252104e+01
30	i_agency_type_regional_education_services	-4.815589e+00
22	tcuroth_percent	-4.177093e+00

	features	estimatedCoefficients
81	r_stud_reg_12_W_F	1.541761
80	r_stud_reg_12_W_M	1.165075
26	tlocrev_percent	0.467191
0	num_schools	-0.000000
107	r_lrev_gst	-0.000000
101	r_st_LEASS	-0.000000
102	r_st_SA	-0.000000
103	r_st_SASS	-0.000000
104	r_st_SSSS	-0.000000
105	r_st_OSSS	0.000000

	features	estimatedCoefficients
57	r_lunch_free	-3.312818
67	r_stud_re_B	-1.548774
8	tsrev_pp	-0.561664
25	tfedrev_percent	-0.414278
9	tfedrev_pp	-0.068704
100	r_st_LEA	-0.000000
101	r_st_LEASS	-0.000000
102	r_st_SA	-0.000000
103	r_st_SASS	-0.000000
104	r_st_SSSS	-0.000000

	features	estimatedCoefficients
81	r_stud_reg_12_W_F	1.239553
80	r_stud_reg_12_W_M	1.098755
68	r_stud_re_W	0.795457
26	tlocrev_percent	0.693129
77	r_stud_reg_12_H_F	0.351569
23	tcuresal_percent	0.262755
75	r_stud_reg_12_AAP_F	0.227409
74	r_stud_reg_12_AAP_M	0.158365
63	r_stud_re_F	0.113068
59	r_stud_PKK	0.105005
73	r_stud_reg_12_AIAN_F	0.056997
109	r_lrev_it	0.039117
116	r_lrev_slr	0.030626
99	r_st_LMSS	0.028520
76	r_stud_reg_12_H_M	0.022879
96	r_st_OGC	0.013479
138	r_srev_aor	0.000000
88	r_st_ET	-0.000000
101	r_st_LEASS	-0.000000
100	r_st_LEA	-0.000000

	features	estimatedCoefficients
57	r_lunch_free	-1.950453
67	r_stud_re_B	-1.240224
25	tfedrev_percent	-0.800888
8	tsrev_pp	-0.500096
60	r_stud_18	-0.444123
9	tfedrev_pp	-0.343794
142	r_frev_title1	-0.343191
90	r_st_UT	-0.334007
107	r_lrev_gst	-0.321746
24	tcurssvc_percent	-0.263275
0	num_schools	-0.197457
2	num_pub_schools	-0.193220
27	tsrev_percent	-0.173343
122	r_lrev_molr	-0.104637
125	r_lrev_sale	-0.100227
56	r_IEP	-0.095247
55	r_ELL	-0.065675
136	r_srev_gt	-0.058401
132	r_srev_cbsp	-0.055461
11	tcurssv_pp	-0.014056

	features	estimatedCoefficients
0	gradrate0607	0.188040
1	gradrate0708	0.254905
2	gradrate0809	0.354630

	features	estimatedCoefficients
34	i_fin_sdlc_voc	10.833789
18	tcurelsc_pp	9.617015
33	i_fin_sdlc_elem_sec	9.372606
54	i_lgo_PK	7.170665
53	i_lgo_K	6.893866
0	num_schools	6.641498
101	r_st_LEASS	5.879462
148	r_frev_ao	5.565641
32	i_fin_sdlc_sec	5.094310
128	r_srev_gfa	4.702548
143	r_frev_dis	4.335033
149	r_frev_ns	3.546220
142	r_frev_title1	3.440538
92	r_st_IA	3.325286
7	tlocrev_pp	3.096447
138	r_srev_aor	2.824535
81	r_stud_reg_12_W_F	2.263718
8	tsrev_pp	2.077341
153	r_frev_aofed	2.004041
77	r_stud_reg_12_H_F	1.996720

	features	estimatedCoefficients
26	tlocrev_percent	-38.035606
27	tsrev_percent	-34.018567
25	tfedrev_percent	-14.698286
91	r_st_TS	-10.003194
68	r_stud_re_W	-9.298882
21	tcurinst_percent	-7.825261
2	num_pub_schools	-7.714817
24	tcurssvc_percent	-7.695972
30	i_agency_type_regional_education_services	-7.089685
67	r_stud_re_B	-6.779637
66	r_stud_re_H	-6.759184
64	r_stud_re_AIAN	-4.411344
6	totalrev_pp	-3.691741
11	tcurssv_pp	-3.441055
29	i_agency_type_local_school_district_sup_union	-3.146080
13	tcursalary_pp	-2.965253
100	r_st_LEA	-2.824958
35	i_ucl_city_large	-2.683151
22	tcuroth_percent	-2.312685
15	totalexp_pp	-1.717676

	features	estimatedCoefficients
81	r_stud_reg_12_W_F	0.880563
156	gradrate0809	0.336472
155	gradrate0708	0.240823
154	gradrate0607	0.171633
108	r_lrev_put	0.000000
101	r_st_LEASS	-0.000000
102	r_st_SA	-0.000000
103	r_st_SASS	-0.000000
104	r_st_SSSS	-0.000000
105	r_st_OSSS	0.000000

	features	estimatedCoefficients
57	r_lunch_free	-0.185954
100	r_st_LEA	-0.000000
101	r_st_LEASS	-0.000000
102	r_st_SA	-0.000000
103	r_st_SASS	-0.000000
104	r_st_SSSS	-0.000000
105	r_st_OSSS	0.000000
106	r_lrev_pt	0.000000
107	r_lrev_gst	-0.000000
108	r_lrev_put	0.000000

	features	estimatedCoefficients
81	r_stud_reg_12_W_F	0.724249
80	r_stud_reg_12_W_M	0.390046
156	gradrate0809	0.321462
77	r_stud_reg_12_H_F	0.263953
155	gradrate0708	0.231649
23	tcuresal_percent	0.206815
154	gradrate0607	0.159676
26	tlocrev_percent	0.113949
74	r_stud_reg_12_AAP_M	0.079391
106	r_lrev_pt	0.008552

	features	estimatedCoefficients
57	r_lunch_free	-0.386189
67	r_stud_re_B	-0.253313
8	tsrev_pp	-0.224147
132	r_srev_cbsp	-0.136606
119	r_lrev_osalserv	-0.099034
9	tfedrev_pp	-0.080264
25	tfedrev_percent	-0.064001
60	r_stud_18	-0.056252
17	tnonelse_pp	-0.053719
122	r_lrev_molr	-0.009340

	features	estimatedCoefficients
0	r_stud_reg_12_W_F	1.944409
1	r_stud_reg_12_W_M	1.214658
2	tlocrev_percent	0.257012
3	r_stud_re_W	-0.173361
4	r_lrev_pt	0.146182
5	r_lunch_free	-2.896304
6	r_stud_re_B	-2.116497
7	tfedrev_percent	-1.158260
8	r_stud_18	-1.422072
9	tsrev_pp	-1.629074
10	tcuresal_percent	0.973579
11	r_frev_title1	-0.543912
12	tcurinst_percent	0.025455
13	r_lrev_molr	-0.731097

	features	estimatedCoefficients
62	r_st_TS	3.864551e+11
41	r_stud_re_W	7.257401e+10
40	r_stud_re_B	4.937619e+10
39	r_stud_re_H	4.656368e+10
37	r_stud_re_AIAN	2.732111e+10
38	r_stud_re_AAP	8.635449e+09
148	i_ucl_rural_remote	1.371368e+07
137	i_ucl_city_large	1.371368e+07
147	i_ucl_rural_distant	1.371368e+07
144	i_ucl_town_distant	1.371368e+07
146	i_ucl_rural_fringe	1.371368e+07
145	i_ucl_town_remote	1.371368e+07
141	i_ucl_suburb_mid	1.371368e+07
140	i_ucl_suburb_large	1.371368e+07
143	i_ucl_town_fringe	1.371368e+07
138	i_ucl_city_mid	1.371368e+07
142	i_ucl_suburb_small	1.371368e+07
139	i_ucl_city_small	1.371368e+07
125	i_agency_type_local_school_district_sup_union	1.135006e+06
124	i_agency_type_local_school_district	1.135005e+06
130	i_agency_type_other_education_agency	1.134998e+06
126	i_agency_type_sup_union_admin	1.134994e+06
127	i_agency_type_regional_education_services	1.134994e+06
128	i_agency_type_state_operated_institution	1.134972e+06
7	totalrev_pp	1.580506e+03

	features	estimatedCoefficients
61	r_st_UT	-7.935545e+10
60	r_st_ST	-7.853948e+10
75	r_st_OSSS	-7.125960e+10
73	r_st_SASS	-5.356324e+10
74	r_st_SSSS	-4.925571e+10
64	r_st_IC	-4.432454e+10
63	r_st_IA	-4.252249e+10
72	r_st_SA	-1.783860e+10
71	r_st_LEASS	-1.046188e+10
59	r_st_ET	-1.019717e+10
69	r_st_LMSS	-9.968358e+09
42	r_stud_re_Total	-9.824563e+09
70	r_st_LEA	-8.634636e+09
35	r_stud_re_M	-4.814197e+09
36	r_stud_re_F	-4.672918e+09
67	r_st_TGC	-4.488305e+09
58	r_st_KT	-2.207510e+09
57	r_st_PKT	-1.370151e+09
68	r_st_LMS	-8.418198e+08
150	i_cs_charter_noncharter	-1.888661e+07
151	i_cs_all_noncharter	-1.888660e+07
154	i_ma_micropolitan	-1.561338e+07
153	i_ma_metropolitan	-1.561338e+07
152	i_ma_ne_nr	-1.561338e+07
149	i_cs_all_charter	-1.253634e+07

Predicting Numerical Graduation Rate

Table of Contents

Acquiring graduation rate from previous years and adding it to schools districts dataset

Data Analysis

Baseline Comparison :

Predictive Modelling using the current school districts datasets.

Linear Regression

Lasso Regression

Elastic Net Regression

Decision Tree & Random Forests

Predicting graduation based on previous year graduation

Linear Regression - 0607, 0708, and 0809

Linear Regression - 0708 and 0809

Linear Regression - 0809 Only

Linear Regression - 0708 Only

Linear Regression - 0607 Only

Predicting graduation based on previous year graduation and all other factors

Linear Regression

Lasso Regression

Elastic Net Regression

Decision Trees and Random Forests

Creating a best model

Findings

Trying 2009-2010 predictions using 2006-2007 data

Linear Regression

Lasso Regression

Elastic Net

Creating a best model

Predicting 2009-2010 graduation rate from 2006-2007 data.

	features	estimatedCoefficients
41	r_stud_re_W	1.496449
34	r_stud_912	1.148622
26	tlocrev_percent	0.842331
42	r_stud_re_Total	0.834020
51	r_stud_reg_12_W_M	0.542936
36	r_stud_re_F	0.033100
0	num_students	-0.000000
120	r_frev_ia	0.000000
113	r_frev_dis	0.000000
114	r_frev_cna	0.000000

	features	estimatedCoefficients
30	r_lunch_free	-3.016324
40	r_stud_re_B	-1.922189
25	tfedrev_percent	-0.671879
3	num_pub_schools	-0.114687
106	r_srev_gt	-0.036813
118	r_frev_ao	-0.034284
112	r_frev_title1	-0.000000
113	r_frev_dis	0.000000
114	r_frev_cna	0.000000
115	r_frev_ems	0.000000

	features	estimatedCoefficients
41	r_stud_re_W	1.292664
34	r_stud_912	1.060088
26	tlocrev_percent	0.816008
51	r_stud_reg_12_W_M	0.775423
42	r_stud_re_Total	0.608449
36	r_stud_re_F	0.346064
52	r_stud_reg_12_W_F	0.336655
76	r_lrev_pt	0.316851
38	r_stud_re_AAP	0.291935
4	tcuresal_percent	0.210040
21	tcurinst_percent	0.205878
57	r_st_PKT	0.180802
113	r_frev_dis	0.165260
79	r_lrev_it	0.154029
86	r_lrev_slr	0.151288
46	r_stud_reg_12_AAP_F	0.144122
8	tlocrev_pp	0.136161
88	r_lrev_sar	0.133059
80	r_lrev_aot	0.127004
37	r_stud_re_AIAN	0.092383

	features	estimatedCoefficients
30	r_lunch_free	-1.886635
40	r_stud_re_B	-1.520982
25	tfedrev_percent	-0.921166
29	r_IEP	-0.331453
10	tfedrev_pp	-0.304889
106	r_srev_gt	-0.304483
118	r_frev_ao	-0.296672
3	num_pub_schools	-0.265479
28	r_ELL	-0.218137
1	num_schools	-0.210946
39	r_stud_re_H	-0.205766
119	r_frev_ns	-0.205559
50	r_stud_reg_12_B_F	-0.182480
77	r_lrev_gst	-0.159556
27	tsrev_percent	-0.118256
101	r_srev_sip	-0.104726
103	r_srev_vep	-0.077543
9	tsrev_pp	-0.045933
91	r_lrev_ie	-0.019656
92	r_lrev_molr	-0.017349

	features	estimatedCoefficients
0	r_stud_re_W	1.112856
1	r_stud_912	1.706844
2	r_stud_reg_12_W_M	1.158192
3	tlocrev_percent	1.707986
4	r_stud_re_Total	1.150385
5	r_lunch_free	-3.219141
6	r_stud_re_B	-2.528986
7	tfedrev_percent	-0.610386
8	tfedrev_pp	-0.383879
9	r_frev_ao	-0.848105
10	r_stud_re_F	0.884312
11	r_stud_reg_12_W_F	-0.342922
12	r_srev_gt	-0.845915
13	num_pub_schools	-0.735196