Predicting School District Performance

The Data Schoolers - Ashwin Deo, Aasta Frascati-Robinson, Bhanu Kanna, Brendan Law

Overview and Motivation

What factors have an impact on school district performance? We seek to learn if we can predict graduation rate based upon numerous school district characteristics. We look to understand what factors have little or no impact on performance. We also strive to classify school districts by custom team-built peer school district grouping rather than solely geographical grouping by nation, state, and school district, which would include factors like total students, student/teacher ratio, percent of children in poverty, district type, location, etc. We used the most current national graduation data found, which was for the school year 2009-2010. We have kept the dataset years consistent across data sources.

The goal of predicting school district performance based on school environment is to inform parents and interested citizens of what factors in school districts influence key success indicators such as graduation rate. Identifying these factors would help school districts look at potential opportunities to improve. This topic was selected because of a passion for using technology to enhance education and desire to give back to the education communities that have helped shape us. One team member would love to work in educational data science in the future.

Open education data is now being provided via several national, state, and local government portals. It is often up to the end user to piece together datasets across these portals to answer their questions, which is not something that a typical parent or interested citizen has the time or expertise to pursue. Instead, the data science community can support these users by melding these datasets and answering important education questions.

Dekker, Pechenizkiy, and Vleeshouwers built multiple models to predict Eindhoven University of Technology freshman dropout (2009, Educational Data Mining). We referenced this work to identify what types of models might be applicable for interpreting education data.

Data Sources

Schools

We did all of our processing on school data in a separate process book from the school district data.

We cleaned school data so that we could show the school detail underneath the school districts in our final visualization.
Link: School Data Cleanup
Link: School Exploratory Data Analysis

School Districts

Start with the standard imports we have used for every notebook in this class.


In [1]:
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")

Data Loading

Each of the datasheets downloaded from ELSI had download metadata on the top of them and total and key information on the bottom of them that were not data rows. This metadata, total, and key information was manually deleted before import. Some of the files had ="0" instead of 0 in the cells. This was found/replaced before import using the sed -i '' 's/="0"/0/g' *.csv command from the terminal.


In [2]:
#CITATION: This is the data from National Center for Education Statistics on Schools
#School districts for all 50 states and Washington, D.C.
#http://nces.ed.gov/ccd/elsi/
#Data Source: U.S. Department of Education National Center for Education Statistics Common Core of Data (CCD) "Local Education Agency (School District) Universe Survey" 2009-10 v.2a  2013-14 v.1a; "Public Elementary/Secondary School Universe Survey" 2009-10 v.2a; "Survey of Local Government Finances School Systems (F-33)" 2009-10 (FY 2010) v.1a.
#KEY:
#† indicates that the data are not applicable.
#– indicates that the data are missing.
#‡ indicates that the data do not meet NCES data quality standards.

districtinformation = pd.read_csv("data/rawdata/districts/2009-2010 DISTRICTS Information Tab.csv", dtype=np.str)
districtcharacteristicsa = pd.read_csv("data/rawdata/districts/2009-2010 DISTRICTS Characteristics Tab.csv", dtype=np.str)
districtenrollments = pd.read_csv("data/rawdata/districts/2009-2010 DISTRICTS Enrollments Tab.csv", dtype=np.str)
districtenrollmentK3 = pd.read_csv("data/rawdata/districts/2009-2010 DISTRICTS EnrollK3 Tab.csv", dtype=np.str)
districtenrollment48 = pd.read_csv("data/rawdata/districts/2009-2010 DISTRICTS Enroll48 Tab.csv", dtype=np.str)
districtenrollment912 = pd.read_csv("data/rawdata/districts/2009-2010 DISTRICTS Enroll912 Tab.csv", dtype=np.str)
districtteacherstaff = pd.read_csv("data/rawdata/districts/2009-2010 DISTRICTS TeacherStaff Tab.csv", dtype=np.str)
districtgeneralfinance = pd.read_csv("data/rawdata/districts/2009-2010 DISTRICTS GeneralFinance Tab.csv", dtype=np.str)
districtrevenue = pd.read_csv("data/rawdata/districts/2009-2010 DISTRICTS Revenue Tab.csv", dtype=np.str)
districtexpenditures = pd.read_csv("data/rawdata/districts/2009-2010 DISTRICTS Expenditures Tab.csv", dtype=np.str)

#Data Source: Local Education Agency (School District) Universe Survey Dropout and Completion Data: 2009-10 v.1a.
#KEY:
#-1 indicates that data is missing
#-2 indicates that data is not applicable
#-3 indicates that data was suppressed because of low count disclosure protection
#-4 indicates that data was supporessed because of high count disclosure protection
#-9 indicates that data was supporessed because data quality was poor

#SURVYEAR     1        AN       School year
#FIPST        2        AN       Two Digit American National Standards Institute (ANSI) State Code.
#
#                             	01  =  Alabama        02  =  Alaska          04  =  Arizona
#                             	05  =  Arkansas       06  =  California      08  =  Colorado
#                             	09  =  Connecticut    10  =  Delaware        11  =  District of Columbia
#                             	12  =  Florida        13  =  Georgia         15  =  Hawaii
#                             	16  =  Idaho          17  =  Illinois        18  =  Indiana
#                             	19  =  Iowa           20  =  Kansas          21  =  Kentucky
#                             	22  =  Louisiana      23  =  Maine           24  =  Maryland
#                             	25  =  Massachusetts  26  =  Michigan        27  =  Minnesota
#                             	28  =  Mississippi    29  =  Missouri        30  =  Montana
#                             	31  =  Nebraska       32  =  Nevada          33  =  New Hampshire
#                             	34  =  New Jersey     35  =  New Mexico      36  =  New York
#                             	37  =  North Carolina 38  =  North Dakota    39  =  Ohio
#                             	40  =  Oklahoma       41  =  Oregon          42  =  Pennsylvania
#                             	44  =  Rhode Island   45  =  South Carolina  46  =  South Dakota
#                             	47  =  Tennessee      48  =  Texas           49  =  Utah
#                             	50  =  Vermont        51  =  Virginia        53  =  Washington
#                             	54  =  West Virginia  55  =  Wisconsin       56  =  Wyoming
#                             	58  =  DOD Dependents Schools-Overseas    
#                             	59  =  Bureau of Indian Education
#                             	60  =  American Samoa 61  =  DOD Dependents School-Domestic
#                             	66  =  Guam           69  =  Northern Marianas
#                             	72  =  Puerto Rico    78  =  Virgin Islands
#
#LEAID        3        AN       NCES Assigned Local Education Agency Identification Number
#TOTD912      4        N        Total Dropouts, Grades 9–12
#EBS912       5        N        Dropout Enrollment Base, Grades 9–12
#DRP912       6*       N        Dropout Rate, Grades 9–12
#TOTDPL       7        N        Total Diploma Count
#AFGEB        8        N        Total Averaged Freshman Graduation Rate (AFGR) Enrollment Base
#AFGR         9*       N        Total Averaged Freshmen Graduation Rate (AFGR)
#TOTOHC      10        N        Total Other High School Completion Certificate (OHC) Recipients

districtdropoutscompleters = pd.read_csv("data/rawdata/districts/2009-2010 DISTRICTS DropoutsCompleters.txt", dtype=np.str, delim_whitespace=True)

In [63]:
############
#If you need to come back and create a dftouse of old data, use this code.
#This code is purposely referencing 2009-2010 enrollment files so those later steps do not fail.  We later learned we didn't need the data, and it's not worth redownloading.
#We purposely changed the column names in the file to 2009-2010 so we could reuse our code.  It is 2006-2007 data.
districtinformation = pd.read_csv("data/rawdata/districts/prevyears/0607/2006-2007 DISTRICTS Information Tab.csv", dtype=np.str)
districtcharacteristicsa = pd.read_csv("data/rawdata/districts/prevyears/0607/2006-2007 DISTRICTS Characteristics Tab.csv", dtype=np.str)
districtenrollments = pd.read_csv("data/rawdata/districts/prevyears/0607/2006-2007 DISTRICTS Enrollments Tab.csv", dtype=np.str)
districtenrollmentK3 = pd.read_csv("data/rawdata/districts/2009-2010 DISTRICTS EnrollK3 Tab.csv", dtype=np.str)
districtenrollment48 = pd.read_csv("data/rawdata/districts/2009-2010 DISTRICTS Enroll48 Tab.csv", dtype=np.str)
districtenrollment912 = pd.read_csv("data/rawdata/districts/2009-2010 DISTRICTS Enroll912 Tab.csv", dtype=np.str)
districtteacherstaff = pd.read_csv("data/rawdata/districts/prevyears/0607/2006-2007 DISTRICTS TeacherStaff Tab.csv", dtype=np.str)
districtgeneralfinance = pd.read_csv("data/rawdata/districts/prevyears/0607/2006-2007 DISTRICTS GeneralFinance Tab.csv", dtype=np.str)
districtrevenue = pd.read_csv("data/rawdata/districts/prevyears/0607/2006-2007 DISTRICTS Revenue Tab.csv", dtype=np.str)
districtexpenditures = pd.read_csv("data/rawdata/districts/prevyears/0607/2006-2007 DISTRICTS Expenditures Tab.csv", dtype=np.str)

districtdropoutscompleters = pd.read_csv("data/rawdata/districts/prevyears/0607/2006-2007 DISTRICTS DropoutsCompleters.txt", dtype=np.str, delim_whitespace=True)

Check the lengths of the datasets to see if we have a row for every school district. We have more school district IDs in districtinformation than we have school district characteristics, and we have more rows of graduation information than we have school district characteristics. Rows without school district characteristics will later be dropped.


In [3]:
print len(districtinformation)
print len(districtcharacteristicsa)
print len(districtenrollments)
print len(districtenrollmentK3)
print len(districtenrollment48)
print len(districtenrollment912)
print len(districtteacherstaff)
print len(districtgeneralfinance)
print len(districtrevenue)
print len(districtexpenditures)
print len(districtdropoutscompleters)


19023
17916
17916
17916
17916
17916
17916
17916
17916
17916
18439

Drop all of the duplicate columns.


In [4]:
#Duplicate columns are:
#Agency Name
#State Name [District] Latest available year
#Agency ID - NCES Assigned [District] Latest available year
districtenrollments = districtenrollments.drop(districtenrollments.columns[[0, 1, 2]], 1)
districtenrollmentK3 = districtenrollmentK3.drop(districtenrollmentK3.columns[[0, 1, 2]], 1)
districtenrollment48 = districtenrollment48.drop(districtenrollment48.columns[[0, 1, 2]], 1)
districtenrollment912 = districtenrollment912.drop(districtenrollment912.columns[[0, 1, 2]], 1)
districtteacherstaff = districtteacherstaff.drop(districtteacherstaff.columns[[0, 1, 2]], 1)
districtgeneralfinance = districtgeneralfinance.drop(districtgeneralfinance.columns[[0, 1, 2]], 1)
districtrevenue = districtrevenue.drop(districtrevenue.columns[[0, 1, 2]], 1)
districtexpenditures = districtexpenditures.drop(districtexpenditures.columns[[0, 1, 2]], 1)

In [6]:
#FOR OLD DATA RELOAD ONLY
districtenrollments = districtenrollments.drop(districtenrollments.columns[[0, 1]], 1)
districtenrollmentK3 = districtenrollmentK3.drop(districtenrollmentK3.columns[[0, 1]], 1)
districtenrollment48 = districtenrollment48.drop(districtenrollment48.columns[[0, 1]], 1)
districtenrollment912 = districtenrollment912.drop(districtenrollment912.columns[[0, 1]], 1)
districtteacherstaff = districtteacherstaff.drop(districtteacherstaff.columns[[0, 1]], 1)
districtgeneralfinance = districtgeneralfinance.drop(districtgeneralfinance.columns[[0, 1]], 1)
districtrevenue = districtrevenue.drop(districtrevenue.columns[[0, 1]], 1)
districtexpenditures = districtexpenditures.drop(districtexpenditures.columns[[0, 1]], 1)

Join all of the school district datasets. The datasets districtinformation and districtdropoutscompleters need special treatment, as they have more rows for the school districts than the other datasets. All of the other datasets can be joined by ID without issue.


In [5]:
#Join the datasets that can be joined without issue.
joineddistrict = districtcharacteristicsa.join([districtenrollments, districtenrollmentK3, districtenrollment48, districtenrollment912, districtteacherstaff, districtgeneralfinance, districtrevenue, districtexpenditures])

#Clean up an extra hidden character in the Agency Name column
joineddistrict = joineddistrict.rename(columns={'Agency Name': 'Agency Name'})
districtinformation = districtinformation.rename(columns={'Agency Name': 'Agency Name'})

#Merge to the districtinformation dataset
joineddistrict = districtinformation.merge(joineddistrict, 'left', 'Agency ID - NCES Assigned [District] Latest available year', suffixes=('', '_DEL'))

#Need to get rid of Excel syntax ="" from the school district ID column so it can be joined successfully
joineddistrict['Agency ID - NCES Assigned [District] Latest available year'] = joineddistrict['Agency ID - NCES Assigned [District] Latest available year'].map(lambda x: str(x).lstrip('="').rstrip('"'))

#Rename the LEAID column so it can be merged with the joineddistrict dataset
districtdropoutscompleters = districtdropoutscompleters.rename(columns={'LEAID': 'Agency ID - NCES Assigned [District] Latest available year'})

#Merge to the joineddistrict dataset
joineddistrict = joineddistrict.merge(districtdropoutscompleters, 'left', 'Agency ID - NCES Assigned [District] Latest available year', suffixes=('', '_DEL'))

#If by chance any rows have NaN, replace with the ELSI standard for missing data '–'
joineddistrict = joineddistrict.fillna('–')
joineddistrict = joineddistrict.replace('nan', '–')

joineddistrict.head()


Out[5]:
Agency Name State Name [District] Latest available year State Name [District] 2009-10 State Abbr [District] Latest available year Agency Name [District] 2009-10 Agency ID - NCES Assigned [District] Latest available year County Name [District] 2009-10 County Number [District] 2009-10 Race/Ethnicity Category [District] 2009-10 ANSI/FIPS State Code [District] Latest available year Total Number Operational Schools [Public School] 2009-10 Total Number Operational Charter Schools [Public School] 2009-10 Total Number of Public Schools [Public School] 2009-10 Years District Reported Data [District] Latest available year Years District Did Not Report Data [District] Latest available year Location Address [District] 2013-14 Location City [District] 2013-14 Location State Abbr [District] 2013-14 Location ZIP [District] 2013-14 Location ZIP4 [District] 2013-14 Mailing Address [District] 2013-14 Mailing City [District] 2013-14 Mailing State Abbr [District] 2013-14 Mailing ZIP [District] 2013-14 Mailing ZIP4 [District] 2013-14 Phone Number [District] 2013-14 Agency Name_DEL State Name [District] Latest available year_DEL Agency Type [District] 2009-10 School District Level Code (SCHLEV) [District Finance] 2009-10 Urban-centric Locale [District] 2009-10 Boundary Change Indicator Flag [District] 2009-10 CBSA Name [District] 2009-10 CBSA ID [District] 2009-10 CSA Name [District] 2009-10 CSA ID [District] 2009-10 Latitude [District] 2009-10 Longitude [District] 2009-10 State Agency ID [District] 2009-10 Supervisory Union (ID) Number [District] 2009-10 Agency Charter Status [District] 2009-10 Metro Micro Area Code [District] 2009-10 Congressional Code [District] 2009-10 Census ID (CENSUSID) [District Finance] 2009-10 Lowest Grade Offered [District] 2009-10 Highest Grade Offered [District] 2009-10 Total Students (UG PK-12) [District] 2009-10 PK thru 12th Students [District] 2009-10 Ungraded Students [District] 2009-10 Total Students [Public School] 2009-10 ... Total - School Admin.- Supp. Serv. Exp. (E09) [District Finance] 2009-10 Total - Student Transp.- Supp. Serv. Exp. (V45) [District Finance] 2009-10 Total - Students- Supp. Serv. Exp. (E17) [District Finance] 2009-10 Salary - Instruction Expenditures (Z33) [District Finance] 2009-10 Salary - Students- Supp. Serv. Exp. (V11) [District Finance] 2009-10 Salary - Instruct. Staff- Supp. Serv. Exp. (V13) [District Finance] 2009-10 Salary - General Admin.- Supp. Serv. Exp. (V15) [District Finance] 2009-10 Salary - School Admin.- Supp. Serv. Exp. (V17) [District Finance] 2009-10 Salary - Ops. & Mainten.- Supp. Serv. Exp. (V21) [District Finance] 2009-10 Salary - Student Transp.- Supp. Serv. Exp. (V23) [District Finance] 2009-10 Salary - Other Supp. Serv.- Supp. Serv. Exp. (V37) [District Finance] 2009-10 Salary - Food Services- Non-Instruction (V29) [District Finance] 2009-10 Employee Benefits - Instruction Expend. (V10) [District Finance] 2009-10 Empl. Benefits - Students- Supp. Serv. Exp. (V12) [District Finance] 2009-10 Empl. Benefits - Instruction- Supp. Serv. Exp. (V14) [District Finance] 2009-10 Empl. Benefits - Gen. Adm.- Supp. Serv. Exp. (V16) [District Finance] 2009-10 Empl. Benefits - Sch. Adm.- Supp. Serv. Exp. (V18) [District Finance] 2009-10 Empl. Benefits - Ops. & Maint.- Supp. Serv. Exp. (V22) [District Finance] 2009-10 Empl. Benefits - Student Trans.- Supp. Serv. Exp. (V24) [District Finance] 2009-10 Empl. Benefits - Other Supp Serv- Supp. Serv. Exp. (V38) [District Finance] 2009-10 Empl. Benefits - Food Services- Non-Instruction (V30) [District Finance] 2009-10 Empl. Benefits - Enterp. Oper.- Non-Instruction (V32) [District Finance] 2009-10 Current Spending - Private Schools (V91) [District Finance] 2009-10 Current Spending - Public Charter Schools (V92) [District Finance] 2009-10 Teacher Salaries - Regular Education Programs (Z35) [District Finance] 2009-10 Teacher Salaries - Special Education Programs (Z36) [District Finance] 2009-10 Teacher Salaries - Vocational Education Programs (Z37) [District Finance] 2009-10 Teacher Salaries - Other Education Programs (Z38) [District Finance] 2009-10 Textbooks for Instruction (V93) [District Finance] 2009-10 Community Services - Non El-Sec (V70) [District Finance] 2009-10 Adult Education - Non El-Sec (V75) [District Finance] 2009-10 Other Expenditures - Non El-Sec (V80) [District Finance] 2009-10 Construction - Capital Outlay (F12) [District Finance] 2009-10 Instructional Equipment - Capital Outlay (K09) [District Finance] 2009-10 Other Equipment - Capital Outlay (K10) [District Finance] 2009-10 Non-specified - Equipment Expenditures (K11) [District Finance] 2009-10 Land & Existing Structures - Capital Outlay (G15) [District Finance] 2009-10 Payments to Local Governments (M12) [District Finance] 2009-10 Payments to State Governments (L12) [District Finance] 2009-10 Interest on School System Indebtedness (I86) [District Finance] 2009-10 Payments to Other School Systems (Q11) [District Finance] 2009-10 SURVYEAR FIPST TOTD912 EBS912 DRP912 TOTDPL AFGEB AFGR TOTOHC
0 100 LEGACY ACADEMY CHARTER SCHOOL New Jersey NJ 3400743 34 2010-2013 1986-2009 75 MORTON ST NEWARK NJ ="07103" 75 MORTON STREET NEWARK NJ ="07103" 9733178361 ...
1 21ST CENTURY CHARTER SCH OF GARY Indiana Indiana IN 21ST CENTURY CHARTER SCH OF GARY 1800046 MARION COUNTY 18097 Reported 5 categories 18 1 1 1 2004-2013 1986-2003 556 WASHINGTON ST GARY IN 46402 333 N PENNSYLVANIA SUITE 1000 INDIANAPOLIS IN 46202 3175361027 21ST CENTURY CHARTER SCH OF GARY Indiana 7-Charter school agency ="03-Elementary/secondary school system" 13-City: Small 1-No change since last report Indianapolis IN 26900 Indianapolis-Anderson-Columbus IN 294 39.771949 -86.155184 9545 ="000" 1-All associated schools are charter schools 1-Metropolitan Area 1807 Kindergarten 12th Grade 360 360 360 ... 318000 9000 18000 1173000 ="0" ="0" ="0" 315000 ="0" ="0" ="0" ="0" 12000 ="0" ="0" ="0" ="0" ="0" ="0" 358000 ="0" ="0" ="0" ="0" ="0" ="0" ="0" ="0" 77000 13000 13000 ="0" 37000 86000 72000 146000 374000 ="0" ="0" ="0" ="0" 2009-10 18 -3 100 -3.0 13 43 30.2 -2
2 21ST CENTURY CYBER CS Pennsylvania Pennsylvania PA 21ST CENTURY CYBER CS 4200091 CHESTER COUNTY 42029 Reported 5 categories 42 1 1 1 2001-2013 1986-2000 805 SPRINGDALE DR EXTON PA 19341 3043 805 SPRINGDALE DR. EXTON PA 19341 3043 4848755400 21ST CENTURY CYBER CS Pennsylvania 7-Charter school agency ="03-Elementary/secondary school system" 21-Suburb: Large 1-No change since last report Philadelphia-Camden-Wilmington PA-NJ-DE-MD 37980 Philadelphia-Camden-Vineland PA-NJ-DE-MD 428 40.005030 -75.678564 124150002 ="000" 1-All associated schools are charter schools 1-Metropolitan Area 4206 6th Grade 12th Grade 594 594 594 ... 220000 ="0" 337000 1652000 252000 224000 190000 174000 ="0" ="0" ="0" ="0" 644000 74000 61000 47000 38000 ="0" ="0" ="0" ="0" ="0" 18000 ="0" 1512000 140000 ="0" ="0" 236000 ="0" ="0" ="0" ="0" 317000 12000 ="0" ="0" ="0" ="0" ="0" 24000 2009-10 42 33 483 6.8 102 77 100.0 -2
3 21ST CENTURY PREPARATORY SCHOOL AGENCY Wisconsin Wisconsin WI 21ST CENTURY PREPARATORY SCHOOL AGENCY 5500045 RACINE COUNTY 55101 Reported 5 categories 55 1 1 1 2002-2013 1986-2001 1220 MOUND AVE RACINE WI 53404 1220 MOUND AVE RACINE WI 53404 2625980026 21ST CENTURY PREPARATORY SCHOOL AGENCY Wisconsin 7-Charter school agency ="01-Elementary school system only" 13-City: Small 1-No change since last report Racine WI 39540 Milwaukee-Racine-Waukesha WI 376 42.729073 -87.797014 8110 ="000" 1-All associated schools are charter schools 1-Metropolitan Area 5501 Prekindergarten 8th Grade 507 507 507 ... 2009-10 55 -2 -2 -2.0 -2 -2 -2.0 -2
4 4-WINDS ACADEMY INCORPORATED DBA 4-WINDS ACADEMY Arizona Arizona AZ 4-WINDS ACADEMY INCORPORATED DBA 4-WINDS ACADEMY 0400380 APACHE COUNTY ="04001" Reported 5 categories ="04" ="0" ="0" 1 2002-2013 1986-2001 725 EAST MAIN SPRINGERVILLE AZ 85938 P. O. BOX 1210 EAGAR AZ 85925 9283331060 4-WINDS ACADEMY INCORPORATED DBA 4-WINDS ACADEMY Arizona 7-Charter school agency 33-Town: Remote 6-Agency temporarily closed 34.132862 -109.276602 79998 1-All associated schools are charter schools ="0-New England (NECTA) or not reported" ="0401" Kindergarten 8th Grade ... 2009-10 04 -2 -2 -2.0 -2 -2 -2.0 -2

5 rows × 472 columns


In [6]:
#FOR OLD DATA LOAD ONLY
#joineddistrict = districtinformation.merge(districtcharacteristicsa, 'left', 'Agency ID - NCES Assigned [District] Latest available year', suffixes=('', '_DEL'))
#joineddistrict = joineddistrict.merge(districtenrollments, 'left', 'Agency ID - NCES Assigned [District] Latest available year', suffixes=('', '_DEL'))
#joineddistrict = joineddistrict.merge(districtenrollmentK3, 'left', 'Agency ID - NCES Assigned [District] Latest available year', suffixes=('', '_DEL'))
#joineddistrict = joineddistrict.merge(districtenrollment48, 'left', 'Agency ID - NCES Assigned [District] Latest available year', suffixes=('', '_DEL'))
#joineddistrict = joineddistrict.merge(districtenrollment912, 'left', 'Agency ID - NCES Assigned [District] Latest available year', suffixes=('', '_DEL'))
#joineddistrict = joineddistrict.merge(districtteacherstaff, 'left', 'Agency ID - NCES Assigned [District] Latest available year', suffixes=('', '_DEL'))
#joineddistrict = joineddistrict.merge(districtgeneralfinance, 'left', 'Agency ID - NCES Assigned [District] Latest available year', suffixes=('', '_DEL'))
#joineddistrict = joineddistrict.merge(districtrevenue, 'left', 'Agency ID - NCES Assigned [District] Latest available year', suffixes=('', '_DEL'))
#joineddistrict = joineddistrict.merge(districtexpenditures, 'left', 'Agency ID - NCES Assigned [District] Latest available year', suffixes=('', '_DEL'))
#joineddistrict = joineddistrict.rename(columns={'Agency Name': 'Agency Name'})

#Rename the LEAID column so it can be merged with the joineddistrict dataset
#districtdropoutscompleters = districtdropoutscompleters.rename(columns={'LEAID': 'Agency ID - NCES Assigned [District] Latest available year'})

#Merge to the joineddistrict dataset
#joineddistrict = joineddistrict.merge(districtdropoutscompleters, 'left', 'Agency ID - NCES Assigned [District] Latest available year', suffixes=('', '_DEL'))

#If by chance any rows have NaN, replace with the ELSI standard for missing data '–'
#joineddistrict = joineddistrict.fillna('–')
#joineddistrict = joineddistrict.replace('nan', '–')

#joineddistrict.head()

If we did this correctly, we should still have 19023 rows as we did from the previous step.


In [7]:
print len(joineddistrict)


19023

Data Cleaning

Now we start cleaning the data.


In [8]:
cleaneddistrict = joineddistrict.copy(deep=True)

The dropouts and completers dataset has flags for different types of missing data (NAs, missing, etc.), including -1, -2, -3, -4, -9. Set all of these to missing data flag instead.


In [9]:
cleaneddistrict['TOTD912'] = cleaneddistrict['TOTD912'].replace('-1', '–')
cleaneddistrict['TOTD912'] = cleaneddistrict['TOTD912'].replace('-2', '–')
cleaneddistrict['TOTD912'] = cleaneddistrict['TOTD912'].replace('-3', '–')
cleaneddistrict['TOTD912'] = cleaneddistrict['TOTD912'].replace('-4', '–')
cleaneddistrict['TOTD912'] = cleaneddistrict['TOTD912'].replace('-9', '–')

cleaneddistrict['EBS912'] = cleaneddistrict['EBS912'].replace('-2', '–')

cleaneddistrict['DRP912'] = cleaneddistrict['DRP912'].replace('-1.0', '–')
cleaneddistrict['DRP912'] = cleaneddistrict['DRP912'].replace('-2.0', '–')
cleaneddistrict['DRP912'] = cleaneddistrict['DRP912'].replace('-3.0', '–')
cleaneddistrict['DRP912'] = cleaneddistrict['DRP912'].replace('-4.0', '–')
cleaneddistrict['DRP912'] = cleaneddistrict['DRP912'].replace('-9.0', '–')

cleaneddistrict['TOTDPL'] = cleaneddistrict['TOTDPL'].replace('-1', '–')
cleaneddistrict['TOTDPL'] = cleaneddistrict['TOTDPL'].replace('-2', '–')
cleaneddistrict['TOTDPL'] = cleaneddistrict['TOTDPL'].replace('-9', '–')

cleaneddistrict['AFGEB'] = cleaneddistrict['AFGEB'].replace('-1', '–')
cleaneddistrict['AFGEB'] = cleaneddistrict['AFGEB'].replace('-2', '–')

cleaneddistrict['AFGR'] = cleaneddistrict['AFGR'].replace('-1.0', '–')
cleaneddistrict['AFGR'] = cleaneddistrict['AFGR'].replace('-2.0', '–')
cleaneddistrict['AFGR'] = cleaneddistrict['AFGR'].replace('-9.0', '–')

cleaneddistrict['TOTOHC'] = cleaneddistrict['TOTOHC'].replace('-1', '–')
cleaneddistrict['TOTOHC'] = cleaneddistrict['TOTOHC'].replace('-2', '–')
cleaneddistrict['TOTOHC'] = cleaneddistrict['TOTOHC'].replace('-3', '–')
cleaneddistrict['TOTOHC'] = cleaneddistrict['TOTOHC'].replace('-9', '–')

Some of the columns had Excel style syntax in them ="". We need to get rid of it.


In [10]:
#Need to get rid of Excel syntax ="" from some of the columns
for i, col in enumerate(cleaneddistrict.columns):
    cleaneddistrict[col] = cleaneddistrict[col].map(lambda x: str(x).lstrip('="').rstrip('"'))

We need to replace the flags for missing, NA, and bad quality data with blanks that can later be turned into NaN for float columns.


In [11]:
# Replacing Missing Data / NA / Bad Quality data with blank, later to be turned into NaN for float columns
# CITATION : http://pandas.pydata.org/pandas-docs/version/0.15.2/missing_data.html

cleaneddistrict = cleaneddistrict.replace('\xe2\x80\x93', '') # Replace "-" (Missing Data) with blank
cleaneddistrict = cleaneddistrict.replace('\xe2\x80\xa0', '') # Replace "†" (Not Applicable) with blank
cleaneddistrict = cleaneddistrict.replace('\xe2\x80\xa1', '') # Replace "‡" (Bad Quality) with blank

Turn all of the numerical columns into floats. Replace the blanks from the previous step with NaN.


In [12]:
countcolumns = ['Local Rev. - Individual & Corp. Income Taxes (T40) [District Finance] 2009-10','Local Rev. - All Other Taxes (T99) [District Finance] 2009-10','Local Rev. - Parent Government Contributions (T02) [District Finance] 2009-10','Local Rev. - Revenue- Cities and Counties (D23) [District Finance] 2009-10','Local Rev. - Revenue- Other School Systems (D11) [District Finance] 2009-10','Local Rev. - Tuition Fees- Pupils and Parents (A07) [District Finance] 2009-10','Local Rev. - Transp. Fees- Pupils and Parents (A08) [District Finance] 2009-10','Local Rev. - School Lunch Revenues (A09) [District Finance] 2009-10','Local Rev. - Textbook Sales and Rentals (A11) [District Finance] 2009-10','Local Rev. - Student Activity Receipts (A13) [District Finance] 2009-10','Local Rev. - Other Sales and Service Rev. (A20) [District Finance] 2009-10','Local Rev. - Student Fees Non-Specified (A15) [District Finance] 2009-10','Local Rev. - Interest Earnings (U22) [District Finance] 2009-10','Local Rev. - Miscellaneous Other Local Rev. (U97) [District Finance] 2009-10','Local Rev. - Special Processing (C24) [District Finance] 2009-10','Local Rev. - Rents and Royalties (A40) [District Finance] 2009-10','Local Rev. - Sale of Property (U11) [District Finance] 2009-10','Local Rev. - Fines and Forfeits (U30) [District Finance] 2009-10','Local Rev. - Private Contributions (U50) [District Finance] 2009-10','State Rev. - General Formula Assistance (C01) [District Finance] 2009-10','State Rev. - Special Education Programs (C05) [District Finance] 2009-10','State Rev. - Transportation Programs (C12) [District Finance] 2009-10','State Rev. - Staff Improvement Programs (C04) [District Finance] 2009-10','State Rev. - Compensat. and Basic Skills Prog. (C06) [District Finance] 2009-10','State Rev. - Vocational Education Programs (C09) [District Finance] 2009-10','State Rev. - Capital Outlay and Debt Serv. Prog. (C11) [District Finance] 2009-10','State Rev. - Bilingual Education Programs (C07) [District Finance] 2009-10','State Rev. - Gifted and Talented Programs (C08) [District Finance] 2009-10','State Rev. - School Lunch Programs (C10) [District Finance] 2009-10','State Rev. - All Other Rev.- State Sources (C13) [District Finance] 2009-10','State Rev. - State Payment for LEA Empl. Benefits (C38) [District Finance] 2009-10','State Rev. - Other State Payments (C39) [District Finance] 2009-10','State Rev. - Non-Specified (C35) [District Finance] 2009-10','Federal Rev. - Federal Title I Revenue (C14) [District Finance] 2009-10','Federal Rev. - Children with Disabilities (C15) [District Finance] 2009-10','Federal Rev. - Child Nutrition Act (C25) [District Finance] 2009-10','Federal Rev. - Eisenhower Math and Science (C16) [District Finance] 2009-10','Federal Rev. - Drug-Free Schools (C17) [District Finance] 2009-10','Federal Rev. - Vocational Education (C19) [District Finance] 2009-10','Federal Rev. - All Other Fed. Aid Through State (C20) [District Finance] 2009-10','Federal Rev. - Nonspecified (C36) [District Finance] 2009-10','Federal Rev. - Impact Aid (PL 815 and 874) (B10) [District Finance] 2009-10','Federal Rev. - Bilingual Education (B11) [District Finance] 2009-10','Federal Rev. - Native American (Ind.) Educ. (B12) [District Finance] 2009-10','Federal Rev. - All Other Federal Aid (B13) [District Finance] 2009-10','Enterprise Operations - Non Instructional (V60) [District Finance] 2009-10','Food Services - Non Instuctional (E11) [District Finance] 2009-10','Instruction Expenditures - Total (E13) [District Finance] 2009-10','Non-Specified - Supp. Serv. Exp. (V85) [District Finance] 2009-10','Other Non Instructional (V65) [District Finance] 2009-10','Total - Gen. Admin.- Supp. Serv. Exp. (E08) [District Finance] 2009-10','Total - Instruct. Staff- Supp. Serv. Exp. (E07) [District Finance] 2009-10','Total - Ops. & Mainten.- Supp. Serv. Exp. (V40) [District Finance] 2009-10','Total - Other Supp. Serv.- Supp. Serv. Exp. (V90) [District Finance] 2009-10','Total - School Admin.- Supp. Serv. Exp. (E09) [District Finance] 2009-10','Total - Student Transp.- Supp. Serv. Exp. (V45) [District Finance] 2009-10','Total - Students- Supp. Serv. Exp. (E17) [District Finance] 2009-10','Salary - Instruction Expenditures (Z33) [District Finance] 2009-10','Salary - Students- Supp. Serv. Exp. (V11) [District Finance] 2009-10','Salary - Instruct. Staff- Supp. Serv. Exp. (V13) [District Finance] 2009-10','Salary - General Admin.- Supp. Serv. Exp. (V15) [District Finance] 2009-10','Salary - School Admin.- Supp. Serv. Exp. (V17) [District Finance] 2009-10','Salary - Ops. & Mainten.- Supp. Serv. Exp. (V21) [District Finance] 2009-10','Salary - Student Transp.- Supp. Serv. Exp. (V23) [District Finance] 2009-10','Salary - Other Supp. Serv.- Supp. Serv. Exp. (V37) [District Finance] 2009-10','Salary - Food Services- Non-Instruction (V29) [District Finance] 2009-10','Employee Benefits - Instruction Expend. (V10) [District Finance] 2009-10','Empl. Benefits - Students- Supp. Serv. Exp. (V12) [District Finance] 2009-10','Empl. Benefits - Instruction- Supp. Serv. Exp. (V14) [District Finance] 2009-10','Empl. Benefits - Gen. Adm.- Supp. Serv. Exp. (V16) [District Finance] 2009-10','Empl. Benefits - Sch. Adm.- Supp. Serv. Exp. (V18) [District Finance] 2009-10','Empl. Benefits - Ops. & Maint.- Supp. Serv. Exp. (V22) [District Finance] 2009-10','Empl. Benefits - Student Trans.- Supp. Serv. Exp. (V24) [District Finance] 2009-10','Empl. Benefits - Other Supp Serv- Supp. Serv. Exp. (V38) [District Finance] 2009-10','Empl. Benefits - Food Services- Non-Instruction (V30) [District Finance] 2009-10','Empl. Benefits - Enterp. Oper.- Non-Instruction (V32) [District Finance] 2009-10','Current Spending - Private Schools (V91) [District Finance] 2009-10','Current Spending - Public Charter Schools (V92) [District Finance] 2009-10','Teacher Salaries - Regular Education Programs (Z35) [District Finance] 2009-10','Teacher Salaries - Special Education Programs (Z36) [District Finance] 2009-10','Teacher Salaries - Vocational Education Programs (Z37) [District Finance] 2009-10','Teacher Salaries - Other Education Programs (Z38) [District Finance] 2009-10','Textbooks for Instruction (V93) [District Finance] 2009-10','Community Services - Non El-Sec (V70) [District Finance] 2009-10','Adult Education - Non El-Sec (V75) [District Finance] 2009-10','Other Expenditures - Non El-Sec (V80) [District Finance] 2009-10','Construction - Capital Outlay (F12) [District Finance] 2009-10','Instructional Equipment - Capital Outlay (K09) [District Finance] 2009-10','Other Equipment - Capital Outlay (K10) [District Finance] 2009-10','Non-specified - Equipment Expenditures (K11) [District Finance] 2009-10','Land & Existing Structures - Capital Outlay (G15) [District Finance] 2009-10','Payments to Local Governments (M12) [District Finance] 2009-10','Payments to State Governments (L12) [District Finance] 2009-10','Interest on School System Indebtedness (I86) [District Finance] 2009-10','Payments to Other School Systems (Q11) [District Finance] 2009-10','FIPST','TOTD912','EBS912','DRP912','TOTDPL','AFGEB','AFGR','TOTOHC','Total Number Operational Schools [Public School] 2009-10', 'Total Number Operational Charter Schools [Public School] 2009-10', 'Total Number of Public Schools [Public School] 2009-10', 'Total Students (UG PK-12) [District] 2009-10', 'PK thru 12th Students [District] 2009-10', 'Ungraded Students [District] 2009-10', 'Total Students [Public School] 2009-10', 'Limited English Proficient (LEP) / English Language Learners (ELL) [District] 2009-10', 'Individualized Education Program Students [District] 2009-10', 'Free Lunch Eligible [Public School] 2009-10', 'Reduced-price Lunch Eligible Students [Public School] 2009-10', 'Total Free and Reduced Lunch Students [Public School] 2009-10', 'Prekindergarten and Kindergarten Students [Public School] 2009-10', 'Grades 1-8 Students [Public School] 2009-10','Grades 9-12 Students [Public School] 2009-10','Prekindergarten Students [Public School] 2009-10','Kindergarten Students [Public School] 2009-10','Grade 1 Students [Public School] 2009-10','Grade 2 Students [Public School] 2009-10','Grade 3 Students [Public School] 2009-10','Grade 4 Students [Public School] 2009-10','Grade 5 Students [Public School] 2009-10','Grade 6 Students [Public School] 2009-10','Grade 7 Students [Public School] 2009-10','Grade 8 Students [Public School] 2009-10','Grade 9 Students [Public School] 2009-10','Grade 10 Students [Public School] 2009-10','Grade 11 Students [Public School] 2009-10','Grade 12 Students [Public School] 2009-10','Ungraded Students [Public School] 2009-10','Male Students [Public School] 2009-10','Female Students [Public School] 2009-10','American Indian/Alaska Native Students [Public School] 2009-10','Asian or Asian/Pacific Islander Students [Public School] 2009-10','Hispanic Students [Public School] 2009-10','Black Students [Public School] 2009-10','White Students [Public School] 2009-10','Hawaiian Nat./Pacific Isl. Students [Public School] 2009-10','Two or More Races Students [Public School] 2009-10','Total Race/Ethnicity [Public School] 2009-10','Prekindergarten Students - American Indian/Alaska Native - male [Public School] 2009-10','Prekindergarten Students - American Indian/Alaska Native - female [Public School] 2009-10','Prekindergarten Students - Asian or Asian/Pacific Islander - male [Public School] 2009-10','Prekindergarten Students - Asian or Asian/Pacific Islander - female [Public School] 2009-10','Prekindergarten Students - Hispanic - male [Public School] 2009-10','Prekindergarten Students - Hispanic - female [Public School] 2009-10','Prekindergarten Students - Black - male [Public School] 2009-10','Prekindergarten Students - Black - female [Public School] 2009-10','Prekindergarten Students - White - male [Public School] 2009-10','Prekindergarten Students - White - female [Public School] 2009-10','Prekindergarten Students - Hawaiian Nat./Pacific Isl. - male [Public School] 2009-10','Prekindergarten Students - Hawaiian Nat./Pacific Isl. - female [Public School] 2009-10','Prekindergarten Students - Two or More Races - male [Public School] 2009-10','Prekindergarten Students - Two or More Races - female [Public School] 2009-10','Kindergarten Students - American Indian/Alaska Native - male [Public School] 2009-10','Kindergarten Students - American Indian/Alaska Native - female [Public School] 2009-10','Kindergarten Students - Asian or Asian/Pacific Islander - male [Public School] 2009-10','Kindergarten Students - Asian or Asian/Pacific Islander - female [Public School] 2009-10','Kindergarten Students - Hispanic - male [Public School] 2009-10','Kindergarten Students - Hispanic - female [Public School] 2009-10','Kindergarten Students - Black - male [Public School] 2009-10','Kindergarten Students - Black - female [Public School] 2009-10','Kindergarten Students - White - male [Public School] 2009-10','Kindergarten Students - White - female [Public School] 2009-10','Kindergarten Students - Hawaiian Nat./Pacific Isl. - male [Public School] 2009-10','Kindergarten Students - Hawaiian Nat./Pacific Isl. - female [Public School] 2009-10','Kindergarten Students - Two or More Races - male [Public School] 2009-10','Kindergarten Students - Two or More Races - female [Public School] 2009-10','Grade 1 Students - American Indian/Alaska Native - male [Public School] 2009-10','Grade 1 Students - American Indian/Alaska Native - female [Public School] 2009-10','Grade 1 Students - Asian or Asian/Pacific Islander - male [Public School] 2009-10','Grade 1 Students - Asian or Asian/Pacific Islander - female [Public School] 2009-10','Grade 1 Students - Hispanic - male [Public School] 2009-10','Grade 1 Students - Hispanic - female [Public School] 2009-10','Grade 1 Students - Black - male [Public School] 2009-10','Grade 1 Students - Black - female [Public School] 2009-10','Grade 1 Students - White - male [Public School] 2009-10','Grade 1 Students - White - female [Public School] 2009-10','Grade 1 Students - Hawaiian Nat./Pacific Isl. - male [Public School] 2009-10','Grade 1 Students - Hawaiian Nat./Pacific Isl. - female [Public School] 2009-10','Grade 1 Students - Two or More Races - male [Public School] 2009-10','Grade 1 Students - Two or More Races - female [Public School] 2009-10','Grade 2 Students - American Indian/Alaska Native - male [Public School] 2009-10','Grade 2 Students - American Indian/Alaska Native - female [Public School] 2009-10','Grade 2 Students - Asian or Asian/Pacific Islander - male [Public School] 2009-10','Grade 2 Students - Asian or Asian/Pacific Islander - female [Public School] 2009-10','Grade 2 Students - Hispanic - male [Public School] 2009-10','Grade 2 Students - Hispanic - female [Public School] 2009-10','Grade 2 Students - Black - male [Public School] 2009-10','Grade 2 Students - Black - female [Public School] 2009-10','Grade 2 Students - White - male [Public School] 2009-10','Grade 2 Students - White - female [Public School] 2009-10','Grade 2 Students - Hawaiian Nat./Pacific Isl. - male [Public School] 2009-10','Grade 2 Students - Hawaiian Nat./Pacific Isl. - female [Public School] 2009-10','Grade 2 Students - Two or More Races - male [Public School] 2009-10','Grade 2 Students - Two or More Races - female [Public School] 2009-10','Grade 3 Students - American Indian/Alaska Native - male [Public School] 2009-10','Grade 3 Students - American Indian/Alaska Native - female [Public School] 2009-10','Grade 3 Students - Asian or Asian/Pacific Islander - male [Public School] 2009-10','Grade 3 Students - Asian or Asian/Pacific Islander - female [Public School] 2009-10','Grade 3 Students - Hispanic - male [Public School] 2009-10','Grade 3 Students - Hispanic - female [Public School] 2009-10','Grade 3 Students - Black - male [Public School] 2009-10','Grade 3 Students - Black - female [Public School] 2009-10','Grade 3 Students - White - male [Public School] 2009-10','Grade 3 Students - White - female [Public School] 2009-10','Grade 3 Students - Hawaiian Nat./Pacific Isl. - male [Public School] 2009-10','Grade 3 Students - Hawaiian Nat./Pacific Isl. - female [Public School] 2009-10','Grade 3 Students - Two or More Races - male [Public School] 2009-10','Grade 3 Students - Two or More Races - female [Public School] 2009-10','Grade 4 Students - American Indian/Alaska Native - male [Public School] 2009-10','Grade 4 Students - American Indian/Alaska Native - female [Public School] 2009-10','Grade 4 Students - Asian or Asian/Pacific Islander - male [Public School] 2009-10','Grade 4 Students - Asian or Asian/Pacific Islander - female [Public School] 2009-10','Grade 4  Students - Hispanic - male [Public School] 2009-10','Grade 4 Students - Hispanic - female [Public School] 2009-10','Grade 4 Students - Black - male [Public School] 2009-10','Grade 4 Students - Black - female [Public School] 2009-10','Grade 4 Students - White - male [Public School] 2009-10','Grade 4 Students - White - female [Public School] 2009-10','Grade 4 Students - Hawaiian Nat./Pacific Isl. - male [Public School] 2009-10','Grade 4 Students - Hawaiian Nat./Pacific Isl. - female [Public School] 2009-10','Grade 4 Students - Two or More Races - male [Public School] 2009-10','Grade 4 Students - Two or More Races - female [Public School] 2009-10','Grade 5 Students - American Indian/Alaska Native - male [Public School] 2009-10','Grade 5 Students - American Indian/Alaska Native - female [Public School] 2009-10','Grade 5 Students - Asian or Asian/Pacific Islander - male [Public School] 2009-10','Grade 5 Students - Asian or Asian/Pacific Islander - female [Public School] 2009-10','Grade 5 Students - Hispanic - male [Public School] 2009-10','Grade 5 Students - Hispanic - female [Public School] 2009-10','Grade 5 Students - Black - male [Public School] 2009-10','Grade 5 Students - Black - female [Public School] 2009-10','Grade 5 Students - White - male [Public School] 2009-10','Grade 5 Students - White - female [Public School] 2009-10','Grade 5 Students - Hawaiian Nat./Pacific Isl. - male [Public School] 2009-10','Grade 5 Students - Hawaiian Nat./Pacific Isl. - female [Public School] 2009-10','Grade 5 Students - Two or More Races - male [Public School] 2009-10','Grade 5 Students - Two or More Races - female [Public School] 2009-10','Grade 6 Students - American Indian/Alaska Native - male [Public School] 2009-10','Grade 6 Students - American Indian/Alaska Native - female [Public School] 2009-10','Grade 6 Students - Asian or Asian/Pacific Islander - male [Public School] 2009-10','Grade 6 Students - Asian or Asian/Pacific Islander - female [Public School] 2009-10','Grade 6 Students - Hispanic - male [Public School] 2009-10','Grade 6 Students - Hispanic - female [Public School] 2009-10','Grade 6 Students - Black - male [Public School] 2009-10','Grade 6 Students - Black - female [Public School] 2009-10','Grade 6 Students - White - male [Public School] 2009-10','Grade 6 Students - White - female [Public School] 2009-10','Grade 6 Students - Hawaiian Nat./Pacific Isl. - male [Public School] 2009-10','Grade 6 Students - Hawaiian Nat./Pacific Isl. - female [Public School] 2009-10','Grade 6  Students- Two or More Races - male [Public School] 2009-10','Grade 6 Students - Two or More Races - female [Public School] 2009-10','Grade 7 Students - American Indian/Alaska Native - male [Public School] 2009-10','Grade 7 Students - American Indian/Alaska Native - female [Public School] 2009-10','Grade 7 Students - Asian or Asian/Pacific Islander - male [Public School] 2009-10','Grade 7 Students - Asian or Asian/Pacific Islander - female [Public School] 2009-10','Grade 7 Students - Hispanic - male [Public School] 2009-10','Grade 7 Students - Hispanic - female [Public School] 2009-10','Grade 7 Students - Black - male [Public School] 2009-10','Grade 7 Students - Black - female [Public School] 2009-10', 'Grade 7 Students - White - male [Public School] 2009-10','Grade 7 Students - White - female [Public School] 2009-10','Grade 7 Students - Hawaiian Nat./Pacific Isl. - male [Public School] 2009-10','Grade 7 Students - Hawaiian Nat./Pacific Isl. - female [Public School] 2009-10','Grade 7 Students - Two or More Races - male [Public School] 2009-10','Grade 7 Students - Two or More Races - female [Public School] 2009-10','Grade 8 Students - American Indian/Alaska Native - male [Public School] 2009-10','Grade 8 Students - American Indian/Alaska Native - female [Public School] 2009-10','Grade 8 Students - Asian or Asian/Pacific Islander - male [Public School] 2009-10','Grade 8 Students - Asian or Asian/Pacific Islander - female [Public School] 2009-10','Grade 8  Students- Hispanic - male [Public School] 2009-10','Grade 8 Students - Hispanic - female [Public School] 2009-10','Grade 8 Students - Black - male [Public School] 2009-10','Grade 8 Students - Black - female [Public School] 2009-10','Grade 8 Students - White - male [Public School] 2009-10','Grade 8 Students - White - female [Public School] 2009-10','Grade 8 Students - Hawaiian Nat./Pacific Isl. - male [Public School] 2009-10','Grade 8 Students - Hawaiian Nat./Pacific Isl. - female [Public School] 2009-10','Grade 8 Students - Two or More Races - male [Public School] 2009-10','Grade 8 Students - Two or More Races - female [Public School] 2009-10','Grade 9 Students - American Indian/Alaska Native - male [Public School] 2009-10','Grade 9 Students - American Indian/Alaska Native - female [Public School] 2009-10','Grade 9 Students - Asian or Asian/Pacific Islander - male [Public School] 2009-10','Grade 9 Students - Asian or Asian/Pacific Islander - female [Public School] 2009-10','Grade 9 Students - Hispanic - male [Public School] 2009-10','Grade 9 Students - Hispanic - female [Public School] 2009-10','Grade 9 Students - Black - male [Public School] 2009-10','Grade 9 Students - Black - female [Public School] 2009-10','Grade 9 Students - White - male [Public School] 2009-10','Grade 9 Students - White - female [Public School] 2009-10','Grade 9 Students - Hawaiian Nat./Pacific Isl. - male [Public School] 2009-10','Grade 9 Students - Hawaiian Nat./Pacific Isl. - female [Public School] 2009-10','Grade 9 Students - Two or More Races - male [Public School] 2009-10','Grade 9 Students - Two or More Races - female [Public School] 2009-10','Grade 10 Students - American Indian/Alaska Native - male [Public School] 2009-10','Grade 10 Students - American Indian/Alaska Native - female [Public School] 2009-10','Grade 10 Students - Asian or Asian/Pacific Islander - male [Public School] 2009-10','Grade 10 Students - Asian or Asian/Pacific Islander - female [Public School] 2009-10','Grade 10 Students - Hispanic - male [Public School] 2009-10','Grade 10 Students - Hispanic - female [Public School] 2009-10','Grade 10 Students - Black - male [Public School] 2009-10','Grade 10 Students - Black - female [Public School] 2009-10','Grade 10 Students - White - male [Public School] 2009-10','Grade 10 Students - White - female [Public School] 2009-10','Grade 10 Students - Hawaiian Nat./Pacific Isl. - male [Public School] 2009-10','Grade 10 Students - Hawaiian Nat./Pacific Isl. - female [Public School] 2009-10','Grade 10 Students - Two or More Races - male [Public School] 2009-10','Grade 10 Students - Two or More Races - female [Public School] 2009-10','Grade 11 Students - American Indian/Alaska Native - male [Public School] 2009-10','Grade 11 Students - American Indian/Alaska Native - female [Public School] 2009-10','Grade 11 Students - Asian or Asian/Pacific Islander - male [Public School] 2009-10','Grade 11 Students - Asian or Asian/Pacific Islander - female [Public School] 2009-10','Grade 11 Students - Hispanic - male [Public School] 2009-10','Grade 11 Students - Hispanic - female [Public School] 2009-10','Grade 11 Students - Black - male [Public School] 2009-10','Grade 11 Students - Black - female [Public School] 2009-10','Grade 11 Students - White - male [Public School] 2009-10','Grade 11 Students - White - female [Public School] 2009-10','Grade 11 Students - Hawaiian Nat./Pacific Isl. - male [Public School] 2009-10','Grade 11 Students - Hawaiian Nat./Pacific Isl. - female [Public School] 2009-10','Grade 11 Students - Two or More Races - male [Public School] 2009-10','Grade 11 Students - Two or More Races - female [Public School] 2009-10','Grade 12 Students - American Indian/Alaska Native - male [Public School] 2009-10','Grade 12 Students - American Indian/Alaska Native - female [Public School] 2009-10','Grade 12 Students - Asian or Asian/Pacific Islander - male [Public School] 2009-10','Grade 12 Students - Asian or Asian/Pacific Islander - female [Public School] 2009-10','Grade 12 Students - Hispanic - male [Public School] 2009-10','Grade 12 Students - Hispanic - female [Public School] 2009-10','Grade 12 Students - Black - male [Public School] 2009-10','Grade 12 Students - Black - female [Public School] 2009-10','Grade 12 Students - White - male [Public School] 2009-10','Grade 12 Students - White - female [Public School] 2009-10','Grade 12 Students - Hawaiian Nat./Pacific Isl. - male [Public School] 2009-10','Grade 12 Students - Hawaiian Nat./Pacific Isl. - female [Public School] 2009-10','Grade 12 Students - Two or More Races - male [Public School] 2009-10','Grade 12 Students - Two or More Races - female [Public School] 2009-10','Ungraded Students - American Indian/Alaska Native - male [Public School] 2009-10','Ungraded  Students- American Indian/Alaska Native - female [Public School] 2009-10','Ungraded Students - Asian or Asian/Pacific Islander - male [Public School] 2009-10','Ungraded Students - Asian or Asian/Pacific Islander - female [Public School] 2009-10','Ungraded Students - Hispanic - male [Public School] 2009-10','Ungraded Students - Hispanic - female [Public School] 2009-10','Ungraded Students - Black - male [Public School] 2009-10','Ungraded Students - Black - female [Public School] 2009-10','Ungraded Students - White - male [Public School] 2009-10','Ungraded Students - White - female [Public School] 2009-10','Ungraded Students - Hawaiian Nat./Pacific Isl. - male [Public School] 2009-10','Ungraded Students - Hawaiian Nat./Pacific Isl. - female [Public School] 2009-10','Ungraded Students - Two or More Races - male [Public School] 2009-10','Ungraded Students - Two or More Races - female [Public School] 2009-10','Full-Time Equivalent (FTE) Teachers [District] 2009-10','Full-Time Equivalent (FTE) Teachers [Public School] 2009-10','Pupil/Teacher Ratio [District] 2009-10','Pupil/Teacher Ratio [Public School] 2009-10','Prekindergarten Teachers [District] 2009-10','Kindergarten Teachers [District] 2009-10','Elementary Teachers [District] 2009-10','Secondary Teachers [District] 2009-10','Ungraded Teachers [District] 2009-10','Total Staff [District] 2009-10','Instructional Aides [District] 2009-10','Instructional Coordinators [District] 2009-10','Elementary Guidance Counselors [District] 2009-10','Secondary Guidance Counselors [District] 2009-10','Other Guidance Counselors [District] 2009-10','Total Guidance Counselors [District] 2009-10','Librarians/Media Specialists [District] 2009-10','Library Media Support Staff [District] 2009-10','LEA Administrators [District] 2009-10','LEA Administrative Support Staff [District] 2009-10','School Administrators [District] 2009-10','School Administrative Support Staff [District] 2009-10','Student Support Services Staff [District] 2009-10','Other Support Services Staff [District] 2009-10','Fall Membership (V33) [District Finance] 2009-10','Total General Revenue (TOTALREV) [District Finance] 2009-10','Total Revenue - Local Sources (TLOCREV) [District Finance] 2009-10','Total Revenue - State Sources (TSTREV) [District Finance] 2009-10','Total Revenue - Federal Sources (TFEDREV) [District Finance] 2009-10','Total Current Expenditures - El-Sec Education (TCURELSC) [District Finance] 2009-10','Total Current Expenditures - Instruction (TCURINST) [District Finance] 2009-10','Total Current Expenditures - Support Services (TCURSSVC) [District Finance] 2009-10','Total Current Expenditures - Other El-Sec Programs (TCUROTH) [District Finance] 2009-10','Total Current Expenditures - Salary (Z32) [District Finance] 2009-10','Total Current Expenditures - Benefits (Z34) [District Finance] 2009-10','Total Expenditures (TOTALEXP) [District Finance] 2009-10','Total Expenditures - Capital Outlay (TCAPOUT) [District Finance] 2009-10','Total Current Expenditures - Non El-Sec Programs (TNONELSE) [District Finance] 2009-10','ARRA Revenues - Title I (HR1) [District Finance] 2009-10','Current Expenditures - ARRA (HE1) [District Finance] 2009-10','Capital Outlay - ARRA (HE2) [District Finance] 2009-10','Total Revenue (TOTALREV) per Pupil (V33) [District Finance] 2009-10','Total Revenue - Local Sources (TLOCREV) per Pupil (V33) [District Finance] 2009-10','Total Revenue - State Sources (TSTREV) per Pupil (V33) [District Finance] 2009-10','Total Revenue - Federal Sources (TFEDREV) per Pupil (V33) [District Finance] 2009-10','Total Current Expenditures - Instruction (TCURINST) per Pupil (V33) [District Finance] 2009-10','Total Current Expenditures - Support Services (TCURSSVC) per Pupil (V33) [District Finance] 2009-10','Total Current Expenditures - Other El-Sec Programs (TCUROTH) per Pupil (V33) [District Finance] 2009-10','Total Current Expenditures - Salary (Z32) per Pupil (V33) [District Finance] 2009-10','Total Current Expenditures - Benefits (Z34) per Pupil (V33) [District Finance] 2009-10','Total Expenditures (TOTALEXP) per Pupil (V33) [District Finance] 2009-10','Total Expenditures - Capital Outlay (TCAPOUT) per Pupil (V33) [District Finance] 2009-10','Total Current Expenditures - Non El-Sec Programs (TNONELSE) per Pupil (V33) [District Finance] 2009-10','Total Current Expenditures (TCURELSC) per Pupil (V33) [District Finance] 2009-10','Instructional Expenditures (E13) per Pupil (V33) [District Finance] 2009-10','Total Current Expenditures - Benefits (Z34) as Percentage of Curr El-Sec (TCURELSC) [District Finance] 2009-10','Total Current Expenditures - Instruction (TCURINST) as Percentage of Curr El-SEC (TCURELSC) [District Finance] 2009-10','Total Current Expenditures - Other El-Sec Prog (TCUROTH) as Percentage of Curr El-Sec (TCURELSC) [District Finance] 2009-10','Total Current Expenditures - Salary (Z32) as Percentage of Curr El-Sec (TCURELSC) [District Finance] 2009-10','Total Current Expenditures - Support Services (TCURSSVC) as Percentage of Curr El-Sec (TCURELSC) [District Finance] 2009-10','Total Revenue - Federal Sources (TFEDREV) as Percentage of Total Revenue (TOTALREV) [District Finance] 2009-10','Total Revenue - Local Sources (TLOCREV) as Percentage of Total Revenue (TOTALREV) [District Finance] 2009-10','Total Revenue - State Sources (TSTREV) as Percentage of Total Revenue (TOTALREV) [District Finance] 2009-10','Long Term Debt - Outstanding Beginning of FY (_19H) [District Finance] 2009-10','Long Term Debt - Issued During FY (_21F) [District Finance] 2009-10','Long Term Debt - Retired During FY (_31F) [District Finance] 2009-10','Long Term Debt - Outstanding at End of FY (_41F) [District Finance] 2009-10','Short Term Debt - Outstanding Beginning of FY (_61V) [District Finance] 2009-10','Short Term Debt - Outstanding at End of FY (_66V) [District Finance] 2009-10','Debt Service Funds (W01) [District Finance] 2009-10','Bond Funds (W31) [District Finance] 2009-10','Other Funds (W61) [District Finance] 2009-10','Local Rev. - Property Taxes (T06) [District Finance] 2009-10','Local Rev. - General Sales Taxes (T09) [District Finance] 2009-10','Local Rev. - Public Utility Taxes (T15) [District Finance] 2009-10',]

for col in countcolumns:
    try:
        cleaneddistrict[col] = cleaneddistrict[col].replace('', np.nan)
        cleaneddistrict[col] = cleaneddistrict[col].astype(float)
    except:
        pass

Display all of the columns and datatypes to check that they are correct.


In [13]:
for i, col in enumerate(cleaneddistrict.columns):
    print i," : ", col, " : ", cleaneddistrict[col].dtype


0  :  Agency Name  :  object
1  :  State Name [District] Latest available year  :  object
2  :  State Name [District] 2009-10  :  object
3  :  State Abbr [District] Latest available year  :  object
4  :  Agency Name [District] 2009-10  :  object
5  :  Agency ID - NCES Assigned [District] Latest available year  :  object
6  :  County Name [District] 2009-10  :  object
7  :  County Number [District] 2009-10  :  object
8  :  Race/Ethnicity Category [District] 2009-10  :  object
9  :  ANSI/FIPS State Code [District] Latest available year  :  object
10  :  Total Number Operational Schools [Public School] 2009-10  :  float64
11  :  Total Number Operational Charter Schools [Public School] 2009-10  :  float64
12  :  Total Number of Public Schools [Public School] 2009-10  :  float64
13  :  Years District Reported Data [District] Latest available year  :  object
14  :  Years District Did Not Report Data [District] Latest available year  :  object
15  :  Location Address [District] 2013-14  :  object
16  :  Location City [District] 2013-14  :  object
17  :  Location State Abbr [District] 2013-14  :  object
18  :  Location ZIP [District] 2013-14  :  object
19  :  Location ZIP4 [District] 2013-14  :  object
20  :  Mailing Address [District] 2013-14  :  object
21  :  Mailing City [District] 2013-14  :  object
22  :  Mailing State Abbr [District] 2013-14  :  object
23  :  Mailing ZIP [District] 2013-14  :  object
24  :  Mailing ZIP4 [District] 2013-14  :  object
25  :  Phone Number [District] 2013-14  :  object
26  :  Agency Name_DEL  :  object
27  :  State Name [District] Latest available year_DEL  :  object
28  :  Agency Type [District] 2009-10  :  object
29  :  School District Level Code (SCHLEV) [District Finance] 2009-10  :  object
30  :  Urban-centric Locale [District] 2009-10  :  object
31  :  Boundary Change Indicator Flag [District] 2009-10  :  object
32  :  CBSA Name [District] 2009-10  :  object
33  :  CBSA ID [District] 2009-10  :  object
34  :  CSA Name [District] 2009-10  :  object
35  :  CSA ID [District] 2009-10  :  object
36  :  Latitude [District] 2009-10  :  object
37  :  Longitude [District] 2009-10  :  object
38  :  State Agency ID [District] 2009-10  :  object
39  :  Supervisory Union (ID) Number [District] 2009-10  :  object
40  :  Agency Charter Status [District] 2009-10  :  object
41  :  Metro Micro Area Code [District] 2009-10  :  object
42  :  Congressional Code [District] 2009-10  :  object
43  :  Census ID (CENSUSID) [District Finance] 2009-10  :  object
44  :  Lowest Grade Offered [District] 2009-10  :  object
45  :  Highest Grade Offered [District] 2009-10  :  object
46  :  Total Students (UG PK-12) [District] 2009-10  :  float64
47  :  PK thru 12th Students [District] 2009-10  :  float64
48  :  Ungraded Students [District] 2009-10  :  float64
49  :  Total Students [Public School] 2009-10  :  float64
50  :  Limited English Proficient (LEP) / English Language Learners (ELL) [District] 2009-10  :  float64
51  :  Individualized Education Program Students [District] 2009-10  :  float64
52  :  Free Lunch Eligible [Public School] 2009-10  :  float64
53  :  Reduced-price Lunch Eligible Students [Public School] 2009-10  :  float64
54  :  Total Free and Reduced Lunch Students [Public School] 2009-10  :  float64
55  :  Prekindergarten and Kindergarten Students [Public School] 2009-10  :  float64
56  :  Grades 1-8 Students [Public School] 2009-10  :  float64
57  :  Grades 9-12 Students [Public School] 2009-10  :  float64
58  :  Prekindergarten Students [Public School] 2009-10  :  float64
59  :  Kindergarten Students [Public School] 2009-10  :  float64
60  :  Grade 1 Students [Public School] 2009-10  :  float64
61  :  Grade 2 Students [Public School] 2009-10  :  float64
62  :  Grade 3 Students [Public School] 2009-10  :  float64
63  :  Grade 4 Students [Public School] 2009-10  :  float64
64  :  Grade 5 Students [Public School] 2009-10  :  float64
65  :  Grade 6 Students [Public School] 2009-10  :  float64
66  :  Grade 7 Students [Public School] 2009-10  :  float64
67  :  Grade 8 Students [Public School] 2009-10  :  float64
68  :  Grade 9 Students [Public School] 2009-10  :  float64
69  :  Grade 10 Students [Public School] 2009-10  :  float64
70  :  Grade 11 Students [Public School] 2009-10  :  float64
71  :  Grade 12 Students [Public School] 2009-10  :  float64
72  :  Ungraded Students [Public School] 2009-10  :  float64
73  :  Male Students [Public School] 2009-10  :  float64
74  :  Female Students [Public School] 2009-10  :  float64
75  :  American Indian/Alaska Native Students [Public School] 2009-10  :  float64
76  :  Asian or Asian/Pacific Islander Students [Public School] 2009-10  :  float64
77  :  Hispanic Students [Public School] 2009-10  :  float64
78  :  Black Students [Public School] 2009-10  :  float64
79  :  White Students [Public School] 2009-10  :  float64
80  :  Hawaiian Nat./Pacific Isl. Students [Public School] 2009-10  :  float64
81  :  Two or More Races Students [Public School] 2009-10  :  float64
82  :  Total Race/Ethnicity [Public School] 2009-10  :  float64
83  :  Prekindergarten Students - American Indian/Alaska Native - male [Public School] 2009-10  :  float64
84  :  Prekindergarten Students - American Indian/Alaska Native - female [Public School] 2009-10  :  float64
85  :  Prekindergarten Students - Asian or Asian/Pacific Islander - male [Public School] 2009-10  :  float64
86  :  Prekindergarten Students - Asian or Asian/Pacific Islander - female [Public School] 2009-10  :  float64
87  :  Prekindergarten Students - Hispanic - male [Public School] 2009-10  :  float64
88  :  Prekindergarten Students - Hispanic - female [Public School] 2009-10  :  float64
89  :  Prekindergarten Students - Black - male [Public School] 2009-10  :  float64
90  :  Prekindergarten Students - Black - female [Public School] 2009-10  :  float64
91  :  Prekindergarten Students - White - male [Public School] 2009-10  :  float64
92  :  Prekindergarten Students - White - female [Public School] 2009-10  :  float64
93  :  Prekindergarten Students - Hawaiian Nat./Pacific Isl. - male [Public School] 2009-10  :  float64
94  :  Prekindergarten Students - Hawaiian Nat./Pacific Isl. - female [Public School] 2009-10  :  float64
95  :  Prekindergarten Students - Two or More Races - male [Public School] 2009-10  :  float64
96  :  Prekindergarten Students - Two or More Races - female [Public School] 2009-10  :  float64
97  :  Kindergarten Students - American Indian/Alaska Native - male [Public School] 2009-10  :  float64
98  :  Kindergarten Students - American Indian/Alaska Native - female [Public School] 2009-10  :  float64
99  :  Kindergarten Students - Asian or Asian/Pacific Islander - male [Public School] 2009-10  :  float64
100  :  Kindergarten Students - Asian or Asian/Pacific Islander - female [Public School] 2009-10  :  float64
101  :  Kindergarten Students - Hispanic - male [Public School] 2009-10  :  float64
102  :  Kindergarten Students - Hispanic - female [Public School] 2009-10  :  float64
103  :  Kindergarten Students - Black - male [Public School] 2009-10  :  float64
104  :  Kindergarten Students - Black - female [Public School] 2009-10  :  float64
105  :  Kindergarten Students - White - male [Public School] 2009-10  :  float64
106  :  Kindergarten Students - White - female [Public School] 2009-10  :  float64
107  :  Kindergarten Students - Hawaiian Nat./Pacific Isl. - male [Public School] 2009-10  :  float64
108  :  Kindergarten Students - Hawaiian Nat./Pacific Isl. - female [Public School] 2009-10  :  float64
109  :  Kindergarten Students - Two or More Races - male [Public School] 2009-10  :  float64
110  :  Kindergarten Students - Two or More Races - female [Public School] 2009-10  :  float64
111  :  Grade 1 Students - American Indian/Alaska Native - male [Public School] 2009-10  :  float64
112  :  Grade 1 Students - American Indian/Alaska Native - female [Public School] 2009-10  :  float64
113  :  Grade 1 Students - Asian or Asian/Pacific Islander - male [Public School] 2009-10  :  float64
114  :  Grade 1 Students - Asian or Asian/Pacific Islander - female [Public School] 2009-10  :  float64
115  :  Grade 1 Students - Hispanic - male [Public School] 2009-10  :  float64
116  :  Grade 1 Students - Hispanic - female [Public School] 2009-10  :  float64
117  :  Grade 1 Students - Black - male [Public School] 2009-10  :  float64
118  :  Grade 1 Students - Black - female [Public School] 2009-10  :  float64
119  :  Grade 1 Students - White - male [Public School] 2009-10  :  float64
120  :  Grade 1 Students - White - female [Public School] 2009-10  :  float64
121  :  Grade 1 Students - Hawaiian Nat./Pacific Isl. - male [Public School] 2009-10  :  float64
122  :  Grade 1 Students - Hawaiian Nat./Pacific Isl. - female [Public School] 2009-10  :  float64
123  :  Grade 1 Students - Two or More Races - male [Public School] 2009-10  :  float64
124  :  Grade 1 Students - Two or More Races - female [Public School] 2009-10  :  float64
125  :  Grade 2 Students - American Indian/Alaska Native - male [Public School] 2009-10  :  float64
126  :  Grade 2 Students - American Indian/Alaska Native - female [Public School] 2009-10  :  float64
127  :  Grade 2 Students - Asian or Asian/Pacific Islander - male [Public School] 2009-10  :  float64
128  :  Grade 2 Students - Asian or Asian/Pacific Islander - female [Public School] 2009-10  :  float64
129  :  Grade 2 Students - Hispanic - male [Public School] 2009-10  :  float64
130  :  Grade 2 Students - Hispanic - female [Public School] 2009-10  :  float64
131  :  Grade 2 Students - Black - male [Public School] 2009-10  :  float64
132  :  Grade 2 Students - Black - female [Public School] 2009-10  :  float64
133  :  Grade 2 Students - White - male [Public School] 2009-10  :  float64
134  :  Grade 2 Students - White - female [Public School] 2009-10  :  float64
135  :  Grade 2 Students - Hawaiian Nat./Pacific Isl. - male [Public School] 2009-10  :  float64
136  :  Grade 2 Students - Hawaiian Nat./Pacific Isl. - female [Public School] 2009-10  :  float64
137  :  Grade 2 Students - Two or More Races - male [Public School] 2009-10  :  float64
138  :  Grade 2 Students - Two or More Races - female [Public School] 2009-10  :  float64
139  :  Grade 3 Students - American Indian/Alaska Native - male [Public School] 2009-10  :  float64
140  :  Grade 3 Students - American Indian/Alaska Native - female [Public School] 2009-10  :  float64
141  :  Grade 3 Students - Asian or Asian/Pacific Islander - male [Public School] 2009-10  :  float64
142  :  Grade 3 Students - Asian or Asian/Pacific Islander - female [Public School] 2009-10  :  float64
143  :  Grade 3 Students - Hispanic - male [Public School] 2009-10  :  float64
144  :  Grade 3 Students - Hispanic - female [Public School] 2009-10  :  float64
145  :  Grade 3 Students - Black - male [Public School] 2009-10  :  float64
146  :  Grade 3 Students - Black - female [Public School] 2009-10  :  float64
147  :  Grade 3 Students - White - male [Public School] 2009-10  :  float64
148  :  Grade 3 Students - White - female [Public School] 2009-10  :  float64
149  :  Grade 3 Students - Hawaiian Nat./Pacific Isl. - male [Public School] 2009-10  :  float64
150  :  Grade 3 Students - Hawaiian Nat./Pacific Isl. - female [Public School] 2009-10  :  float64
151  :  Grade 3 Students - Two or More Races - male [Public School] 2009-10  :  float64
152  :  Grade 3 Students - Two or More Races - female [Public School] 2009-10  :  float64
153  :  Grade 4 Students - American Indian/Alaska Native - male [Public School] 2009-10  :  float64
154  :  Grade 4 Students - American Indian/Alaska Native - female [Public School] 2009-10  :  float64
155  :  Grade 4 Students - Asian or Asian/Pacific Islander - male [Public School] 2009-10  :  float64
156  :  Grade 4 Students - Asian or Asian/Pacific Islander - female [Public School] 2009-10  :  float64
157  :  Grade 4  Students - Hispanic - male [Public School] 2009-10  :  float64
158  :  Grade 4 Students - Hispanic - female [Public School] 2009-10  :  float64
159  :  Grade 4 Students - Black - male [Public School] 2009-10  :  float64
160  :  Grade 4 Students - Black - female [Public School] 2009-10  :  float64
161  :  Grade 4 Students - White - male [Public School] 2009-10  :  float64
162  :  Grade 4 Students - White - female [Public School] 2009-10  :  float64
163  :  Grade 4 Students - Hawaiian Nat./Pacific Isl. - male [Public School] 2009-10  :  float64
164  :  Grade 4 Students - Hawaiian Nat./Pacific Isl. - female [Public School] 2009-10  :  float64
165  :  Grade 4 Students - Two or More Races - male [Public School] 2009-10  :  float64
166  :  Grade 4 Students - Two or More Races - female [Public School] 2009-10  :  float64
167  :  Grade 5 Students - American Indian/Alaska Native - male [Public School] 2009-10  :  float64
168  :  Grade 5 Students - American Indian/Alaska Native - female [Public School] 2009-10  :  float64
169  :  Grade 5 Students - Asian or Asian/Pacific Islander - male [Public School] 2009-10  :  float64
170  :  Grade 5 Students - Asian or Asian/Pacific Islander - female [Public School] 2009-10  :  float64
171  :  Grade 5 Students - Hispanic - male [Public School] 2009-10  :  float64
172  :  Grade 5 Students - Hispanic - female [Public School] 2009-10  :  float64
173  :  Grade 5 Students - Black - male [Public School] 2009-10  :  float64
174  :  Grade 5 Students - Black - female [Public School] 2009-10  :  float64
175  :  Grade 5 Students - White - male [Public School] 2009-10  :  float64
176  :  Grade 5 Students - White - female [Public School] 2009-10  :  float64
177  :  Grade 5 Students - Hawaiian Nat./Pacific Isl. - male [Public School] 2009-10  :  float64
178  :  Grade 5 Students - Hawaiian Nat./Pacific Isl. - female [Public School] 2009-10  :  float64
179  :  Grade 5 Students - Two or More Races - male [Public School] 2009-10  :  float64
180  :  Grade 5 Students - Two or More Races - female [Public School] 2009-10  :  float64
181  :  Grade 6 Students - American Indian/Alaska Native - male [Public School] 2009-10  :  float64
182  :  Grade 6 Students - American Indian/Alaska Native - female [Public School] 2009-10  :  float64
183  :  Grade 6 Students - Asian or Asian/Pacific Islander - male [Public School] 2009-10  :  float64
184  :  Grade 6 Students - Asian or Asian/Pacific Islander - female [Public School] 2009-10  :  float64
185  :  Grade 6 Students - Hispanic - male [Public School] 2009-10  :  float64
186  :  Grade 6 Students - Hispanic - female [Public School] 2009-10  :  float64
187  :  Grade 6 Students - Black - male [Public School] 2009-10  :  float64
188  :  Grade 6 Students - Black - female [Public School] 2009-10  :  float64
189  :  Grade 6 Students - White - male [Public School] 2009-10  :  float64
190  :  Grade 6 Students - White - female [Public School] 2009-10  :  float64
191  :  Grade 6 Students - Hawaiian Nat./Pacific Isl. - male [Public School] 2009-10  :  float64
192  :  Grade 6 Students - Hawaiian Nat./Pacific Isl. - female [Public School] 2009-10  :  float64
193  :  Grade 6  Students- Two or More Races - male [Public School] 2009-10  :  float64
194  :  Grade 6 Students - Two or More Races - female [Public School] 2009-10  :  float64
195  :  Grade 7 Students - American Indian/Alaska Native - male [Public School] 2009-10  :  float64
196  :  Grade 7 Students - American Indian/Alaska Native - female [Public School] 2009-10  :  float64
197  :  Grade 7 Students - Asian or Asian/Pacific Islander - male [Public School] 2009-10  :  float64
198  :  Grade 7 Students - Asian or Asian/Pacific Islander - female [Public School] 2009-10  :  float64
199  :  Grade 7 Students - Hispanic - male [Public School] 2009-10  :  float64
200  :  Grade 7 Students - Hispanic - female [Public School] 2009-10  :  float64
201  :  Grade 7 Students - Black - male [Public School] 2009-10  :  float64
202  :  Grade 7 Students - Black - female [Public School] 2009-10  :  float64
203  :  Grade 7 Students - White - male [Public School] 2009-10  :  float64
204  :  Grade 7 Students - White - female [Public School] 2009-10  :  float64
205  :  Grade 7 Students - Hawaiian Nat./Pacific Isl. - male [Public School] 2009-10  :  float64
206  :  Grade 7 Students - Hawaiian Nat./Pacific Isl. - female [Public School] 2009-10  :  float64
207  :  Grade 7 Students - Two or More Races - male [Public School] 2009-10  :  float64
208  :  Grade 7 Students - Two or More Races - female [Public School] 2009-10  :  float64
209  :  Grade 8 Students - American Indian/Alaska Native - male [Public School] 2009-10  :  float64
210  :  Grade 8 Students - American Indian/Alaska Native - female [Public School] 2009-10  :  float64
211  :  Grade 8 Students - Asian or Asian/Pacific Islander - male [Public School] 2009-10  :  float64
212  :  Grade 8 Students - Asian or Asian/Pacific Islander - female [Public School] 2009-10  :  float64
213  :  Grade 8  Students- Hispanic - male [Public School] 2009-10  :  float64
214  :  Grade 8 Students - Hispanic - female [Public School] 2009-10  :  float64
215  :  Grade 8 Students - Black - male [Public School] 2009-10  :  float64
216  :  Grade 8 Students - Black - female [Public School] 2009-10  :  float64
217  :  Grade 8 Students - White - male [Public School] 2009-10  :  float64
218  :  Grade 8 Students - White - female [Public School] 2009-10  :  float64
219  :  Grade 8 Students - Hawaiian Nat./Pacific Isl. - male [Public School] 2009-10  :  float64
220  :  Grade 8 Students - Hawaiian Nat./Pacific Isl. - female [Public School] 2009-10  :  float64
221  :  Grade 8 Students - Two or More Races - male [Public School] 2009-10  :  float64
222  :  Grade 8 Students - Two or More Races - female [Public School] 2009-10  :  float64
223  :  Grade 9 Students - American Indian/Alaska Native - male [Public School] 2009-10  :  float64
224  :  Grade 9 Students - American Indian/Alaska Native - female [Public School] 2009-10  :  float64
225  :  Grade 9 Students - Asian or Asian/Pacific Islander - male [Public School] 2009-10  :  float64
226  :  Grade 9 Students - Asian or Asian/Pacific Islander - female [Public School] 2009-10  :  float64
227  :  Grade 9 Students - Hispanic - male [Public School] 2009-10  :  float64
228  :  Grade 9 Students - Hispanic - female [Public School] 2009-10  :  float64
229  :  Grade 9 Students - Black - male [Public School] 2009-10  :  float64
230  :  Grade 9 Students - Black - female [Public School] 2009-10  :  float64
231  :  Grade 9 Students - White - male [Public School] 2009-10  :  float64
232  :  Grade 9 Students - White - female [Public School] 2009-10  :  float64
233  :  Grade 9 Students - Hawaiian Nat./Pacific Isl. - male [Public School] 2009-10  :  float64
234  :  Grade 9 Students - Hawaiian Nat./Pacific Isl. - female [Public School] 2009-10  :  float64
235  :  Grade 9 Students - Two or More Races - male [Public School] 2009-10  :  float64
236  :  Grade 9 Students - Two or More Races - female [Public School] 2009-10  :  float64
237  :  Grade 10 Students - American Indian/Alaska Native - male [Public School] 2009-10  :  float64
238  :  Grade 10 Students - American Indian/Alaska Native - female [Public School] 2009-10  :  float64
239  :  Grade 10 Students - Asian or Asian/Pacific Islander - male [Public School] 2009-10  :  float64
240  :  Grade 10 Students - Asian or Asian/Pacific Islander - female [Public School] 2009-10  :  float64
241  :  Grade 10 Students - Hispanic - male [Public School] 2009-10  :  float64
242  :  Grade 10 Students - Hispanic - female [Public School] 2009-10  :  float64
243  :  Grade 10 Students - Black - male [Public School] 2009-10  :  float64
244  :  Grade 10 Students - Black - female [Public School] 2009-10  :  float64
245  :  Grade 10 Students - White - male [Public School] 2009-10  :  float64
246  :  Grade 10 Students - White - female [Public School] 2009-10  :  float64
247  :  Grade 10 Students - Hawaiian Nat./Pacific Isl. - male [Public School] 2009-10  :  float64
248  :  Grade 10 Students - Hawaiian Nat./Pacific Isl. - female [Public School] 2009-10  :  float64
249  :  Grade 10 Students - Two or More Races - male [Public School] 2009-10  :  float64
250  :  Grade 10 Students - Two or More Races - female [Public School] 2009-10  :  float64
251  :  Grade 11 Students - American Indian/Alaska Native - male [Public School] 2009-10  :  float64
252  :  Grade 11 Students - American Indian/Alaska Native - female [Public School] 2009-10  :  float64
253  :  Grade 11 Students - Asian or Asian/Pacific Islander - male [Public School] 2009-10  :  float64
254  :  Grade 11 Students - Asian or Asian/Pacific Islander - female [Public School] 2009-10  :  float64
255  :  Grade 11 Students - Hispanic - male [Public School] 2009-10  :  float64
256  :  Grade 11 Students - Hispanic - female [Public School] 2009-10  :  float64
257  :  Grade 11 Students - Black - male [Public School] 2009-10  :  float64
258  :  Grade 11 Students - Black - female [Public School] 2009-10  :  float64
259  :  Grade 11 Students - White - male [Public School] 2009-10  :  float64
260  :  Grade 11 Students - White - female [Public School] 2009-10  :  float64
261  :  Grade 11 Students - Hawaiian Nat./Pacific Isl. - male [Public School] 2009-10  :  float64
262  :  Grade 11 Students - Hawaiian Nat./Pacific Isl. - female [Public School] 2009-10  :  float64
263  :  Grade 11 Students - Two or More Races - male [Public School] 2009-10  :  float64
264  :  Grade 11 Students - Two or More Races - female [Public School] 2009-10  :  float64
265  :  Grade 12 Students - American Indian/Alaska Native - male [Public School] 2009-10  :  float64
266  :  Grade 12 Students - American Indian/Alaska Native - female [Public School] 2009-10  :  float64
267  :  Grade 12 Students - Asian or Asian/Pacific Islander - male [Public School] 2009-10  :  float64
268  :  Grade 12 Students - Asian or Asian/Pacific Islander - female [Public School] 2009-10  :  float64
269  :  Grade 12 Students - Hispanic - male [Public School] 2009-10  :  float64
270  :  Grade 12 Students - Hispanic - female [Public School] 2009-10  :  float64
271  :  Grade 12 Students - Black - male [Public School] 2009-10  :  float64
272  :  Grade 12 Students - Black - female [Public School] 2009-10  :  float64
273  :  Grade 12 Students - White - male [Public School] 2009-10  :  float64
274  :  Grade 12 Students - White - female [Public School] 2009-10  :  float64
275  :  Grade 12 Students - Hawaiian Nat./Pacific Isl. - male [Public School] 2009-10  :  float64
276  :  Grade 12 Students - Hawaiian Nat./Pacific Isl. - female [Public School] 2009-10  :  float64
277  :  Grade 12 Students - Two or More Races - male [Public School] 2009-10  :  float64
278  :  Grade 12 Students - Two or More Races - female [Public School] 2009-10  :  float64
279  :  Ungraded Students - American Indian/Alaska Native - male [Public School] 2009-10  :  float64
280  :  Ungraded  Students- American Indian/Alaska Native - female [Public School] 2009-10  :  float64
281  :  Ungraded Students - Asian or Asian/Pacific Islander - male [Public School] 2009-10  :  float64
282  :  Ungraded Students - Asian or Asian/Pacific Islander - female [Public School] 2009-10  :  float64
283  :  Ungraded Students - Hispanic - male [Public School] 2009-10  :  float64
284  :  Ungraded Students - Hispanic - female [Public School] 2009-10  :  float64
285  :  Ungraded Students - Black - male [Public School] 2009-10  :  float64
286  :  Ungraded Students - Black - female [Public School] 2009-10  :  float64
287  :  Ungraded Students - White - male [Public School] 2009-10  :  float64
288  :  Ungraded Students - White - female [Public School] 2009-10  :  float64
289  :  Ungraded Students - Hawaiian Nat./Pacific Isl. - male [Public School] 2009-10  :  float64
290  :  Ungraded Students - Hawaiian Nat./Pacific Isl. - female [Public School] 2009-10  :  float64
291  :  Ungraded Students - Two or More Races - male [Public School] 2009-10  :  float64
292  :  Ungraded Students - Two or More Races - female [Public School] 2009-10  :  float64
293  :  Full-Time Equivalent (FTE) Teachers [District] 2009-10  :  float64
294  :  Full-Time Equivalent (FTE) Teachers [Public School] 2009-10  :  float64
295  :  Pupil/Teacher Ratio [District] 2009-10  :  float64
296  :  Pupil/Teacher Ratio [Public School] 2009-10  :  float64
297  :  Prekindergarten Teachers [District] 2009-10  :  float64
298  :  Kindergarten Teachers [District] 2009-10  :  float64
299  :  Elementary Teachers [District] 2009-10  :  float64
300  :  Secondary Teachers [District] 2009-10  :  float64
301  :  Ungraded Teachers [District] 2009-10  :  float64
302  :  Total Staff [District] 2009-10  :  float64
303  :  Instructional Aides [District] 2009-10  :  float64
304  :  Instructional Coordinators [District] 2009-10  :  float64
305  :  Elementary Guidance Counselors [District] 2009-10  :  float64
306  :  Secondary Guidance Counselors [District] 2009-10  :  float64
307  :  Other Guidance Counselors [District] 2009-10  :  float64
308  :  Total Guidance Counselors [District] 2009-10  :  float64
309  :  Librarians/Media Specialists [District] 2009-10  :  float64
310  :  Library Media Support Staff [District] 2009-10  :  float64
311  :  LEA Administrators [District] 2009-10  :  float64
312  :  LEA Administrative Support Staff [District] 2009-10  :  float64
313  :  School Administrators [District] 2009-10  :  float64
314  :  School Administrative Support Staff [District] 2009-10  :  float64
315  :  Student Support Services Staff [District] 2009-10  :  float64
316  :  Other Support Services Staff [District] 2009-10  :  float64
317  :  Fall Membership (V33) [District Finance] 2009-10  :  float64
318  :  Total General Revenue (TOTALREV) [District Finance] 2009-10  :  float64
319  :  Total Revenue - Local Sources (TLOCREV) [District Finance] 2009-10  :  float64
320  :  Total Revenue - State Sources (TSTREV) [District Finance] 2009-10  :  float64
321  :  Total Revenue - Federal Sources (TFEDREV) [District Finance] 2009-10  :  float64
322  :  Total Current Expenditures - El-Sec Education (TCURELSC) [District Finance] 2009-10  :  float64
323  :  Total Current Expenditures - Instruction (TCURINST) [District Finance] 2009-10  :  float64
324  :  Total Current Expenditures - Support Services (TCURSSVC) [District Finance] 2009-10  :  float64
325  :  Total Current Expenditures - Other El-Sec Programs (TCUROTH) [District Finance] 2009-10  :  float64
326  :  Total Current Expenditures - Salary (Z32) [District Finance] 2009-10  :  float64
327  :  Total Current Expenditures - Benefits (Z34) [District Finance] 2009-10  :  float64
328  :  Total Expenditures (TOTALEXP) [District Finance] 2009-10  :  float64
329  :  Total Expenditures - Capital Outlay (TCAPOUT) [District Finance] 2009-10  :  float64
330  :  Total Current Expenditures - Non El-Sec Programs (TNONELSE) [District Finance] 2009-10  :  float64
331  :  ARRA Revenues - Title I (HR1) [District Finance] 2009-10  :  float64
332  :  Current Expenditures - ARRA (HE1) [District Finance] 2009-10  :  float64
333  :  Capital Outlay - ARRA (HE2) [District Finance] 2009-10  :  float64
334  :  Total Revenue (TOTALREV) per Pupil (V33) [District Finance] 2009-10  :  float64
335  :  Total Revenue - Local Sources (TLOCREV) per Pupil (V33) [District Finance] 2009-10  :  float64
336  :  Total Revenue - State Sources (TSTREV) per Pupil (V33) [District Finance] 2009-10  :  float64
337  :  Total Revenue - Federal Sources (TFEDREV) per Pupil (V33) [District Finance] 2009-10  :  float64
338  :  Total Current Expenditures - Instruction (TCURINST) per Pupil (V33) [District Finance] 2009-10  :  float64
339  :  Total Current Expenditures - Support Services (TCURSSVC) per Pupil (V33) [District Finance] 2009-10  :  float64
340  :  Total Current Expenditures - Other El-Sec Programs (TCUROTH) per Pupil (V33) [District Finance] 2009-10  :  float64
341  :  Total Current Expenditures - Salary (Z32) per Pupil (V33) [District Finance] 2009-10  :  float64
342  :  Total Current Expenditures - Benefits (Z34) per Pupil (V33) [District Finance] 2009-10  :  float64
343  :  Total Expenditures (TOTALEXP) per Pupil (V33) [District Finance] 2009-10  :  float64
344  :  Total Expenditures - Capital Outlay (TCAPOUT) per Pupil (V33) [District Finance] 2009-10  :  float64
345  :  Total Current Expenditures - Non El-Sec Programs (TNONELSE) per Pupil (V33) [District Finance] 2009-10  :  float64
346  :  Total Current Expenditures (TCURELSC) per Pupil (V33) [District Finance] 2009-10  :  float64
347  :  Instructional Expenditures (E13) per Pupil (V33) [District Finance] 2009-10  :  float64
348  :  Total Current Expenditures - Benefits (Z34) as Percentage of Curr El-Sec (TCURELSC) [District Finance] 2009-10  :  float64
349  :  Total Current Expenditures - Instruction (TCURINST) as Percentage of Curr El-SEC (TCURELSC) [District Finance] 2009-10  :  float64
350  :  Total Current Expenditures - Other El-Sec Prog (TCUROTH) as Percentage of Curr El-Sec (TCURELSC) [District Finance] 2009-10  :  float64
351  :  Total Current Expenditures - Salary (Z32) as Percentage of Curr El-Sec (TCURELSC) [District Finance] 2009-10  :  float64
352  :  Total Current Expenditures - Support Services (TCURSSVC) as Percentage of Curr El-Sec (TCURELSC) [District Finance] 2009-10  :  float64
353  :  Total Revenue - Federal Sources (TFEDREV) as Percentage of Total Revenue (TOTALREV) [District Finance] 2009-10  :  float64
354  :  Total Revenue - Local Sources (TLOCREV) as Percentage of Total Revenue (TOTALREV) [District Finance] 2009-10  :  float64
355  :  Total Revenue - State Sources (TSTREV) as Percentage of Total Revenue (TOTALREV) [District Finance] 2009-10  :  float64
356  :  Long Term Debt - Outstanding Beginning of FY (_19H) [District Finance] 2009-10  :  float64
357  :  Long Term Debt - Issued During FY (_21F) [District Finance] 2009-10  :  float64
358  :  Long Term Debt - Retired During FY (_31F) [District Finance] 2009-10  :  float64
359  :  Long Term Debt - Outstanding at End of FY (_41F) [District Finance] 2009-10  :  float64
360  :  Short Term Debt - Outstanding Beginning of FY (_61V) [District Finance] 2009-10  :  float64
361  :  Short Term Debt - Outstanding at End of FY (_66V) [District Finance] 2009-10  :  float64
362  :  Debt Service Funds (W01) [District Finance] 2009-10  :  float64
363  :  Bond Funds (W31) [District Finance] 2009-10  :  float64
364  :  Other Funds (W61) [District Finance] 2009-10  :  float64
365  :  Local Rev. - Property Taxes (T06) [District Finance] 2009-10  :  float64
366  :  Local Rev. - General Sales Taxes (T09) [District Finance] 2009-10  :  float64
367  :  Local Rev. - Public Utility Taxes (T15) [District Finance] 2009-10  :  float64
368  :  Local Rev. - Individual & Corp. Income Taxes (T40) [District Finance] 2009-10  :  float64
369  :  Local Rev. - All Other Taxes (T99) [District Finance] 2009-10  :  float64
370  :  Local Rev. - Parent Government Contributions (T02) [District Finance] 2009-10  :  float64
371  :  Local Rev. - Revenue- Cities and Counties (D23) [District Finance] 2009-10  :  float64
372  :  Local Rev. - Revenue- Other School Systems (D11) [District Finance] 2009-10  :  float64
373  :  Local Rev. - Tuition Fees- Pupils and Parents (A07) [District Finance] 2009-10  :  float64
374  :  Local Rev. - Transp. Fees- Pupils and Parents (A08) [District Finance] 2009-10  :  float64
375  :  Local Rev. - School Lunch Revenues (A09) [District Finance] 2009-10  :  float64
376  :  Local Rev. - Textbook Sales and Rentals (A11) [District Finance] 2009-10  :  float64
377  :  Local Rev. - Student Activity Receipts (A13) [District Finance] 2009-10  :  float64
378  :  Local Rev. - Other Sales and Service Rev. (A20) [District Finance] 2009-10  :  float64
379  :  Local Rev. - Student Fees Non-Specified (A15) [District Finance] 2009-10  :  float64
380  :  Local Rev. - Interest Earnings (U22) [District Finance] 2009-10  :  float64
381  :  Local Rev. - Miscellaneous Other Local Rev. (U97) [District Finance] 2009-10  :  float64
382  :  Local Rev. - Special Processing (C24) [District Finance] 2009-10  :  float64
383  :  Local Rev. - Rents and Royalties (A40) [District Finance] 2009-10  :  float64
384  :  Local Rev. - Sale of Property (U11) [District Finance] 2009-10  :  float64
385  :  Local Rev. - Fines and Forfeits (U30) [District Finance] 2009-10  :  float64
386  :  Local Rev. - Private Contributions (U50) [District Finance] 2009-10  :  float64
387  :  State Rev. - General Formula Assistance (C01) [District Finance] 2009-10  :  float64
388  :  State Rev. - Special Education Programs (C05) [District Finance] 2009-10  :  float64
389  :  State Rev. - Transportation Programs (C12) [District Finance] 2009-10  :  float64
390  :  State Rev. - Staff Improvement Programs (C04) [District Finance] 2009-10  :  float64
391  :  State Rev. - Compensat. and Basic Skills Prog. (C06) [District Finance] 2009-10  :  float64
392  :  State Rev. - Vocational Education Programs (C09) [District Finance] 2009-10  :  float64
393  :  State Rev. - Capital Outlay and Debt Serv. Prog. (C11) [District Finance] 2009-10  :  float64
394  :  State Rev. - Bilingual Education Programs (C07) [District Finance] 2009-10  :  float64
395  :  State Rev. - Gifted and Talented Programs (C08) [District Finance] 2009-10  :  float64
396  :  State Rev. - School Lunch Programs (C10) [District Finance] 2009-10  :  float64
397  :  State Rev. - All Other Rev.- State Sources (C13) [District Finance] 2009-10  :  float64
398  :  State Rev. - State Payment for LEA Empl. Benefits (C38) [District Finance] 2009-10  :  float64
399  :  State Rev. - Other State Payments (C39) [District Finance] 2009-10  :  float64
400  :  State Rev. - Non-Specified (C35) [District Finance] 2009-10  :  float64
401  :  Federal Rev. - Federal Title I Revenue (C14) [District Finance] 2009-10  :  float64
402  :  Federal Rev. - Children with Disabilities (C15) [District Finance] 2009-10  :  float64
403  :  Federal Rev. - Child Nutrition Act (C25) [District Finance] 2009-10  :  float64
404  :  Federal Rev. - Eisenhower Math and Science (C16) [District Finance] 2009-10  :  float64
405  :  Federal Rev. - Drug-Free Schools (C17) [District Finance] 2009-10  :  float64
406  :  Federal Rev. - Vocational Education (C19) [District Finance] 2009-10  :  float64
407  :  Federal Rev. - All Other Fed. Aid Through State (C20) [District Finance] 2009-10  :  float64
408  :  Federal Rev. - Nonspecified (C36) [District Finance] 2009-10  :  float64
409  :  Federal Rev. - Impact Aid (PL 815 and 874) (B10) [District Finance] 2009-10  :  float64
410  :  Federal Rev. - Bilingual Education (B11) [District Finance] 2009-10  :  float64
411  :  Federal Rev. - Native American (Ind.) Educ. (B12) [District Finance] 2009-10  :  float64
412  :  Federal Rev. - All Other Federal Aid (B13) [District Finance] 2009-10  :  float64
413  :  Enterprise Operations - Non Instructional (V60) [District Finance] 2009-10  :  float64
414  :  Food Services - Non Instuctional (E11) [District Finance] 2009-10  :  float64
415  :  Instruction Expenditures - Total (E13) [District Finance] 2009-10  :  float64
416  :  Non-Specified - Supp. Serv. Exp. (V85) [District Finance] 2009-10  :  float64
417  :  Other Non Instructional (V65) [District Finance] 2009-10  :  float64
418  :  Total - Gen. Admin.- Supp. Serv. Exp. (E08) [District Finance] 2009-10  :  float64
419  :  Total - Instruct. Staff- Supp. Serv. Exp. (E07) [District Finance] 2009-10  :  float64
420  :  Total - Ops. & Mainten.- Supp. Serv. Exp. (V40) [District Finance] 2009-10  :  float64
421  :  Total - Other Supp. Serv.- Supp. Serv. Exp. (V90) [District Finance] 2009-10  :  float64
422  :  Total - School Admin.- Supp. Serv. Exp. (E09) [District Finance] 2009-10  :  float64
423  :  Total - Student Transp.- Supp. Serv. Exp. (V45) [District Finance] 2009-10  :  float64
424  :  Total - Students- Supp. Serv. Exp. (E17) [District Finance] 2009-10  :  float64
425  :  Salary - Instruction Expenditures (Z33) [District Finance] 2009-10  :  float64
426  :  Salary - Students- Supp. Serv. Exp. (V11) [District Finance] 2009-10  :  float64
427  :  Salary - Instruct. Staff- Supp. Serv. Exp. (V13) [District Finance] 2009-10  :  float64
428  :  Salary - General Admin.- Supp. Serv. Exp. (V15) [District Finance] 2009-10  :  float64
429  :  Salary - School Admin.- Supp. Serv. Exp. (V17) [District Finance] 2009-10  :  float64
430  :  Salary - Ops. & Mainten.- Supp. Serv. Exp. (V21) [District Finance] 2009-10  :  float64
431  :  Salary - Student Transp.- Supp. Serv. Exp. (V23) [District Finance] 2009-10  :  float64
432  :  Salary - Other Supp. Serv.- Supp. Serv. Exp. (V37) [District Finance] 2009-10  :  float64
433  :  Salary - Food Services- Non-Instruction (V29) [District Finance] 2009-10  :  float64
434  :  Employee Benefits - Instruction Expend. (V10) [District Finance] 2009-10  :  float64
435  :  Empl. Benefits - Students- Supp. Serv. Exp. (V12) [District Finance] 2009-10  :  float64
436  :  Empl. Benefits - Instruction- Supp. Serv. Exp. (V14) [District Finance] 2009-10  :  float64
437  :  Empl. Benefits - Gen. Adm.- Supp. Serv. Exp. (V16) [District Finance] 2009-10  :  float64
438  :  Empl. Benefits - Sch. Adm.- Supp. Serv. Exp. (V18) [District Finance] 2009-10  :  float64
439  :  Empl. Benefits - Ops. & Maint.- Supp. Serv. Exp. (V22) [District Finance] 2009-10  :  float64
440  :  Empl. Benefits - Student Trans.- Supp. Serv. Exp. (V24) [District Finance] 2009-10  :  float64
441  :  Empl. Benefits - Other Supp Serv- Supp. Serv. Exp. (V38) [District Finance] 2009-10  :  float64
442  :  Empl. Benefits - Food Services- Non-Instruction (V30) [District Finance] 2009-10  :  float64
443  :  Empl. Benefits - Enterp. Oper.- Non-Instruction (V32) [District Finance] 2009-10  :  float64
444  :  Current Spending - Private Schools (V91) [District Finance] 2009-10  :  float64
445  :  Current Spending - Public Charter Schools (V92) [District Finance] 2009-10  :  float64
446  :  Teacher Salaries - Regular Education Programs (Z35) [District Finance] 2009-10  :  float64
447  :  Teacher Salaries - Special Education Programs (Z36) [District Finance] 2009-10  :  float64
448  :  Teacher Salaries - Vocational Education Programs (Z37) [District Finance] 2009-10  :  float64
449  :  Teacher Salaries - Other Education Programs (Z38) [District Finance] 2009-10  :  float64
450  :  Textbooks for Instruction (V93) [District Finance] 2009-10  :  float64
451  :  Community Services - Non El-Sec (V70) [District Finance] 2009-10  :  float64
452  :  Adult Education - Non El-Sec (V75) [District Finance] 2009-10  :  float64
453  :  Other Expenditures - Non El-Sec (V80) [District Finance] 2009-10  :  float64
454  :  Construction - Capital Outlay (F12) [District Finance] 2009-10  :  float64
455  :  Instructional Equipment - Capital Outlay (K09) [District Finance] 2009-10  :  float64
456  :  Other Equipment - Capital Outlay (K10) [District Finance] 2009-10  :  float64
457  :  Non-specified - Equipment Expenditures (K11) [District Finance] 2009-10  :  float64
458  :  Land & Existing Structures - Capital Outlay (G15) [District Finance] 2009-10  :  float64
459  :  Payments to Local Governments (M12) [District Finance] 2009-10  :  float64
460  :  Payments to State Governments (L12) [District Finance] 2009-10  :  float64
461  :  Interest on School System Indebtedness (I86) [District Finance] 2009-10  :  float64
462  :  Payments to Other School Systems (Q11) [District Finance] 2009-10  :  float64
463  :  SURVYEAR  :  object
464  :  FIPST  :  float64
465  :  TOTD912  :  float64
466  :  EBS912  :  float64
467  :  DRP912  :  float64
468  :  TOTDPL  :  float64
469  :  AFGEB  :  float64
470  :  AFGR  :  float64
471  :  TOTOHC  :  float64

Data Derivation

We need to turn many type columns into indicator columns.


In [14]:
#School District Types
cleaneddistrict['i_agency_type_local_school_district'] = np.where(cleaneddistrict['Agency Type [District] 2009-10']=='1-Local school district', 1, 0)
cleaneddistrict['i_agency_type_local_school_district_sup_union'] = np.where(cleaneddistrict['Agency Type [District] 2009-10']=='2-Local school district component of supervisory union', 1, 0)
cleaneddistrict['i_agency_type_sup_union_admin'] = np.where(cleaneddistrict['Agency Type [District] 2009-10']=='3-Supervisory union administrative center', 1, 0)
cleaneddistrict['i_agency_type_regional_education_services'] = np.where(cleaneddistrict['Agency Type [District] 2009-10']=='4-Regional education services agency', 1, 0)
cleaneddistrict['i_agency_type_state_operated_institution'] = np.where(cleaneddistrict['Agency Type [District] 2009-10']=='5-State-operated institution', 1, 0)
cleaneddistrict['i_agency_type_charter_school_agency'] = np.where(cleaneddistrict['Agency Type [District] 2009-10']=='7-Charter school agency', 1, 0)
cleaneddistrict['i_agency_type_other_education_agency'] = np.where(cleaneddistrict['Agency Type [District] 2009-10']=='8-Other education agency', 1, 0)

#School District Level Code
cleaneddistrict['i_fin_sdlc_elem'] = np.where(cleaneddistrict['School District Level Code (SCHLEV) [District Finance] 2009-10']=='01-Elementary school system only', 1, 0)
cleaneddistrict['i_fin_sdlc_sec'] = np.where(cleaneddistrict['School District Level Code (SCHLEV) [District Finance] 2009-10']=='02-Secondary school system only', 1, 0)
cleaneddistrict['i_fin_sdlc_elem_sec'] = np.where(cleaneddistrict['School District Level Code (SCHLEV) [District Finance] 2009-10']=='03-Elementary/secondary school system', 1, 0)
cleaneddistrict['i_fin_sdlc_voc'] = np.where(cleaneddistrict['School District Level Code (SCHLEV) [District Finance] 2009-10']=='05-Vocational or special education school system', 1, 0)
cleaneddistrict['i_fin_sdlc_nonop'] = np.where(cleaneddistrict['School District Level Code (SCHLEV) [District Finance] 2009-10']=='06-Nonoperating school system', 1, 0)
cleaneddistrict['i_fin_sdlc_ed_serv'] = np.where(cleaneddistrict['School District Level Code (SCHLEV) [District Finance] 2009-10']=='07-Educational service agency', 1, 0)

#Urban Centric Locale
cleaneddistrict['i_ucl_city_large'] = np.where(cleaneddistrict['Urban-centric Locale [District] 2009-10']=='11-City: Large', 1, 0)
cleaneddistrict['i_ucl_city_mid'] = np.where(cleaneddistrict['Urban-centric Locale [District] 2009-10']=='12-City: Mid-size', 1, 0)
cleaneddistrict['i_ucl_city_small'] = np.where(cleaneddistrict['Urban-centric Locale [District] 2009-10']=='13-City: Small', 1, 0)
cleaneddistrict['i_ucl_suburb_large'] = np.where(cleaneddistrict['Urban-centric Locale [District] 2009-10']=='21-Suburb: Large', 1, 0)
cleaneddistrict['i_ucl_suburb_mid'] = np.where(cleaneddistrict['Urban-centric Locale [District] 2009-10']=='22-Suburb: Mid-size', 1, 0)
cleaneddistrict['i_ucl_suburb_small'] = np.where(cleaneddistrict['Urban-centric Locale [District] 2009-10']=='23-Suburb: Small', 1, 0)
cleaneddistrict['i_ucl_town_fringe'] = np.where(cleaneddistrict['Urban-centric Locale [District] 2009-10']=='31-Town: Fringe', 1, 0)
cleaneddistrict['i_ucl_town_distant'] = np.where(cleaneddistrict['Urban-centric Locale [District] 2009-10']=='32-Town: Distant', 1, 0)
cleaneddistrict['i_ucl_town_remote'] = np.where(cleaneddistrict['Urban-centric Locale [District] 2009-10']=='33-Town: Remote', 1, 0)
cleaneddistrict['i_ucl_rural_fringe'] = np.where(cleaneddistrict['Urban-centric Locale [District] 2009-10']=='41-Rural: Fringe', 1, 0)
cleaneddistrict['i_ucl_rural_distant'] = np.where(cleaneddistrict['Urban-centric Locale [District] 2009-10']=='42-Rural: Distant', 1, 0)
cleaneddistrict['i_ucl_rural_remote'] = np.where(cleaneddistrict['Urban-centric Locale [District] 2009-10']=='43-Rural: Remote', 1, 0)

#School District Charter Status
cleaneddistrict['i_cs_all_charter'] = np.where(cleaneddistrict['Agency Charter Status [District] 2009-10']=='1-All associated schools are charter schools', 1, 0)
cleaneddistrict['i_cs_charter_noncharter'] = np.where(cleaneddistrict['Agency Charter Status [District] 2009-10']=='2-All associated schools are charter and noncharter', 1, 0)
cleaneddistrict['i_cs_all_noncharter'] = np.where(cleaneddistrict['Agency Charter Status [District] 2009-10']=='3-All associated schools are noncharter', 1, 0)

#Metro Micro Area Code
cleaneddistrict['i_ma_ne_nr'] = np.where(cleaneddistrict['Metro Micro Area Code [District] 2009-10']=='0-New England (NECTA) or not reported', 1, 0)
cleaneddistrict['i_ma_metropolitan'] = np.where(cleaneddistrict['Metro Micro Area Code [District] 2009-10']=='1-Metropolitan Area', 1, 0)
cleaneddistrict['i_ma_micropolitan'] = np.where(cleaneddistrict['Metro Micro Area Code [District] 2009-10']=='2-Micropolitan Area', 1, 0)

#Lowest Grade Offered
cleaneddistrict['i_lgo_10'] = np.where(cleaneddistrict['Lowest Grade Offered [District] 2009-10']=='10th Grade', 1, 0)
cleaneddistrict['i_lgo_11'] = np.where(cleaneddistrict['Lowest Grade Offered [District] 2009-10']=='11th Grade', 1, 0)
cleaneddistrict['i_lgo_12'] = np.where(cleaneddistrict['Lowest Grade Offered [District] 2009-10']=='12th Grade', 1, 0)
cleaneddistrict['i_lgo_1'] = np.where(cleaneddistrict['Lowest Grade Offered [District] 2009-10']=='1st Grade', 1, 0)
cleaneddistrict['i_lgo_2'] = np.where(cleaneddistrict['Lowest Grade Offered [District] 2009-10']=='2nd Grade', 1, 0)
cleaneddistrict['i_lgo_3'] = np.where(cleaneddistrict['Lowest Grade Offered [District] 2009-10']=='3rd Grade', 1, 0)
cleaneddistrict['i_lgo_4'] = np.where(cleaneddistrict['Lowest Grade Offered [District] 2009-10']=='4th Grade', 1, 0)
cleaneddistrict['i_lgo_5'] = np.where(cleaneddistrict['Lowest Grade Offered [District] 2009-10']=='5th Grade', 1, 0)
cleaneddistrict['i_lgo_6'] = np.where(cleaneddistrict['Lowest Grade Offered [District] 2009-10']=='6th Grade', 1, 0)
cleaneddistrict['i_lgo_7'] = np.where(cleaneddistrict['Lowest Grade Offered [District] 2009-10']=='7th Grade', 1, 0)
cleaneddistrict['i_lgo_8'] = np.where(cleaneddistrict['Lowest Grade Offered [District] 2009-10']=='8th Grade', 1, 0)
cleaneddistrict['i_lgo_9'] = np.where(cleaneddistrict['Lowest Grade Offered [District] 2009-10']=='9th Grade', 1, 0)
cleaneddistrict['i_lgo_K'] = np.where(cleaneddistrict['Lowest Grade Offered [District] 2009-10']=='Kindergarten', 1, 0)
cleaneddistrict['i_lgo_PK'] = np.where(cleaneddistrict['Lowest Grade Offered [District] 2009-10']=='Prekindergarten', 1, 0)
cleaneddistrict['i_lgo_U'] = np.where(cleaneddistrict['Lowest Grade Offered [District] 2009-10']=='Ungraded', 1, 0)

#Highest Grade Offered
cleaneddistrict['i_hgo_10'] = np.where(cleaneddistrict['Highest Grade Offered [District] 2009-10']=='10th Grade', 1, 0)
cleaneddistrict['i_hgo_11'] = np.where(cleaneddistrict['Highest Grade Offered [District] 2009-10']=='11th Grade', 1, 0)
cleaneddistrict['i_hgo_12'] = np.where(cleaneddistrict['Highest Grade Offered [District] 2009-10']=='12th Grade', 1, 0)
cleaneddistrict['i_hgo_1'] = np.where(cleaneddistrict['Highest Grade Offered [District] 2009-10']=='1st Grade', 1, 0)
cleaneddistrict['i_hgo_2'] = np.where(cleaneddistrict['Highest Grade Offered [District] 2009-10']=='2nd Grade', 1, 0)
cleaneddistrict['i_hgo_3'] = np.where(cleaneddistrict['Highest Grade Offered [District] 2009-10']=='3rd Grade', 1, 0)
cleaneddistrict['i_hgo_4'] = np.where(cleaneddistrict['Highest Grade Offered [District] 2009-10']=='4th Grade', 1, 0)
cleaneddistrict['i_hgo_5'] = np.where(cleaneddistrict['Highest Grade Offered [District] 2009-10']=='5th Grade', 1, 0)
cleaneddistrict['i_hgo_6'] = np.where(cleaneddistrict['Highest Grade Offered [District] 2009-10']=='6th Grade', 1, 0)
cleaneddistrict['i_hgo_7'] = np.where(cleaneddistrict['Highest Grade Offered [District] 2009-10']=='7th Grade', 1, 0)
cleaneddistrict['i_hgo_8'] = np.where(cleaneddistrict['Highest Grade Offered [District] 2009-10']=='8th Grade', 1, 0)
cleaneddistrict['i_hgo_9'] = np.where(cleaneddistrict['Highest Grade Offered [District] 2009-10']=='9th Grade', 1, 0)
cleaneddistrict['i_hgo_K'] = np.where(cleaneddistrict['Highest Grade Offered [District] 2009-10']=='Kindergarten', 1, 0)
cleaneddistrict['i_hgo_PK'] = np.where(cleaneddistrict['Highest Grade Offered [District] 2009-10']=='Prekindergarten', 1, 0)
cleaneddistrict['i_hgo_U'] = np.where(cleaneddistrict['Highest Grade Offered [District] 2009-10']=='Ungraded', 1, 0)

Need to turn counts into ratios.


In [15]:
cleaneddistrict['r_ELL'] = cleaneddistrict['Limited English Proficient (LEP) / English Language Learners (ELL) [District] 2009-10']/cleaneddistrict['Total Students (UG PK-12) [District] 2009-10']
cleaneddistrict['r_IEP'] = cleaneddistrict['Individualized Education Program Students [District] 2009-10']/cleaneddistrict['Total Students (UG PK-12) [District] 2009-10']
cleaneddistrict['r_lunch_free'] = cleaneddistrict['Free Lunch Eligible [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_lunch_reduced'] = cleaneddistrict['Reduced-price Lunch Eligible Students [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_PKK'] = cleaneddistrict['Prekindergarten and Kindergarten Students [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_18'] = cleaneddistrict['Grades 1-8 Students [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_912'] = cleaneddistrict['Grades 9-12 Students [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_PK'] = cleaneddistrict['Prekindergarten Students [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_K'] = cleaneddistrict['Kindergarten Students [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_1'] = cleaneddistrict['Grade 1 Students [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_2'] = cleaneddistrict['Grade 2 Students [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_3'] = cleaneddistrict['Grade 3 Students [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_4'] = cleaneddistrict['Grade 4 Students [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_5'] = cleaneddistrict['Grade 5 Students [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_6'] = cleaneddistrict['Grade 6 Students [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_7'] = cleaneddistrict['Grade 7 Students [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_8'] = cleaneddistrict['Grade 8 Students [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_9'] = cleaneddistrict['Grade 9 Students [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_10'] = cleaneddistrict['Grade 10 Students [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_11'] = cleaneddistrict['Grade 11 Students [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_12'] = cleaneddistrict['Grade 12 Students [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_U'] = cleaneddistrict['Ungraded Students [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_re_M'] = cleaneddistrict['Male Students [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_re_F'] = cleaneddistrict['Female Students [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_re_AIAN'] = cleaneddistrict['American Indian/Alaska Native Students [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_re_AAP'] = cleaneddistrict['Asian or Asian/Pacific Islander Students [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_re_H'] = cleaneddistrict['Hispanic Students [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_re_B'] = cleaneddistrict['Black Students [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_re_W'] = cleaneddistrict['White Students [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
try:
    cleaneddistrict['r_stud_re_HNPI'] = cleaneddistrict['Hawaiian Nat./Pacific Isl. Students [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
except:
    pass
try:
    cleaneddistrict['r_stud_re_Two'] = cleaneddistrict['Two or More Races Students [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
except:
    pass
cleaneddistrict['r_stud_re_Total'] = cleaneddistrict['Total Race/Ethnicity [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_PK_AIAN_M'] = cleaneddistrict['Prekindergarten Students - American Indian/Alaska Native - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_PK_AIAN_F'] = cleaneddistrict['Prekindergarten Students - American Indian/Alaska Native - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_PK_AAP_M'] = cleaneddistrict['Prekindergarten Students - Asian or Asian/Pacific Islander - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_PK_AAP_F'] = cleaneddistrict['Prekindergarten Students - Asian or Asian/Pacific Islander - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_PK_H_M'] = cleaneddistrict['Prekindergarten Students - Hispanic - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_PK_H_F'] = cleaneddistrict['Prekindergarten Students - Hispanic - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_PK_B_M'] = cleaneddistrict['Prekindergarten Students - Black - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_PK_B_F'] = cleaneddistrict['Prekindergarten Students - Black - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_PK_W_M'] = cleaneddistrict['Prekindergarten Students - White - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_PK_W_F'] = cleaneddistrict['Prekindergarten Students - White - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_PK_HNPI_M'] = cleaneddistrict['Kindergarten Students - Hawaiian Nat./Pacific Isl. - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_PK_HNPI_F'] = cleaneddistrict['Kindergarten Students - Hawaiian Nat./Pacific Isl. - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_PK_Two_M'] = cleaneddistrict['Kindergarten Students - Two or More Races - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_PK_Two_F'] = cleaneddistrict['Kindergarten Students - Two or More Races - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_K_AIAN_M'] = cleaneddistrict['Kindergarten Students - American Indian/Alaska Native - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_K_AIAN_F'] = cleaneddistrict['Kindergarten Students - American Indian/Alaska Native - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_K_AAP_M'] = cleaneddistrict['Kindergarten Students - Asian or Asian/Pacific Islander - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_K_AAP_F'] = cleaneddistrict['Kindergarten Students - Asian or Asian/Pacific Islander - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_K_H_M'] = cleaneddistrict['Kindergarten Students - Hispanic - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_K_H_F'] = cleaneddistrict['Kindergarten Students - Hispanic - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_K_B_M'] = cleaneddistrict['Kindergarten Students - Black - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_K_B_F'] = cleaneddistrict['Kindergarten Students - Black - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_K_W_M'] = cleaneddistrict['Kindergarten Students - White - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_K_W_F'] = cleaneddistrict['Kindergarten Students - White - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_K_HNPI_M'] = cleaneddistrict['Kindergarten Students - Hawaiian Nat./Pacific Isl. - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_K_HNPI_F'] = cleaneddistrict['Kindergarten Students - Hawaiian Nat./Pacific Isl. - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_K_Two_M'] = cleaneddistrict['Kindergarten Students - Two or More Races - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_K_Two_F'] = cleaneddistrict['Kindergarten Students - Two or More Races - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_1_AIAN_M'] = cleaneddistrict['Grade 1 Students - American Indian/Alaska Native - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_1_AIAN_F'] = cleaneddistrict['Grade 1 Students - American Indian/Alaska Native - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_1_AAP_M'] = cleaneddistrict['Grade 1 Students - Asian or Asian/Pacific Islander - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_1_AAP_F'] = cleaneddistrict['Grade 1 Students - Asian or Asian/Pacific Islander - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_1_H_M'] = cleaneddistrict['Grade 1 Students - Hispanic - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_1_H_F'] = cleaneddistrict['Grade 1 Students - Hispanic - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_1_B_M'] = cleaneddistrict['Grade 1 Students - Black - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_1_B_F'] = cleaneddistrict['Grade 1 Students - Black - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_1_W_M'] = cleaneddistrict['Grade 1 Students - White - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_1_W_F'] = cleaneddistrict['Grade 1 Students - White - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_1_HNPI_M'] = cleaneddistrict['Grade 1 Students - Hawaiian Nat./Pacific Isl. - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_1_HNPI_F'] = cleaneddistrict['Grade 1 Students - Hawaiian Nat./Pacific Isl. - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_1_Two_M'] = cleaneddistrict['Grade 1 Students - Two or More Races - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_1_Two_F'] = cleaneddistrict['Grade 1 Students - Two or More Races - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_2_AIAN_M'] = cleaneddistrict['Grade 2 Students - American Indian/Alaska Native - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_2_AIAN_F'] = cleaneddistrict['Grade 2 Students - American Indian/Alaska Native - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_2_AAP_M'] = cleaneddistrict['Grade 2 Students - Asian or Asian/Pacific Islander - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_2_AAP_F'] = cleaneddistrict['Grade 2 Students - Asian or Asian/Pacific Islander - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_2_H_M'] = cleaneddistrict['Grade 2 Students - Hispanic - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_2_H_F'] = cleaneddistrict['Grade 2 Students - Hispanic - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_2_B_M'] = cleaneddistrict['Grade 2 Students - Black - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_2_B_F'] = cleaneddistrict['Grade 2 Students - Black - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_2_W_M'] = cleaneddistrict['Grade 2 Students - White - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_2_W_F'] = cleaneddistrict['Grade 2 Students - White - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_2_HNPI_M'] = cleaneddistrict['Grade 2 Students - Hawaiian Nat./Pacific Isl. - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_2_HNPI_F'] = cleaneddistrict['Grade 2 Students - Hawaiian Nat./Pacific Isl. - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_2_Two_M'] = cleaneddistrict['Grade 2 Students - Two or More Races - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_2_Two_F'] = cleaneddistrict['Grade 2 Students - Two or More Races - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_3_AIAN_M'] = cleaneddistrict['Grade 3 Students - American Indian/Alaska Native - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_3_AIAN_F'] = cleaneddistrict['Grade 3 Students - American Indian/Alaska Native - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_3_AAP_M'] = cleaneddistrict['Grade 3 Students - Asian or Asian/Pacific Islander - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_3_AAP_F'] = cleaneddistrict['Grade 3 Students - Asian or Asian/Pacific Islander - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_3_H_M'] = cleaneddistrict['Grade 3 Students - Hispanic - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_3_H_F'] = cleaneddistrict['Grade 3 Students - Hispanic - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_3_B_M'] = cleaneddistrict['Grade 3 Students - Black - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_3_B_F'] = cleaneddistrict['Grade 3 Students - Black - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_3_W_M'] = cleaneddistrict['Grade 3 Students - White - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_3_W_F'] = cleaneddistrict['Grade 3 Students - White - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_3_HNPI_M'] = cleaneddistrict['Grade 3 Students - Hawaiian Nat./Pacific Isl. - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_3_HNPI_F'] = cleaneddistrict['Grade 3 Students - Hawaiian Nat./Pacific Isl. - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_3_Two_M'] = cleaneddistrict['Grade 3 Students - Two or More Races - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_3_Two_F'] = cleaneddistrict['Grade 3 Students - Two or More Races - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_4_AIAN_M'] = cleaneddistrict['Grade 4 Students - American Indian/Alaska Native - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_4_AIAN_F'] = cleaneddistrict['Grade 4 Students - American Indian/Alaska Native - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_4_AAP_M'] = cleaneddistrict['Grade 4 Students - Asian or Asian/Pacific Islander - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_4_AAP_F'] = cleaneddistrict['Grade 4 Students - Asian or Asian/Pacific Islander - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_4_H_M'] = cleaneddistrict['Grade 4  Students - Hispanic - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_4_H_F'] = cleaneddistrict['Grade 4 Students - Hispanic - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_4_B_M'] = cleaneddistrict['Grade 4 Students - Black - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_4_B_F'] = cleaneddistrict['Grade 4 Students - Black - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_4_W_M'] = cleaneddistrict['Grade 4 Students - White - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_4_W_F'] = cleaneddistrict['Grade 4 Students - White - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_4_HNPI_M'] = cleaneddistrict['Grade 4 Students - Hawaiian Nat./Pacific Isl. - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_4_HNPI_F'] = cleaneddistrict['Grade 4 Students - Hawaiian Nat./Pacific Isl. - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_4_Two_M'] = cleaneddistrict['Grade 4 Students - Two or More Races - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_4_Two_F'] = cleaneddistrict['Grade 4 Students - Two or More Races - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_5_AIAN_M'] = cleaneddistrict['Grade 5 Students - American Indian/Alaska Native - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_5_AIAN_F'] = cleaneddistrict['Grade 5 Students - American Indian/Alaska Native - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_5_AAP_M'] = cleaneddistrict['Grade 5 Students - Asian or Asian/Pacific Islander - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_5_AAP_F'] = cleaneddistrict['Grade 5 Students - Asian or Asian/Pacific Islander - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_5_H_M'] = cleaneddistrict['Grade 5 Students - Hispanic - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_5_H_F'] = cleaneddistrict['Grade 5 Students - Hispanic - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_5_B_M'] = cleaneddistrict['Grade 5 Students - Black - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_5_B_F'] = cleaneddistrict['Grade 5 Students - Black - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_5_W_M'] = cleaneddistrict['Grade 5 Students - White - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_5_W_F'] = cleaneddistrict['Grade 5 Students - White - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_5_HNPI_M'] = cleaneddistrict['Grade 5 Students - Hawaiian Nat./Pacific Isl. - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_5_HNPI_F'] = cleaneddistrict['Grade 5 Students - Hawaiian Nat./Pacific Isl. - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_5_Two_M'] = cleaneddistrict['Grade 5 Students - Two or More Races - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_5_Two_F'] = cleaneddistrict['Grade 5 Students - Two or More Races - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_6_AIAN_M'] = cleaneddistrict['Grade 6 Students - American Indian/Alaska Native - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_6_AIAN_F'] = cleaneddistrict['Grade 6 Students - American Indian/Alaska Native - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_6_AAP_M'] = cleaneddistrict['Grade 6 Students - Asian or Asian/Pacific Islander - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_6_AAP_F'] = cleaneddistrict['Grade 6 Students - Asian or Asian/Pacific Islander - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_6_H_M'] = cleaneddistrict['Grade 6 Students - Hispanic - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_6_H_F'] = cleaneddistrict['Grade 6 Students - Hispanic - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_6_B_M'] = cleaneddistrict['Grade 6 Students - Black - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_6_B_F'] = cleaneddistrict['Grade 6 Students - Black - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_6_W_M'] = cleaneddistrict['Grade 6 Students - White - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_6_W_F'] = cleaneddistrict['Grade 6 Students - White - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_6_HNPI_M'] = cleaneddistrict['Grade 6 Students - Hawaiian Nat./Pacific Isl. - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_6_HNPI_F'] = cleaneddistrict['Grade 6 Students - Hawaiian Nat./Pacific Isl. - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_6_Two_M'] = cleaneddistrict['Grade 6  Students- Two or More Races - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_6_Two_F'] = cleaneddistrict['Grade 6 Students - Two or More Races - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_7_AIAN_M'] = cleaneddistrict['Grade 7 Students - American Indian/Alaska Native - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_7_AIAN_F'] = cleaneddistrict['Grade 7 Students - American Indian/Alaska Native - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_7_AAP_M'] = cleaneddistrict['Grade 7 Students - Asian or Asian/Pacific Islander - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_7_AAP_F'] = cleaneddistrict['Grade 7 Students - Asian or Asian/Pacific Islander - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_7_H_M'] = cleaneddistrict['Grade 7 Students - Hispanic - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_7_H_F'] = cleaneddistrict['Grade 7 Students - Hispanic - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_7_B_M'] = cleaneddistrict['Grade 7 Students - Black - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_7_B_F'] = cleaneddistrict['Grade 7 Students - Black - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_7_W_M'] = cleaneddistrict['Grade 7 Students - White - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_7_W_F'] = cleaneddistrict['Grade 7 Students - White - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_7_HNPI_M'] = cleaneddistrict['Grade 7 Students - Hawaiian Nat./Pacific Isl. - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_7_HNPI_F'] = cleaneddistrict['Grade 7 Students - Hawaiian Nat./Pacific Isl. - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_7_Two_M'] = cleaneddistrict['Grade 7 Students - Two or More Races - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_7_Two_F'] = cleaneddistrict['Grade 7 Students - Two or More Races - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_8_AIAN_M'] = cleaneddistrict['Grade 8 Students - American Indian/Alaska Native - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_8_AIAN_F'] = cleaneddistrict['Grade 8 Students - American Indian/Alaska Native - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_8_AAP_M'] = cleaneddistrict['Grade 8 Students - Asian or Asian/Pacific Islander - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_8_AAP_F'] = cleaneddistrict['Grade 8 Students - Asian or Asian/Pacific Islander - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_8_H_M'] = cleaneddistrict['Grade 8  Students- Hispanic - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_8_H_F'] = cleaneddistrict['Grade 8 Students - Hispanic - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_8_B_M'] = cleaneddistrict['Grade 8 Students - Black - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_8_B_F'] = cleaneddistrict['Grade 8 Students - Black - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_8_W_M'] = cleaneddistrict['Grade 8 Students - White - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_8_W_F'] = cleaneddistrict['Grade 8 Students - White - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_8_HNPI_M'] = cleaneddistrict['Grade 8 Students - Hawaiian Nat./Pacific Isl. - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_8_HNPI_F'] = cleaneddistrict['Grade 8 Students - Hawaiian Nat./Pacific Isl. - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_8_Two_M'] = cleaneddistrict['Grade 8 Students - Two or More Races - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_8_Two_F'] = cleaneddistrict['Grade 8 Students - Two or More Races - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_9_AIAN_M'] = cleaneddistrict['Grade 9 Students - American Indian/Alaska Native - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_9_AIAN_F'] = cleaneddistrict['Grade 9 Students - American Indian/Alaska Native - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_9_AAP_M'] = cleaneddistrict['Grade 9 Students - Asian or Asian/Pacific Islander - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_9_AAP_F'] = cleaneddistrict['Grade 9 Students - Asian or Asian/Pacific Islander - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_9_H_M'] = cleaneddistrict['Grade 9 Students - Hispanic - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_9_H_F'] = cleaneddistrict['Grade 9 Students - Hispanic - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_9_B_M'] = cleaneddistrict['Grade 9 Students - Black - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_9_B_F'] = cleaneddistrict['Grade 9 Students - Black - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_9_W_M'] = cleaneddistrict['Grade 9 Students - White - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_9_W_F'] = cleaneddistrict['Grade 9 Students - White - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_9_HNPI_M'] = cleaneddistrict['Grade 9 Students - Hawaiian Nat./Pacific Isl. - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_9_HNPI_F'] = cleaneddistrict['Grade 9 Students - Hawaiian Nat./Pacific Isl. - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_9_Two_M'] = cleaneddistrict['Grade 9 Students - Two or More Races - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_9_Two_F'] = cleaneddistrict['Grade 9 Students - Two or More Races - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_10_AIAN_M'] = cleaneddistrict['Grade 10 Students - American Indian/Alaska Native - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_10_AIAN_F'] = cleaneddistrict['Grade 10 Students - American Indian/Alaska Native - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_10_AAP_M'] = cleaneddistrict['Grade 10 Students - Asian or Asian/Pacific Islander - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_10_AAP_F'] = cleaneddistrict['Grade 10 Students - Asian or Asian/Pacific Islander - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_10_H_M'] = cleaneddistrict['Grade 10 Students - Hispanic - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_10_H_F'] = cleaneddistrict['Grade 10 Students - Hispanic - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_10_B_M'] = cleaneddistrict['Grade 10 Students - Black - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_10_B_F'] = cleaneddistrict['Grade 10 Students - Black - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_10_W_M'] = cleaneddistrict['Grade 10 Students - White - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_10_W_F'] = cleaneddistrict['Grade 10 Students - White - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_10_HNPI_M'] = cleaneddistrict['Grade 10 Students - Hawaiian Nat./Pacific Isl. - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_10_HNPI_F'] = cleaneddistrict['Grade 10 Students - Hawaiian Nat./Pacific Isl. - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_10_Two_M'] = cleaneddistrict['Grade 10 Students - Two or More Races - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_10_Two_F'] = cleaneddistrict['Grade 10 Students - Two or More Races - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_11_AIAN_M'] = cleaneddistrict['Grade 11 Students - American Indian/Alaska Native - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_11_AIAN_F'] = cleaneddistrict['Grade 11 Students - American Indian/Alaska Native - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_11_AAP_M'] = cleaneddistrict['Grade 11 Students - Asian or Asian/Pacific Islander - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_11_AAP_F'] = cleaneddistrict['Grade 11 Students - Asian or Asian/Pacific Islander - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_11_H_M'] = cleaneddistrict['Grade 11 Students - Hispanic - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_11_H_F'] = cleaneddistrict['Grade 11 Students - Hispanic - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_11_B_M'] = cleaneddistrict['Grade 11 Students - Black - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_11_B_F'] = cleaneddistrict['Grade 11 Students - Black - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_11_W_M'] = cleaneddistrict['Grade 11 Students - White - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_11_W_F'] = cleaneddistrict['Grade 11 Students - White - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_11_HNPI_M'] = cleaneddistrict['Grade 11 Students - Hawaiian Nat./Pacific Isl. - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_11_HNPI_F'] = cleaneddistrict['Grade 11 Students - Hawaiian Nat./Pacific Isl. - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_11_Two_M'] = cleaneddistrict['Grade 11 Students - Two or More Races - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_11_Two_F'] = cleaneddistrict['Grade 11 Students - Two or More Races - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_12_AIAN_M'] = cleaneddistrict['Grade 12 Students - American Indian/Alaska Native - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_12_AIAN_F'] = cleaneddistrict['Grade 12 Students - American Indian/Alaska Native - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_12_AAP_M'] = cleaneddistrict['Grade 12 Students - Asian or Asian/Pacific Islander - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_12_AAP_F'] = cleaneddistrict['Grade 12 Students - Asian or Asian/Pacific Islander - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_12_H_M'] = cleaneddistrict['Grade 12 Students - Hispanic - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_12_H_F'] = cleaneddistrict['Grade 12 Students - Hispanic - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_12_B_M'] = cleaneddistrict['Grade 12 Students - Black - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_12_B_F'] = cleaneddistrict['Grade 12 Students - Black - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_12_W_M'] = cleaneddistrict['Grade 12 Students - White - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_12_W_F'] = cleaneddistrict['Grade 12 Students - White - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_12_HNPI_M'] = cleaneddistrict['Grade 12 Students - Hawaiian Nat./Pacific Isl. - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_12_HNPI_F'] = cleaneddistrict['Grade 12 Students - Hawaiian Nat./Pacific Isl. - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_12_Two_M'] = cleaneddistrict['Grade 12 Students - Two or More Races - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_12_Two_F'] = cleaneddistrict['Grade 12 Students - Two or More Races - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_U_AIAN_M'] = cleaneddistrict['Ungraded Students - American Indian/Alaska Native - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_U_AIAN_F'] = cleaneddistrict['Ungraded  Students- American Indian/Alaska Native - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_U_AAP_M'] = cleaneddistrict['Ungraded Students - Asian or Asian/Pacific Islander - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_U_AAP_F'] = cleaneddistrict['Ungraded Students - Asian or Asian/Pacific Islander - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_U_H_M'] = cleaneddistrict['Ungraded Students - Hispanic - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_U_H_F'] = cleaneddistrict['Ungraded Students - Hispanic - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_U_B_M'] = cleaneddistrict['Ungraded Students - Black - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_U_B_F'] = cleaneddistrict['Ungraded Students - Black - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_U_W_M'] = cleaneddistrict['Ungraded Students - White - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_U_W_F'] = cleaneddistrict['Ungraded Students - White - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_U_HNPI_M'] = cleaneddistrict['Ungraded Students - Hawaiian Nat./Pacific Isl. - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_U_HNPI_F'] = cleaneddistrict['Ungraded Students - Hawaiian Nat./Pacific Isl. - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_U_Two_M'] = cleaneddistrict['Ungraded Students - Two or More Races - male [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_stud_reg_U_Two_F'] = cleaneddistrict['Ungraded Students - Two or More Races - female [Public School] 2009-10']/cleaneddistrict['Total Students [Public School] 2009-10']
cleaneddistrict['r_st_PKT'] = cleaneddistrict['Prekindergarten Teachers [District] 2009-10']/cleaneddistrict['Total Students (UG PK-12) [District] 2009-10']
cleaneddistrict['r_st_KT'] = cleaneddistrict['Kindergarten Teachers [District] 2009-10']/cleaneddistrict['Total Students (UG PK-12) [District] 2009-10']
cleaneddistrict['r_st_ET'] = cleaneddistrict['Elementary Teachers [District] 2009-10']/cleaneddistrict['Total Students (UG PK-12) [District] 2009-10']
cleaneddistrict['r_st_ST'] = cleaneddistrict['Secondary Teachers [District] 2009-10']/cleaneddistrict['Total Students (UG PK-12) [District] 2009-10']
cleaneddistrict['r_st_UT'] = cleaneddistrict['Ungraded Teachers [District] 2009-10']/cleaneddistrict['Total Students (UG PK-12) [District] 2009-10']
cleaneddistrict['r_st_TS'] = cleaneddistrict['Total Staff [District] 2009-10']/cleaneddistrict['Total Students (UG PK-12) [District] 2009-10']
cleaneddistrict['r_st_IA'] = cleaneddistrict['Instructional Aides [District] 2009-10']/cleaneddistrict['Total Students (UG PK-12) [District] 2009-10']
cleaneddistrict['r_st_IC'] = cleaneddistrict['Instructional Coordinators [District] 2009-10']/cleaneddistrict['Total Students (UG PK-12) [District] 2009-10']
cleaneddistrict['r_st_EGC'] = cleaneddistrict['Elementary Guidance Counselors [District] 2009-10']/cleaneddistrict['Total Students (UG PK-12) [District] 2009-10']
cleaneddistrict['r_st_SGC'] = cleaneddistrict['Secondary Guidance Counselors [District] 2009-10']/cleaneddistrict['Total Students (UG PK-12) [District] 2009-10']
try:
    cleaneddistrict['r_st_OGC'] = cleaneddistrict['Other Guidance Counselors [District] 2009-10']/cleaneddistrict['Total Students (UG PK-12) [District] 2009-10']
except:
    pass
cleaneddistrict['r_st_TGC'] = cleaneddistrict['Total Guidance Counselors [District] 2009-10']/cleaneddistrict['Total Students (UG PK-12) [District] 2009-10']
cleaneddistrict['r_st_LMS'] = cleaneddistrict['Librarians/Media Specialists [District] 2009-10']/cleaneddistrict['Total Students (UG PK-12) [District] 2009-10']
cleaneddistrict['r_st_LMSS'] = cleaneddistrict['Library Media Support Staff [District] 2009-10']/cleaneddistrict['Total Students (UG PK-12) [District] 2009-10']
cleaneddistrict['r_st_LEA'] = cleaneddistrict['LEA Administrators [District] 2009-10']/cleaneddistrict['Total Students (UG PK-12) [District] 2009-10']
cleaneddistrict['r_st_LEASS'] = cleaneddistrict['LEA Administrative Support Staff [District] 2009-10']/cleaneddistrict['Total Students (UG PK-12) [District] 2009-10']
cleaneddistrict['r_st_SA'] = cleaneddistrict['School Administrators [District] 2009-10']/cleaneddistrict['Total Students (UG PK-12) [District] 2009-10']
cleaneddistrict['r_st_SASS'] = cleaneddistrict['School Administrative Support Staff [District] 2009-10']/cleaneddistrict['Total Students (UG PK-12) [District] 2009-10']
cleaneddistrict['r_st_SSSS'] = cleaneddistrict['Student Support Services Staff [District] 2009-10']/cleaneddistrict['Total Students (UG PK-12) [District] 2009-10']
cleaneddistrict['r_st_OSSS'] = cleaneddistrict['Other Support Services Staff [District] 2009-10']/cleaneddistrict['Total Students (UG PK-12) [District] 2009-10']
cleaneddistrict['r_lrev_pt'] = cleaneddistrict['Local Rev. - Property Taxes (T06) [District Finance] 2009-10']/cleaneddistrict['Total Revenue - Local Sources (TLOCREV) [District Finance] 2009-10']
cleaneddistrict['r_lrev_gst'] = cleaneddistrict['Local Rev. - General Sales Taxes (T09) [District Finance] 2009-10']/cleaneddistrict['Total Revenue - Local Sources (TLOCREV) [District Finance] 2009-10']
cleaneddistrict['r_lrev_put'] = cleaneddistrict['Local Rev. - Public Utility Taxes (T15) [District Finance] 2009-10']/cleaneddistrict['Total Revenue - Local Sources (TLOCREV) [District Finance] 2009-10']
cleaneddistrict['r_lrev_it'] = cleaneddistrict['Local Rev. - Individual & Corp. Income Taxes (T40) [District Finance] 2009-10']/cleaneddistrict['Total Revenue - Local Sources (TLOCREV) [District Finance] 2009-10']
cleaneddistrict['r_lrev_aot'] = cleaneddistrict['Local Rev. - All Other Taxes (T99) [District Finance] 2009-10']/cleaneddistrict['Total Revenue - Local Sources (TLOCREV) [District Finance] 2009-10']
cleaneddistrict['r_lrev_pgc'] = cleaneddistrict['Local Rev. - Parent Government Contributions (T02) [District Finance] 2009-10']/cleaneddistrict['Total Revenue - Local Sources (TLOCREV) [District Finance] 2009-10']
cleaneddistrict['r_lrev_cc'] = cleaneddistrict['Local Rev. - Revenue- Cities and Counties (D23) [District Finance] 2009-10']/cleaneddistrict['Total Revenue - Local Sources (TLOCREV) [District Finance] 2009-10']
cleaneddistrict['r_lrev_oss'] = cleaneddistrict['Local Rev. - Revenue- Other School Systems (D11) [District Finance] 2009-10']/cleaneddistrict['Total Revenue - Local Sources (TLOCREV) [District Finance] 2009-10']
cleaneddistrict['r_lrev_tui'] = cleaneddistrict['Local Rev. - Tuition Fees- Pupils and Parents (A07) [District Finance] 2009-10']/cleaneddistrict['Total Revenue - Local Sources (TLOCREV) [District Finance] 2009-10']
cleaneddistrict['r_lrev_trans'] = cleaneddistrict['Local Rev. - Transp. Fees- Pupils and Parents (A08) [District Finance] 2009-10']/cleaneddistrict['Total Revenue - Local Sources (TLOCREV) [District Finance] 2009-10']
cleaneddistrict['r_lrev_slr'] = cleaneddistrict['Local Rev. - School Lunch Revenues (A09) [District Finance] 2009-10']/cleaneddistrict['Total Revenue - Local Sources (TLOCREV) [District Finance] 2009-10']
cleaneddistrict['r_lrev_ts'] = cleaneddistrict['Local Rev. - Textbook Sales and Rentals (A11) [District Finance] 2009-10']/cleaneddistrict['Total Revenue - Local Sources (TLOCREV) [District Finance] 2009-10']
cleaneddistrict['r_lrev_sar'] = cleaneddistrict['Local Rev. - Student Activity Receipts (A13) [District Finance] 2009-10']/cleaneddistrict['Total Revenue - Local Sources (TLOCREV) [District Finance] 2009-10']
cleaneddistrict['r_lrev_osalserv'] = cleaneddistrict['Local Rev. - Other Sales and Service Rev. (A20) [District Finance] 2009-10']/cleaneddistrict['Total Revenue - Local Sources (TLOCREV) [District Finance] 2009-10']
cleaneddistrict['r_lrev_sfns'] = cleaneddistrict['Local Rev. - Student Fees Non-Specified (A15) [District Finance] 2009-10']/cleaneddistrict['Total Revenue - Local Sources (TLOCREV) [District Finance] 2009-10']
cleaneddistrict['r_lrev_ie'] = cleaneddistrict['Local Rev. - Interest Earnings (U22) [District Finance] 2009-10']/cleaneddistrict['Total Revenue - Local Sources (TLOCREV) [District Finance] 2009-10']
cleaneddistrict['r_lrev_molr'] = cleaneddistrict['Local Rev. - Miscellaneous Other Local Rev. (U97) [District Finance] 2009-10']/cleaneddistrict['Total Revenue - Local Sources (TLOCREV) [District Finance] 2009-10']
cleaneddistrict['r_lrev_sp'] = cleaneddistrict['Local Rev. - Special Processing (C24) [District Finance] 2009-10']/cleaneddistrict['Total Revenue - Local Sources (TLOCREV) [District Finance] 2009-10']
cleaneddistrict['r_lrev_rr'] = cleaneddistrict['Local Rev. - Rents and Royalties (A40) [District Finance] 2009-10']/cleaneddistrict['Total Revenue - Local Sources (TLOCREV) [District Finance] 2009-10']
cleaneddistrict['r_lrev_sale'] = cleaneddistrict['Local Rev. - Sale of Property (U11) [District Finance] 2009-10']/cleaneddistrict['Total Revenue - Local Sources (TLOCREV) [District Finance] 2009-10']
cleaneddistrict['r_lrev_ff'] = cleaneddistrict['Local Rev. - Fines and Forfeits (U30) [District Finance] 2009-10']/cleaneddistrict['Total Revenue - Local Sources (TLOCREV) [District Finance] 2009-10']
cleaneddistrict['r_lrev_pc'] = cleaneddistrict['Local Rev. - Private Contributions (U50) [District Finance] 2009-10']/cleaneddistrict['Total Revenue - Local Sources (TLOCREV) [District Finance] 2009-10']
cleaneddistrict['r_srev_gfa'] = cleaneddistrict['State Rev. - General Formula Assistance (C01) [District Finance] 2009-10']/cleaneddistrict['Total Revenue - State Sources (TSTREV) [District Finance] 2009-10']
cleaneddistrict['r_srev_sep'] = cleaneddistrict['State Rev. - Special Education Programs (C05) [District Finance] 2009-10']/cleaneddistrict['Total Revenue - State Sources (TSTREV) [District Finance] 2009-10']
cleaneddistrict['r_srev_trans'] = cleaneddistrict['State Rev. - Transportation Programs (C12) [District Finance] 2009-10']/cleaneddistrict['Total Revenue - State Sources (TSTREV) [District Finance] 2009-10']
cleaneddistrict['r_srev_sip'] = cleaneddistrict['State Rev. - Staff Improvement Programs (C04) [District Finance] 2009-10']/cleaneddistrict['Total Revenue - State Sources (TSTREV) [District Finance] 2009-10']
cleaneddistrict['r_srev_cbsp'] = cleaneddistrict['State Rev. - Compensat. and Basic Skills Prog. (C06) [District Finance] 2009-10']/cleaneddistrict['Total Revenue - State Sources (TSTREV) [District Finance] 2009-10']
cleaneddistrict['r_srev_vep'] = cleaneddistrict['State Rev. - Vocational Education Programs (C09) [District Finance] 2009-10']/cleaneddistrict['Total Revenue - State Sources (TSTREV) [District Finance] 2009-10']
cleaneddistrict['r_srev_codsp'] = cleaneddistrict['State Rev. - Capital Outlay and Debt Serv. Prog. (C11) [District Finance] 2009-10']/cleaneddistrict['Total Revenue - State Sources (TSTREV) [District Finance] 2009-10']
cleaneddistrict['r_srev_bep'] = cleaneddistrict['State Rev. - Bilingual Education Programs (C07) [District Finance] 2009-10']/cleaneddistrict['Total Revenue - State Sources (TSTREV) [District Finance] 2009-10']
cleaneddistrict['r_srev_gt'] = cleaneddistrict['State Rev. - Gifted and Talented Programs (C08) [District Finance] 2009-10']/cleaneddistrict['Total Revenue - State Sources (TSTREV) [District Finance] 2009-10']
cleaneddistrict['r_srev_slp'] = cleaneddistrict['State Rev. - School Lunch Programs (C10) [District Finance] 2009-10']/cleaneddistrict['Total Revenue - State Sources (TSTREV) [District Finance] 2009-10']
cleaneddistrict['r_srev_aor'] = cleaneddistrict['State Rev. - All Other Rev.- State Sources (C13) [District Finance] 2009-10']/cleaneddistrict['Total Revenue - State Sources (TSTREV) [District Finance] 2009-10']
cleaneddistrict['r_srev_splea'] = cleaneddistrict['State Rev. - State Payment for LEA Empl. Benefits (C38) [District Finance] 2009-10']/cleaneddistrict['Total Revenue - State Sources (TSTREV) [District Finance] 2009-10']
cleaneddistrict['r_srev_osp'] = cleaneddistrict['State Rev. - Other State Payments (C39) [District Finance] 2009-10']/cleaneddistrict['Total Revenue - State Sources (TSTREV) [District Finance] 2009-10']
cleaneddistrict['r_srev_ns'] = cleaneddistrict['State Rev. - Non-Specified (C35) [District Finance] 2009-10']/cleaneddistrict['Total Revenue - State Sources (TSTREV) [District Finance] 2009-10']
cleaneddistrict['r_frev_title1'] = cleaneddistrict['Federal Rev. - Federal Title I Revenue (C14) [District Finance] 2009-10']/cleaneddistrict['Total Revenue - Federal Sources (TFEDREV) [District Finance] 2009-10']
cleaneddistrict['r_frev_dis'] = cleaneddistrict['Federal Rev. - Children with Disabilities (C15) [District Finance] 2009-10']/cleaneddistrict['Total Revenue - Federal Sources (TFEDREV) [District Finance] 2009-10']
cleaneddistrict['r_frev_cna'] = cleaneddistrict['Federal Rev. - Child Nutrition Act (C25) [District Finance] 2009-10']/cleaneddistrict['Total Revenue - Federal Sources (TFEDREV) [District Finance] 2009-10']
cleaneddistrict['r_frev_ems'] = cleaneddistrict['Federal Rev. - Eisenhower Math and Science (C16) [District Finance] 2009-10']/cleaneddistrict['Total Revenue - Federal Sources (TFEDREV) [District Finance] 2009-10']
cleaneddistrict['r_frev_dfs'] = cleaneddistrict['Federal Rev. - Drug-Free Schools (C17) [District Finance] 2009-10']/cleaneddistrict['Total Revenue - Federal Sources (TFEDREV) [District Finance] 2009-10']
cleaneddistrict['r_frev_voc'] = cleaneddistrict['Federal Rev. - Vocational Education (C19) [District Finance] 2009-10']/cleaneddistrict['Total Revenue - Federal Sources (TFEDREV) [District Finance] 2009-10']
cleaneddistrict['r_frev_ao'] = cleaneddistrict['Federal Rev. - All Other Fed. Aid Through State (C20) [District Finance] 2009-10']/cleaneddistrict['Total Revenue - Federal Sources (TFEDREV) [District Finance] 2009-10']
cleaneddistrict['r_frev_ns'] = cleaneddistrict['Federal Rev. - Nonspecified (C36) [District Finance] 2009-10']/cleaneddistrict['Total Revenue - Federal Sources (TFEDREV) [District Finance] 2009-10']
cleaneddistrict['r_frev_ia'] = cleaneddistrict['Federal Rev. - Impact Aid (PL 815 and 874) (B10) [District Finance] 2009-10']/cleaneddistrict['Total Revenue - Federal Sources (TFEDREV) [District Finance] 2009-10']
cleaneddistrict['r_frev_be'] = cleaneddistrict['Federal Rev. - Bilingual Education (B11) [District Finance] 2009-10']/cleaneddistrict['Total Revenue - Federal Sources (TFEDREV) [District Finance] 2009-10']
cleaneddistrict['r_frev_na'] = cleaneddistrict['Federal Rev. - Native American (Ind.) Educ. (B12) [District Finance] 2009-10']/cleaneddistrict['Total Revenue - Federal Sources (TFEDREV) [District Finance] 2009-10']
cleaneddistrict['r_frev_aofed'] = cleaneddistrict['Federal Rev. - All Other Federal Aid (B13) [District Finance] 2009-10']/cleaneddistrict['Total Revenue - Federal Sources (TFEDREV) [District Finance] 2009-10']

#If the metric is a district metric, this is the shell
#cleaneddistrict['r_'] = cleaneddistrict['']/cleaneddistrict['Total Students (UG PK-12) [District] 2009-10']
#If the metric is a district public school metric, this is the shell
#cleaneddistrict['r_'] = cleaneddistrict['']/cleaneddistrict['Total Students [Public School] 2009-10']
#If the metric is related to local revenue, this is the shell
#cleaneddistrict['r_'] = cleaneddistrict['']/cleaneddistrict['Total Revenue - Local Sources (TLOCREV) [District Finance] 2009-10']
#If the metric is related to state revenue, this is the shell
#cleaneddistrict['r_'] = cleaneddistrict['']/cleaneddistrict['Total Revenue - State Sources (TSTREV) [District Finance] 2009-10']
#If the metrics is related to federal revenue, this is the shell
#cleaneddistrict['r_'] = cleaneddistrict['']/cleaneddistrict['Total Revenue - Federal Sources (TFEDREV) [District Finance] 2009-10']

In [16]:
cleaneddistrict.to_csv("data/finaldata/cleaned.csv", index=False)

In [44]:
#Use this command if creating old years of data
#cleaneddistrict.to_csv("data/finaldata/cleaned_0607.csv", index=False)

Data Filtering

Now we start filtering to the columns that we need and the rows to school districts that would potentially have graduation information.


In [17]:
filtereddistrict = cleaneddistrict.copy(deep=True)

Trim the rows we can't use. 2,170 of 12,955 school districts with high schools (~17%) could not be used because they did not have valid graduation data.


In [18]:
import math
print 'Total number of unique school districts: ' + str(len(np.unique(filtereddistrict['Agency ID - NCES Assigned [District] Latest available year'])))
filtereddistrict = filtereddistrict[filtereddistrict['Highest Grade Offered [District] 2009-10']=='12th Grade']
print 'Total number of school districts that have high schools: ' + str(len(filtereddistrict))
filtereddistrict = filtereddistrict[filtereddistrict['SURVYEAR']!='–']
print 'Total number of school districts that have a row on raw graduation data: ' + str(len(filtereddistrict))
filtereddistrict = filtereddistrict[filtereddistrict['AFGR']>=0]
print 'Total number of school districts with valid graduation data: ' + str(len(filtereddistrict))


Total number of unique school districts: 19023
Total number of school districts that have high schools: 12955
Total number of school districts that have a row on raw graduation data: 12955
Total number of school districts with valid graduation data: 10785

Trim the columns we don't need and rename the columns to shorter versions.


In [19]:
sc_drdf = pd.read_csv("data/columnlookup/districts_rc.csv")

In [20]:
for index, row in sc_drdf.iterrows():
    current_colname = str(row['Raw Column Name'])
    new_colname     = str(row['New Column Name'])
    if new_colname == "drop":
        #print "Dropping : ", current_colname
        try:
            filtereddistrict.drop(current_colname, axis=1, inplace=True)
        except:
            pass
    else :
        #print "Renaming : ", current_colname, " --> ", new_colname
        try:
            filtereddistrict.rename(columns={current_colname : new_colname    }, inplace=True)
        except:
            pass

filtereddistrict.shape


Out[20]:
(10785, 410)

Create the 2 response columns for good and bad graduation rate. Graduation rate in the top quartile is high. Graduation rate in the bottom quartile is low.


In [21]:
gradhigh = filtereddistrict['afgr'].quantile(q=.75)
gradlow = filtereddistrict['afgr'].quantile(q=.25)

print 'High Graduation Boundary: ' + str(gradhigh)
print 'Low Graduation Boundary: ' + str(gradlow)

filtereddistrict['RESP_High_Graduation'] = np.where(filtereddistrict['afgr']>gradhigh, 1, 0)
filtereddistrict['RESP_Low_Graduation'] = np.where(filtereddistrict['afgr']<=gradlow, 1, 0)


High Graduation Boundary: 93.0
Low Graduation Boundary: 75.0

Check to see if we have any invalid columns.


In [22]:
def df_desc(in_DF):
    INVALID_COLS = []
    for col in in_DF.columns:
        l_NaN = len(in_DF[pd.isnull(filtereddistrict[col])])
        NaN_perc = l_NaN/float(len(in_DF))
        if NaN_perc >= .85: 
            INVALID_COLS.append(col)
        return INVALID_COLS

INVALID_COLUMNS = []
INVALID_COLUMNS = df_desc(filtereddistrict)
 
print INVALID_COLUMNS


[]

We used the two below commands to inspect a column if we had questions about its usefulness.


In [23]:
#joineddistrict['TOTOHC'].value_counts()
#np.unique(joineddistrict['TOTOHC'])

We found that some of the columns had infinity in them. We replaced with NaN.


In [24]:
filtereddistrict=filtereddistrict.replace([np.inf, -np.inf], np.nan)

In [25]:
#This is the fix for replacing NaN with Mean to fix a replace with 0 issue.  There is the potential that variability could be reduced, yet we already checked for invalid columns in a prior step due to NaN, so we do not find this to be of great concern.
#This is much better than the previous mistake of replacing with 0.
#https://www.quora.com/Are-we-doing-justice-with-our-data-set-when-we-replace-NaN-values-with-mean-median-or-mode-most-frequent-value
for col in filtereddistrict.columns:
    if(filtereddistrict[col].dtype == np.float64):
        filtereddistrict[col].fillna(value=np.mean(filtereddistrict[col]), inplace=True)

In [26]:
filtereddistrict.head()


Out[26]:
agency state state_abbr agency_id_nces county county_number num_schools num_charter_schools num_pub_schools report_years no_report_years address city add_state zipcode latitude longitude agency_id_state congressional_code census_id offered_g_lowest num_students pupil_teacher_ratio_dist pupil_teacher_ratio_ps totalrev_pp tlocrev_pp tsrev_pp tfedrev_pp tcurinst_pp tcurssv_pp tcuroth_pp tcursalary_pp tcurbenefits_pp totalexp_pp tcapout_pp tnonelse_pp tcurelsc_pp instexp_pp tcurelsc_percent tcurinst_percent tcuroth_percent tcuresal_percent tcurssvc_percent tfedrev_percent tlocrev_percent tsrev_percent fipst totd912 ebs912 drp912 ... r_lrev_pt r_lrev_gst r_lrev_put r_lrev_it r_lrev_aot r_lrev_pgc r_lrev_cc r_lrev_oss r_lrev_tui r_lrev_trans r_lrev_slr r_lrev_ts r_lrev_sar r_lrev_osalserv r_lrev_sfns r_lrev_ie r_lrev_molr r_lrev_sp r_lrev_rr r_lrev_sale r_lrev_ff r_lrev_pc r_srev_gfa r_srev_sep r_srev_trans r_srev_sip r_srev_cbsp r_srev_vep r_srev_codsp r_srev_bep r_srev_gt r_srev_slp r_srev_aor r_srev_splea r_srev_osp r_srev_ns r_frev_title1 r_frev_dis r_frev_cna r_frev_ems r_frev_dfs r_frev_voc r_frev_ao r_frev_ns r_frev_ia r_frev_be r_frev_na r_frev_aofed RESP_High_Graduation RESP_Low_Graduation
1 21ST CENTURY CHARTER SCH OF GARY Indiana IN 1800046 MARION COUNTY 18097 1 1 1 2004-2013 1986-2003 556 WASHINGTON ST GARY IN 46402 39.771949 -86.155184 9545 1807 Kindergarten 360 15.65 17.14 11111 356 7925 2831 3986 4717 436 4133 1028 11197 1986 72 9139 3986 11.2 43.6 4.8 45.2 51.6 25.5 3.2 71.3 18 203.793066 100 9.676598 ... 0.716005 0.011878 0.003979 0.011257 0.004972 0.769211 0.070312 0.390625 0.000000 0 0.062500 0.000000 0.000000 0.000000 0 0.000000 0.078125 0.007812 0.000000 0.390625 0 0.00000 0.964599 0.000000 0.000000 0.000000 0.003155 0.000000 0 0 0.003155 0.000000 0.029092 0.000000 0.000000 0 0.606477 0.000000 0.156035 0.000000 0.000000 0.000000 0.176644 0.060844 0.000000 0.000000 0.000000 0.000000 0 1
2 21ST CENTURY CYBER CS Pennsylvania PA 4200091 CHESTER COUNTY 42029 1 1 1 2001-2013 1986-2000 805 SPRINGDALE DR EXTON PA 19341 40.005030 -75.678564 124150002 4206 6th Grade 594 28.49 28.49 10557 10285 273 0 5104 3003 0 4199 1456 8732 554 0 8108 5104 18.0 63.0 0.0 51.8 37.0 0.0 97.4 2.6 42 33.000000 483 6.800000 ... 0.716005 0.011878 0.003979 0.011257 0.004972 0.769211 0.000000 0.983958 0.005893 0 0.000000 0.000000 0.000000 0.000000 0 0.000000 0.010149 0.000000 0.000000 0.000000 0 0.00000 0.000000 0.000000 0.000000 0.950617 0.000000 0.000000 0 0 0.000000 0.000000 0.049383 0.000000 0.000000 0 0.208316 0.166727 0.177190 0.024388 0.003395 0.005148 0.309150 0.048223 0.015101 0.003794 0.002556 0.036014 1 0
10 A+ ACADEMY Texas TX 4800203 DALLAS COUNTY 48113 1 1 1 2000-2013 1986-1999 8225 BRUTON RD DALLAS TX 75217 32.767535 -96.660866 057829 4830 Prekindergarten 1033 16.61 16.61 11015 68 9164 1784 4890 3500 441 5788 477 8864 0 1 8832 4890 5.4 55.4 5.0 65.5 39.6 16.2 0.6 83.2 48 203.793066 228 9.676598 ... 0.716005 0.011878 0.003979 0.011257 0.004972 0.769211 0.000000 0.000000 0.000000 0 0.800000 0.000000 0.000000 0.000000 0 0.142857 0.000000 0.000000 0.057143 0.000000 0 0.00000 0.952145 0.000000 0.000000 0.000000 0.000000 0.000000 0 0 0.000000 0.000211 0.006233 0.040038 0.001373 0 0.351058 0.056430 0.238741 0.033641 0.000000 0.000000 0.320130 0.000000 0.000000 0.000000 0.000000 0.000000 0 1
13 A-C CENTRAL CUSD 262 Illinois IL 1700105 CASS COUNTY 17017 3 0 3 1989-2013 1986-1988 501 EAST BUCHANAN ST ASHLAND IL 62612 39.892187 -90.016057 46-009-2620-26 1718 14500920100000 Kindergarten 432 11.86 12.08 11367 4374 5089 1904 4835 4633 498 5741 2112 13200 2204 0 9966 4835 21.2 48.5 5.0 57.6 46.5 16.7 38.5 44.8 17 203.793066 130 9.676598 ... 0.798637 0.000000 0.000000 0.000000 0.000000 0.769211 0.000000 0.001049 0.000000 0 0.060829 0.009963 0.014683 0.000000 0 0.005244 0.052438 0.027792 0.004195 0.000000 0 0.02517 0.642181 0.049572 0.068049 0.000000 0.004056 0.000901 0 0 0.000000 0.000901 0.015773 0.218567 0.000000 0 0.162651 0.306024 0.087952 0.038554 0.002410 0.000000 0.402410 0.000000 0.000000 0.000000 0.000000 0.000000 0 1
14 A-H-S-T COMM SCHOOL DISTRICT Iowa IA 1904080 POTTAWATTAMIE COUNTY 19155 2 0 2 1986-2013 768 SOUTH MAPLE ST AVOCA IA 51521 41.471017 -95.341001 780441 000 1905 16507800100000 Prekindergarten 595 15.64 13.37 10718 5634 4020 1065 4944 2103 336 4703 1438 11586 2769 0 7384 4944 19.5 67.0 4.6 63.7 28.5 9.9 52.6 37.5 19 203.793066 168 9.676598 ... 0.613364 0.101505 0.055343 0.061464 0.000000 0.769211 0.000000 0.043101 0.000000 0 0.034175 0.008926 0.004336 0.002805 0 0.004081 0.066820 0.000000 0.000000 0.004081 0 0.00000 0.859543 0.000000 0.000000 0.118656 0.000000 0.002144 0 0 0.000000 0.001072 0.018585 0.000000 0.000000 0 0.095816 0.107962 0.148448 0.043185 0.008097 0.005398 0.554656 0.000000 0.000000 0.000000 0.000000 0.036437 1 0

5 rows × 412 columns


In [27]:
#This is where I previously made a mistake.  We need to fix this.
#filtereddistrict.fillna(value=0,inplace=True)

In [28]:
filtereddistrict.to_csv("data/finaldata/filtered.csv", index=False)

In [55]:
#Use if loading an old year of data
#filtereddistrict.to_csv("data/finaldata/filtered_0607.csv", index=False)

Feature Engineering


In [29]:
dftouse = filtereddistrict.copy(deep=True)

We do not need the identification data for further analysis. We can also drop several columns that are always true or always false that were found through KDE plots. Lastly, we can drop columns related to graduation rate (we only need to keep the graduation indicator columns).


In [30]:
#Drop identification data
dftouse.drop('agency', axis=1, inplace=True)
dftouse.drop('state', axis=1, inplace=True)
dftouse.drop('state_abbr', axis=1, inplace=True)
dftouse.drop('agency_id_nces', axis=1, inplace=True)
dftouse.drop('county', axis=1, inplace=True)
dftouse.drop('county_number', axis=1, inplace=True)
dftouse.drop('report_years', axis=1, inplace=True)
dftouse.drop('no_report_years', axis=1, inplace=True)
dftouse.drop('address', axis=1, inplace=True)
dftouse.drop('city', axis=1, inplace=True)
dftouse.drop('add_state', axis=1, inplace=True)
dftouse.drop('zipcode', axis=1, inplace=True)
dftouse.drop('latitude', axis=1, inplace=True)
dftouse.drop('longitude', axis=1, inplace=True)
dftouse.drop('agency_id_state', axis=1, inplace=True)
dftouse.drop('congressional_code', axis=1, inplace=True)
dftouse.drop('census_id', axis=1, inplace=True)
dftouse.drop('offered_g_lowest', axis=1, inplace=True)

#Drop indicator columns that are all 1 or 0 - found through KDE
dftouse.drop('i_agency_type_sup_union_admin', axis=1, inplace=True)
dftouse.drop('i_agency_type_state_operated_institution', axis=1, inplace=True)
dftouse.drop('i_agency_type_other_education_agency', axis=1, inplace=True)
dftouse.drop('i_fin_sdlc_elem', axis=1, inplace=True)
dftouse.drop('i_fin_sdlc_nonop', axis=1, inplace=True)
dftouse.drop('i_fin_sdlc_ed_serv', axis=1, inplace=True)
dftouse.drop('i_lgo_10', axis=1, inplace=True)
dftouse.drop('i_lgo_11', axis=1, inplace=True)
dftouse.drop('i_lgo_12', axis=1, inplace=True)
dftouse.drop('i_lgo_1', axis=1, inplace=True)
dftouse.drop('i_lgo_2', axis=1, inplace=True)
dftouse.drop('i_lgo_3', axis=1, inplace=True)
dftouse.drop('i_lgo_4', axis=1, inplace=True)
dftouse.drop('i_lgo_5', axis=1, inplace=True)
dftouse.drop('i_lgo_6', axis=1, inplace=True)
dftouse.drop('i_lgo_7', axis=1, inplace=True)
dftouse.drop('i_lgo_8', axis=1, inplace=True)
dftouse.drop('i_lgo_9', axis=1, inplace=True)
dftouse.drop('i_lgo_U', axis=1, inplace=True)

#Drop columns related to graduation rate
dftouse.drop('fipst', axis=1, inplace=True)
dftouse.drop('totd912', axis=1, inplace=True)
dftouse.drop('ebs912', axis=1, inplace=True)
dftouse.drop('drp912', axis=1, inplace=True)
dftouse.drop('totdpl', axis=1, inplace=True)
dftouse.drop('afgeb', axis=1, inplace=True)
dftouse.drop('totohc', axis=1, inplace=True)

As we are trying to predict graduation rate, total school race/ethnicity makeup and 12th grade race/ethnicity makeup are likely the most relevant. We need to cut down on the number of columns for our further analysis. Therefore, we choose to drop PreK-11 race/ethnicity makeup. We also drop the rates of students in PK-12th grade.


In [31]:
#Drop gender/race information for non 12th grade for now
dftouse.drop(['r_stud_reg_PK_AIAN_M','r_stud_reg_PK_AIAN_F','r_stud_reg_PK_AAP_M','r_stud_reg_PK_AAP_F','r_stud_reg_PK_H_M','r_stud_reg_PK_H_F','r_stud_reg_PK_B_M','r_stud_reg_PK_B_F','r_stud_reg_PK_W_M','r_stud_reg_PK_W_F','r_stud_reg_PK_HNPI_M','r_stud_reg_PK_HNPI_F','r_stud_reg_PK_Two_M','r_stud_reg_PK_Two_F','r_stud_reg_K_AIAN_M','r_stud_reg_K_AIAN_F','r_stud_reg_K_AAP_M','r_stud_reg_K_AAP_F','r_stud_reg_K_H_M','r_stud_reg_K_H_F','r_stud_reg_K_B_M','r_stud_reg_K_B_F','r_stud_reg_K_W_M','r_stud_reg_K_W_F','r_stud_reg_K_HNPI_M','r_stud_reg_K_HNPI_F','r_stud_reg_K_Two_M','r_stud_reg_K_Two_F','r_stud_reg_1_AIAN_M','r_stud_reg_1_AIAN_F','r_stud_reg_1_AAP_M','r_stud_reg_1_AAP_F','r_stud_reg_1_H_M','r_stud_reg_1_H_F','r_stud_reg_1_B_M','r_stud_reg_1_B_F','r_stud_reg_1_W_M','r_stud_reg_1_W_F','r_stud_reg_1_HNPI_M','r_stud_reg_1_HNPI_F','r_stud_reg_1_Two_M','r_stud_reg_1_Two_F','r_stud_reg_2_AIAN_M','r_stud_reg_2_AIAN_F','r_stud_reg_2_AAP_M','r_stud_reg_2_AAP_F','r_stud_reg_2_H_M','r_stud_reg_2_H_F','r_stud_reg_2_B_M','r_stud_reg_2_B_F','r_stud_reg_2_W_M','r_stud_reg_2_W_F','r_stud_reg_2_HNPI_M','r_stud_reg_2_HNPI_F','r_stud_reg_2_Two_M','r_stud_reg_2_Two_F','r_stud_reg_3_AIAN_M','r_stud_reg_3_AIAN_F','r_stud_reg_3_AAP_M','r_stud_reg_3_AAP_F','r_stud_reg_3_H_M','r_stud_reg_3_H_F','r_stud_reg_3_B_M','r_stud_reg_3_B_F','r_stud_reg_3_W_M','r_stud_reg_3_W_F','r_stud_reg_3_HNPI_M','r_stud_reg_3_HNPI_F','r_stud_reg_3_Two_M','r_stud_reg_3_Two_F','r_stud_reg_4_AIAN_M','r_stud_reg_4_AIAN_F','r_stud_reg_4_AAP_M','r_stud_reg_4_AAP_F','r_stud_reg_4_H_M','r_stud_reg_4_H_F','r_stud_reg_4_B_M','r_stud_reg_4_B_F','r_stud_reg_4_W_M','r_stud_reg_4_W_F','r_stud_reg_4_HNPI_M','r_stud_reg_4_HNPI_F','r_stud_reg_4_Two_M','r_stud_reg_4_Two_F','r_stud_reg_5_AIAN_M','r_stud_reg_5_AIAN_F','r_stud_reg_5_AAP_M','r_stud_reg_5_AAP_F','r_stud_reg_5_H_M','r_stud_reg_5_H_F','r_stud_reg_5_B_M','r_stud_reg_5_B_F','r_stud_reg_5_W_M','r_stud_reg_5_W_F','r_stud_reg_5_HNPI_M','r_stud_reg_5_HNPI_F','r_stud_reg_5_Two_M','r_stud_reg_5_Two_F','r_stud_reg_6_AIAN_M','r_stud_reg_6_AIAN_F','r_stud_reg_6_AAP_M','r_stud_reg_6_AAP_F','r_stud_reg_6_H_M','r_stud_reg_6_H_F','r_stud_reg_6_B_M','r_stud_reg_6_B_F','r_stud_reg_6_W_M','r_stud_reg_6_W_F','r_stud_reg_6_HNPI_M','r_stud_reg_6_HNPI_F','r_stud_reg_6_Two_M','r_stud_reg_6_Two_F','r_stud_reg_7_AIAN_M','r_stud_reg_7_AIAN_F','r_stud_reg_7_AAP_M','r_stud_reg_7_AAP_F','r_stud_reg_7_H_M','r_stud_reg_7_H_F','r_stud_reg_7_B_M','r_stud_reg_7_B_F','r_stud_reg_7_W_M','r_stud_reg_7_W_F','r_stud_reg_7_HNPI_M','r_stud_reg_7_HNPI_F','r_stud_reg_7_Two_M','r_stud_reg_7_Two_F','r_stud_reg_8_AIAN_M','r_stud_reg_8_AIAN_F','r_stud_reg_8_AAP_M','r_stud_reg_8_AAP_F','r_stud_reg_8_H_M','r_stud_reg_8_H_F','r_stud_reg_8_B_M','r_stud_reg_8_B_F','r_stud_reg_8_W_M','r_stud_reg_8_W_F','r_stud_reg_8_HNPI_M','r_stud_reg_8_HNPI_F','r_stud_reg_8_Two_M','r_stud_reg_8_Two_F','r_stud_reg_9_AIAN_M','r_stud_reg_9_AIAN_F','r_stud_reg_9_AAP_M','r_stud_reg_9_AAP_F','r_stud_reg_9_H_M','r_stud_reg_9_H_F','r_stud_reg_9_B_M','r_stud_reg_9_B_F','r_stud_reg_9_W_M','r_stud_reg_9_W_F','r_stud_reg_9_HNPI_M','r_stud_reg_9_HNPI_F','r_stud_reg_9_Two_M','r_stud_reg_9_Two_F','r_stud_reg_10_AIAN_M','r_stud_reg_10_AIAN_F','r_stud_reg_10_AAP_M','r_stud_reg_10_AAP_F','r_stud_reg_10_H_M','r_stud_reg_10_H_F','r_stud_reg_10_B_M','r_stud_reg_10_B_F','r_stud_reg_10_W_M','r_stud_reg_10_W_F','r_stud_reg_10_HNPI_M','r_stud_reg_10_HNPI_F','r_stud_reg_10_Two_M','r_stud_reg_10_Two_F','r_stud_reg_11_AIAN_M','r_stud_reg_11_AIAN_F','r_stud_reg_11_AAP_M','r_stud_reg_11_AAP_F','r_stud_reg_11_H_M','r_stud_reg_11_H_F','r_stud_reg_11_B_M','r_stud_reg_11_B_F','r_stud_reg_11_W_M','r_stud_reg_11_W_F','r_stud_reg_11_HNPI_M','r_stud_reg_11_HNPI_F','r_stud_reg_11_Two_M','r_stud_reg_11_Two_F','r_stud_reg_U_AIAN_M','r_stud_reg_U_AIAN_F','r_stud_reg_U_AAP_M','r_stud_reg_U_AAP_F','r_stud_reg_U_H_M','r_stud_reg_U_H_F','r_stud_reg_U_B_M','r_stud_reg_U_B_F','r_stud_reg_U_W_M','r_stud_reg_U_W_F','r_stud_reg_U_HNPI_M','r_stud_reg_U_HNPI_F','r_stud_reg_U_Two_M','r_stud_reg_U_Two_F'], axis=1, inplace=True)

#Drop rates of students in PK-12th grade
dftouse.drop(['r_stud_PK', 'r_stud_K', 'r_stud_1', 'r_stud_2', 'r_stud_3', 'r_stud_4', 'r_stud_5', 'r_stud_6', 'r_stud_7', 'r_stud_8', 'r_stud_9', 'r_stud_10', 'r_stud_11', 'r_stud_12', 'r_stud_U'], axis=1, inplace=True)

We list all of the columns that are amenable to standardization.


In [32]:
dftouse.to_csv("data/finaldata/dftouse.csv", index=False)

In [33]:
#If you need to save an old year of data, use this.
#dftouse.to_csv("data/finaldata/dftouse_0607.csv", index=False)

In [30]:
#NOTE: This is where you start if you're starting with analysis.  Load dftouse and proceed.
#read dftouse if starting here...
dftouse=pd.read_csv("data/finaldata/dftouse.csv")

In [31]:
dftouse.shape


Out[31]:
(10785, 157)

In [35]:
#STANDARDIZABLE = ['pupil_teacher_ratio_dist', 'pupil_teacher_ratio_ps', 'totalrev_pp','tlocrev_pp','tsrev_pp','tfedrev_pp','tcurinst_pp','tcurssv_pp','tcuroth_pp','tcursalary_pp','tcurbenefits_pp','totalexp_pp','tcapout_pp','tnonelse_pp','tcurelsc_pp','instexp_pp','tcurelsc_percent','tcurinst_percent','tcuroth_percent','tcurelsc_percent','tcurssvc_percent','tfedrev_percent','tlocrev_percent','tsrev_percent','fipst','totd912','ebs912','drp912','totdpl','afgeb','afgr','totohc','r_ELL','r_IEP','r_lunch_free','r_lunch_reduced','r_stud_PKK','r_stud_18','r_stud_912','r_stud_PK','r_stud_K','r_stud_1','r_stud_2','r_stud_3','r_stud_4','r_stud_5','r_stud_6','r_stud_7','r_stud_8','r_stud_9','r_stud_10','r_stud_11','r_stud_12','r_stud_U','r_stud_re_M','r_stud_re_F','r_stud_re_AIAN','r_stud_re_AAP','r_stud_re_H','r_stud_re_B','r_stud_re_W','r_stud_re_HNPI','r_stud_re_Two','r_stud_re_Total','r_stud_reg_PK_AIAN_M','r_stud_reg_PK_AIAN_F','r_stud_reg_PK_AAP_M','r_stud_reg_PK_AAP_F','r_stud_reg_PK_H_M','r_stud_reg_PK_H_F','r_stud_reg_PK_B_M','r_stud_reg_PK_B_F','r_stud_reg_PK_W_M','r_stud_reg_PK_W_F','r_stud_reg_PK_HNPI_M','r_stud_reg_PK_HNPI_F','r_stud_reg_PK_Two_M','r_stud_reg_PK_Two_F','r_stud_reg_K_AIAN_M','r_stud_reg_K_AIAN_F','r_stud_reg_K_AAP_M','r_stud_reg_K_AAP_F','r_stud_reg_K_H_M','r_stud_reg_K_H_F','r_stud_reg_K_B_M','r_stud_reg_K_B_F','r_stud_reg_K_W_M','r_stud_reg_K_W_F','r_stud_reg_K_HNPI_M','r_stud_reg_K_HNPI_F','r_stud_reg_K_Two_M','r_stud_reg_K_Two_F','r_stud_reg_1_AIAN_M','r_stud_reg_1_AIAN_F','r_stud_reg_1_AAP_M','r_stud_reg_1_AAP_F','r_stud_reg_1_H_M','r_stud_reg_1_H_F','r_stud_reg_1_B_M','r_stud_reg_1_B_F','r_stud_reg_1_W_M','r_stud_reg_1_W_F','r_stud_reg_1_HNPI_M','r_stud_reg_1_HNPI_F','r_stud_reg_1_Two_M','r_stud_reg_1_Two_F','r_stud_reg_2_AIAN_M','r_stud_reg_2_AIAN_F','r_stud_reg_2_AAP_M','r_stud_reg_2_AAP_F','r_stud_reg_2_H_M','r_stud_reg_2_H_F','r_stud_reg_2_B_M','r_stud_reg_2_B_F','r_stud_reg_2_W_M','r_stud_reg_2_W_F','r_stud_reg_2_HNPI_M','r_stud_reg_2_HNPI_F','r_stud_reg_2_Two_M','r_stud_reg_2_Two_F','r_stud_reg_3_AIAN_M','r_stud_reg_3_AIAN_F','r_stud_reg_3_AAP_M','r_stud_reg_3_AAP_F','r_stud_reg_3_H_M','r_stud_reg_3_H_F','r_stud_reg_3_B_M','r_stud_reg_3_B_F','r_stud_reg_3_W_M','r_stud_reg_3_W_F','r_stud_reg_3_HNPI_M','r_stud_reg_3_HNPI_F','r_stud_reg_3_Two_M','r_stud_reg_3_Two_F','r_stud_reg_4_AIAN_M','r_stud_reg_4_AIAN_F','r_stud_reg_4_AAP_M','r_stud_reg_4_AAP_F','r_stud_reg_4_H_M','r_stud_reg_4_H_F','r_stud_reg_4_B_M','r_stud_reg_4_B_F','r_stud_reg_4_W_M','r_stud_reg_4_W_F','r_stud_reg_4_HNPI_M','r_stud_reg_4_HNPI_F','r_stud_reg_4_Two_M','r_stud_reg_4_Two_F','r_stud_reg_5_AIAN_M','r_stud_reg_5_AIAN_F','r_stud_reg_5_AAP_M','r_stud_reg_5_AAP_F','r_stud_reg_5_H_M','r_stud_reg_5_H_F','r_stud_reg_5_B_M','r_stud_reg_5_B_F','r_stud_reg_5_W_M','r_stud_reg_5_W_F','r_stud_reg_5_HNPI_M','r_stud_reg_5_HNPI_F','r_stud_reg_5_Two_M','r_stud_reg_5_Two_F','r_stud_reg_6_AIAN_M','r_stud_reg_6_AIAN_F','r_stud_reg_6_AAP_M','r_stud_reg_6_AAP_F','r_stud_reg_6_H_M','r_stud_reg_6_H_F','r_stud_reg_6_B_M','r_stud_reg_6_B_F','r_stud_reg_6_W_M','r_stud_reg_6_W_F','r_stud_reg_6_HNPI_M','r_stud_reg_6_HNPI_F','r_stud_reg_6_Two_M','r_stud_reg_6_Two_F','r_stud_reg_7_AIAN_M','r_stud_reg_7_AIAN_F','r_stud_reg_7_AAP_M','r_stud_reg_7_AAP_F','r_stud_reg_7_H_M','r_stud_reg_7_H_F','r_stud_reg_7_B_M','r_stud_reg_7_B_F','r_stud_reg_7_W_M','r_stud_reg_7_W_F','r_stud_reg_7_HNPI_M','r_stud_reg_7_HNPI_F','r_stud_reg_7_Two_M','r_stud_reg_7_Two_F','r_stud_reg_8_AIAN_M','r_stud_reg_8_AIAN_F','r_stud_reg_8_AAP_M','r_stud_reg_8_AAP_F','r_stud_reg_8_H_M','r_stud_reg_8_H_F','r_stud_reg_8_B_M','r_stud_reg_8_B_F','r_stud_reg_8_W_M','r_stud_reg_8_W_F','r_stud_reg_8_HNPI_M','r_stud_reg_8_HNPI_F','r_stud_reg_8_Two_M','r_stud_reg_8_Two_F','r_stud_reg_9_AIAN_M','r_stud_reg_9_AIAN_F','r_stud_reg_9_AAP_M','r_stud_reg_9_AAP_F','r_stud_reg_9_H_M','r_stud_reg_9_H_F','r_stud_reg_9_B_M','r_stud_reg_9_B_F','r_stud_reg_9_W_M','r_stud_reg_9_W_F','r_stud_reg_9_HNPI_M','r_stud_reg_9_HNPI_F','r_stud_reg_9_Two_M','r_stud_reg_9_Two_F','r_stud_reg_10_AIAN_M','r_stud_reg_10_AIAN_F','r_stud_reg_10_AAP_M','r_stud_reg_10_AAP_F','r_stud_reg_10_H_M','r_stud_reg_10_H_F','r_stud_reg_10_B_M','r_stud_reg_10_B_F','r_stud_reg_10_W_M','r_stud_reg_10_W_F','r_stud_reg_10_HNPI_M','r_stud_reg_10_HNPI_F','r_stud_reg_10_Two_M','r_stud_reg_10_Two_F','r_stud_reg_11_AIAN_M','r_stud_reg_11_AIAN_F','r_stud_reg_11_AAP_M','r_stud_reg_11_AAP_F','r_stud_reg_11_H_M','r_stud_reg_11_H_F','r_stud_reg_11_B_M','r_stud_reg_11_B_F','r_stud_reg_11_W_M','r_stud_reg_11_W_F','r_stud_reg_11_HNPI_M','r_stud_reg_11_HNPI_F','r_stud_reg_11_Two_M','r_stud_reg_11_Two_F','r_stud_reg_12_AIAN_M','r_stud_reg_12_AIAN_F','r_stud_reg_12_AAP_M','r_stud_reg_12_AAP_F','r_stud_reg_12_H_M','r_stud_reg_12_H_F','r_stud_reg_12_B_M','r_stud_reg_12_B_F','r_stud_reg_12_W_M','r_stud_reg_12_W_F','r_stud_reg_12_HNPI_M','r_stud_reg_12_HNPI_F','r_stud_reg_12_Two_M','r_stud_reg_12_Two_F','r_stud_reg_U_AIAN_M','r_stud_reg_U_AIAN_F','r_stud_reg_U_AAP_M','r_stud_reg_U_AAP_F','r_stud_reg_U_H_M','r_stud_reg_U_H_F','r_stud_reg_U_B_M','r_stud_reg_U_B_F','r_stud_reg_U_W_M','r_stud_reg_U_W_F','r_stud_reg_U_HNPI_M','r_stud_reg_U_HNPI_F','r_stud_reg_U_Two_M','r_stud_reg_U_Two_F','r_st_PKT','r_st_KT','r_st_ET','r_st_ST','r_st_UT','r_st_TS','r_st_IA','r_st_IC','r_st_EGC','r_st_SGC','r_st_OGC','r_st_TGC','r_st_LMS','r_st_LMSS','r_st_LEA','r_st_LEASS','r_st_SA','r_st_SASS','r_st_SSSS','r_st_OSSS','r_lrev_pt','r_lrev_gst','r_lrev_put','r_lrev_it','r_lrev_aot','r_lrev_pgc','r_lrev_cc','r_lrev_oss','r_lrev_tui','r_lrev_trans','r_lrev_slr','r_lrev_ts','r_lrev_sar','r_lrev_osalserv','r_lrev_sfns','r_lrev_ie','r_lrev_molr','r_lrev_sp','r_lrev_rr','r_lrev_sale','r_lrev_ff','r_lrev_pc','r_srev_gfa','r_srev_sep','r_srev_trans','r_srev_sip','r_srev_cbsp','r_srev_vep','r_srev_codsp','r_srev_bep','r_srev_gt','r_srev_slp','r_srev_aor','r_srev_splea','r_srev_osp','r_srev_ns','r_frev_title1','r_frev_dis','r_frev_cna','r_frev_ems','r_frev_dfs','r_frev_voc','r_frev_ao','r_frev_ns','r_frev_ia','r_frev_be','r_frev_na','r_frev_aofed']
STANDARDIZABLE = ['num_students', 'num_schools','num_charter_schools','num_pub_schools','tcuresal_percent','pupil_teacher_ratio_dist', 'pupil_teacher_ratio_ps', 'totalrev_pp','tlocrev_pp','tsrev_pp','tfedrev_pp','tcurinst_pp','tcurssv_pp','tcuroth_pp','tcursalary_pp','tcurbenefits_pp','totalexp_pp','tcapout_pp','tnonelse_pp','tcurelsc_pp','instexp_pp','tcurinst_percent','tcuroth_percent','tcurelsc_percent','tcurssvc_percent','tfedrev_percent','tlocrev_percent','tsrev_percent','r_ELL','r_IEP','r_lunch_free','r_lunch_reduced','r_stud_PKK','r_stud_18','r_stud_912','r_stud_re_M','r_stud_re_F','r_stud_re_AIAN','r_stud_re_AAP','r_stud_re_H','r_stud_re_B','r_stud_re_W','r_stud_re_HNPI','r_stud_re_Two','r_stud_re_Total','r_stud_reg_12_AIAN_M','r_stud_reg_12_AIAN_F','r_stud_reg_12_AAP_M','r_stud_reg_12_AAP_F','r_stud_reg_12_H_M','r_stud_reg_12_H_F','r_stud_reg_12_B_M','r_stud_reg_12_B_F','r_stud_reg_12_W_M','r_stud_reg_12_W_F','r_stud_reg_12_HNPI_M','r_stud_reg_12_HNPI_F','r_stud_reg_12_Two_M','r_stud_reg_12_Two_F','r_st_PKT','r_st_KT','r_st_ET','r_st_ST','r_st_UT','r_st_TS','r_st_IA','r_st_IC','r_st_EGC','r_st_SGC','r_st_OGC','r_st_TGC','r_st_LMS','r_st_LMSS','r_st_LEA','r_st_LEASS','r_st_SA','r_st_SASS','r_st_SSSS','r_st_OSSS','r_lrev_pt','r_lrev_gst','r_lrev_put','r_lrev_it','r_lrev_aot','r_lrev_pgc','r_lrev_cc','r_lrev_oss','r_lrev_tui','r_lrev_trans','r_lrev_slr','r_lrev_ts','r_lrev_sar','r_lrev_osalserv','r_lrev_sfns','r_lrev_ie','r_lrev_molr','r_lrev_sp','r_lrev_rr','r_lrev_sale','r_lrev_ff','r_lrev_pc','r_srev_gfa','r_srev_sep','r_srev_trans','r_srev_sip','r_srev_cbsp','r_srev_vep','r_srev_codsp','r_srev_bep','r_srev_gt','r_srev_slp','r_srev_aor','r_srev_splea','r_srev_osp','r_srev_ns','r_frev_title1','r_frev_dis','r_frev_cna','r_frev_ems','r_frev_dfs','r_frev_voc','r_frev_ao','r_frev_ns','r_frev_ia','r_frev_be','r_frev_na','r_frev_aofed']
print STANDARDIZABLE


['num_students', 'num_schools', 'num_charter_schools', 'num_pub_schools', 'tcuresal_percent', 'pupil_teacher_ratio_dist', 'pupil_teacher_ratio_ps', 'totalrev_pp', 'tlocrev_pp', 'tsrev_pp', 'tfedrev_pp', 'tcurinst_pp', 'tcurssv_pp', 'tcuroth_pp', 'tcursalary_pp', 'tcurbenefits_pp', 'totalexp_pp', 'tcapout_pp', 'tnonelse_pp', 'tcurelsc_pp', 'instexp_pp', 'tcurinst_percent', 'tcuroth_percent', 'tcurelsc_percent', 'tcurssvc_percent', 'tfedrev_percent', 'tlocrev_percent', 'tsrev_percent', 'r_ELL', 'r_IEP', 'r_lunch_free', 'r_lunch_reduced', 'r_stud_PKK', 'r_stud_18', 'r_stud_912', 'r_stud_re_M', 'r_stud_re_F', 'r_stud_re_AIAN', 'r_stud_re_AAP', 'r_stud_re_H', 'r_stud_re_B', 'r_stud_re_W', 'r_stud_re_HNPI', 'r_stud_re_Two', 'r_stud_re_Total', 'r_stud_reg_12_AIAN_M', 'r_stud_reg_12_AIAN_F', 'r_stud_reg_12_AAP_M', 'r_stud_reg_12_AAP_F', 'r_stud_reg_12_H_M', 'r_stud_reg_12_H_F', 'r_stud_reg_12_B_M', 'r_stud_reg_12_B_F', 'r_stud_reg_12_W_M', 'r_stud_reg_12_W_F', 'r_stud_reg_12_HNPI_M', 'r_stud_reg_12_HNPI_F', 'r_stud_reg_12_Two_M', 'r_stud_reg_12_Two_F', 'r_st_PKT', 'r_st_KT', 'r_st_ET', 'r_st_ST', 'r_st_UT', 'r_st_TS', 'r_st_IA', 'r_st_IC', 'r_st_EGC', 'r_st_SGC', 'r_st_OGC', 'r_st_TGC', 'r_st_LMS', 'r_st_LMSS', 'r_st_LEA', 'r_st_LEASS', 'r_st_SA', 'r_st_SASS', 'r_st_SSSS', 'r_st_OSSS', 'r_lrev_pt', 'r_lrev_gst', 'r_lrev_put', 'r_lrev_it', 'r_lrev_aot', 'r_lrev_pgc', 'r_lrev_cc', 'r_lrev_oss', 'r_lrev_tui', 'r_lrev_trans', 'r_lrev_slr', 'r_lrev_ts', 'r_lrev_sar', 'r_lrev_osalserv', 'r_lrev_sfns', 'r_lrev_ie', 'r_lrev_molr', 'r_lrev_sp', 'r_lrev_rr', 'r_lrev_sale', 'r_lrev_ff', 'r_lrev_pc', 'r_srev_gfa', 'r_srev_sep', 'r_srev_trans', 'r_srev_sip', 'r_srev_cbsp', 'r_srev_vep', 'r_srev_codsp', 'r_srev_bep', 'r_srev_gt', 'r_srev_slp', 'r_srev_aor', 'r_srev_splea', 'r_srev_osp', 'r_srev_ns', 'r_frev_title1', 'r_frev_dis', 'r_frev_cna', 'r_frev_ems', 'r_frev_dfs', 'r_frev_voc', 'r_frev_ao', 'r_frev_ns', 'r_frev_ia', 'r_frev_be', 'r_frev_na', 'r_frev_aofed']

Need to get all of the binary indicator columns


In [36]:
INDICATORS = []
for v in dftouse.columns:
    l=np.unique(dftouse[v])
    if len(l) <= 10:
        INDICATORS.append(v)
        
INDICATORS.remove('RESP_High_Graduation')        
INDICATORS.remove('RESP_Low_Graduation')  
print INDICATORS


['i_agency_type_local_school_district', 'i_agency_type_local_school_district_sup_union', 'i_agency_type_regional_education_services', 'i_agency_type_charter_school_agency', 'i_fin_sdlc_sec', 'i_fin_sdlc_elem_sec', 'i_fin_sdlc_voc', 'i_ucl_city_large', 'i_ucl_city_mid', 'i_ucl_city_small', 'i_ucl_suburb_large', 'i_ucl_suburb_mid', 'i_ucl_suburb_small', 'i_ucl_town_fringe', 'i_ucl_town_distant', 'i_ucl_town_remote', 'i_ucl_rural_fringe', 'i_ucl_rural_distant', 'i_ucl_rural_remote', 'i_cs_all_charter', 'i_cs_charter_noncharter', 'i_cs_all_noncharter', 'i_ma_ne_nr', 'i_ma_metropolitan', 'i_ma_micropolitan', 'i_lgo_K', 'i_lgo_PK']

Test and Training Sets and Standardization

We need to create test and training datasets and standardize the standardizable columns. Much of this code is credited to HW3.


In [32]:
#CITATION: From HW3
from sklearn.cross_validation import train_test_split
itrain, itest = train_test_split(xrange(dftouse.shape[0]), train_size=0.7)

In [33]:
#CITATION: From HW3
mask=np.ones(dftouse.shape[0], dtype='int')
mask[itrain]=1
mask[itest]=0
mask = (mask==1)

In [34]:
# make sure we didn't get unlucky in our mask selection
print "% High_Graduation in Training:", np.mean(dftouse['RESP_High_Graduation'][mask])
print "% High_Graduation in Testing:", np.mean(dftouse['RESP_High_Graduation'][~mask])
print "% Low_Graduation in Training:", np.mean(dftouse['RESP_Low_Graduation'][mask])
print "% Low_Graduation in Testing:", np.mean(dftouse['RESP_Low_Graduation'][~mask])


% High_Graduation in Training: 0.250496754537
% High_Graduation in Testing: 0.245673671199
% Low_Graduation in Training: 0.252616240562
% Low_Graduation in Testing: 0.247218788628

In [40]:
#CITATION: From HW3
mask


Out[40]:
array([ True,  True, False, ...,  True,  True,  True], dtype=bool)

In [41]:
#CITATION: From HW3
mask.shape, mask.sum()


Out[41]:
((10785,), 7549)

In [42]:
dftouse.head()


Out[42]:
num_schools num_charter_schools num_pub_schools num_students pupil_teacher_ratio_dist pupil_teacher_ratio_ps totalrev_pp tlocrev_pp tsrev_pp tfedrev_pp tcurinst_pp tcurssv_pp tcuroth_pp tcursalary_pp tcurbenefits_pp totalexp_pp tcapout_pp tnonelse_pp tcurelsc_pp instexp_pp tcurelsc_percent tcurinst_percent tcuroth_percent tcuresal_percent tcurssvc_percent tfedrev_percent tlocrev_percent tsrev_percent afgr i_agency_type_local_school_district i_agency_type_local_school_district_sup_union i_agency_type_regional_education_services i_agency_type_charter_school_agency i_fin_sdlc_sec i_fin_sdlc_elem_sec i_fin_sdlc_voc i_ucl_city_large i_ucl_city_mid i_ucl_city_small i_ucl_suburb_large i_ucl_suburb_mid i_ucl_suburb_small i_ucl_town_fringe i_ucl_town_distant i_ucl_town_remote i_ucl_rural_fringe i_ucl_rural_distant i_ucl_rural_remote i_cs_all_charter i_cs_charter_noncharter ... r_lrev_pt r_lrev_gst r_lrev_put r_lrev_it r_lrev_aot r_lrev_pgc r_lrev_cc r_lrev_oss r_lrev_tui r_lrev_trans r_lrev_slr r_lrev_ts r_lrev_sar r_lrev_osalserv r_lrev_sfns r_lrev_ie r_lrev_molr r_lrev_sp r_lrev_rr r_lrev_sale r_lrev_ff r_lrev_pc r_srev_gfa r_srev_sep r_srev_trans r_srev_sip r_srev_cbsp r_srev_vep r_srev_codsp r_srev_bep r_srev_gt r_srev_slp r_srev_aor r_srev_splea r_srev_osp r_srev_ns r_frev_title1 r_frev_dis r_frev_cna r_frev_ems r_frev_dfs r_frev_voc r_frev_ao r_frev_ns r_frev_ia r_frev_be r_frev_na r_frev_aofed RESP_High_Graduation RESP_Low_Graduation
1 1 1 1 360 15.65 17.14 11111 356 7925 2831 3986 4717 436 4133 1028 11197 1986 72 9139 3986 11.2 43.6 4.8 45.2 51.6 25.5 3.2 71.3 30.2 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 ... 0.716005 0.011878 0.003979 0.011257 0.004972 0.769211 0.070312 0.390625 0.000000 0 0.062500 0.000000 0.000000 0.000000 0 0.000000 0.078125 0.007812 0.000000 0.390625 0 0.00000 0.964599 0.000000 0.000000 0.000000 0.003155 0.000000 0 0 0.003155 0.000000 0.029092 0.000000 0.000000 0 0.606477 0.000000 0.156035 0.000000 0.000000 0.000000 0.176644 0.060844 0.000000 0.000000 0.000000 0.000000 0 1
2 1 1 1 594 28.49 28.49 10557 10285 273 0 5104 3003 0 4199 1456 8732 554 0 8108 5104 18.0 63.0 0.0 51.8 37.0 0.0 97.4 2.6 100.0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 ... 0.716005 0.011878 0.003979 0.011257 0.004972 0.769211 0.000000 0.983958 0.005893 0 0.000000 0.000000 0.000000 0.000000 0 0.000000 0.010149 0.000000 0.000000 0.000000 0 0.00000 0.000000 0.000000 0.000000 0.950617 0.000000 0.000000 0 0 0.000000 0.000000 0.049383 0.000000 0.000000 0 0.208316 0.166727 0.177190 0.024388 0.003395 0.005148 0.309150 0.048223 0.015101 0.003794 0.002556 0.036014 1 0
10 1 1 1 1033 16.61 16.61 11015 68 9164 1784 4890 3500 441 5788 477 8864 0 1 8832 4890 5.4 55.4 5.0 65.5 39.6 16.2 0.6 83.2 55.7 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 ... 0.716005 0.011878 0.003979 0.011257 0.004972 0.769211 0.000000 0.000000 0.000000 0 0.800000 0.000000 0.000000 0.000000 0 0.142857 0.000000 0.000000 0.057143 0.000000 0 0.00000 0.952145 0.000000 0.000000 0.000000 0.000000 0.000000 0 0 0.000000 0.000211 0.006233 0.040038 0.001373 0 0.351058 0.056430 0.238741 0.033641 0.000000 0.000000 0.320130 0.000000 0.000000 0.000000 0.000000 0.000000 0 1
13 3 0 3 432 11.86 12.08 11367 4374 5089 1904 4835 4633 498 5741 2112 13200 2204 0 9966 4835 21.2 48.5 5.0 57.6 46.5 16.7 38.5 44.8 70.7 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 ... 0.798637 0.000000 0.000000 0.000000 0.000000 0.769211 0.000000 0.001049 0.000000 0 0.060829 0.009963 0.014683 0.000000 0 0.005244 0.052438 0.027792 0.004195 0.000000 0 0.02517 0.642181 0.049572 0.068049 0.000000 0.004056 0.000901 0 0 0.000000 0.000901 0.015773 0.218567 0.000000 0 0.162651 0.306024 0.087952 0.038554 0.002410 0.000000 0.402410 0.000000 0.000000 0.000000 0.000000 0.000000 0 1
14 2 0 2 595 15.64 13.37 10718 5634 4020 1065 4944 2103 336 4703 1438 11586 2769 0 7384 4944 19.5 67.0 4.6 63.7 28.5 9.9 52.6 37.5 95.7 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 ... 0.613364 0.101505 0.055343 0.061464 0.000000 0.769211 0.000000 0.043101 0.000000 0 0.034175 0.008926 0.004336 0.002805 0 0.004081 0.066820 0.000000 0.000000 0.004081 0 0.00000 0.859543 0.000000 0.000000 0.118656 0.000000 0.002144 0 0 0.000000 0.001072 0.018585 0.000000 0.000000 0 0.095816 0.107962 0.148448 0.043185 0.008097 0.005398 0.554656 0.000000 0.000000 0.000000 0.000000 0.036437 1 0

5 rows × 157 columns


In [43]:
#CITATION: From HW3
from sklearn.preprocessing import StandardScaler

for col in STANDARDIZABLE:
    #print col
    valstrain=dftouse[col].values[mask]
    valstest=dftouse[col].values[~mask]
    scaler=StandardScaler().fit(valstrain)
    outtrain=scaler.transform(valstrain)
    outtest=scaler.fit_transform(valstest)
    out=np.empty(mask.shape[0])
    out[mask]=outtrain
    out[~mask]=outtest
    dftouse[col]=out

In [44]:
dftouse.head()


Out[44]:
num_schools num_charter_schools num_pub_schools num_students pupil_teacher_ratio_dist pupil_teacher_ratio_ps totalrev_pp tlocrev_pp tsrev_pp tfedrev_pp tcurinst_pp tcurssv_pp tcuroth_pp tcursalary_pp tcurbenefits_pp totalexp_pp tcapout_pp tnonelse_pp tcurelsc_pp instexp_pp tcurelsc_percent tcurinst_percent tcuroth_percent tcuresal_percent tcurssvc_percent tfedrev_percent tlocrev_percent tsrev_percent afgr i_agency_type_local_school_district i_agency_type_local_school_district_sup_union i_agency_type_regional_education_services i_agency_type_charter_school_agency i_fin_sdlc_sec i_fin_sdlc_elem_sec i_fin_sdlc_voc i_ucl_city_large i_ucl_city_mid i_ucl_city_small i_ucl_suburb_large i_ucl_suburb_mid i_ucl_suburb_small i_ucl_town_fringe i_ucl_town_distant i_ucl_town_remote i_ucl_rural_fringe i_ucl_rural_distant i_ucl_rural_remote i_cs_all_charter i_cs_charter_noncharter ... r_lrev_pt r_lrev_gst r_lrev_put r_lrev_it r_lrev_aot r_lrev_pgc r_lrev_cc r_lrev_oss r_lrev_tui r_lrev_trans r_lrev_slr r_lrev_ts r_lrev_sar r_lrev_osalserv r_lrev_sfns r_lrev_ie r_lrev_molr r_lrev_sp r_lrev_rr r_lrev_sale r_lrev_ff r_lrev_pc r_srev_gfa r_srev_sep r_srev_trans r_srev_sip r_srev_cbsp r_srev_vep r_srev_codsp r_srev_bep r_srev_gt r_srev_slp r_srev_aor r_srev_splea r_srev_osp r_srev_ns r_frev_title1 r_frev_dis r_frev_cna r_frev_ems r_frev_dfs r_frev_voc r_frev_ao r_frev_ns r_frev_ia r_frev_be r_frev_na r_frev_aofed RESP_High_Graduation RESP_Low_Graduation
1 -0.338670 0.285781 -0.338530 -0.263150 0.370429 0.643250 -0.252703 -1.045369 0.513776 0.499275 -0.982301 0.325687 -0.172634 -0.818227 -0.949990 -0.246089 0.209376 -0.082842 -0.383344 -0.982301 -1.542321 -3.095619 0.135113 -1.990740 3.100188 1.751038 -1.866873 1.392814 30.2 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 ... -0.004331 0.002337 0.007879 -0.001394 -0.004129 0.000123 0.318808 3.363975 -0.225391 -0.061487 0.468186 -0.299231 -0.531039 -0.249591 -0.071873 -0.357817 0.269488 -0.160697 -0.193537 24.416334 -0.275063 -0.214134 1.043488 -0.620333 -0.523161 -0.272429 -0.298833 -0.298501 -0.329174 -0.084847 0.936561 -0.399810 -0.433058 -0.478460 -0.258870 -0.165369 3.286801 -1.051086 -0.252304 -0.846108 -0.270667 -0.293313 -0.620835 0.106330 -0.209575 -0.172193 -0.177657 -0.425784 0 1
2 -0.338670 0.285781 -0.338530 -0.246355 3.802319 3.477488 -0.324872 0.979177 -1.494251 -0.649759 -0.563746 -0.375461 -2.331622 -0.795416 -0.616411 -0.549351 -0.226724 -0.211273 -0.595544 -0.563746 -0.356836 0.481116 -2.499379 -1.061096 0.364361 -1.665712 2.897765 -2.620852 100.0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 ... -0.004331 0.002337 0.007879 -0.001394 -0.004129 0.000123 -0.336610 9.086800 0.101830 -0.061487 -0.956598 -0.299231 -0.531039 -0.249591 -0.071873 -0.357817 -0.393769 -0.240093 -0.193537 -0.121287 -0.275063 -0.214134 -3.414172 -0.620333 -0.523161 15.537340 -0.412178 -0.298501 -0.329174 -0.084847 -0.183232 -0.399810 -0.275099 -0.478460 -0.258870 -0.165369 0.008007 -0.013537 0.009237 0.004294 -0.003753 0.007585 0.001292 -0.000180 0.000213 0.003168 0.011451 -0.002840 1 0
10 -0.422872 0.439154 -0.422156 -0.274692 0.442319 0.466407 -0.288200 -1.098900 0.801543 0.108514 -0.642160 -0.174783 -0.134424 -0.240457 -1.347906 -0.577879 -0.494834 -0.153717 -0.446049 -0.642160 -2.501998 -0.960148 0.219731 0.818049 0.907002 0.520328 -1.988405 2.036793 55.7 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 ... 0.009860 -0.005335 -0.017153 0.003291 0.010331 -0.000279 -0.344831 -0.394031 -0.219113 -0.101592 14.442359 -0.258712 -0.507340 -0.230801 -0.067678 4.881880 -0.478171 -0.248068 2.542225 -0.103029 -0.256766 -0.182310 0.982546 -0.589963 -0.424858 -0.278095 -0.391875 -0.271856 -0.296876 -0.145120 -0.187212 -0.345436 -0.612235 -0.037104 -0.130878 -0.168546 1.124561 -0.675011 0.682654 0.269034 -0.260355 -0.231567 0.048369 -0.408107 -0.219425 -0.172871 -0.177715 -0.454086 0 1
13 -0.306821 -0.171903 -0.307531 -0.325331 -0.422178 -0.489816 -0.240633 -0.234429 -0.241805 0.170106 -0.662217 0.304052 0.081584 -0.255538 -0.095561 -0.013277 0.347247 -0.154817 -0.217071 -0.662217 0.233459 -2.244313 0.219731 -0.226506 2.229938 0.588338 -0.078371 -0.158376 70.7 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 ... 0.431616 -0.227301 -0.181783 -0.287101 -0.292690 -0.000279 -0.344831 -0.384811 -0.219113 -0.101592 0.353634 1.966252 -0.289286 -0.230801 -0.067678 -0.251133 0.027511 0.029672 0.018232 -0.103029 -0.256766 0.383792 -0.429599 -0.028338 1.059650 -0.278095 -0.251325 -0.203906 -0.296876 -0.145120 -0.187212 -0.215260 -0.538928 1.910174 -0.287991 -0.168546 -0.383730 0.926164 -1.038448 0.416472 -0.069274 -0.231567 0.433324 -0.408107 -0.219425 -0.172871 -0.177715 -0.454086 0 1
14 -0.288877 -0.107682 -0.289290 -0.246284 0.367757 -0.298166 -0.303899 0.030827 -0.510968 -0.217502 -0.623646 -0.743625 -0.667815 -0.621222 -0.630440 -0.198232 0.447830 -0.211273 -0.744558 -0.623646 -0.095332 1.218587 0.025343 0.615080 -1.228415 -0.339209 0.631780 -0.581886 95.7 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 ... -0.541109 1.713735 2.285169 1.278384 -0.286657 0.000123 -0.336610 0.012034 -0.225391 -0.061487 -0.177527 2.351518 -0.459021 -0.160781 -0.071873 -0.248798 0.159180 -0.240093 -0.193537 0.135041 -0.275063 -0.214134 0.557996 -0.620333 -0.523161 1.700949 -0.412178 -0.104012 -0.329174 -0.084847 -0.183232 -0.179839 -0.514857 -0.478460 -0.258870 -0.165369 -0.918407 -0.379231 -0.346110 0.659741 0.366013 0.022212 1.153961 -0.407121 -0.209575 -0.172193 -0.177657 0.002136 1 0

5 rows × 157 columns


In [45]:
#CITATION: From HW3
lcols=list(dftouse.columns)
lcols.remove('RESP_High_Graduation')
lcols.remove('RESP_Low_Graduation')
lcols.remove('afgr')
###Check for Index Column if exixts in data and remove.
indexcol='Unnamed: 0'
if indexcol in list(dftouse.columns): lcols.remove(indexcol)
print len(lcols)


154

High Graduation Rate

If the graduation rate is in the top percentile nationwide, it is considered a high graduation rate (1). All other graduation rates are considered not a high graduation rate (0).

Exploratory Data Analysis


In [46]:
#CITATION: From HW3
ccols=[]
for c in lcols:
    if c not in INDICATORS:
        ccols.append(c)
print len(ccols), len(INDICATORS)
ccols


127 27
Out[46]:
['num_schools',
 'num_charter_schools',
 'num_pub_schools',
 'num_students',
 'pupil_teacher_ratio_dist',
 'pupil_teacher_ratio_ps',
 'totalrev_pp',
 'tlocrev_pp',
 'tsrev_pp',
 'tfedrev_pp',
 'tcurinst_pp',
 'tcurssv_pp',
 'tcuroth_pp',
 'tcursalary_pp',
 'tcurbenefits_pp',
 'totalexp_pp',
 'tcapout_pp',
 'tnonelse_pp',
 'tcurelsc_pp',
 'instexp_pp',
 'tcurelsc_percent',
 'tcurinst_percent',
 'tcuroth_percent',
 'tcuresal_percent',
 'tcurssvc_percent',
 'tfedrev_percent',
 'tlocrev_percent',
 'tsrev_percent',
 'r_ELL',
 'r_IEP',
 'r_lunch_free',
 'r_lunch_reduced',
 'r_stud_PKK',
 'r_stud_18',
 'r_stud_912',
 'r_stud_re_M',
 'r_stud_re_F',
 'r_stud_re_AIAN',
 'r_stud_re_AAP',
 'r_stud_re_H',
 'r_stud_re_B',
 'r_stud_re_W',
 'r_stud_re_HNPI',
 'r_stud_re_Two',
 'r_stud_re_Total',
 'r_stud_reg_12_AIAN_M',
 'r_stud_reg_12_AIAN_F',
 'r_stud_reg_12_AAP_M',
 'r_stud_reg_12_AAP_F',
 'r_stud_reg_12_H_M',
 'r_stud_reg_12_H_F',
 'r_stud_reg_12_B_M',
 'r_stud_reg_12_B_F',
 'r_stud_reg_12_W_M',
 'r_stud_reg_12_W_F',
 'r_stud_reg_12_HNPI_M',
 'r_stud_reg_12_HNPI_F',
 'r_stud_reg_12_Two_M',
 'r_stud_reg_12_Two_F',
 'r_st_PKT',
 'r_st_KT',
 'r_st_ET',
 'r_st_ST',
 'r_st_UT',
 'r_st_TS',
 'r_st_IA',
 'r_st_IC',
 'r_st_EGC',
 'r_st_SGC',
 'r_st_OGC',
 'r_st_TGC',
 'r_st_LMS',
 'r_st_LMSS',
 'r_st_LEA',
 'r_st_LEASS',
 'r_st_SA',
 'r_st_SASS',
 'r_st_SSSS',
 'r_st_OSSS',
 'r_lrev_pt',
 'r_lrev_gst',
 'r_lrev_put',
 'r_lrev_it',
 'r_lrev_aot',
 'r_lrev_pgc',
 'r_lrev_cc',
 'r_lrev_oss',
 'r_lrev_tui',
 'r_lrev_trans',
 'r_lrev_slr',
 'r_lrev_ts',
 'r_lrev_sar',
 'r_lrev_osalserv',
 'r_lrev_sfns',
 'r_lrev_ie',
 'r_lrev_molr',
 'r_lrev_sp',
 'r_lrev_rr',
 'r_lrev_sale',
 'r_lrev_ff',
 'r_lrev_pc',
 'r_srev_gfa',
 'r_srev_sep',
 'r_srev_trans',
 'r_srev_sip',
 'r_srev_cbsp',
 'r_srev_vep',
 'r_srev_codsp',
 'r_srev_bep',
 'r_srev_gt',
 'r_srev_slp',
 'r_srev_aor',
 'r_srev_splea',
 'r_srev_osp',
 'r_srev_ns',
 'r_frev_title1',
 'r_frev_dis',
 'r_frev_cna',
 'r_frev_ems',
 'r_frev_dfs',
 'r_frev_voc',
 'r_frev_ao',
 'r_frev_ns',
 'r_frev_ia',
 'r_frev_be',
 'r_frev_na',
 'r_frev_aofed']

We make a kernel-density estimate (KDE) plot for each feature in ccols to look for promising separators. The following separators look promising: r_stud_reg_12_AIAN_M, r_stud_reg_12_AIAN_F, r_stud_reg_12_B_M, r_stud_reg_12_B_F, r_srev_trans, and r_frev_voc, among others in the r_st section.


In [192]:
#Number of ccols from above divided by 3 gives the number of rows needed.  So for instance 127/3 = 43.
fig, axs = plt.subplots(43, 3, figsize=(15,100), tight_layout=True)

for item, ax in zip(dftouse[ccols], axs.flat):
    sns.kdeplot(dftouse[dftouse["RESP_High_Graduation"]==0][item], ax=ax, color='r')
    sns.kdeplot(dftouse[dftouse["RESP_High_Graduation"]==1][item], ax=ax, color='b')


We make histograms for each feature in INDICATORS. Most of the separators have nearly all data points of one class on one value, which should help with the rate of false negatives. The exception is i_ma_metropolitan.


In [193]:
fig, axs = plt.subplots(9, 3, figsize=(15,30), tight_layout=True)

for item, ax in zip(dftouse[INDICATORS], axs.flat):
    dftouse[dftouse["RESP_High_Graduation"]==0][item].hist(ax=ax,color="r",label=item)
    dftouse[dftouse["RESP_High_Graduation"]==1][item].hist(ax=ax,color="b",label=item)
    ax.legend(loc='upper right')


Writing Classifiers

We try out many different types of classifiers to predict high graduation rate, RESP_High_Graduation. We tried the classifiers from HW3 and Lab7.

We iteratively worked in this section and then determined more columns that needed to be removed, went back up to data filtering and exploratory analysis, then came back down to this section.

Linear SVM

In [52]:
#CITATION: From HW3
from sklearn.svm import LinearSVC

In [247]:
#CITATION: Adapted from HW3
clfsvm=LinearSVC(loss="hinge")
Cs=[0.001, 0.01, 0.1, 1.0, 10.0, 100.0]
Xmatrix=dftouse[lcols].values
Yresp=dftouse['RESP_High_Graduation'].values

In [248]:
#CITATION: From HW3
Xmatrix_train=Xmatrix[mask]
Xmatrix_test=Xmatrix[~mask]
Yresp_train=Yresp[mask]
Yresp_test=Yresp[~mask]

In [61]:
#CITATION: From HW3
from sklearn.grid_search import GridSearchCV

In [47]:
#CITATION: From HW3
def cv_optimize(clf, parameters, X, y, n_folds=5, score_func=None):
    if score_func:
        gs = GridSearchCV(clf, param_grid=parameters, cv=n_folds, scoring=score_func)
    else:
        gs = GridSearchCV(clf, param_grid=parameters, cv=n_folds)
    gs.fit(X, y)
    print "BEST", gs.best_params_, gs.best_score_, gs.grid_scores_
    best = gs.best_estimator_
    return best

In [48]:
#CITATION: From HW3
from sklearn.metrics import confusion_matrix
def do_classify(clf, parameters, indf, featurenames, targetname, target1val, mask=None, reuse_split=None, score_func=None, n_folds=5):
    subdf=indf[featurenames]
    X=subdf.values
    y=(indf[targetname].values==target1val)*1
    if mask !=None:
        print "using mask"
        Xtrain, Xtest, ytrain, ytest = X[mask], X[~mask], y[mask], y[~mask]
    if reuse_split !=None:
        print "using reuse split"
        Xtrain, Xtest, ytrain, ytest = reuse_split['Xtrain'], reuse_split['Xtest'], reuse_split['ytrain'], reuse_split['ytest']
    if parameters:
        clf = cv_optimize(clf, parameters, Xtrain, ytrain, n_folds=n_folds, score_func=score_func)
    clf=clf.fit(Xtrain, ytrain)
    training_accuracy = clf.score(Xtrain, ytrain)
    test_accuracy = clf.score(Xtest, ytest)
    print "############# based on standard predict ################"
    print "Accuracy on training data: %0.2f" % (training_accuracy)
    print "Accuracy on test data:     %0.2f" % (test_accuracy)
    print confusion_matrix(ytest, clf.predict(Xtest))
    print "########################################################"
    return clf, Xtrain, ytrain, Xtest, ytest

In [49]:
#CITATION: From HW3
def nonzero_lasso(clf):
    featuremask=(clf.coef_ !=0.0)[0]
    return pd.DataFrame(dict(feature=lcols, coef=clf.coef_[0], abscoef=np.abs(clf.coef_[0])))[featuremask].sort('abscoef', ascending=False)

In [253]:
%%time
clfsvm, Xtrain, ytrain, Xtest, ytest = do_classify(LinearSVC(loss="hinge"), {"C": [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]}, dftouse,lcols, 'RESP_High_Graduation',1, mask=mask)
#CITATION: Adapted from HW3


using mask
BEST {'C': 1.0} 0.802357928202 [mean: 0.75904, std: 0.00404, params: {'C': 0.001}, mean: 0.78421, std: 0.00563, params: {'C': 0.01}, mean: 0.79653, std: 0.00889, params: {'C': 0.1}, mean: 0.80236, std: 0.01111, params: {'C': 1.0}, mean: 0.78567, std: 0.00910, params: {'C': 10.0}, mean: 0.71347, std: 0.01607, params: {'C': 100.0}]
############# based on standard predict ################
Accuracy on training data: 0.82
Accuracy on test data:     0.73
[[1992  431]
 [ 432  381]]
########################################################
CPU times: user 48.6 s, sys: 361 ms, total: 48.9 s
Wall time: 49.2 s
/Users/ChaserAcer/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:7: FutureWarning: comparison to `None` will result in an elementwise object comparison in the future.

In [254]:
#CITATION: From HW3
reuse_split=dict(Xtrain=Xtrain, Xtest=Xtest, ytrain=ytrain, ytest=ytest)

In [255]:
#CITATION: From HW3
print "OP=", ytest.sum(), ", ON=",ytest.shape[0] - ytest.sum()


OP= 813 , ON= 2423

In [256]:
#CITATION: From HW3
ypred=clfsvm.predict(Xtest)
mcr=round((confusion_matrix(ytest, ypred)[1][0]+confusion_matrix(ytest, ypred)[0][1])/float(confusion_matrix(ytest, ypred).sum()),2)
print "Cycling through the parameter grid of regularization coefficients in the Cs array, we discover that 1.0 has the greatest mean and results in a %0.2f miscalculation rate, which is a very good indicator that the classifier is worth persuing. " % (mcr)


Cycling through the parameter grid of regularization coefficients in the Cs array, we discover that 1.0 has the greatest mean and results in a 0.27 miscalculation rate, which is a very good indicator that the classifier is worth persuing. 
Log Regression

In [53]:
#CITATION: From HW3
from sklearn.linear_model import LogisticRegression

In [258]:
%%time
clflog,_,_,_,_  = do_classify(LogisticRegression(penalty="l1"), {"C": [0.001, 0.01, 0.1, 1, 10, 100]}, dftouse, lcols, 'RESP_High_Graduation', 1, reuse_split=reuse_split)
#CITATION: Adapted from HW3


using reuse split
BEST {'C': 10} 0.806994303881 [mean: 0.75189, std: 0.00023, params: {'C': 0.001}, mean: 0.78063, std: 0.00347, params: {'C': 0.01}, mean: 0.80183, std: 0.01056, params: {'C': 0.1}, mean: 0.80474, std: 0.00588, params: {'C': 1}, mean: 0.80699, std: 0.00524, params: {'C': 10}, mean: 0.80699, std: 0.00606, params: {'C': 100}]
############# based on standard predict ################
Accuracy on training data: 0.82
Accuracy on test data:     0.65
[[1674  749]
 [ 384  429]]
########################################################
CPU times: user 3min 10s, sys: 1.15 s, total: 3min 11s
Wall time: 3min 12s

Logistic regression returns very similar results.


In [25]:
#CITATION: From HW3
from sklearn.metrics import roc_curve, auc
def make_roc(name, clf, ytest, xtest, ax=None, labe=5, proba=True, skip=0):
    initial=False
    if not ax:
        ax=plt.gca()
        initial=True
    if proba:#for stuff like logistic regression
        fpr, tpr, thresholds=roc_curve(ytest, clf.predict_proba(xtest)[:,1])
    else:#for stuff like SVM
        fpr, tpr, thresholds=roc_curve(ytest, clf.decision_function(xtest))
    roc_auc = auc(fpr, tpr)
    if skip:
        l=fpr.shape[0]
        ax.plot(fpr[0:l:skip], tpr[0:l:skip], '.-', alpha=0.3, label='ROC curve for %s (area = %0.2f)' % (name, roc_auc))
    else:
        ax.plot(fpr, tpr, '.-', alpha=0.3, label='ROC curve for %s (area = %0.2f)' % (name, roc_auc))
    label_kwargs = {}
    label_kwargs['bbox'] = dict(
        boxstyle='round,pad=0.3', alpha=0.2,
    )
    if labe!=None:
        for k in xrange(0, fpr.shape[0],labe):
            #from https://gist.github.com/podshumok/c1d1c9394335d86255b8
            threshold = str(np.round(thresholds[k], 2))
            ax.annotate(threshold, (fpr[k], tpr[k]), **label_kwargs)
    if initial:
        ax.plot([0, 1], [0, 1], 'k--')
        ax.set_xlim([0.0, 1.0])
        ax.set_ylim([0.0, 1.05])
        ax.set_xlabel('False Positive Rate')
        ax.set_ylabel('True Positive Rate')
        ax.set_title('ROC')
    ax.legend(loc="lower right")
    return ax

We now begin to add our classifier models ROC curves in order to visually identify sets of classfiers.


In [260]:
#CITATION: From HW3
with sns.color_palette("dark"):
    ax=make_roc("logistic-with-lasso",clflog, ytest, Xtest, labe=200, skip=50)
    make_roc("svm-all-features",clfsvm, ytest, Xtest, ax, labe=200, proba=False, skip=50);


The logostic with lasso and svm models are both faily good predictors with all of the data provided, but there is question of which features are more correlated with the positive rates.


In [261]:
#CITATION: From HW3
lasso_importances=nonzero_lasso(clflog)
lasso_importances.set_index("feature", inplace=True)
lasso_importances.head(10)


Out[261]:
abscoef coef
feature
r_IEP 5.363398 -5.363398
r_st_IA 5.142588 -5.142588
r_st_SSSS 4.535939 4.535939
i_lgo_K 2.772037 2.772037
i_lgo_PK 2.743221 2.743221
r_st_TS 2.674446 -2.674446
r_st_SGC 2.625193 2.625193
r_st_TGC 2.599000 -2.599000
i_fin_sdlc_sec 2.120275 -2.120275
i_fin_sdlc_voc 2.018974 -2.018974
Feature Selection

In [72]:
from scipy.stats.stats import pearsonr

In [262]:
#CITATION: From HW3

correlations=[]
dftousetrain=dftouse[mask]
for col in lcols:
    r=pearsonr(dftousetrain[col], dftousetrain['RESP_High_Graduation'])[0]
    correlations.append(dict(feature=col,corr=r, abscorr=np.abs(r)))

bpdf=pd.DataFrame(correlations).sort('abscorr', ascending=False)
bpdf.set_index(['feature'], inplace=True)
bpdf.head(25)


Out[262]:
abscorr corr
feature
r_lunch_free 0.311269 -0.311269
r_stud_reg_12_W_F 0.286848 0.286848
r_stud_reg_12_W_M 0.280295 0.280295
r_stud_912 0.217694 0.217694
r_stud_18 0.210958 -0.210958
tlocrev_percent 0.205417 0.205417
tfedrev_percent 0.190213 -0.190213
r_stud_re_W 0.181687 0.181687
tlocrev_pp 0.172095 0.172095
r_stud_re_B 0.163758 -0.163758
r_frev_cna 0.158002 -0.158002
r_frev_title1 0.157105 -0.157105
r_frev_dis 0.156956 0.156956
tsrev_percent 0.153752 -0.153752
tcuroth_percent 0.125794 -0.125794
r_srev_sep 0.124077 0.124077
r_stud_reg_12_AAP_M 0.118350 0.118350
r_stud_re_AAP 0.112981 0.112981
i_ucl_suburb_large 0.109399 0.109399
r_stud_reg_12_AAP_F 0.109254 0.109254
r_stud_re_H 0.105209 -0.105209
num_pub_schools 0.105128 -0.105128
num_schools 0.105097 -0.105097
r_ELL 0.104771 -0.104771
r_srev_gfa 0.101644 -0.101644

The features with the greatest correlation to high graduation appear to be the percentage of students receiving free lunch and then the male/female ratios.


In [54]:
#CITATION: From HW3
from sklearn import feature_selection
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest

In [75]:
#CITATION: From HW3
def pearson_scorer(X,y):
    rs=np.zeros(X.shape[1])
    pvals=np.zeros(X.shape[1])
    i=0
    for v in X.T:
        rs[i], pvals[i]=pearsonr(v, y)
        i=i+1
    return np.abs(rs), pvals

In [265]:
#CITATION: From HW3
selectorlinearsvm = SelectKBest(k=25, score_func=pearson_scorer)
pipelinearsvm = Pipeline([('select', selectorlinearsvm), ('svm', LinearSVC(loss="hinge"))])

In [266]:
%%time
pipelinearsvm, _,_,_,_  = do_classify(pipelinearsvm, {"svm__C": [0.00001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0]}, dftouse,lcols, 'RESP_High_Graduation',1, reuse_split=reuse_split)
#CITATION: From HW3


using reuse split
BEST {'svm__C': 1.0} 0.768976023314 [mean: 0.71201, std: 0.00872, params: {'svm__C': 1e-05}, mean: 0.75758, std: 0.00388, params: {'svm__C': 0.001}, mean: 0.76249, std: 0.00383, params: {'svm__C': 0.01}, mean: 0.76725, std: 0.00431, params: {'svm__C': 0.1}, mean: 0.76898, std: 0.00531, params: {'svm__C': 1.0}, mean: 0.76553, std: 0.00443, params: {'svm__C': 10.0}, mean: 0.75705, std: 0.00949, params: {'svm__C': 100.0}]
############# based on standard predict ################
Accuracy on training data: 0.77
Accuracy on test data:     0.77
[[2398   25]
 [ 706  107]]
########################################################
CPU times: user 10.2 s, sys: 164 ms, total: 10.3 s
Wall time: 10.7 s

In [267]:
#CITATION: From HW3
np.array(lcols)[pipelinearsvm.get_params()['select'].get_support()]


Out[267]:
array(['num_schools', 'num_pub_schools', 'tlocrev_pp', 'tcuroth_percent',
       'tfedrev_percent', 'tlocrev_percent', 'tsrev_percent',
       'i_ucl_suburb_large', 'r_ELL', 'r_lunch_free', 'r_stud_18',
       'r_stud_912', 'r_stud_re_AAP', 'r_stud_re_H', 'r_stud_re_B',
       'r_stud_re_W', 'r_stud_reg_12_AAP_M', 'r_stud_reg_12_AAP_F',
       'r_stud_reg_12_W_M', 'r_stud_reg_12_W_F', 'r_srev_gfa',
       'r_srev_sep', 'r_frev_title1', 'r_frev_dis', 'r_frev_cna'], 
      dtype='|S45')

In [268]:
#CITATION: From HW3
with sns.color_palette("dark"):
    ax=make_roc("svm-all-features",clfsvm, ytest, Xtest, None, labe=250, proba=False, skip=50)
    make_roc("svm-feature-selected",pipelinearsvm, ytest, Xtest, ax, labe=250, proba=False, skip=50);
    make_roc("logistic-with-lasso",clflog, ytest, Xtest, ax, labe=250, proba=True,  skip=50);


As shown, the feature selected is the same, if not slightly worse, than the all features model.


In [269]:
#CITATION: From HW3
jtrain=np.arange(0, ytrain.shape[0])
n_pos=len(jtrain[ytrain==1])
n_neg=len(jtrain[ytrain==0])
print n_pos, n_neg


1873 5676

In [270]:
#CITATION: From HW3
ineg = np.random.choice(jtrain[ytrain==0], n_pos, replace=False)

In [271]:
#CITATION: From HW3
alli=np.concatenate((jtrain[ytrain==1], ineg))
alli.shape


Out[271]:
(3746,)

In [272]:
#CITATION: From HW3
Xtrain_new = Xtrain[alli]
ytrain_new = ytrain[alli]
Xtrain_new.shape, ytrain_new.shape


Out[272]:
((3746, 154), (3746,))

In [273]:
#CITATION: From HW3
reuse_split_new=dict(Xtrain=Xtrain_new, Xtest=Xtest, ytrain=ytrain_new, ytest=ytest)

In [274]:
%%time
clfsvm_b, _,_,_,_  = do_classify(LinearSVC(loss="hinge"), {"C": [0.00001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0]}, dftouse,lcols, 'RESP_High_Graduation',1, reuse_split=reuse_split_new)
#CITATION: From HW3


using reuse split
BEST {'C': 0.1} 0.735451147891 [mean: 0.68740, std: 0.01768, params: {'C': 1e-05}, mean: 0.70849, std: 0.02449, params: {'C': 0.001}, mean: 0.72744, std: 0.02026, params: {'C': 0.01}, mean: 0.73545, std: 0.01870, params: {'C': 0.1}, mean: 0.73518, std: 0.01428, params: {'C': 1.0}, mean: 0.72904, std: 0.01298, params: {'C': 10.0}, mean: 0.64255, std: 0.04169, params: {'C': 100.0}]
############# based on standard predict ################
Accuracy on training data: 0.76
Accuracy on test data:     0.71
[[1701  722]
 [ 220  593]]
########################################################
CPU times: user 17.3 s, sys: 159 ms, total: 17.5 s
Wall time: 17.6 s

In [275]:
#CITATION: From HW3
ax = make_roc("svm-all-features",clfsvm, ytest, Xtest, None, labe=250, proba=False, skip=50)
make_roc("svm-feature-selected",pipelinearsvm, ytest, Xtest, ax, labe=250, proba=False, skip=50);
make_roc("svm-all-features-balanced",clfsvm_b, ytest, Xtest, ax, labe=250, proba=False, skip=50);


Kernalized SVM

In [55]:
#CITATION: From HW3
from sklearn.svm import SVC

In [277]:
#CITATION: From HW3
selectorsvm2 = SelectKBest(k=25, score_func=pearson_scorer)
pipesvm2 = Pipeline([('select2', selectorsvm2), ('svm2', SVC())])

In [278]:
#CITATION: From HW3
jtrain_new=np.arange(0, ytrain_new.shape[0])
ipos_new = np.random.choice(jtrain_new[ytrain_new==1], 300, replace=False)
ineg_new = np.random.choice(jtrain_new[ytrain_new==0], 300, replace=False)
subsampled_i=np.concatenate((ipos_new,ineg_new))
Xtrain_new2=Xtrain_new[subsampled_i]
ytrain_new2=ytrain_new[subsampled_i]

In [279]:
#CITATION: From HW3
reuse_split_subsampled=dict(Xtrain=Xtrain_new2, Xtest=Xtest, ytrain=ytrain_new2, ytest=ytest)

In [280]:
%%time
pipesvm2, _,_,_,_  = do_classify(pipesvm2, {"svm2__C": [1e8],
                                              "svm2__gamma":[1e-5, 1e-7, 1e-9]}, 
                                 dftouse,lcols, 'RESP_High_Graduation',1, reuse_split=reuse_split_subsampled)
#CITATION: From HW3


using reuse split
BEST {'svm2__C': 100000000.0, 'svm2__gamma': 1e-07} 0.673333333333 [mean: 0.66000, std: 0.03472, params: {'svm2__C': 100000000.0, 'svm2__gamma': 1e-05}, mean: 0.67333, std: 0.03590, params: {'svm2__C': 100000000.0, 'svm2__gamma': 1e-07}, mean: 0.61000, std: 0.05385, params: {'svm2__C': 100000000.0, 'svm2__gamma': 1e-09}]
############# based on standard predict ################
Accuracy on training data: 0.71
Accuracy on test data:     0.70
[[1726  697]
 [ 263  550]]
########################################################
CPU times: user 9.15 s, sys: 62 ms, total: 9.21 s
Wall time: 9.32 s

In [281]:
#CITATION: From HW3
gamma_wanted=pipesvm2.get_params()['svm2__gamma']
C_chosen=pipesvm2.get_params()['svm2__C']
print gamma_wanted, C_chosen
selectorsvm3 = SelectKBest(k=25, score_func=pearson_scorer)
pipesvm3 = Pipeline([('select3', selectorsvm3), ('svm3', SVC(C=C_chosen, gamma=gamma_wanted))])
pipesvm3, _,_,_,_  = do_classify(pipesvm3, None, 
                                 dftouse,lcols, 'RESP_High_Graduation',1, reuse_split=reuse_split_new)


1e-07 100000000.0
using reuse split
############# based on standard predict ################
Accuracy on training data: 0.72
Accuracy on test data:     0.71
[[1710  713]
 [ 231  582]]
########################################################

In [283]:
#CITATION: From HW3
with sns.color_palette("dark"):
    ax = make_roc("logistic-with-lasso",clflog, ytest, Xtest, None, labe=300, skip=50)
    make_roc("rbf-svm-feature-selected-balanced",pipesvm3, ytest, Xtest, ax, labe=None, proba=False, skip=50);
    make_roc("svm-all-features-balanced",clfsvm_b, ytest, Xtest, ax, labe=250, proba=False, skip=50);



In [284]:
Xtraina = Xtrain 
ytraina = ytrain 
Xtesta = Xtest 
ytesta = ytest

Decision Trees, Random Forest, ADA & Gradient Boost :

For all these models we use 2 different approach.

  1. Modelling using Funding/Expenditure/Location/School Types and Race/Sex. The goal here is to find the best predicted.
  2. Modelling using Funding/Expenditure/Location/School Types but no Race/Sex. The goal here to give reccomendations to schools on factors they can focus to improve their performance. Additional to compare how much impact does Race/Sex to imporove graduation rate prediction.

In [14]:
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import train_test_split
from sklearn.metrics import confusion_matrix

def cv_optimize(clf, parameters, X, y, n_jobs=1, n_folds=5, score_func=None):
    if score_func:
        gs = GridSearchCV(clf, param_grid=parameters, cv=n_folds, n_jobs=n_jobs, scoring=score_func)
    else:
        gs = GridSearchCV(clf, param_grid=parameters, n_jobs=n_jobs, cv=n_folds)
    gs.fit(X, y)
    print "BEST", gs.best_params_, gs.best_score_, gs.grid_scores_
    best = gs.best_estimator_
    return best

def do_classify(clf, parameters, indf, featurenames, targetname, target1val, mask=None, reuse_split=None, score_func=None, n_folds=5, n_jobs=1):
    subdf=indf[featurenames]
    X=subdf.values
    y=(indf[targetname].values==target1val)*1
    if mask !=None:
        print "using mask"
        Xtrain, Xtest, ytrain, ytest = X[mask], X[~mask], y[mask], y[~mask]
    if reuse_split !=None:
        print "using reuse split"
        Xtrain, Xtest, ytrain, ytest = reuse_split['Xtrain'], reuse_split['Xtest'], reuse_split['ytrain'], reuse_split['ytest']
    if parameters:
        clf = cv_optimize(clf, parameters, Xtrain, ytrain, n_jobs=n_jobs, n_folds=n_folds, score_func=score_func)
    clf=clf.fit(Xtrain, ytrain)
    training_accuracy = clf.score(Xtrain, ytrain)
    test_accuracy = clf.score(Xtest, ytest)
    print "############# based on standard predict ################"
    print "Accuracy on training data: %0.2f" % (training_accuracy)
    print "Accuracy on test data:     %0.2f" % (test_accuracy)
    print confusion_matrix(ytest, clf.predict(Xtest))
    print "########################################################"
    return clf, Xtrain, ytrain, Xtest, ytest

Decision Trees 1


In [15]:
# Indicators used : Funding/Expenditure/Location/School Types and Race/Sex
Xnames1 = [
    'pupil_teacher_ratio_dist',
            'totalrev_pp',
            'tcurinst_pp',
            'tcurssv_pp',
            'tcursalary_pp',
            'tcurbenefits_pp',
            'totalexp_pp',
            'tcapout_pp',
            'tnonelse_pp',
            'tcurelsc_pp',
            'instexp_pp',
            'i_agency_type_local_school_district',
            'i_agency_type_local_school_district_sup_union',
            'i_agency_type_regional_education_services',
            'i_agency_type_charter_school_agency',
            'i_fin_sdlc_sec',
            'i_fin_sdlc_elem_sec',
            'i_fin_sdlc_voc',
            'i_ucl_city_large',
            'i_ucl_city_mid',
            'i_ucl_city_small',
            'i_ucl_suburb_large',
            'i_ucl_suburb_mid',
            'i_ucl_suburb_small',
            'i_ucl_town_fringe',
            'i_ucl_town_distant',
            'i_ucl_town_remote',
            'i_ucl_rural_fringe',
            'i_ucl_rural_distant',
            'i_ucl_rural_remote',
            'i_cs_all_charter',
            'i_cs_charter_noncharter',
            'i_cs_all_noncharter',
            'i_ma_ne_nr',
            'i_ma_metropolitan',
            'i_ma_micropolitan',            
            'r_ELL',
            'r_IEP',
            'r_stud_re_M',
            'r_stud_re_F',
            'r_stud_re_AIAN',
            'r_stud_re_AAP',
            'r_stud_re_H',
            'r_stud_re_B',
            'r_stud_re_W',
            'r_stud_re_HNPI',
            'r_stud_re_Two',
            'r_lunch_free',
            'r_lunch_reduced'
]

target1 = 'RESP_Low_Graduation'

In [16]:
from sklearn import tree
# Decision Trees
clfTree1 = tree.DecisionTreeClassifier()

parameters = {"max_depth": [1, 2, 3, 4, 5, 6, 7], 'min_samples_leaf': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}
clfTree1, Xtrain, ytrain, Xtest, ytest = do_classify(clfTree1, parameters, dftouse, 
                                                     Xnames1, target, 1, 
                                                     mask=mask, n_jobs = 4, score_func = 'f1')

importance_list = clfTree1.feature_importances_
name_list = dftouse[Xnames1].columns
importance_list, name_list = zip(*sorted(zip(importance_list, name_list))[-15:])
plt.barh(range(len(name_list)),importance_list,align='center')
plt.yticks(range(len(name_list)),name_list)
plt.xlabel('Relative Importance in the Descision Trees 1')
plt.ylabel('Features')
plt.title('Relative importance of Each Feature')
plt.show()


using mask
/Users/ashwindeo/Dev/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:19: FutureWarning: comparison to `None` will result in an elementwise object comparison in the future.
BEST {'max_depth': 7, 'min_samples_leaf': 3} 0.368356617659 [mean: 0.26186, std: 0.21514, params: {'max_depth': 1, 'min_samples_leaf': 1}, mean: 0.26186, std: 0.21514, params: {'max_depth': 1, 'min_samples_leaf': 2}, mean: 0.26186, std: 0.21514, params: {'max_depth': 1, 'min_samples_leaf': 3}, mean: 0.26186, std: 0.21514, params: {'max_depth': 1, 'min_samples_leaf': 4}, mean: 0.26186, std: 0.21514, params: {'max_depth': 1, 'min_samples_leaf': 5}, mean: 0.26186, std: 0.21514, params: {'max_depth': 1, 'min_samples_leaf': 6}, mean: 0.26186, std: 0.21514, params: {'max_depth': 1, 'min_samples_leaf': 7}, mean: 0.26186, std: 0.21514, params: {'max_depth': 1, 'min_samples_leaf': 8}, mean: 0.26186, std: 0.21514, params: {'max_depth': 1, 'min_samples_leaf': 9}, mean: 0.26186, std: 0.21514, params: {'max_depth': 1, 'min_samples_leaf': 10}, mean: 0.25946, std: 0.07400, params: {'max_depth': 2, 'min_samples_leaf': 1}, mean: 0.25946, std: 0.07400, params: {'max_depth': 2, 'min_samples_leaf': 2}, mean: 0.25946, std: 0.07400, params: {'max_depth': 2, 'min_samples_leaf': 3}, mean: 0.25946, std: 0.07400, params: {'max_depth': 2, 'min_samples_leaf': 4}, mean: 0.25946, std: 0.07400, params: {'max_depth': 2, 'min_samples_leaf': 5}, mean: 0.25946, std: 0.07400, params: {'max_depth': 2, 'min_samples_leaf': 6}, mean: 0.25946, std: 0.07400, params: {'max_depth': 2, 'min_samples_leaf': 7}, mean: 0.25946, std: 0.07400, params: {'max_depth': 2, 'min_samples_leaf': 8}, mean: 0.25946, std: 0.07400, params: {'max_depth': 2, 'min_samples_leaf': 9}, mean: 0.25946, std: 0.07400, params: {'max_depth': 2, 'min_samples_leaf': 10}, mean: 0.32760, std: 0.04714, params: {'max_depth': 3, 'min_samples_leaf': 1}, mean: 0.32760, std: 0.04714, params: {'max_depth': 3, 'min_samples_leaf': 2}, mean: 0.32939, std: 0.04802, params: {'max_depth': 3, 'min_samples_leaf': 3}, mean: 0.32760, std: 0.04714, params: {'max_depth': 3, 'min_samples_leaf': 4}, mean: 0.32939, std: 0.04802, params: {'max_depth': 3, 'min_samples_leaf': 5}, mean: 0.32939, std: 0.04802, params: {'max_depth': 3, 'min_samples_leaf': 6}, mean: 0.32760, std: 0.04714, params: {'max_depth': 3, 'min_samples_leaf': 7}, mean: 0.32760, std: 0.04714, params: {'max_depth': 3, 'min_samples_leaf': 8}, mean: 0.32939, std: 0.04802, params: {'max_depth': 3, 'min_samples_leaf': 9}, mean: 0.32939, std: 0.04802, params: {'max_depth': 3, 'min_samples_leaf': 10}, mean: 0.34832, std: 0.03633, params: {'max_depth': 4, 'min_samples_leaf': 1}, mean: 0.34844, std: 0.03616, params: {'max_depth': 4, 'min_samples_leaf': 2}, mean: 0.34414, std: 0.03025, params: {'max_depth': 4, 'min_samples_leaf': 3}, mean: 0.35015, std: 0.03449, params: {'max_depth': 4, 'min_samples_leaf': 4}, mean: 0.34473, std: 0.02669, params: {'max_depth': 4, 'min_samples_leaf': 5}, mean: 0.35302, std: 0.03058, params: {'max_depth': 4, 'min_samples_leaf': 6}, mean: 0.34760, std: 0.02210, params: {'max_depth': 4, 'min_samples_leaf': 7}, mean: 0.35302, std: 0.03058, params: {'max_depth': 4, 'min_samples_leaf': 8}, mean: 0.35302, std: 0.03058, params: {'max_depth': 4, 'min_samples_leaf': 9}, mean: 0.35302, std: 0.03058, params: {'max_depth': 4, 'min_samples_leaf': 10}, mean: 0.35219, std: 0.02523, params: {'max_depth': 5, 'min_samples_leaf': 1}, mean: 0.35005, std: 0.03081, params: {'max_depth': 5, 'min_samples_leaf': 2}, mean: 0.35076, std: 0.03012, params: {'max_depth': 5, 'min_samples_leaf': 3}, mean: 0.35312, std: 0.02523, params: {'max_depth': 5, 'min_samples_leaf': 4}, mean: 0.34822, std: 0.02932, params: {'max_depth': 5, 'min_samples_leaf': 5}, mean: 0.34646, std: 0.02731, params: {'max_depth': 5, 'min_samples_leaf': 6}, mean: 0.34830, std: 0.02673, params: {'max_depth': 5, 'min_samples_leaf': 7}, mean: 0.34627, std: 0.02590, params: {'max_depth': 5, 'min_samples_leaf': 8}, mean: 0.34781, std: 0.02677, params: {'max_depth': 5, 'min_samples_leaf': 9}, mean: 0.34930, std: 0.02586, params: {'max_depth': 5, 'min_samples_leaf': 10}, mean: 0.36798, std: 0.02460, params: {'max_depth': 6, 'min_samples_leaf': 1}, mean: 0.36493, std: 0.02550, params: {'max_depth': 6, 'min_samples_leaf': 2}, mean: 0.36662, std: 0.02532, params: {'max_depth': 6, 'min_samples_leaf': 3}, mean: 0.36549, std: 0.02194, params: {'max_depth': 6, 'min_samples_leaf': 4}, mean: 0.36602, std: 0.02366, params: {'max_depth': 6, 'min_samples_leaf': 5}, mean: 0.36477, std: 0.02411, params: {'max_depth': 6, 'min_samples_leaf': 6}, mean: 0.36310, std: 0.01859, params: {'max_depth': 6, 'min_samples_leaf': 7}, mean: 0.36325, std: 0.02090, params: {'max_depth': 6, 'min_samples_leaf': 8}, mean: 0.35666, std: 0.02050, params: {'max_depth': 6, 'min_samples_leaf': 9}, mean: 0.36040, std: 0.02231, params: {'max_depth': 6, 'min_samples_leaf': 10}, mean: 0.36681, std: 0.02360, params: {'max_depth': 7, 'min_samples_leaf': 1}, mean: 0.36634, std: 0.02744, params: {'max_depth': 7, 'min_samples_leaf': 2}, mean: 0.36836, std: 0.02204, params: {'max_depth': 7, 'min_samples_leaf': 3}, mean: 0.35668, std: 0.02855, params: {'max_depth': 7, 'min_samples_leaf': 4}, mean: 0.36152, std: 0.03450, params: {'max_depth': 7, 'min_samples_leaf': 5}, mean: 0.36490, std: 0.03233, params: {'max_depth': 7, 'min_samples_leaf': 6}, mean: 0.35973, std: 0.02870, params: {'max_depth': 7, 'min_samples_leaf': 7}, mean: 0.35822, std: 0.02311, params: {'max_depth': 7, 'min_samples_leaf': 8}, mean: 0.35677, std: 0.03355, params: {'max_depth': 7, 'min_samples_leaf': 9}, mean: 0.35942, std: 0.02684, params: {'max_depth': 7, 'min_samples_leaf': 10}]
############# based on standard predict ################
Accuracy on training data: 0.81
Accuracy on test data:     0.77
[[2275  163]
 [ 587  211]]
########################################################
/Users/ashwindeo/Dev/anaconda/lib/python2.7/site-packages/sklearn/metrics/classification.py:958: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/ashwindeo/Dev/anaconda/lib/python2.7/site-packages/sklearn/metrics/classification.py:958: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/ashwindeo/Dev/anaconda/lib/python2.7/site-packages/sklearn/metrics/classification.py:958: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/ashwindeo/Dev/anaconda/lib/python2.7/site-packages/sklearn/metrics/classification.py:958: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no predicted samples.
  'precision', 'predicted', average, warn_for)

The decision tree model has allowed us to identifity the relative importance of each indicator provided. As shown, the ethnicty of the students population plays a role in the model along with the lunch program.

Random Forests

In [17]:
from sklearn.ensemble import RandomForestClassifier

In [18]:
# Random Forests


clfForest1 = RandomForestClassifier()

parameters = {"n_estimators": range(1, 10)}
clfForest1, Xtrain, ytrain, Xtest, ytest = do_classify(clfForest1, parameters, 
                                                       dftouse, Xnames1, target, 1, mask=mask, 
                                                       n_jobs = 4, score_func='f1')

importance_list = clfForest1.feature_importances_
name_list = dftouse[Xnames1].columns
importance_list, name_list = zip(*sorted(zip(importance_list, name_list))[-15:])
plt.barh(range(len(name_list)),importance_list,align='center')
plt.yticks(range(len(name_list)),name_list)
plt.xlabel('Relative Importance in the Random Forests 1')
plt.ylabel('Features')
plt.title('Relative importance of Each Feature')
plt.show()


using mask
/Users/ashwindeo/Dev/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:19: FutureWarning: comparison to `None` will result in an elementwise object comparison in the future.
BEST {'n_estimators': 3} 0.407021120254 [mean: 0.38425, std: 0.01587, params: {'n_estimators': 1}, mean: 0.28604, std: 0.02757, params: {'n_estimators': 2}, mean: 0.40702, std: 0.02765, params: {'n_estimators': 3}, mean: 0.33906, std: 0.01401, params: {'n_estimators': 4}, mean: 0.38797, std: 0.01678, params: {'n_estimators': 5}, mean: 0.34087, std: 0.01803, params: {'n_estimators': 6}, mean: 0.39619, std: 0.02247, params: {'n_estimators': 7}, mean: 0.35074, std: 0.01246, params: {'n_estimators': 8}, mean: 0.40522, std: 0.01135, params: {'n_estimators': 9}]
############# based on standard predict ################
Accuracy on training data: 0.95
Accuracy on test data:     0.73
[[2041  397]
 [ 481  317]]
########################################################
ADA Booster

In [19]:
from sklearn.ensemble import AdaBoostClassifier

In [20]:
clfAda1 = AdaBoostClassifier()

parameters = {"n_estimators": range(10, 60)}
clfAda1, Xtrain, ytrain, Xtest, ytest = do_classify(clfAda1, parameters, 
                                                       dftouse, Xnames1, target, 1, mask=mask, 
                                                       n_jobs = 4, score_func='f1')

importance_list = clfAda1.feature_importances_
name_list = dftouse[Xnames1].columns
importance_list, name_list = zip(*sorted(zip(importance_list, name_list))[-15:])
plt.barh(range(len(name_list)),importance_list,align='center')
plt.yticks(range(len(name_list)),name_list)
plt.xlabel('Relative Importance in the ADA Boost 1')
plt.ylabel('Features')
plt.title('Relative importance of Each Feature')
plt.show()


using mask
/Users/ashwindeo/Dev/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:19: FutureWarning: comparison to `None` will result in an elementwise object comparison in the future.
BEST {'n_estimators': 12} 0.402341294418 [mean: 0.38318, std: 0.02673, params: {'n_estimators': 10}, mean: 0.37775, std: 0.03292, params: {'n_estimators': 11}, mean: 0.40234, std: 0.02428, params: {'n_estimators': 12}, mean: 0.36862, std: 0.03311, params: {'n_estimators': 13}, mean: 0.37309, std: 0.02079, params: {'n_estimators': 14}, mean: 0.35022, std: 0.01284, params: {'n_estimators': 15}, mean: 0.37993, std: 0.02576, params: {'n_estimators': 16}, mean: 0.37402, std: 0.02746, params: {'n_estimators': 17}, mean: 0.38581, std: 0.02061, params: {'n_estimators': 18}, mean: 0.38692, std: 0.02421, params: {'n_estimators': 19}, mean: 0.39861, std: 0.01408, params: {'n_estimators': 20}, mean: 0.39225, std: 0.01116, params: {'n_estimators': 21}, mean: 0.39471, std: 0.00937, params: {'n_estimators': 22}, mean: 0.38257, std: 0.02343, params: {'n_estimators': 23}, mean: 0.38159, std: 0.01771, params: {'n_estimators': 24}, mean: 0.39120, std: 0.01397, params: {'n_estimators': 25}, mean: 0.38870, std: 0.01355, params: {'n_estimators': 26}, mean: 0.38938, std: 0.01362, params: {'n_estimators': 27}, mean: 0.39060, std: 0.01227, params: {'n_estimators': 28}, mean: 0.38877, std: 0.01524, params: {'n_estimators': 29}, mean: 0.38846, std: 0.01297, params: {'n_estimators': 30}, mean: 0.38707, std: 0.00779, params: {'n_estimators': 31}, mean: 0.38502, std: 0.01007, params: {'n_estimators': 32}, mean: 0.38308, std: 0.00828, params: {'n_estimators': 33}, mean: 0.39407, std: 0.00738, params: {'n_estimators': 34}, mean: 0.39099, std: 0.00867, params: {'n_estimators': 35}, mean: 0.40118, std: 0.00968, params: {'n_estimators': 36}, mean: 0.39569, std: 0.00750, params: {'n_estimators': 37}, mean: 0.39681, std: 0.01101, params: {'n_estimators': 38}, mean: 0.39480, std: 0.01135, params: {'n_estimators': 39}, mean: 0.39207, std: 0.01305, params: {'n_estimators': 40}, mean: 0.39542, std: 0.01154, params: {'n_estimators': 41}, mean: 0.39040, std: 0.01349, params: {'n_estimators': 42}, mean: 0.39491, std: 0.01254, params: {'n_estimators': 43}, mean: 0.39172, std: 0.01337, params: {'n_estimators': 44}, mean: 0.39492, std: 0.01512, params: {'n_estimators': 45}, mean: 0.39656, std: 0.01877, params: {'n_estimators': 46}, mean: 0.39377, std: 0.01560, params: {'n_estimators': 47}, mean: 0.39508, std: 0.01037, params: {'n_estimators': 48}, mean: 0.39645, std: 0.01198, params: {'n_estimators': 49}, mean: 0.39488, std: 0.00918, params: {'n_estimators': 50}, mean: 0.39465, std: 0.00888, params: {'n_estimators': 51}, mean: 0.39858, std: 0.01318, params: {'n_estimators': 52}, mean: 0.39818, std: 0.01096, params: {'n_estimators': 53}, mean: 0.39075, std: 0.01628, params: {'n_estimators': 54}, mean: 0.39429, std: 0.01785, params: {'n_estimators': 55}, mean: 0.39389, std: 0.01624, params: {'n_estimators': 56}, mean: 0.39286, std: 0.01646, params: {'n_estimators': 57}, mean: 0.39759, std: 0.02205, params: {'n_estimators': 58}, mean: 0.40109, std: 0.01936, params: {'n_estimators': 59}]
############# based on standard predict ################
Accuracy on training data: 0.77
Accuracy on test data:     0.77
[[2261  177]
 [ 569  229]]
########################################################
Gradient Boosting

In [22]:
from sklearn.ensemble import GradientBoostingClassifier

In [23]:
# Gradient Boosting
clfGB1 = GradientBoostingClassifier()

parameters = {"n_estimators": range(50, 60), "max_depth": [5, 6, 7, 8, 9, 10]}
clfGB1, Xtrain, ytrain, Xtest, ytest = do_classify(clfGB1, parameters, 
                                                       dftouse, Xnames1, target, 1, mask=mask, 
                                                       n_jobs = 4, score_func='f1')

importance_list = clfGB1.feature_importances_
name_list = dftouse[Xnames1].columns
importance_list, name_list = zip(*sorted(zip(importance_list, name_list))[-15:])
plt.barh(range(len(name_list)),importance_list,align='center')
plt.yticks(range(len(name_list)),name_list)
plt.xlabel('Relative Importance in the Gradient Boosting 1')
plt.ylabel('Features')
plt.title('Relative importance of Each Feature')
plt.show()


using mask
/Users/ashwindeo/Dev/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:19: FutureWarning: comparison to `None` will result in an elementwise object comparison in the future.
BEST {'n_estimators': 55, 'max_depth': 10} 0.417774174679 [mean: 0.38172, std: 0.01349, params: {'n_estimators': 50, 'max_depth': 5}, mean: 0.39014, std: 0.01600, params: {'n_estimators': 51, 'max_depth': 5}, mean: 0.39457, std: 0.00629, params: {'n_estimators': 52, 'max_depth': 5}, mean: 0.38796, std: 0.01452, params: {'n_estimators': 53, 'max_depth': 5}, mean: 0.39074, std: 0.01402, params: {'n_estimators': 54, 'max_depth': 5}, mean: 0.39349, std: 0.00485, params: {'n_estimators': 55, 'max_depth': 5}, mean: 0.38908, std: 0.00959, params: {'n_estimators': 56, 'max_depth': 5}, mean: 0.39374, std: 0.01186, params: {'n_estimators': 57, 'max_depth': 5}, mean: 0.38591, std: 0.01052, params: {'n_estimators': 58, 'max_depth': 5}, mean: 0.39142, std: 0.00821, params: {'n_estimators': 59, 'max_depth': 5}, mean: 0.39660, std: 0.00838, params: {'n_estimators': 50, 'max_depth': 6}, mean: 0.39938, std: 0.00812, params: {'n_estimators': 51, 'max_depth': 6}, mean: 0.40412, std: 0.01302, params: {'n_estimators': 52, 'max_depth': 6}, mean: 0.39590, std: 0.00959, params: {'n_estimators': 53, 'max_depth': 6}, mean: 0.39900, std: 0.00870, params: {'n_estimators': 54, 'max_depth': 6}, mean: 0.39762, std: 0.01158, params: {'n_estimators': 55, 'max_depth': 6}, mean: 0.39375, std: 0.00857, params: {'n_estimators': 56, 'max_depth': 6}, mean: 0.40240, std: 0.00786, params: {'n_estimators': 57, 'max_depth': 6}, mean: 0.38930, std: 0.00988, params: {'n_estimators': 58, 'max_depth': 6}, mean: 0.40122, std: 0.01927, params: {'n_estimators': 59, 'max_depth': 6}, mean: 0.40269, std: 0.01719, params: {'n_estimators': 50, 'max_depth': 7}, mean: 0.40064, std: 0.02135, params: {'n_estimators': 51, 'max_depth': 7}, mean: 0.40146, std: 0.01893, params: {'n_estimators': 52, 'max_depth': 7}, mean: 0.40320, std: 0.01567, params: {'n_estimators': 53, 'max_depth': 7}, mean: 0.38941, std: 0.02220, params: {'n_estimators': 54, 'max_depth': 7}, mean: 0.39864, std: 0.01140, params: {'n_estimators': 55, 'max_depth': 7}, mean: 0.40578, std: 0.01265, params: {'n_estimators': 56, 'max_depth': 7}, mean: 0.39959, std: 0.00627, params: {'n_estimators': 57, 'max_depth': 7}, mean: 0.40283, std: 0.01226, params: {'n_estimators': 58, 'max_depth': 7}, mean: 0.40039, std: 0.01282, params: {'n_estimators': 59, 'max_depth': 7}, mean: 0.40520, std: 0.01801, params: {'n_estimators': 50, 'max_depth': 8}, mean: 0.39808, std: 0.00862, params: {'n_estimators': 51, 'max_depth': 8}, mean: 0.40630, std: 0.02271, params: {'n_estimators': 52, 'max_depth': 8}, mean: 0.40586, std: 0.01122, params: {'n_estimators': 53, 'max_depth': 8}, mean: 0.40049, std: 0.01650, params: {'n_estimators': 54, 'max_depth': 8}, mean: 0.40031, std: 0.01861, params: {'n_estimators': 55, 'max_depth': 8}, mean: 0.40903, std: 0.01357, params: {'n_estimators': 56, 'max_depth': 8}, mean: 0.39770, std: 0.00764, params: {'n_estimators': 57, 'max_depth': 8}, mean: 0.39564, std: 0.01394, params: {'n_estimators': 58, 'max_depth': 8}, mean: 0.40505, std: 0.00816, params: {'n_estimators': 59, 'max_depth': 8}, mean: 0.40222, std: 0.01437, params: {'n_estimators': 50, 'max_depth': 9}, mean: 0.41006, std: 0.01682, params: {'n_estimators': 51, 'max_depth': 9}, mean: 0.40620, std: 0.01794, params: {'n_estimators': 52, 'max_depth': 9}, mean: 0.41213, std: 0.01142, params: {'n_estimators': 53, 'max_depth': 9}, mean: 0.40361, std: 0.01343, params: {'n_estimators': 54, 'max_depth': 9}, mean: 0.39907, std: 0.00558, params: {'n_estimators': 55, 'max_depth': 9}, mean: 0.41041, std: 0.01767, params: {'n_estimators': 56, 'max_depth': 9}, mean: 0.40751, std: 0.01925, params: {'n_estimators': 57, 'max_depth': 9}, mean: 0.40779, std: 0.01381, params: {'n_estimators': 58, 'max_depth': 9}, mean: 0.39807, std: 0.01271, params: {'n_estimators': 59, 'max_depth': 9}, mean: 0.39505, std: 0.01670, params: {'n_estimators': 50, 'max_depth': 10}, mean: 0.39867, std: 0.00629, params: {'n_estimators': 51, 'max_depth': 10}, mean: 0.40276, std: 0.02488, params: {'n_estimators': 52, 'max_depth': 10}, mean: 0.40581, std: 0.00429, params: {'n_estimators': 53, 'max_depth': 10}, mean: 0.39652, std: 0.01997, params: {'n_estimators': 54, 'max_depth': 10}, mean: 0.41777, std: 0.00820, params: {'n_estimators': 55, 'max_depth': 10}, mean: 0.40286, std: 0.00823, params: {'n_estimators': 56, 'max_depth': 10}, mean: 0.39941, std: 0.01452, params: {'n_estimators': 57, 'max_depth': 10}, mean: 0.39348, std: 0.00890, params: {'n_estimators': 58, 'max_depth': 10}, mean: 0.40005, std: 0.00907, params: {'n_estimators': 59, 'max_depth': 10}]
############# based on standard predict ################
Accuracy on training data: 1.00
Accuracy on test data:     0.78
[[2277  161]
 [ 553  245]]
########################################################

In [26]:
# Plotting ROC Curves

with sns.color_palette("dark"):
    ax=make_roc("Descision Trees 1",clfTree1  , ytest, Xtest, None, labe=250, proba=True)
    make_roc("Random Forest 1"     ,clfForest1, ytest, Xtest, ax  , labe=250, proba=True);
    make_roc("ADA Boost 1"         ,clfAda1   , ytest, Xtest, ax  , labe=250, proba=True, skip=50);
    make_roc("Gradient Boost 1"    ,clfGB1    , ytest, Xtest, ax  , labe=250, proba=True, skip=50);



In [27]:
Xtrainb = Xtrain 
ytrainb = ytrain 
Xtestb = Xtest 
ytestb = ytest
Decision Tree - No Gender or Ethnicity

In [39]:
# Indicators used : Funding/Expenditure/Location/School Types (no Race)

Xnames2 = [
            'pupil_teacher_ratio_dist',
            'totalrev_pp',
            'tcurinst_pp',
            'tcurssv_pp',
            'tcursalary_pp',
            'tcurbenefits_pp',
            'totalexp_pp',
            'tcapout_pp',
            'tnonelse_pp',
            'tcurelsc_pp',
            'instexp_pp',
            'i_agency_type_local_school_district',
            'i_agency_type_local_school_district_sup_union',
            'i_agency_type_regional_education_services',
            'i_agency_type_charter_school_agency',
            'i_fin_sdlc_sec',
            'i_fin_sdlc_elem_sec',
            'i_fin_sdlc_voc',
            'i_ucl_city_large',
            'i_ucl_city_mid',
            'i_ucl_city_small',
            'i_ucl_suburb_large',
            'i_ucl_suburb_mid',
            'i_ucl_suburb_small',
            'i_ucl_town_fringe',
            'i_ucl_town_distant',
            'i_ucl_town_remote',
            'i_ucl_rural_fringe',
            'i_ucl_rural_distant',
            'i_ucl_rural_remote',
            'i_cs_all_charter',
            'i_cs_charter_noncharter',
            'i_cs_all_noncharter',
            'i_ma_ne_nr',
            'i_ma_metropolitan',
            'i_ma_micropolitan',            
            'r_ELL',
            'r_IEP',
            'r_lunch_free',
            'r_lunch_reduced'
]

target2 = 'RESP_High_Graduation'

In [297]:
# Descision Tree
clfTree2 = tree.DecisionTreeClassifier()

parameters = {"max_depth": [1, 2, 3, 4, 5, 6, 7], 'min_samples_leaf': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}
clfTree2, Xtrain, ytrain, Xtest, ytest = do_classify(clfTree2, parameters, dftouse, 
                                                     Xnames2, target2, 1, 
                                                     mask=mask, n_jobs = 4, score_func = 'f1')

importance_list = clfTree2.feature_importances_
name_list = dftouse[Xnames2].columns
importance_list, name_list = zip(*sorted(zip(importance_list, name_list))[-15:])
plt.barh(range(len(name_list)),importance_list,align='center')
plt.yticks(range(len(name_list)),name_list)
plt.xlabel('Relative Importance in the Descision Trees 2')
plt.ylabel('Features')
plt.title('Relative importance of Each Feature')
plt.show()


using mask
/Users/ChaserAcer/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:5: FutureWarning: comparison to `None` will result in an elementwise object comparison in the future.
BEST {'max_depth': 7, 'min_samples_leaf': 4} 0.404466675783 [mean: 0.08859, std: 0.17727, params: {'max_depth': 1, 'min_samples_leaf': 1}, mean: 0.08859, std: 0.17727, params: {'max_depth': 1, 'min_samples_leaf': 2}, mean: 0.08859, std: 0.17727, params: {'max_depth': 1, 'min_samples_leaf': 3}, mean: 0.08859, std: 0.17727, params: {'max_depth': 1, 'min_samples_leaf': 4}, mean: 0.08859, std: 0.17727, params: {'max_depth': 1, 'min_samples_leaf': 5}, mean: 0.08859, std: 0.17727, params: {'max_depth': 1, 'min_samples_leaf': 6}, mean: 0.08859, std: 0.17727, params: {'max_depth': 1, 'min_samples_leaf': 7}, mean: 0.08859, std: 0.17727, params: {'max_depth': 1, 'min_samples_leaf': 8}, mean: 0.08859, std: 0.17727, params: {'max_depth': 1, 'min_samples_leaf': 9}, mean: 0.08859, std: 0.17727, params: {'max_depth': 1, 'min_samples_leaf': 10}, mean: 0.32678, std: 0.01887, params: {'max_depth': 2, 'min_samples_leaf': 1}, mean: 0.32678, std: 0.01887, params: {'max_depth': 2, 'min_samples_leaf': 2}, mean: 0.32678, std: 0.01887, params: {'max_depth': 2, 'min_samples_leaf': 3}, mean: 0.32678, std: 0.01887, params: {'max_depth': 2, 'min_samples_leaf': 4}, mean: 0.32678, std: 0.01887, params: {'max_depth': 2, 'min_samples_leaf': 5}, mean: 0.32678, std: 0.01887, params: {'max_depth': 2, 'min_samples_leaf': 6}, mean: 0.32678, std: 0.01887, params: {'max_depth': 2, 'min_samples_leaf': 7}, mean: 0.32678, std: 0.01887, params: {'max_depth': 2, 'min_samples_leaf': 8}, mean: 0.32678, std: 0.01887, params: {'max_depth': 2, 'min_samples_leaf': 9}, mean: 0.32678, std: 0.01887, params: {'max_depth': 2, 'min_samples_leaf': 10}, mean: 0.35588, std: 0.01961, params: {'max_depth': 3, 'min_samples_leaf': 1}, mean: 0.35588, std: 0.01961, params: {'max_depth': 3, 'min_samples_leaf': 2}, mean: 0.35588, std: 0.01961, params: {'max_depth': 3, 'min_samples_leaf': 3}, mean: 0.35588, std: 0.01961, params: {'max_depth': 3, 'min_samples_leaf': 4}, mean: 0.35588, std: 0.01961, params: {'max_depth': 3, 'min_samples_leaf': 5}, mean: 0.35588, std: 0.01961, params: {'max_depth': 3, 'min_samples_leaf': 6}, mean: 0.35588, std: 0.01961, params: {'max_depth': 3, 'min_samples_leaf': 7}, mean: 0.35588, std: 0.01961, params: {'max_depth': 3, 'min_samples_leaf': 8}, mean: 0.35588, std: 0.01961, params: {'max_depth': 3, 'min_samples_leaf': 9}, mean: 0.35588, std: 0.01961, params: {'max_depth': 3, 'min_samples_leaf': 10}, mean: 0.36258, std: 0.01657, params: {'max_depth': 4, 'min_samples_leaf': 1}, mean: 0.36258, std: 0.01657, params: {'max_depth': 4, 'min_samples_leaf': 2}, mean: 0.36258, std: 0.01657, params: {'max_depth': 4, 'min_samples_leaf': 3}, mean: 0.36258, std: 0.01657, params: {'max_depth': 4, 'min_samples_leaf': 4}, mean: 0.36258, std: 0.01657, params: {'max_depth': 4, 'min_samples_leaf': 5}, mean: 0.36205, std: 0.01761, params: {'max_depth': 4, 'min_samples_leaf': 6}, mean: 0.36173, std: 0.01740, params: {'max_depth': 4, 'min_samples_leaf': 7}, mean: 0.36173, std: 0.01740, params: {'max_depth': 4, 'min_samples_leaf': 8}, mean: 0.36173, std: 0.01740, params: {'max_depth': 4, 'min_samples_leaf': 9}, mean: 0.36173, std: 0.01740, params: {'max_depth': 4, 'min_samples_leaf': 10}, mean: 0.37541, std: 0.03261, params: {'max_depth': 5, 'min_samples_leaf': 1}, mean: 0.37602, std: 0.03325, params: {'max_depth': 5, 'min_samples_leaf': 2}, mean: 0.37638, std: 0.03316, params: {'max_depth': 5, 'min_samples_leaf': 3}, mean: 0.37745, std: 0.03259, params: {'max_depth': 5, 'min_samples_leaf': 4}, mean: 0.37732, std: 0.03231, params: {'max_depth': 5, 'min_samples_leaf': 5}, mean: 0.37696, std: 0.03256, params: {'max_depth': 5, 'min_samples_leaf': 6}, mean: 0.37730, std: 0.03269, params: {'max_depth': 5, 'min_samples_leaf': 7}, mean: 0.37816, std: 0.03343, params: {'max_depth': 5, 'min_samples_leaf': 8}, mean: 0.38252, std: 0.02896, params: {'max_depth': 5, 'min_samples_leaf': 9}, mean: 0.38215, std: 0.02910, params: {'max_depth': 5, 'min_samples_leaf': 10}, mean: 0.36626, std: 0.02095, params: {'max_depth': 6, 'min_samples_leaf': 1}, mean: 0.36740, std: 0.02077, params: {'max_depth': 6, 'min_samples_leaf': 2}, mean: 0.36656, std: 0.02024, params: {'max_depth': 6, 'min_samples_leaf': 3}, mean: 0.36680, std: 0.02071, params: {'max_depth': 6, 'min_samples_leaf': 4}, mean: 0.36909, std: 0.01834, params: {'max_depth': 6, 'min_samples_leaf': 5}, mean: 0.37047, std: 0.01862, params: {'max_depth': 6, 'min_samples_leaf': 6}, mean: 0.36900, std: 0.01829, params: {'max_depth': 6, 'min_samples_leaf': 7}, mean: 0.36803, std: 0.01789, params: {'max_depth': 6, 'min_samples_leaf': 8}, mean: 0.36550, std: 0.02135, params: {'max_depth': 6, 'min_samples_leaf': 9}, mean: 0.36722, std: 0.02480, params: {'max_depth': 6, 'min_samples_leaf': 10}, mean: 0.40340, std: 0.01814, params: {'max_depth': 7, 'min_samples_leaf': 1}, mean: 0.40332, std: 0.01432, params: {'max_depth': 7, 'min_samples_leaf': 2}, mean: 0.40415, std: 0.01875, params: {'max_depth': 7, 'min_samples_leaf': 3}, mean: 0.40447, std: 0.02007, params: {'max_depth': 7, 'min_samples_leaf': 4}, mean: 0.39632, std: 0.02029, params: {'max_depth': 7, 'min_samples_leaf': 5}, mean: 0.39600, std: 0.02059, params: {'max_depth': 7, 'min_samples_leaf': 6}, mean: 0.39967, std: 0.02077, params: {'max_depth': 7, 'min_samples_leaf': 7}, mean: 0.39670, std: 0.01225, params: {'max_depth': 7, 'min_samples_leaf': 8}, mean: 0.39187, std: 0.02176, params: {'max_depth': 7, 'min_samples_leaf': 9}, mean: 0.39057, std: 0.01569, params: {'max_depth': 7, 'min_samples_leaf': 10}]
############# based on standard predict ################
Accuracy on training data: 0.81
Accuracy on test data:     0.76
[[2237  186]
 [ 582  231]]
########################################################
/Users/ChaserAcer/anaconda/lib/python2.7/site-packages/sklearn/metrics/classification.py:958: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/ChaserAcer/anaconda/lib/python2.7/site-packages/sklearn/metrics/classification.py:958: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/ChaserAcer/anaconda/lib/python2.7/site-packages/sklearn/metrics/classification.py:958: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/ChaserAcer/anaconda/lib/python2.7/site-packages/sklearn/metrics/classification.py:958: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no predicted samples.
  'precision', 'predicted', average, warn_for)
Random Forests - No Gender/Ethnicity

In [298]:
# Random Forests
clfForest2 = RandomForestClassifier()

parameters = {"n_estimators": range(1, 10)}
clfForest2, Xtrain, ytrain, Xtest, ytest = do_classify(clfForest2, parameters, 
                                                       dftouse, Xnames2, target2, 1, mask=mask, 
                                                       n_jobs = 4, score_func='f1')

importance_list = clfForest2.feature_importances_
name_list = dftouse[Xnames2].columns
importance_list, name_list = zip(*sorted(zip(importance_list, name_list))[-15:])
plt.barh(range(len(name_list)),importance_list,align='center')
plt.yticks(range(len(name_list)),name_list)
plt.xlabel('Relative Importance in the Random Forests 2')
plt.ylabel('Features')
plt.title('Relative importance of Each Feature')
plt.show()


using mask
/Users/ChaserAcer/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:5: FutureWarning: comparison to `None` will result in an elementwise object comparison in the future.
BEST {'n_estimators': 9} 0.405009158276 [mean: 0.38174, std: 0.02190, params: {'n_estimators': 1}, mean: 0.26841, std: 0.02676, params: {'n_estimators': 2}, mean: 0.39164, std: 0.00880, params: {'n_estimators': 3}, mean: 0.32942, std: 0.03048, params: {'n_estimators': 4}, mean: 0.40304, std: 0.01602, params: {'n_estimators': 5}, mean: 0.34480, std: 0.02156, params: {'n_estimators': 6}, mean: 0.39491, std: 0.02138, params: {'n_estimators': 7}, mean: 0.35334, std: 0.01997, params: {'n_estimators': 8}, mean: 0.40501, std: 0.00968, params: {'n_estimators': 9}]
############# based on standard predict ################
Accuracy on training data: 0.99
Accuracy on test data:     0.75
[[2146  277]
 [ 543  270]]
########################################################

Random Forest in both the models (with and without Race/Sex) appears to be suffering from overfitting since their is large difference in accuracy between the training and test data.

ADA Booster - No Gender/Ethnicity

In [299]:
# ADA Booster
clfAda2 = AdaBoostClassifier()

parameters = {"n_estimators": range(10, 60)}
clfAda2, Xtrain, ytrain, Xtest, ytest = do_classify(clfAda2, parameters, 
                                                       dftouse, Xnames2, target2, 1, mask=mask, 
                                                       n_jobs = 4, score_func='f1')

importance_list = clfAda2.feature_importances_
name_list = dftouse[Xnames2].columns
importance_list, name_list = zip(*sorted(zip(importance_list, name_list))[-15:])
plt.barh(range(len(name_list)),importance_list,align='center')
plt.yticks(range(len(name_list)),name_list)
plt.xlabel('Relative Importance in the ADA Boost 2')
plt.ylabel('Features')
plt.title('Relative importance of Each Feature')
plt.show()


using mask
/Users/ChaserAcer/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:5: FutureWarning: comparison to `None` will result in an elementwise object comparison in the future.
BEST {'n_estimators': 53} 0.398388646449 [mean: 0.33907, std: 0.03363, params: {'n_estimators': 10}, mean: 0.34869, std: 0.04241, params: {'n_estimators': 11}, mean: 0.33673, std: 0.04999, params: {'n_estimators': 12}, mean: 0.37593, std: 0.03186, params: {'n_estimators': 13}, mean: 0.37198, std: 0.02968, params: {'n_estimators': 14}, mean: 0.37129, std: 0.02653, params: {'n_estimators': 15}, mean: 0.37542, std: 0.02156, params: {'n_estimators': 16}, mean: 0.37941, std: 0.02594, params: {'n_estimators': 17}, mean: 0.37472, std: 0.02775, params: {'n_estimators': 18}, mean: 0.37633, std: 0.03438, params: {'n_estimators': 19}, mean: 0.36856, std: 0.02291, params: {'n_estimators': 20}, mean: 0.37031, std: 0.03517, params: {'n_estimators': 21}, mean: 0.37255, std: 0.02711, params: {'n_estimators': 22}, mean: 0.37443, std: 0.02831, params: {'n_estimators': 23}, mean: 0.37711, std: 0.03307, params: {'n_estimators': 24}, mean: 0.37369, std: 0.03537, params: {'n_estimators': 25}, mean: 0.36955, std: 0.03297, params: {'n_estimators': 26}, mean: 0.37352, std: 0.03833, params: {'n_estimators': 27}, mean: 0.36663, std: 0.03174, params: {'n_estimators': 28}, mean: 0.36070, std: 0.03031, params: {'n_estimators': 29}, mean: 0.36575, std: 0.03056, params: {'n_estimators': 30}, mean: 0.36437, std: 0.02877, params: {'n_estimators': 31}, mean: 0.36522, std: 0.03516, params: {'n_estimators': 32}, mean: 0.36927, std: 0.02272, params: {'n_estimators': 33}, mean: 0.37303, std: 0.02792, params: {'n_estimators': 34}, mean: 0.37270, std: 0.02265, params: {'n_estimators': 35}, mean: 0.37502, std: 0.02641, params: {'n_estimators': 36}, mean: 0.36913, std: 0.03563, params: {'n_estimators': 37}, mean: 0.38159, std: 0.02188, params: {'n_estimators': 38}, mean: 0.38421, std: 0.02241, params: {'n_estimators': 39}, mean: 0.38500, std: 0.02041, params: {'n_estimators': 40}, mean: 0.38408, std: 0.02306, params: {'n_estimators': 41}, mean: 0.38143, std: 0.02196, params: {'n_estimators': 42}, mean: 0.38791, std: 0.02062, params: {'n_estimators': 43}, mean: 0.38206, std: 0.02413, params: {'n_estimators': 44}, mean: 0.38646, std: 0.02144, params: {'n_estimators': 45}, mean: 0.38276, std: 0.02182, params: {'n_estimators': 46}, mean: 0.38862, std: 0.02272, params: {'n_estimators': 47}, mean: 0.38848, std: 0.02374, params: {'n_estimators': 48}, mean: 0.39675, std: 0.02267, params: {'n_estimators': 49}, mean: 0.39436, std: 0.02103, params: {'n_estimators': 50}, mean: 0.39765, std: 0.02083, params: {'n_estimators': 51}, mean: 0.39553, std: 0.01943, params: {'n_estimators': 52}, mean: 0.39839, std: 0.02147, params: {'n_estimators': 53}, mean: 0.39547, std: 0.02422, params: {'n_estimators': 54}, mean: 0.39715, std: 0.02346, params: {'n_estimators': 55}, mean: 0.39603, std: 0.02428, params: {'n_estimators': 56}, mean: 0.39252, std: 0.03002, params: {'n_estimators': 57}, mean: 0.39738, std: 0.02334, params: {'n_estimators': 58}, mean: 0.39450, std: 0.02212, params: {'n_estimators': 59}]
############# based on standard predict ################
Accuracy on training data: 0.78
Accuracy on test data:     0.76
[[2205  218]
 [ 556  257]]
########################################################
Gradient Boosting - No Gender/Ethnicity

In [300]:
# Gradient Boosting
clfGB2 = GradientBoostingClassifier()

parameters = {"n_estimators": range(30, 60), "max_depth": [1, 2, 3, 4, 5]}
clfGB2, Xtrain, ytrain, Xtest, ytest = do_classify(clfGB2, parameters, 
                                                       dftouse, Xnames2, target2, 1, mask=mask, 
                                                       n_jobs = 4, score_func='f1')

importance_list = clfGB2.feature_importances_
name_list = dftouse[Xnames2].columns
importance_list, name_list = zip(*sorted(zip(importance_list, name_list))[-15:])
plt.barh(range(len(name_list)),importance_list,align='center')
plt.yticks(range(len(name_list)),name_list)
plt.xlabel('Relative Importance in the Gradient Boosting 2')
plt.ylabel('Features')
plt.title('Relative importance of Each Feature')
plt.show()


using mask
/Users/ChaserAcer/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:5: FutureWarning: comparison to `None` will result in an elementwise object comparison in the future.
BEST {'n_estimators': 58, 'max_depth': 5} 0.406467905165 [mean: 0.32885, std: 0.01971, params: {'n_estimators': 30, 'max_depth': 1}, mean: 0.32885, std: 0.01971, params: {'n_estimators': 31, 'max_depth': 1}, mean: 0.32755, std: 0.01934, params: {'n_estimators': 32, 'max_depth': 1}, mean: 0.32974, std: 0.02005, params: {'n_estimators': 33, 'max_depth': 1}, mean: 0.33088, std: 0.01752, params: {'n_estimators': 34, 'max_depth': 1}, mean: 0.33203, std: 0.01808, params: {'n_estimators': 35, 'max_depth': 1}, mean: 0.33024, std: 0.01733, params: {'n_estimators': 36, 'max_depth': 1}, mean: 0.33213, std: 0.01477, params: {'n_estimators': 37, 'max_depth': 1}, mean: 0.33203, std: 0.01808, params: {'n_estimators': 38, 'max_depth': 1}, mean: 0.33329, std: 0.01509, params: {'n_estimators': 39, 'max_depth': 1}, mean: 0.33393, std: 0.01517, params: {'n_estimators': 40, 'max_depth': 1}, mean: 0.33393, std: 0.01517, params: {'n_estimators': 41, 'max_depth': 1}, mean: 0.33424, std: 0.01429, params: {'n_estimators': 42, 'max_depth': 1}, mean: 0.33410, std: 0.01418, params: {'n_estimators': 43, 'max_depth': 1}, mean: 0.33460, std: 0.01496, params: {'n_estimators': 44, 'max_depth': 1}, mean: 0.33524, std: 0.01546, params: {'n_estimators': 45, 'max_depth': 1}, mean: 0.33650, std: 0.01600, params: {'n_estimators': 46, 'max_depth': 1}, mean: 0.33638, std: 0.01549, params: {'n_estimators': 47, 'max_depth': 1}, mean: 0.33624, std: 0.01539, params: {'n_estimators': 48, 'max_depth': 1}, mean: 0.33661, std: 0.01570, params: {'n_estimators': 49, 'max_depth': 1}, mean: 0.33648, std: 0.01578, params: {'n_estimators': 50, 'max_depth': 1}, mean: 0.33622, std: 0.01593, params: {'n_estimators': 51, 'max_depth': 1}, mean: 0.33572, std: 0.01587, params: {'n_estimators': 52, 'max_depth': 1}, mean: 0.33572, std: 0.01587, params: {'n_estimators': 53, 'max_depth': 1}, mean: 0.33559, std: 0.01574, params: {'n_estimators': 54, 'max_depth': 1}, mean: 0.33622, std: 0.01583, params: {'n_estimators': 55, 'max_depth': 1}, mean: 0.33532, std: 0.01538, params: {'n_estimators': 56, 'max_depth': 1}, mean: 0.33734, std: 0.01613, params: {'n_estimators': 57, 'max_depth': 1}, mean: 0.33925, std: 0.01691, params: {'n_estimators': 58, 'max_depth': 1}, mean: 0.33785, std: 0.01674, params: {'n_estimators': 59, 'max_depth': 1}, mean: 0.32958, std: 0.02348, params: {'n_estimators': 30, 'max_depth': 2}, mean: 0.32834, std: 0.02256, params: {'n_estimators': 31, 'max_depth': 2}, mean: 0.32995, std: 0.02065, params: {'n_estimators': 32, 'max_depth': 2}, mean: 0.33096, std: 0.02177, params: {'n_estimators': 33, 'max_depth': 2}, mean: 0.33033, std: 0.02112, params: {'n_estimators': 34, 'max_depth': 2}, mean: 0.33230, std: 0.02025, params: {'n_estimators': 35, 'max_depth': 2}, mean: 0.33461, std: 0.02114, params: {'n_estimators': 36, 'max_depth': 2}, mean: 0.33484, std: 0.02218, params: {'n_estimators': 37, 'max_depth': 2}, mean: 0.33634, std: 0.02343, params: {'n_estimators': 38, 'max_depth': 2}, mean: 0.33880, std: 0.02345, params: {'n_estimators': 39, 'max_depth': 2}, mean: 0.33935, std: 0.02368, params: {'n_estimators': 40, 'max_depth': 2}, mean: 0.33822, std: 0.02249, params: {'n_estimators': 41, 'max_depth': 2}, mean: 0.33885, std: 0.02338, params: {'n_estimators': 42, 'max_depth': 2}, mean: 0.34003, std: 0.02276, params: {'n_estimators': 43, 'max_depth': 2}, mean: 0.34009, std: 0.02172, params: {'n_estimators': 44, 'max_depth': 2}, mean: 0.34142, std: 0.01911, params: {'n_estimators': 45, 'max_depth': 2}, mean: 0.34418, std: 0.01974, params: {'n_estimators': 46, 'max_depth': 2}, mean: 0.34354, std: 0.01994, params: {'n_estimators': 47, 'max_depth': 2}, mean: 0.34471, std: 0.01937, params: {'n_estimators': 48, 'max_depth': 2}, mean: 0.34361, std: 0.01800, params: {'n_estimators': 49, 'max_depth': 2}, mean: 0.34481, std: 0.01996, params: {'n_estimators': 50, 'max_depth': 2}, mean: 0.34452, std: 0.02001, params: {'n_estimators': 51, 'max_depth': 2}, mean: 0.34430, std: 0.01933, params: {'n_estimators': 52, 'max_depth': 2}, mean: 0.34413, std: 0.01996, params: {'n_estimators': 53, 'max_depth': 2}, mean: 0.34450, std: 0.02047, params: {'n_estimators': 54, 'max_depth': 2}, mean: 0.34560, std: 0.02134, params: {'n_estimators': 55, 'max_depth': 2}, mean: 0.34412, std: 0.02003, params: {'n_estimators': 56, 'max_depth': 2}, mean: 0.34566, std: 0.02041, params: {'n_estimators': 57, 'max_depth': 2}, mean: 0.34737, std: 0.02195, params: {'n_estimators': 58, 'max_depth': 2}, mean: 0.34899, std: 0.01974, params: {'n_estimators': 59, 'max_depth': 2}, mean: 0.34403, std: 0.03025, params: {'n_estimators': 30, 'max_depth': 3}, mean: 0.34699, std: 0.02843, params: {'n_estimators': 31, 'max_depth': 3}, mean: 0.34902, std: 0.02701, params: {'n_estimators': 32, 'max_depth': 3}, mean: 0.35191, std: 0.02847, params: {'n_estimators': 33, 'max_depth': 3}, mean: 0.35385, std: 0.02542, params: {'n_estimators': 34, 'max_depth': 3}, mean: 0.35414, std: 0.02725, params: {'n_estimators': 35, 'max_depth': 3}, mean: 0.35419, std: 0.02519, params: {'n_estimators': 36, 'max_depth': 3}, mean: 0.35668, std: 0.02569, params: {'n_estimators': 37, 'max_depth': 3}, mean: 0.35895, std: 0.02436, params: {'n_estimators': 38, 'max_depth': 3}, mean: 0.36168, std: 0.02477, params: {'n_estimators': 39, 'max_depth': 3}, mean: 0.36113, std: 0.02719, params: {'n_estimators': 40, 'max_depth': 3}, mean: 0.36470, std: 0.02522, params: {'n_estimators': 41, 'max_depth': 3}, mean: 0.36476, std: 0.02609, params: {'n_estimators': 42, 'max_depth': 3}, mean: 0.36558, std: 0.02421, params: {'n_estimators': 43, 'max_depth': 3}, mean: 0.36638, std: 0.02155, params: {'n_estimators': 44, 'max_depth': 3}, mean: 0.36680, std: 0.02176, params: {'n_estimators': 45, 'max_depth': 3}, mean: 0.36973, std: 0.02187, params: {'n_estimators': 46, 'max_depth': 3}, mean: 0.37027, std: 0.02097, params: {'n_estimators': 47, 'max_depth': 3}, mean: 0.37150, std: 0.02101, params: {'n_estimators': 48, 'max_depth': 3}, mean: 0.37220, std: 0.02330, params: {'n_estimators': 49, 'max_depth': 3}, mean: 0.37123, std: 0.02364, params: {'n_estimators': 50, 'max_depth': 3}, mean: 0.37299, std: 0.02307, params: {'n_estimators': 51, 'max_depth': 3}, mean: 0.37203, std: 0.02203, params: {'n_estimators': 52, 'max_depth': 3}, mean: 0.37458, std: 0.02240, params: {'n_estimators': 53, 'max_depth': 3}, mean: 0.37573, std: 0.02633, params: {'n_estimators': 54, 'max_depth': 3}, mean: 0.37865, std: 0.02636, params: {'n_estimators': 55, 'max_depth': 3}, mean: 0.37836, std: 0.02615, params: {'n_estimators': 56, 'max_depth': 3}, mean: 0.37980, std: 0.02788, params: {'n_estimators': 57, 'max_depth': 3}, mean: 0.37901, std: 0.02855, params: {'n_estimators': 58, 'max_depth': 3}, mean: 0.37797, std: 0.02814, params: {'n_estimators': 59, 'max_depth': 3}, mean: 0.35951, std: 0.02490, params: {'n_estimators': 30, 'max_depth': 4}, mean: 0.35914, std: 0.02540, params: {'n_estimators': 31, 'max_depth': 4}, mean: 0.36169, std: 0.02306, params: {'n_estimators': 32, 'max_depth': 4}, mean: 0.36486, std: 0.02104, params: {'n_estimators': 33, 'max_depth': 4}, mean: 0.36689, std: 0.02312, params: {'n_estimators': 34, 'max_depth': 4}, mean: 0.36957, std: 0.02533, params: {'n_estimators': 35, 'max_depth': 4}, mean: 0.37483, std: 0.02535, params: {'n_estimators': 36, 'max_depth': 4}, mean: 0.37406, std: 0.02429, params: {'n_estimators': 37, 'max_depth': 4}, mean: 0.37281, std: 0.02303, params: {'n_estimators': 38, 'max_depth': 4}, mean: 0.37435, std: 0.02689, params: {'n_estimators': 39, 'max_depth': 4}, mean: 0.37277, std: 0.02781, params: {'n_estimators': 40, 'max_depth': 4}, mean: 0.37669, std: 0.02577, params: {'n_estimators': 41, 'max_depth': 4}, mean: 0.37873, std: 0.02596, params: {'n_estimators': 42, 'max_depth': 4}, mean: 0.37941, std: 0.02399, params: {'n_estimators': 43, 'max_depth': 4}, mean: 0.37739, std: 0.02529, params: {'n_estimators': 44, 'max_depth': 4}, mean: 0.37653, std: 0.02494, params: {'n_estimators': 45, 'max_depth': 4}, mean: 0.37894, std: 0.02661, params: {'n_estimators': 46, 'max_depth': 4}, mean: 0.37967, std: 0.02719, params: {'n_estimators': 47, 'max_depth': 4}, mean: 0.38221, std: 0.02704, params: {'n_estimators': 48, 'max_depth': 4}, mean: 0.38413, std: 0.02841, params: {'n_estimators': 49, 'max_depth': 4}, mean: 0.38305, std: 0.02745, params: {'n_estimators': 50, 'max_depth': 4}, mean: 0.38221, std: 0.02689, params: {'n_estimators': 51, 'max_depth': 4}, mean: 0.37970, std: 0.02658, params: {'n_estimators': 52, 'max_depth': 4}, mean: 0.38311, std: 0.02406, params: {'n_estimators': 53, 'max_depth': 4}, mean: 0.38431, std: 0.02328, params: {'n_estimators': 54, 'max_depth': 4}, mean: 0.38386, std: 0.02351, params: {'n_estimators': 55, 'max_depth': 4}, mean: 0.38637, std: 0.02348, params: {'n_estimators': 56, 'max_depth': 4}, mean: 0.38622, std: 0.02368, params: {'n_estimators': 57, 'max_depth': 4}, mean: 0.38720, std: 0.02171, params: {'n_estimators': 58, 'max_depth': 4}, mean: 0.38702, std: 0.02237, params: {'n_estimators': 59, 'max_depth': 4}, mean: 0.37734, std: 0.01967, params: {'n_estimators': 30, 'max_depth': 5}, mean: 0.37896, std: 0.02001, params: {'n_estimators': 31, 'max_depth': 5}, mean: 0.38078, std: 0.01798, params: {'n_estimators': 32, 'max_depth': 5}, mean: 0.38203, std: 0.01770, params: {'n_estimators': 33, 'max_depth': 5}, mean: 0.38317, std: 0.01973, params: {'n_estimators': 34, 'max_depth': 5}, mean: 0.38302, std: 0.02339, params: {'n_estimators': 35, 'max_depth': 5}, mean: 0.38362, std: 0.02178, params: {'n_estimators': 36, 'max_depth': 5}, mean: 0.38564, std: 0.02138, params: {'n_estimators': 37, 'max_depth': 5}, mean: 0.38274, std: 0.02332, params: {'n_estimators': 38, 'max_depth': 5}, mean: 0.38231, std: 0.02296, params: {'n_estimators': 39, 'max_depth': 5}, mean: 0.38708, std: 0.01644, params: {'n_estimators': 40, 'max_depth': 5}, mean: 0.38958, std: 0.01550, params: {'n_estimators': 41, 'max_depth': 5}, mean: 0.39120, std: 0.01685, params: {'n_estimators': 42, 'max_depth': 5}, mean: 0.39059, std: 0.01807, params: {'n_estimators': 43, 'max_depth': 5}, mean: 0.39291, std: 0.01410, params: {'n_estimators': 44, 'max_depth': 5}, mean: 0.38580, std: 0.01049, params: {'n_estimators': 45, 'max_depth': 5}, mean: 0.39598, std: 0.02026, params: {'n_estimators': 46, 'max_depth': 5}, mean: 0.39751, std: 0.01325, params: {'n_estimators': 47, 'max_depth': 5}, mean: 0.39682, std: 0.01699, params: {'n_estimators': 48, 'max_depth': 5}, mean: 0.39883, std: 0.01847, params: {'n_estimators': 49, 'max_depth': 5}, mean: 0.39657, std: 0.02016, params: {'n_estimators': 50, 'max_depth': 5}, mean: 0.39214, std: 0.01398, params: {'n_estimators': 51, 'max_depth': 5}, mean: 0.39785, std: 0.01528, params: {'n_estimators': 52, 'max_depth': 5}, mean: 0.39748, std: 0.02306, params: {'n_estimators': 53, 'max_depth': 5}, mean: 0.39478, std: 0.02077, params: {'n_estimators': 54, 'max_depth': 5}, mean: 0.40088, std: 0.01797, params: {'n_estimators': 55, 'max_depth': 5}, mean: 0.40325, std: 0.01956, params: {'n_estimators': 56, 'max_depth': 5}, mean: 0.40230, std: 0.01804, params: {'n_estimators': 57, 'max_depth': 5}, mean: 0.40647, std: 0.01581, params: {'n_estimators': 58, 'max_depth': 5}, mean: 0.40561, std: 0.01609, params: {'n_estimators': 59, 'max_depth': 5}]
############# based on standard predict ################
Accuracy on training data: 0.85
Accuracy on test data:     0.76
[[2199  224]
 [ 544  269]]
########################################################

In [301]:
with sns.color_palette("dark"):
    ax=make_roc("Descision Trees 2",clfTree2  , ytest, Xtest, None, labe=250, proba=True)
    make_roc("Random Forest 2"     ,clfForest2, ytest, Xtest, ax  , labe=250, proba=True);
    make_roc("ADA Boost 2"         ,clfAda2   , ytest, Xtest, ax  , labe=250, proba=True, skip=50);
    make_roc("Gradient Boost 2"    ,clfGB2    , ytest, Xtest, ax  , labe=250, proba=True, skip=50);


When comparing these 4 models, we can see that their perfomance did have a big impact when modeling was done with Gender/Ethnicity and when without. It could quite likely have corelations with which schools are better funded, have better student/pupil ratio and their demographic destributions.


In [28]:
Xtrainc = Xtrain 
ytrainc = ytrain 
Xtestc = Xtest 
ytestc = ytest

Final Comparison of All Models


In [303]:
with sns.color_palette("dark"):
    ax = make_roc("svm-all-features",clfsvm, ytesta, Xtesta, None, labe=250, proba=False, skip=50)
    make_roc("svm-feature-selected",pipelinearsvm, ytesta, Xtesta, ax, labe=250, proba=False, skip=50);
    make_roc("svm-all-features-balanced",clfsvm_b, ytesta, Xtesta, ax, labe=250, proba=False, skip=50);
    make_roc("logistic-with-lasso",clflog, ytesta, Xtesta, ax, labe=250, proba=True,  skip=50);
    make_roc("Descision Trees 1",clfTree1  , ytestb, Xtestb, None, labe=250, proba=True)
    make_roc("Random Forest 1"     ,clfForest1, ytestb, Xtestb, ax  , labe=250, proba=True);
    make_roc("ADA Boost 1"         ,clfAda1   , ytestb, Xtestb, ax  , labe=250, proba=True, skip=50);
    make_roc("Gradient Boost 1"    ,clfGB1    , ytestb, Xtestb, ax  , labe=250, proba=True, skip=50);
    make_roc("Descision Trees 2",clfTree2  , ytestc, Xtestc, None, labe=250, proba=True)
    make_roc("Random Forest 2"     ,clfForest2, ytestc, Xtestc, ax  , labe=250, proba=True);
    make_roc("ADA Boost 2"         ,clfAda2   , ytestc, Xtestc, ax  , labe=250, proba=True, skip=50);
    make_roc("Gradient Boost 2"    ,clfGB2    , ytestc, Xtestc, ax  , labe=250, proba=True, skip=50);


Low Graduation

Exploratory Data Analysis

We make a kernel-density estimate (KDE) plot for each feature in ccols to look for promising separators.


In [ ]:
fig, axs = plt.subplots(43, 3, figsize=(15,100), tight_layout=True)

for item, ax in zip(dftouse[ccols], axs.flat):
    sns.kdeplot(dftouse[dftouse["RESP_Low_Graduation"]==0][item], ax=ax, color='r')
    sns.kdeplot(dftouse[dftouse["RESP_Low_Graduation"]==1][item], ax=ax, color='b')

We make histograms for each feature in INDICATORS.


In [281]:
fig, axs = plt.subplots(9, 3, figsize=(15,30), tight_layout=True)

for item, ax in zip(dftouse[INDICATORS], axs.flat):
    dftouse[dftouse["RESP_Low_Graduation"]==0][item].hist(ax=ax,color="r",label=item)
    dftouse[dftouse["RESP_Low_Graduation"]==1][item].hist(ax=ax,color="b",label=item)
    ax.legend(loc='upper right')


Writing Classifiers

We try out many different types of classifiers to predict high graduation rate, RESP_Low_Graduation. We tried the classifiers from HW3 and Lab7.

We iteratively worked in this section and then determined more columns that needed to be removed, went back up to data filtering and exploratory analysis, then came back down to this section.

Linear SVM

In [56]:
#CITATION: Adapted from HW3
clfsvm=LinearSVC(loss="hinge")
Cs=[0.001, 0.01, 0.1, 1.0, 10.0, 100.0]
Xmatrix=dftouse[lcols].values
Yresp=dftouse['RESP_Low_Graduation'].values

In [57]:
#CITATION: From HW3
Xmatrix_train=Xmatrix[mask]
Xmatrix_test=Xmatrix[~mask]
Yresp_train=Yresp[mask]
Yresp_test=Yresp[~mask]

In [58]:
#CITATION: From HW3
def cv_optimize(clf, parameters, X, y, n_folds=5, score_func=None):
    if score_func:
        gs = GridSearchCV(clf, param_grid=parameters, cv=n_folds, scoring=score_func)
    else:
        gs = GridSearchCV(clf, param_grid=parameters, cv=n_folds)
    gs.fit(X, y)
    print "BEST", gs.best_params_, gs.best_score_, gs.grid_scores_
    best = gs.best_estimator_
    return best

In [59]:
#CITATION: From HW3
from sklearn.metrics import confusion_matrix
def do_classify(clf, parameters, indf, featurenames, targetname, target1val, mask=None, reuse_split=None, score_func=None, n_folds=5):
    subdf=indf[featurenames]
    X=subdf.values
    y=(indf[targetname].values==target1val)*1
    if mask !=None:
        print "using mask"
        Xtrain, Xtest, ytrain, ytest = X[mask], X[~mask], y[mask], y[~mask]
    if reuse_split !=None:
        print "using reuse split"
        Xtrain, Xtest, ytrain, ytest = reuse_split['Xtrain'], reuse_split['Xtest'], reuse_split['ytrain'], reuse_split['ytest']
    if parameters:
        clf = cv_optimize(clf, parameters, Xtrain, ytrain, n_folds=n_folds, score_func=score_func)
    clf=clf.fit(Xtrain, ytrain)
    training_accuracy = clf.score(Xtrain, ytrain)
    test_accuracy = clf.score(Xtest, ytest)
    print "############# based on standard predict ################"
    print "Accuracy on training data: %0.2f" % (training_accuracy)
    print "Accuracy on test data:     %0.2f" % (test_accuracy)
    print confusion_matrix(ytest, clf.predict(Xtest))
    print "########################################################"
    return clf, Xtrain, ytrain, Xtest, ytest

In [62]:
%%time
clfsvm, Xtrain, ytrain, Xtest, ytest = do_classify(LinearSVC(loss="hinge"), {"C": [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]}, dftouse,lcols, 'RESP_Low_Graduation',1, mask=mask)
#CITATION: Adapted from HW3


using mask
BEST {'C': 1.0} 0.830308650152 [mean: 0.81825, std: 0.00638, params: {'C': 0.001}, mean: 0.82369, std: 0.00732, params: {'C': 0.01}, mean: 0.82912, std: 0.00353, params: {'C': 0.1}, mean: 0.83031, std: 0.00408, params: {'C': 1.0}, mean: 0.82223, std: 0.01125, params: {'C': 10.0}, mean: 0.77745, std: 0.01322, params: {'C': 100.0}]
############# based on standard predict ################
Accuracy on training data: 0.84
Accuracy on test data:     0.75
[[2021  395]
 [ 403  417]]
########################################################
CPU times: user 56.3 s, sys: 573 ms, total: 56.9 s
Wall time: 1min 1s
/Users/ChaserAcer/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:7: FutureWarning: comparison to `None` will result in an elementwise object comparison in the future.

In [63]:
#CITATION: From HW3
reuse_split=dict(Xtrain=Xtrain, Xtest=Xtest, ytrain=ytrain, ytest=ytest)

In [64]:
#CITATION: From HW3
ypred=clfsvm.predict(Xtest)
confusion_matrix(ytest, ypred)


Out[64]:
array([[2021,  395],
       [ 403,  417]])

In [65]:
#CITATION: From HW3
print "OP=", ytest.sum(), ", ON=",ytest.shape[0] - ytest.sum()


OP= 820 , ON= 2416
Log Regression

In [67]:
%%time
clflog,_,_,_,_  = do_classify(LogisticRegression(penalty="l1"), {"C": [0.001, 0.01, 0.1, 1, 10, 100]}, dftouse, lcols, 'RESP_Low_Graduation', 1, reuse_split=reuse_split)
#CITATION: Adapted from HW3


using reuse split
BEST {'C': 1} 0.835209961584 [mean: 0.75017, std: 0.00036, params: {'C': 0.001}, mean: 0.81084, std: 0.00416, params: {'C': 0.01}, mean: 0.83124, std: 0.00514, params: {'C': 0.1}, mean: 0.83521, std: 0.00526, params: {'C': 1}, mean: 0.83442, std: 0.00342, params: {'C': 10}, mean: 0.83402, std: 0.00300, params: {'C': 100}]
############# based on standard predict ################
Accuracy on training data: 0.84
Accuracy on test data:     0.79
[[2173  243]
 [ 430  390]]
########################################################
CPU times: user 2min 50s, sys: 1.61 s, total: 2min 51s
Wall time: 3min 1s

In [68]:
#CITATION: From HW3
from sklearn.metrics import roc_curve, auc
def make_roc(name, clf, ytest, xtest, ax=None, labe=5, proba=True, skip=0):
    initial=False
    if not ax:
        ax=plt.gca()
        initial=True
    if proba:#for stuff like logistic regression
        fpr, tpr, thresholds=roc_curve(ytest, clf.predict_proba(xtest)[:,1])
    else:#for stuff like SVM
        fpr, tpr, thresholds=roc_curve(ytest, clf.decision_function(xtest))
    roc_auc = auc(fpr, tpr)
    if skip:
        l=fpr.shape[0]
        ax.plot(fpr[0:l:skip], tpr[0:l:skip], '.-', alpha=0.3, label='ROC curve for %s (area = %0.2f)' % (name, roc_auc))
    else:
        ax.plot(fpr, tpr, '.-', alpha=0.3, label='ROC curve for %s (area = %0.2f)' % (name, roc_auc))
    label_kwargs = {}
    label_kwargs['bbox'] = dict(
        boxstyle='round,pad=0.3', alpha=0.2,
    )
    if labe!=None:
        for k in xrange(0, fpr.shape[0],labe):
            #from https://gist.github.com/podshumok/c1d1c9394335d86255b8
            threshold = str(np.round(thresholds[k], 2))
            ax.annotate(threshold, (fpr[k], tpr[k]), **label_kwargs)
    if initial:
        ax.plot([0, 1], [0, 1], 'k--')
        ax.set_xlim([0.0, 1.0])
        ax.set_ylim([0.0, 1.05])
        ax.set_xlabel('False Positive Rate')
        ax.set_ylabel('True Positive Rate')
        ax.set_title('ROC')
    ax.legend(loc="lower right")
    return ax

In [69]:
#CITATION: From HW3
with sns.color_palette("dark"):
    ax=make_roc("logistic-with-lasso",clflog, ytest, Xtest, labe=200, skip=50)
    make_roc("svm-all-features",clfsvm, ytest, Xtest, ax, labe=200, proba=False, skip=50);



In [70]:
#CITATION: From HW3
lasso_importances=nonzero_lasso(clflog)
lasso_importances.set_index("feature", inplace=True)
lasso_importances.head(10)


Out[70]:
abscoef coef
feature
i_fin_sdlc_voc 2.322830 2.322830
i_lgo_PK 1.698527 -1.698527
i_lgo_K 1.645265 -1.645265
r_IEP 1.416148 1.416148
r_st_LEA 1.390952 1.390952
r_stud_reg_12_W_M 1.106258 -1.106258
r_stud_reg_12_W_F 0.982725 -0.982725
i_agency_type_local_school_district 0.748206 -0.748206
r_st_OSSS 0.703137 -0.703137
r_st_ET 0.612841 0.612841
Feature Selection

In [73]:
#CITATION: From HW3
correlations=[]
dftousetrain=dftouse[mask]
for col in lcols:
    r=pearsonr(dftousetrain[col], dftousetrain['RESP_Low_Graduation'])[0]
    correlations.append(dict(feature=col,corr=r, abscorr=np.abs(r)))

bpdf=pd.DataFrame(correlations).sort('abscorr', ascending=False)
bpdf.set_index(['feature'], inplace=True)
bpdf.head(25)


Out[73]:
abscorr corr
feature
r_lunch_free 0.433845 0.433845
r_stud_re_W 0.385215 -0.385215
r_stud_re_B 0.371466 0.371466
r_stud_reg_12_W_F 0.357820 -0.357820
r_stud_reg_12_W_M 0.336567 -0.336567
tfedrev_percent 0.306173 0.306173
tlocrev_percent 0.254222 -0.254222
r_stud_reg_12_B_F 0.252644 0.252644
r_frev_title1 0.192257 0.192257
r_stud_reg_12_B_M 0.183210 0.183210
tcurinst_percent 0.169673 -0.169673
r_stud_re_H 0.165755 0.165755
tsrev_percent 0.160114 0.160114
tfedrev_pp 0.155860 0.155860
r_ELL 0.152673 0.152673
r_lrev_gst 0.147223 0.147223
tlocrev_pp 0.142139 -0.142139
tcurssvc_percent 0.136632 0.136632
r_frev_dis 0.135954 -0.135954
i_ucl_city_large 0.131143 0.131143
i_cs_all_noncharter 0.127125 -0.127125
num_pub_schools 0.123553 0.123553
num_schools 0.123337 0.123337
i_cs_all_charter 0.122456 0.122456
i_agency_type_charter_school_agency 0.122456 0.122456

In [76]:
#CITATION: From HW3
selectorlinearsvm = SelectKBest(k=25, score_func=pearson_scorer)
pipelinearsvm = Pipeline([('select', selectorlinearsvm), ('svm', LinearSVC(loss="hinge"))])

In [77]:
%%time
pipelinearsvm, _,_,_,_  = do_classify(pipelinearsvm, {"svm__C": [0.00001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0]}, dftouse,lcols, 'RESP_Low_Graduation',1, reuse_split=reuse_split)
#CITATION: From HW3


using reuse split
BEST {'svm__C': 1.0} 0.812425486819 [mean: 0.78381, std: 0.00910, params: {'svm__C': 1e-05}, mean: 0.80461, std: 0.00506, params: {'svm__C': 0.001}, mean: 0.81150, std: 0.00624, params: {'svm__C': 0.01}, mean: 0.81150, std: 0.00586, params: {'svm__C': 0.1}, mean: 0.81243, std: 0.00560, params: {'svm__C': 1.0}, mean: 0.81004, std: 0.00640, params: {'svm__C': 10.0}, mean: 0.78077, std: 0.01292, params: {'svm__C': 100.0}]
############# based on standard predict ################
Accuracy on training data: 0.81
Accuracy on test data:     0.81
[[2337   79]
 [ 532  288]]
########################################################
CPU times: user 8.39 s, sys: 116 ms, total: 8.5 s
Wall time: 8.55 s

In [78]:
#CITATION: From HW3
np.array(lcols)[pipelinearsvm.get_params()['select'].get_support()]


Out[78]:
array(['num_schools', 'num_pub_schools', 'tlocrev_pp', 'tfedrev_pp',
       'tcurinst_percent', 'tcurssvc_percent', 'tfedrev_percent',
       'tlocrev_percent', 'tsrev_percent',
       'i_agency_type_charter_school_agency', 'i_ucl_city_large',
       'i_cs_all_charter', 'i_cs_all_noncharter', 'r_ELL', 'r_lunch_free',
       'r_stud_re_H', 'r_stud_re_B', 'r_stud_re_W', 'r_stud_reg_12_B_M',
       'r_stud_reg_12_B_F', 'r_stud_reg_12_W_M', 'r_stud_reg_12_W_F',
       'r_lrev_gst', 'r_frev_title1', 'r_frev_dis'], 
      dtype='|S45')

In [79]:
#CITATION: From HW3
with sns.color_palette("dark"):
    ax=make_roc("svm-all-features",clfsvm, ytest, Xtest, None, labe=250, proba=False, skip=50)
    make_roc("svm-feature-selected",pipelinearsvm, ytest, Xtest, ax, labe=250, proba=False, skip=50);
    make_roc("logistic-with-lasso",clflog, ytest, Xtest, ax, labe=250, proba=True,  skip=50);



In [80]:
#CITATION: From HW3
jtrain=np.arange(0, ytrain.shape[0])
n_pos=len(jtrain[ytrain==1])
n_neg=len(jtrain[ytrain==0])
print n_pos, n_neg


1887 5662

In [81]:
#CITATION: From HW3
ineg = np.random.choice(jtrain[ytrain==0], n_pos, replace=False)

In [82]:
#CITATION: From HW3
alli=np.concatenate((jtrain[ytrain==1], ineg))
alli.shape


Out[82]:
(3774,)

In [83]:
#CITATION: From HW3
Xtrain_new = Xtrain[alli]
ytrain_new = ytrain[alli]
Xtrain_new.shape, ytrain_new.shape


Out[83]:
((3774, 154), (3774,))

In [84]:
#CITATION: From HW3
reuse_split_new=dict(Xtrain=Xtrain_new, Xtest=Xtest, ytrain=ytrain_new, ytest=ytest)

In [85]:
%%time
clfsvm_b, _,_,_,_  = do_classify(LinearSVC(loss="hinge"), {"C": [0.00001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0]}, dftouse,lcols, 'RESP_Low_Graduation',1, reuse_split=reuse_split_new)
#CITATION: From HW3


using reuse split
BEST {'C': 1.0} 0.782193958665 [mean: 0.74536, std: 0.01241, params: {'C': 1e-05}, mean: 0.74987, std: 0.01654, params: {'C': 0.001}, mean: 0.77954, std: 0.01221, params: {'C': 0.01}, mean: 0.77981, std: 0.01000, params: {'C': 0.1}, mean: 0.78219, std: 0.00819, params: {'C': 1.0}, mean: 0.77557, std: 0.00969, params: {'C': 10.0}, mean: 0.65236, std: 0.05809, params: {'C': 100.0}]
############# based on standard predict ################
Accuracy on training data: 0.81
Accuracy on test data:     0.73
[[1767  649]
 [ 234  586]]
########################################################
CPU times: user 18.6 s, sys: 175 ms, total: 18.8 s
Wall time: 19 s

In [86]:
#CITATION: From HW3
ax = make_roc("svm-all-features",clfsvm, ytest, Xtest, None, labe=250, proba=False, skip=50)
make_roc("svm-feature-selected",pipelinearsvm, ytest, Xtest, ax, labe=250, proba=False, skip=50);
make_roc("svm-all-features-balanced",clfsvm_b, ytest, Xtest, ax, labe=250, proba=False, skip=50);


Kernalized SVM

In [87]:
#CITATION: From HW3
selectorsvm2 = SelectKBest(k=25, score_func=pearson_scorer)
pipesvm2 = Pipeline([('select2', selectorsvm2), ('svm2', SVC())])

In [88]:
#CITATION: From HW3
jtrain_new=np.arange(0, ytrain_new.shape[0])
ipos_new = np.random.choice(jtrain_new[ytrain_new==1], 300, replace=False)
ineg_new = np.random.choice(jtrain_new[ytrain_new==0], 300, replace=False)
subsampled_i=np.concatenate((ipos_new,ineg_new))
Xtrain_new2=Xtrain_new[subsampled_i]
ytrain_new2=ytrain_new[subsampled_i]

In [89]:
#CITATION: From HW3
reuse_split_subsampled=dict(Xtrain=Xtrain_new2, Xtest=Xtest, ytrain=ytrain_new2, ytest=ytest)

In [90]:
%%time
pipesvm2, _,_,_,_  = do_classify(pipesvm2, {"svm2__C": [1e8],
                                              "svm2__gamma":[1e-5, 1e-7, 1e-9]}, 
                                 dftouse,lcols, 'RESP_Low_Graduation',1, reuse_split=reuse_split_subsampled)
#CITATION: From HW3


using reuse split
BEST {'svm2__C': 100000000.0, 'svm2__gamma': 1e-07} 0.758333333333 [mean: 0.73833, std: 0.02819, params: {'svm2__C': 100000000.0, 'svm2__gamma': 1e-05}, mean: 0.75833, std: 0.03456, params: {'svm2__C': 100000000.0, 'svm2__gamma': 1e-07}, mean: 0.74167, std: 0.02472, params: {'svm2__C': 100000000.0, 'svm2__gamma': 1e-09}]
############# based on standard predict ################
Accuracy on training data: 0.77
Accuracy on test data:     0.75
[[1825  591]
 [ 227  593]]
########################################################
CPU times: user 8.7 s, sys: 34.8 ms, total: 8.74 s
Wall time: 8.78 s

In [91]:
#CITATION: From HW3
gamma_wanted=pipesvm2.get_params()['svm2__gamma']
C_chosen=pipesvm2.get_params()['svm2__C']
print gamma_wanted, C_chosen
selectorsvm3 = SelectKBest(k=25, score_func=pearson_scorer)
pipesvm3 = Pipeline([('select3', selectorsvm3), ('svm3', SVC(C=C_chosen, gamma=gamma_wanted))])
pipesvm3, _,_,_,_  = do_classify(pipesvm3, None, 
                                 dftouse,lcols, 'RESP_Low_Graduation',1, reuse_split=reuse_split_new)


1e-07 100000000.0
using reuse split
############# based on standard predict ################
Accuracy on training data: 0.77
Accuracy on test data:     0.76
[[1845  571]
 [ 211  609]]
########################################################

In [92]:
#CITATION: From HW3
with sns.color_palette("dark"):
    ax = make_roc("logistic-with-lasso",clflog, ytest, Xtest, None, labe=300, skip=50)
    make_roc("rbf-svm-feature-selected-balanced",pipesvm3, ytest, Xtest, ax, labe=None, proba=False, skip=50);
    make_roc("svm-all-features-balanced",clfsvm_b, ytest, Xtest, ax, labe=250, proba=False, skip=50);



In [93]:
Xtraina = Xtrain 
ytraina = ytrain 
Xtesta = Xtest 
ytesta = ytest
Decision Trees

In [35]:
target1 = 'RESP_Low_Graduation'
clfTree1 = tree.DecisionTreeClassifier()

parameters = {"max_depth": [1, 2, 3, 4, 5, 6, 7], 'min_samples_leaf': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}
clfTree1, Xtrain, ytrain, Xtest, ytest = do_classify(clfTree1, parameters, dftouse, 
                                                     Xnames1, target1, 1, 
                                                     mask=mask, n_jobs = 4, score_func = 'f1')

importance_list = clfTree1.feature_importances_
name_list = dftouse[Xnames1].columns
importance_list, name_list = zip(*sorted(zip(importance_list, name_list))[-15:])
plt.barh(range(len(name_list)),importance_list,align='center')
plt.yticks(range(len(name_list)),name_list)
plt.xlabel('Relative Importance in the Descision Trees 1')
plt.ylabel('Features')
plt.title('Relative importance of Each Feature')
plt.show()


using mask
/Users/ashwindeo/Dev/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:19: FutureWarning: comparison to `None` will result in an elementwise object comparison in the future.
BEST {'max_depth': 4, 'min_samples_leaf': 1} 0.525865068351 [mean: 0.42575, std: 0.21319, params: {'max_depth': 1, 'min_samples_leaf': 1}, mean: 0.42575, std: 0.21319, params: {'max_depth': 1, 'min_samples_leaf': 2}, mean: 0.42575, std: 0.21319, params: {'max_depth': 1, 'min_samples_leaf': 3}, mean: 0.42575, std: 0.21319, params: {'max_depth': 1, 'min_samples_leaf': 4}, mean: 0.42575, std: 0.21319, params: {'max_depth': 1, 'min_samples_leaf': 5}, mean: 0.42575, std: 0.21319, params: {'max_depth': 1, 'min_samples_leaf': 6}, mean: 0.42575, std: 0.21319, params: {'max_depth': 1, 'min_samples_leaf': 7}, mean: 0.42575, std: 0.21319, params: {'max_depth': 1, 'min_samples_leaf': 8}, mean: 0.42575, std: 0.21319, params: {'max_depth': 1, 'min_samples_leaf': 9}, mean: 0.42575, std: 0.21319, params: {'max_depth': 1, 'min_samples_leaf': 10}, mean: 0.38852, std: 0.03162, params: {'max_depth': 2, 'min_samples_leaf': 1}, mean: 0.38852, std: 0.03162, params: {'max_depth': 2, 'min_samples_leaf': 2}, mean: 0.38852, std: 0.03162, params: {'max_depth': 2, 'min_samples_leaf': 3}, mean: 0.38852, std: 0.03162, params: {'max_depth': 2, 'min_samples_leaf': 4}, mean: 0.38852, std: 0.03162, params: {'max_depth': 2, 'min_samples_leaf': 5}, mean: 0.38852, std: 0.03162, params: {'max_depth': 2, 'min_samples_leaf': 6}, mean: 0.38852, std: 0.03162, params: {'max_depth': 2, 'min_samples_leaf': 7}, mean: 0.38852, std: 0.03162, params: {'max_depth': 2, 'min_samples_leaf': 8}, mean: 0.38852, std: 0.03162, params: {'max_depth': 2, 'min_samples_leaf': 9}, mean: 0.38852, std: 0.03162, params: {'max_depth': 2, 'min_samples_leaf': 10}, mean: 0.43957, std: 0.02720, params: {'max_depth': 3, 'min_samples_leaf': 1}, mean: 0.43957, std: 0.02720, params: {'max_depth': 3, 'min_samples_leaf': 2}, mean: 0.43957, std: 0.02720, params: {'max_depth': 3, 'min_samples_leaf': 3}, mean: 0.43957, std: 0.02720, params: {'max_depth': 3, 'min_samples_leaf': 4}, mean: 0.43957, std: 0.02720, params: {'max_depth': 3, 'min_samples_leaf': 5}, mean: 0.43957, std: 0.02720, params: {'max_depth': 3, 'min_samples_leaf': 6}, mean: 0.43957, std: 0.02720, params: {'max_depth': 3, 'min_samples_leaf': 7}, mean: 0.43957, std: 0.02720, params: {'max_depth': 3, 'min_samples_leaf': 8}, mean: 0.43957, std: 0.02720, params: {'max_depth': 3, 'min_samples_leaf': 9}, mean: 0.43957, std: 0.02720, params: {'max_depth': 3, 'min_samples_leaf': 10}, mean: 0.52587, std: 0.03770, params: {'max_depth': 4, 'min_samples_leaf': 1}, mean: 0.52587, std: 0.03770, params: {'max_depth': 4, 'min_samples_leaf': 2}, mean: 0.52537, std: 0.03738, params: {'max_depth': 4, 'min_samples_leaf': 3}, mean: 0.52537, std: 0.03738, params: {'max_depth': 4, 'min_samples_leaf': 4}, mean: 0.52546, std: 0.03876, params: {'max_depth': 4, 'min_samples_leaf': 5}, mean: 0.52546, std: 0.03876, params: {'max_depth': 4, 'min_samples_leaf': 6}, mean: 0.52574, std: 0.03918, params: {'max_depth': 4, 'min_samples_leaf': 7}, mean: 0.52574, std: 0.03918, params: {'max_depth': 4, 'min_samples_leaf': 8}, mean: 0.52530, std: 0.03889, params: {'max_depth': 4, 'min_samples_leaf': 9}, mean: 0.52530, std: 0.03889, params: {'max_depth': 4, 'min_samples_leaf': 10}, mean: 0.48889, std: 0.03389, params: {'max_depth': 5, 'min_samples_leaf': 1}, mean: 0.48976, std: 0.03503, params: {'max_depth': 5, 'min_samples_leaf': 2}, mean: 0.48731, std: 0.03265, params: {'max_depth': 5, 'min_samples_leaf': 3}, mean: 0.48794, std: 0.03340, params: {'max_depth': 5, 'min_samples_leaf': 4}, mean: 0.48613, std: 0.03541, params: {'max_depth': 5, 'min_samples_leaf': 5}, mean: 0.48537, std: 0.03733, params: {'max_depth': 5, 'min_samples_leaf': 6}, mean: 0.48158, std: 0.04130, params: {'max_depth': 5, 'min_samples_leaf': 7}, mean: 0.47939, std: 0.04095, params: {'max_depth': 5, 'min_samples_leaf': 8}, mean: 0.48134, std: 0.03851, params: {'max_depth': 5, 'min_samples_leaf': 9}, mean: 0.47989, std: 0.03643, params: {'max_depth': 5, 'min_samples_leaf': 10}, mean: 0.48005, std: 0.03277, params: {'max_depth': 6, 'min_samples_leaf': 1}, mean: 0.47975, std: 0.03454, params: {'max_depth': 6, 'min_samples_leaf': 2}, mean: 0.48172, std: 0.03635, params: {'max_depth': 6, 'min_samples_leaf': 3}, mean: 0.47768, std: 0.03688, params: {'max_depth': 6, 'min_samples_leaf': 4}, mean: 0.47672, std: 0.03568, params: {'max_depth': 6, 'min_samples_leaf': 5}, mean: 0.47623, std: 0.03390, params: {'max_depth': 6, 'min_samples_leaf': 6}, mean: 0.47642, std: 0.03290, params: {'max_depth': 6, 'min_samples_leaf': 7}, mean: 0.47525, std: 0.03346, params: {'max_depth': 6, 'min_samples_leaf': 8}, mean: 0.47835, std: 0.03377, params: {'max_depth': 6, 'min_samples_leaf': 9}, mean: 0.47407, std: 0.03478, params: {'max_depth': 6, 'min_samples_leaf': 10}, mean: 0.51820, std: 0.00896, params: {'max_depth': 7, 'min_samples_leaf': 1}, mean: 0.51917, std: 0.00726, params: {'max_depth': 7, 'min_samples_leaf': 2}, mean: 0.51653, std: 0.01170, params: {'max_depth': 7, 'min_samples_leaf': 3}, mean: 0.51054, std: 0.01619, params: {'max_depth': 7, 'min_samples_leaf': 4}, mean: 0.51103, std: 0.01047, params: {'max_depth': 7, 'min_samples_leaf': 5}, mean: 0.50750, std: 0.01528, params: {'max_depth': 7, 'min_samples_leaf': 6}, mean: 0.50855, std: 0.01590, params: {'max_depth': 7, 'min_samples_leaf': 7}, mean: 0.50920, std: 0.01731, params: {'max_depth': 7, 'min_samples_leaf': 8}, mean: 0.51001, std: 0.01174, params: {'max_depth': 7, 'min_samples_leaf': 9}, mean: 0.50736, std: 0.01050, params: {'max_depth': 7, 'min_samples_leaf': 10}]
############# based on standard predict ################
Accuracy on training data: 0.80
Accuracy on test data:     0.79
[[2184  252]
 [ 418  382]]
########################################################
/Users/ashwindeo/Dev/anaconda/lib/python2.7/site-packages/sklearn/metrics/classification.py:958: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/ashwindeo/Dev/anaconda/lib/python2.7/site-packages/sklearn/metrics/classification.py:958: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/ashwindeo/Dev/anaconda/lib/python2.7/site-packages/sklearn/metrics/classification.py:958: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/ashwindeo/Dev/anaconda/lib/python2.7/site-packages/sklearn/metrics/classification.py:958: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no predicted samples.
  'precision', 'predicted', average, warn_for)
Random Forests

In [36]:
clfForest1 = RandomForestClassifier()

parameters = {"n_estimators": range(1, 10)}
clfForest1, Xtrain, ytrain, Xtest, ytest = do_classify(clfForest1, parameters, 
                                                       dftouse, Xnames1, target1, 1, mask=mask, 
                                                       n_jobs = 4, score_func='f1')

importance_list = clfForest1.feature_importances_
name_list = dftouse[Xnames1].columns
importance_list, name_list = zip(*sorted(zip(importance_list, name_list))[-15:])
plt.barh(range(len(name_list)),importance_list,align='center')
plt.yticks(range(len(name_list)),name_list)
plt.xlabel('Relative Importance in the Random Forests 1')
plt.ylabel('Features')
plt.title('Relative importance of Each Feature')
plt.show()


using mask
/Users/ashwindeo/Dev/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:19: FutureWarning: comparison to `None` will result in an elementwise object comparison in the future.
BEST {'n_estimators': 7} 0.491799231394 [mean: 0.45014, std: 0.01818, params: {'n_estimators': 1}, mean: 0.36207, std: 0.01399, params: {'n_estimators': 2}, mean: 0.48433, std: 0.00762, params: {'n_estimators': 3}, mean: 0.43038, std: 0.02320, params: {'n_estimators': 4}, mean: 0.48926, std: 0.01878, params: {'n_estimators': 5}, mean: 0.43797, std: 0.02022, params: {'n_estimators': 6}, mean: 0.49180, std: 0.01212, params: {'n_estimators': 7}, mean: 0.46005, std: 0.01628, params: {'n_estimators': 8}, mean: 0.48233, std: 0.00767, params: {'n_estimators': 9}]
############# based on standard predict ################
Accuracy on training data: 0.99
Accuracy on test data:     0.78
[[2199  237]
 [ 461  339]]
########################################################
ADA Booster

In [37]:
clfAda1 = AdaBoostClassifier()

parameters = {"n_estimators": range(30, 60)}
clfAda1, Xtrain, ytrain, Xtest, ytest = do_classify(clfAda1, parameters, 
                                                       dftouse, Xnames1, target1, 1, mask=mask, 
                                                       n_jobs = 4, score_func='f1')

importance_list = clfAda1.feature_importances_
name_list = dftouse[Xnames1].columns
importance_list, name_list = zip(*sorted(zip(importance_list, name_list))[-15:])
plt.barh(range(len(name_list)),importance_list,align='center')
plt.yticks(range(len(name_list)),name_list)
plt.xlabel('Relative Importance in the ADA Boost 1')
plt.ylabel('Features')
plt.title('Relative importance of Each Feature')
plt.show()


using mask
/Users/ashwindeo/Dev/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:19: FutureWarning: comparison to `None` will result in an elementwise object comparison in the future.
BEST {'n_estimators': 49} 0.510457024433 [mean: 0.50633, std: 0.02398, params: {'n_estimators': 30}, mean: 0.50651, std: 0.02481, params: {'n_estimators': 31}, mean: 0.50670, std: 0.02410, params: {'n_estimators': 32}, mean: 0.50859, std: 0.02424, params: {'n_estimators': 33}, mean: 0.50658, std: 0.02437, params: {'n_estimators': 34}, mean: 0.50274, std: 0.02268, params: {'n_estimators': 35}, mean: 0.49792, std: 0.02166, params: {'n_estimators': 36}, mean: 0.49523, std: 0.02374, params: {'n_estimators': 37}, mean: 0.50171, std: 0.02341, params: {'n_estimators': 38}, mean: 0.49956, std: 0.01975, params: {'n_estimators': 39}, mean: 0.50241, std: 0.01935, params: {'n_estimators': 40}, mean: 0.50159, std: 0.01768, params: {'n_estimators': 41}, mean: 0.50239, std: 0.01976, params: {'n_estimators': 42}, mean: 0.50983, std: 0.02694, params: {'n_estimators': 43}, mean: 0.50406, std: 0.02464, params: {'n_estimators': 44}, mean: 0.50600, std: 0.02529, params: {'n_estimators': 45}, mean: 0.50618, std: 0.02880, params: {'n_estimators': 46}, mean: 0.50833, std: 0.02509, params: {'n_estimators': 47}, mean: 0.50739, std: 0.02828, params: {'n_estimators': 48}, mean: 0.51046, std: 0.02356, params: {'n_estimators': 49}, mean: 0.50717, std: 0.02515, params: {'n_estimators': 50}, mean: 0.50378, std: 0.02538, params: {'n_estimators': 51}, mean: 0.50780, std: 0.02172, params: {'n_estimators': 52}, mean: 0.50237, std: 0.02439, params: {'n_estimators': 53}, mean: 0.50201, std: 0.02427, params: {'n_estimators': 54}, mean: 0.50575, std: 0.02755, params: {'n_estimators': 55}, mean: 0.50467, std: 0.02124, params: {'n_estimators': 56}, mean: 0.50442, std: 0.02450, params: {'n_estimators': 57}, mean: 0.50491, std: 0.02059, params: {'n_estimators': 58}, mean: 0.50801, std: 0.02217, params: {'n_estimators': 59}]
############# based on standard predict ################
Accuracy on training data: 0.81
Accuracy on test data:     0.80
[[2275  161]
 [ 489  311]]
########################################################
Gradient Boosting

In [107]:
clfGB1 = GradientBoostingClassifier()

parameters = {"n_estimators": range(30, 60), "max_depth": [1, 2, 3, 4, 5]}
clfGB1, Xtrain, ytrain, Xtest, ytest = do_classify(clfGB1, parameters, 
                                                       dftouse, Xnames1, target1, 1, mask=mask, 
                                                       n_jobs = 4, score_func='f1')

importance_list = clfGB1.feature_importances_
name_list = dftouse[Xnames1].columns
importance_list, name_list = zip(*sorted(zip(importance_list, name_list))[-15:])
plt.barh(range(len(name_list)),importance_list,align='center')
plt.yticks(range(len(name_list)),name_list)
plt.xlabel('Relative Importance in the Gradient Boosting 1')
plt.ylabel('Features')
plt.title('Relative importance of Each Feature')
plt.show()


using mask
/Users/ChaserAcer/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:5: FutureWarning: comparison to `None` will result in an elementwise object comparison in the future.
BEST {'n_estimators': 52, 'max_depth': 4} 0.531489186066 [mean: 0.43496, std: 0.02422, params: {'n_estimators': 30, 'max_depth': 1}, mean: 0.43607, std: 0.02167, params: {'n_estimators': 31, 'max_depth': 1}, mean: 0.43797, std: 0.02489, params: {'n_estimators': 32, 'max_depth': 1}, mean: 0.44585, std: 0.01574, params: {'n_estimators': 33, 'max_depth': 1}, mean: 0.44633, std: 0.01689, params: {'n_estimators': 34, 'max_depth': 1}, mean: 0.44858, std: 0.01908, params: {'n_estimators': 35, 'max_depth': 1}, mean: 0.44991, std: 0.01779, params: {'n_estimators': 36, 'max_depth': 1}, mean: 0.45020, std: 0.01767, params: {'n_estimators': 37, 'max_depth': 1}, mean: 0.44928, std: 0.01893, params: {'n_estimators': 38, 'max_depth': 1}, mean: 0.45058, std: 0.01695, params: {'n_estimators': 39, 'max_depth': 1}, mean: 0.45483, std: 0.01717, params: {'n_estimators': 40, 'max_depth': 1}, mean: 0.45484, std: 0.01713, params: {'n_estimators': 41, 'max_depth': 1}, mean: 0.45195, std: 0.01823, params: {'n_estimators': 42, 'max_depth': 1}, mean: 0.45282, std: 0.01802, params: {'n_estimators': 43, 'max_depth': 1}, mean: 0.45461, std: 0.01773, params: {'n_estimators': 44, 'max_depth': 1}, mean: 0.45325, std: 0.01680, params: {'n_estimators': 45, 'max_depth': 1}, mean: 0.45368, std: 0.01562, params: {'n_estimators': 46, 'max_depth': 1}, mean: 0.45597, std: 0.01544, params: {'n_estimators': 47, 'max_depth': 1}, mean: 0.45650, std: 0.01399, params: {'n_estimators': 48, 'max_depth': 1}, mean: 0.45838, std: 0.01348, params: {'n_estimators': 49, 'max_depth': 1}, mean: 0.45812, std: 0.01364, params: {'n_estimators': 50, 'max_depth': 1}, mean: 0.46247, std: 0.01012, params: {'n_estimators': 51, 'max_depth': 1}, mean: 0.46227, std: 0.01014, params: {'n_estimators': 52, 'max_depth': 1}, mean: 0.46122, std: 0.01196, params: {'n_estimators': 53, 'max_depth': 1}, mean: 0.46056, std: 0.01144, params: {'n_estimators': 54, 'max_depth': 1}, mean: 0.46133, std: 0.01111, params: {'n_estimators': 55, 'max_depth': 1}, mean: 0.45953, std: 0.01465, params: {'n_estimators': 56, 'max_depth': 1}, mean: 0.46129, std: 0.01398, params: {'n_estimators': 57, 'max_depth': 1}, mean: 0.46303, std: 0.01102, params: {'n_estimators': 58, 'max_depth': 1}, mean: 0.46394, std: 0.01173, params: {'n_estimators': 59, 'max_depth': 1}, mean: 0.47093, std: 0.02332, params: {'n_estimators': 30, 'max_depth': 2}, mean: 0.47468, std: 0.02207, params: {'n_estimators': 31, 'max_depth': 2}, mean: 0.47564, std: 0.02077, params: {'n_estimators': 32, 'max_depth': 2}, mean: 0.47599, std: 0.02013, params: {'n_estimators': 33, 'max_depth': 2}, mean: 0.47832, std: 0.02110, params: {'n_estimators': 34, 'max_depth': 2}, mean: 0.47896, std: 0.02348, params: {'n_estimators': 35, 'max_depth': 2}, mean: 0.47955, std: 0.02353, params: {'n_estimators': 36, 'max_depth': 2}, mean: 0.48099, std: 0.02174, params: {'n_estimators': 37, 'max_depth': 2}, mean: 0.47994, std: 0.02138, params: {'n_estimators': 38, 'max_depth': 2}, mean: 0.48221, std: 0.01951, params: {'n_estimators': 39, 'max_depth': 2}, mean: 0.48148, std: 0.02133, params: {'n_estimators': 40, 'max_depth': 2}, mean: 0.48238, std: 0.02054, params: {'n_estimators': 41, 'max_depth': 2}, mean: 0.48305, std: 0.02044, params: {'n_estimators': 42, 'max_depth': 2}, mean: 0.48445, std: 0.02232, params: {'n_estimators': 43, 'max_depth': 2}, mean: 0.48423, std: 0.02251, params: {'n_estimators': 44, 'max_depth': 2}, mean: 0.48560, std: 0.02162, params: {'n_estimators': 45, 'max_depth': 2}, mean: 0.48587, std: 0.01961, params: {'n_estimators': 46, 'max_depth': 2}, mean: 0.48763, std: 0.01932, params: {'n_estimators': 47, 'max_depth': 2}, mean: 0.48910, std: 0.01966, params: {'n_estimators': 48, 'max_depth': 2}, mean: 0.48865, std: 0.01826, params: {'n_estimators': 49, 'max_depth': 2}, mean: 0.49123, std: 0.01907, params: {'n_estimators': 50, 'max_depth': 2}, mean: 0.49041, std: 0.01879, params: {'n_estimators': 51, 'max_depth': 2}, mean: 0.49358, std: 0.01853, params: {'n_estimators': 52, 'max_depth': 2}, mean: 0.49586, std: 0.01937, params: {'n_estimators': 53, 'max_depth': 2}, mean: 0.49602, std: 0.01851, params: {'n_estimators': 54, 'max_depth': 2}, mean: 0.49551, std: 0.01892, params: {'n_estimators': 55, 'max_depth': 2}, mean: 0.49600, std: 0.01884, params: {'n_estimators': 56, 'max_depth': 2}, mean: 0.49619, std: 0.01863, params: {'n_estimators': 57, 'max_depth': 2}, mean: 0.49756, std: 0.01788, params: {'n_estimators': 58, 'max_depth': 2}, mean: 0.49917, std: 0.01776, params: {'n_estimators': 59, 'max_depth': 2}, mean: 0.50059, std: 0.01888, params: {'n_estimators': 30, 'max_depth': 3}, mean: 0.50451, std: 0.01870, params: {'n_estimators': 31, 'max_depth': 3}, mean: 0.50500, std: 0.01793, params: {'n_estimators': 32, 'max_depth': 3}, mean: 0.50587, std: 0.01677, params: {'n_estimators': 33, 'max_depth': 3}, mean: 0.50493, std: 0.01580, params: {'n_estimators': 34, 'max_depth': 3}, mean: 0.50795, std: 0.01395, params: {'n_estimators': 35, 'max_depth': 3}, mean: 0.51069, std: 0.01879, params: {'n_estimators': 36, 'max_depth': 3}, mean: 0.50781, std: 0.02011, params: {'n_estimators': 37, 'max_depth': 3}, mean: 0.50837, std: 0.01928, params: {'n_estimators': 38, 'max_depth': 3}, mean: 0.51047, std: 0.01669, params: {'n_estimators': 39, 'max_depth': 3}, mean: 0.51047, std: 0.01845, params: {'n_estimators': 40, 'max_depth': 3}, mean: 0.51117, std: 0.02110, params: {'n_estimators': 41, 'max_depth': 3}, mean: 0.51029, std: 0.02312, params: {'n_estimators': 42, 'max_depth': 3}, mean: 0.51553, std: 0.02011, params: {'n_estimators': 43, 'max_depth': 3}, mean: 0.51520, std: 0.02050, params: {'n_estimators': 44, 'max_depth': 3}, mean: 0.51354, std: 0.02393, params: {'n_estimators': 45, 'max_depth': 3}, mean: 0.51416, std: 0.02409, params: {'n_estimators': 46, 'max_depth': 3}, mean: 0.51554, std: 0.02270, params: {'n_estimators': 47, 'max_depth': 3}, mean: 0.51509, std: 0.01981, params: {'n_estimators': 48, 'max_depth': 3}, mean: 0.51357, std: 0.02044, params: {'n_estimators': 49, 'max_depth': 3}, mean: 0.51457, std: 0.02221, params: {'n_estimators': 50, 'max_depth': 3}, mean: 0.51346, std: 0.02063, params: {'n_estimators': 51, 'max_depth': 3}, mean: 0.51545, std: 0.01788, params: {'n_estimators': 52, 'max_depth': 3}, mean: 0.51771, std: 0.01978, params: {'n_estimators': 53, 'max_depth': 3}, mean: 0.51794, std: 0.01896, params: {'n_estimators': 54, 'max_depth': 3}, mean: 0.51839, std: 0.01957, params: {'n_estimators': 55, 'max_depth': 3}, mean: 0.52006, std: 0.01972, params: {'n_estimators': 56, 'max_depth': 3}, mean: 0.52085, std: 0.02018, params: {'n_estimators': 57, 'max_depth': 3}, mean: 0.52205, std: 0.01828, params: {'n_estimators': 58, 'max_depth': 3}, mean: 0.52161, std: 0.01984, params: {'n_estimators': 59, 'max_depth': 3}, mean: 0.51742, std: 0.02215, params: {'n_estimators': 30, 'max_depth': 4}, mean: 0.51763, std: 0.02065, params: {'n_estimators': 31, 'max_depth': 4}, mean: 0.51645, std: 0.01972, params: {'n_estimators': 32, 'max_depth': 4}, mean: 0.51907, std: 0.02070, params: {'n_estimators': 33, 'max_depth': 4}, mean: 0.51975, std: 0.02341, params: {'n_estimators': 34, 'max_depth': 4}, mean: 0.52198, std: 0.02147, params: {'n_estimators': 35, 'max_depth': 4}, mean: 0.52104, std: 0.02236, params: {'n_estimators': 36, 'max_depth': 4}, mean: 0.52498, std: 0.02097, params: {'n_estimators': 37, 'max_depth': 4}, mean: 0.52442, std: 0.02191, params: {'n_estimators': 38, 'max_depth': 4}, mean: 0.52546, std: 0.02076, params: {'n_estimators': 39, 'max_depth': 4}, mean: 0.52481, std: 0.02116, params: {'n_estimators': 40, 'max_depth': 4}, mean: 0.52800, std: 0.02001, params: {'n_estimators': 41, 'max_depth': 4}, mean: 0.52720, std: 0.01708, params: {'n_estimators': 42, 'max_depth': 4}, mean: 0.52668, std: 0.01948, params: {'n_estimators': 43, 'max_depth': 4}, mean: 0.52821, std: 0.02001, params: {'n_estimators': 44, 'max_depth': 4}, mean: 0.52732, std: 0.01959, params: {'n_estimators': 45, 'max_depth': 4}, mean: 0.52802, std: 0.02061, params: {'n_estimators': 46, 'max_depth': 4}, mean: 0.52663, std: 0.02144, params: {'n_estimators': 47, 'max_depth': 4}, mean: 0.52795, std: 0.02249, params: {'n_estimators': 48, 'max_depth': 4}, mean: 0.52958, std: 0.02173, params: {'n_estimators': 49, 'max_depth': 4}, mean: 0.52731, std: 0.02430, params: {'n_estimators': 50, 'max_depth': 4}, mean: 0.53118, std: 0.01957, params: {'n_estimators': 51, 'max_depth': 4}, mean: 0.53149, std: 0.01921, params: {'n_estimators': 52, 'max_depth': 4}, mean: 0.52914, std: 0.02020, params: {'n_estimators': 53, 'max_depth': 4}, mean: 0.52686, std: 0.01879, params: {'n_estimators': 54, 'max_depth': 4}, mean: 0.52630, std: 0.02077, params: {'n_estimators': 55, 'max_depth': 4}, mean: 0.52677, std: 0.01997, params: {'n_estimators': 56, 'max_depth': 4}, mean: 0.52549, std: 0.01939, params: {'n_estimators': 57, 'max_depth': 4}, mean: 0.52642, std: 0.01997, params: {'n_estimators': 58, 'max_depth': 4}, mean: 0.52693, std: 0.01985, params: {'n_estimators': 59, 'max_depth': 4}, mean: 0.51575, std: 0.02352, params: {'n_estimators': 30, 'max_depth': 5}, mean: 0.52182, std: 0.02163, params: {'n_estimators': 31, 'max_depth': 5}, mean: 0.52097, std: 0.01945, params: {'n_estimators': 32, 'max_depth': 5}, mean: 0.51591, std: 0.02208, params: {'n_estimators': 33, 'max_depth': 5}, mean: 0.52308, std: 0.01817, params: {'n_estimators': 34, 'max_depth': 5}, mean: 0.51924, std: 0.02348, params: {'n_estimators': 35, 'max_depth': 5}, mean: 0.52248, std: 0.01709, params: {'n_estimators': 36, 'max_depth': 5}, mean: 0.52247, std: 0.01714, params: {'n_estimators': 37, 'max_depth': 5}, mean: 0.52422, std: 0.01703, params: {'n_estimators': 38, 'max_depth': 5}, mean: 0.52264, std: 0.02308, params: {'n_estimators': 39, 'max_depth': 5}, mean: 0.52106, std: 0.01721, params: {'n_estimators': 40, 'max_depth': 5}, mean: 0.51984, std: 0.01775, params: {'n_estimators': 41, 'max_depth': 5}, mean: 0.52538, std: 0.01622, params: {'n_estimators': 42, 'max_depth': 5}, mean: 0.52685, std: 0.01578, params: {'n_estimators': 43, 'max_depth': 5}, mean: 0.52441, std: 0.01790, params: {'n_estimators': 44, 'max_depth': 5}, mean: 0.52584, std: 0.01877, params: {'n_estimators': 45, 'max_depth': 5}, mean: 0.52747, std: 0.01480, params: {'n_estimators': 46, 'max_depth': 5}, mean: 0.52758, std: 0.01554, params: {'n_estimators': 47, 'max_depth': 5}, mean: 0.52416, std: 0.01721, params: {'n_estimators': 48, 'max_depth': 5}, mean: 0.53000, std: 0.01138, params: {'n_estimators': 49, 'max_depth': 5}, mean: 0.52609, std: 0.01717, params: {'n_estimators': 50, 'max_depth': 5}, mean: 0.52376, std: 0.02134, params: {'n_estimators': 51, 'max_depth': 5}, mean: 0.52308, std: 0.02154, params: {'n_estimators': 52, 'max_depth': 5}, mean: 0.53053, std: 0.01417, params: {'n_estimators': 53, 'max_depth': 5}, mean: 0.52712, std: 0.01719, params: {'n_estimators': 54, 'max_depth': 5}, mean: 0.52422, std: 0.01282, params: {'n_estimators': 55, 'max_depth': 5}, mean: 0.52623, std: 0.01714, params: {'n_estimators': 56, 'max_depth': 5}, mean: 0.52172, std: 0.02137, params: {'n_estimators': 57, 'max_depth': 5}, mean: 0.52928, std: 0.01693, params: {'n_estimators': 58, 'max_depth': 5}, mean: 0.52827, std: 0.01913, params: {'n_estimators': 59, 'max_depth': 5}]
############# based on standard predict ################
Accuracy on training data: 0.85
Accuracy on test data:     0.80
[[2156  260]
 [ 402  418]]
########################################################

In [108]:
# Plotting ROC Curves

with sns.color_palette("dark"):
    ax=make_roc("Descision Trees 1",clfTree1  , ytest, Xtest, None, labe=250, proba=True)
    make_roc("Random Forest 1"     ,clfForest1, ytest, Xtest, ax  , labe=250, proba=True);
    make_roc("ADA Boost 1"         ,clfAda1   , ytest, Xtest, ax  , labe=250, proba=True, skip=50);
    make_roc("Gradient Boost 1"    ,clfGB1    , ytest, Xtest, ax  , labe=250, proba=True, skip=50);



In [109]:
Xtrainb = Xtrain 
ytrainb = ytrain 
Xtestb = Xtest 
ytestb = ytest
Decision Tree - No Gender or Ethnicity

In [40]:
# Descision Tree
target2 = 'RESP_Low_Graduation'
clfTree2 = tree.DecisionTreeClassifier()

parameters = {"max_depth": [1, 2, 3, 4, 5, 6, 7], 'min_samples_leaf': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}
clfTree2, Xtrain, ytrain, Xtest, ytest = do_classify(clfTree2, parameters, dftouse, 
                                                     Xnames2, target2, 1, 
                                                     mask=mask, n_jobs = 4, score_func = 'f1')

importance_list = clfTree2.feature_importances_
name_list = dftouse[Xnames2].columns
importance_list, name_list = zip(*sorted(zip(importance_list, name_list))[-15:])
plt.barh(range(len(name_list)),importance_list,align='center')
plt.yticks(range(len(name_list)),name_list)
plt.xlabel('Relative Importance in the Descision Trees 2')
plt.ylabel('Features')
plt.title('Relative importance of Each Feature')
plt.show()


using mask
/Users/ashwindeo/Dev/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:19: FutureWarning: comparison to `None` will result in an elementwise object comparison in the future.
BEST {'max_depth': 6, 'min_samples_leaf': 8} 0.494143271081 [mean: 0.42575, std: 0.21319, params: {'max_depth': 1, 'min_samples_leaf': 1}, mean: 0.42575, std: 0.21319, params: {'max_depth': 1, 'min_samples_leaf': 2}, mean: 0.42575, std: 0.21319, params: {'max_depth': 1, 'min_samples_leaf': 3}, mean: 0.42575, std: 0.21319, params: {'max_depth': 1, 'min_samples_leaf': 4}, mean: 0.42575, std: 0.21319, params: {'max_depth': 1, 'min_samples_leaf': 5}, mean: 0.42575, std: 0.21319, params: {'max_depth': 1, 'min_samples_leaf': 6}, mean: 0.42575, std: 0.21319, params: {'max_depth': 1, 'min_samples_leaf': 7}, mean: 0.42575, std: 0.21319, params: {'max_depth': 1, 'min_samples_leaf': 8}, mean: 0.42575, std: 0.21319, params: {'max_depth': 1, 'min_samples_leaf': 9}, mean: 0.42575, std: 0.21319, params: {'max_depth': 1, 'min_samples_leaf': 10}, mean: 0.40379, std: 0.02731, params: {'max_depth': 2, 'min_samples_leaf': 1}, mean: 0.40379, std: 0.02731, params: {'max_depth': 2, 'min_samples_leaf': 2}, mean: 0.40379, std: 0.02731, params: {'max_depth': 2, 'min_samples_leaf': 3}, mean: 0.40379, std: 0.02731, params: {'max_depth': 2, 'min_samples_leaf': 4}, mean: 0.40379, std: 0.02731, params: {'max_depth': 2, 'min_samples_leaf': 5}, mean: 0.40379, std: 0.02731, params: {'max_depth': 2, 'min_samples_leaf': 6}, mean: 0.40379, std: 0.02731, params: {'max_depth': 2, 'min_samples_leaf': 7}, mean: 0.40379, std: 0.02731, params: {'max_depth': 2, 'min_samples_leaf': 8}, mean: 0.40379, std: 0.02731, params: {'max_depth': 2, 'min_samples_leaf': 9}, mean: 0.40379, std: 0.02731, params: {'max_depth': 2, 'min_samples_leaf': 10}, mean: 0.43806, std: 0.08202, params: {'max_depth': 3, 'min_samples_leaf': 1}, mean: 0.43806, std: 0.08202, params: {'max_depth': 3, 'min_samples_leaf': 2}, mean: 0.43806, std: 0.08202, params: {'max_depth': 3, 'min_samples_leaf': 3}, mean: 0.43806, std: 0.08202, params: {'max_depth': 3, 'min_samples_leaf': 4}, mean: 0.43806, std: 0.08202, params: {'max_depth': 3, 'min_samples_leaf': 5}, mean: 0.43806, std: 0.08202, params: {'max_depth': 3, 'min_samples_leaf': 6}, mean: 0.43806, std: 0.08202, params: {'max_depth': 3, 'min_samples_leaf': 7}, mean: 0.43806, std: 0.08202, params: {'max_depth': 3, 'min_samples_leaf': 8}, mean: 0.43806, std: 0.08202, params: {'max_depth': 3, 'min_samples_leaf': 9}, mean: 0.43806, std: 0.08202, params: {'max_depth': 3, 'min_samples_leaf': 10}, mean: 0.49221, std: 0.02357, params: {'max_depth': 4, 'min_samples_leaf': 1}, mean: 0.49132, std: 0.02530, params: {'max_depth': 4, 'min_samples_leaf': 2}, mean: 0.49221, std: 0.02357, params: {'max_depth': 4, 'min_samples_leaf': 3}, mean: 0.49221, std: 0.02357, params: {'max_depth': 4, 'min_samples_leaf': 4}, mean: 0.49266, std: 0.02270, params: {'max_depth': 4, 'min_samples_leaf': 5}, mean: 0.49266, std: 0.02270, params: {'max_depth': 4, 'min_samples_leaf': 6}, mean: 0.49280, std: 0.02281, params: {'max_depth': 4, 'min_samples_leaf': 7}, mean: 0.49191, std: 0.02236, params: {'max_depth': 4, 'min_samples_leaf': 8}, mean: 0.49146, std: 0.02219, params: {'max_depth': 4, 'min_samples_leaf': 9}, mean: 0.49146, std: 0.02219, params: {'max_depth': 4, 'min_samples_leaf': 10}, mean: 0.47686, std: 0.03212, params: {'max_depth': 5, 'min_samples_leaf': 1}, mean: 0.47750, std: 0.03227, params: {'max_depth': 5, 'min_samples_leaf': 2}, mean: 0.47629, std: 0.03130, params: {'max_depth': 5, 'min_samples_leaf': 3}, mean: 0.47731, std: 0.03108, params: {'max_depth': 5, 'min_samples_leaf': 4}, mean: 0.47734, std: 0.03223, params: {'max_depth': 5, 'min_samples_leaf': 5}, mean: 0.47871, std: 0.03264, params: {'max_depth': 5, 'min_samples_leaf': 6}, mean: 0.47918, std: 0.03399, params: {'max_depth': 5, 'min_samples_leaf': 7}, mean: 0.47817, std: 0.03368, params: {'max_depth': 5, 'min_samples_leaf': 8}, mean: 0.47734, std: 0.03386, params: {'max_depth': 5, 'min_samples_leaf': 9}, mean: 0.47599, std: 0.03414, params: {'max_depth': 5, 'min_samples_leaf': 10}, mean: 0.49060, std: 0.00860, params: {'max_depth': 6, 'min_samples_leaf': 1}, mean: 0.49007, std: 0.00613, params: {'max_depth': 6, 'min_samples_leaf': 2}, mean: 0.49153, std: 0.00674, params: {'max_depth': 6, 'min_samples_leaf': 3}, mean: 0.49132, std: 0.00602, params: {'max_depth': 6, 'min_samples_leaf': 4}, mean: 0.49079, std: 0.00744, params: {'max_depth': 6, 'min_samples_leaf': 5}, mean: 0.49228, std: 0.00997, params: {'max_depth': 6, 'min_samples_leaf': 6}, mean: 0.49139, std: 0.01230, params: {'max_depth': 6, 'min_samples_leaf': 7}, mean: 0.49414, std: 0.01201, params: {'max_depth': 6, 'min_samples_leaf': 8}, mean: 0.49197, std: 0.00882, params: {'max_depth': 6, 'min_samples_leaf': 9}, mean: 0.49050, std: 0.01064, params: {'max_depth': 6, 'min_samples_leaf': 10}, mean: 0.48251, std: 0.01241, params: {'max_depth': 7, 'min_samples_leaf': 1}, mean: 0.48179, std: 0.01118, params: {'max_depth': 7, 'min_samples_leaf': 2}, mean: 0.48280, std: 0.00990, params: {'max_depth': 7, 'min_samples_leaf': 3}, mean: 0.48251, std: 0.00974, params: {'max_depth': 7, 'min_samples_leaf': 4}, mean: 0.48314, std: 0.00653, params: {'max_depth': 7, 'min_samples_leaf': 5}, mean: 0.48121, std: 0.01625, params: {'max_depth': 7, 'min_samples_leaf': 6}, mean: 0.47682, std: 0.01973, params: {'max_depth': 7, 'min_samples_leaf': 7}, mean: 0.48126, std: 0.01714, params: {'max_depth': 7, 'min_samples_leaf': 8}, mean: 0.47783, std: 0.01898, params: {'max_depth': 7, 'min_samples_leaf': 9}, mean: 0.47176, std: 0.01742, params: {'max_depth': 7, 'min_samples_leaf': 10}]
############# based on standard predict ################
Accuracy on training data: 0.82
Accuracy on test data:     0.79
[[2250  186]
 [ 493  307]]
########################################################
/Users/ashwindeo/Dev/anaconda/lib/python2.7/site-packages/sklearn/metrics/classification.py:958: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/ashwindeo/Dev/anaconda/lib/python2.7/site-packages/sklearn/metrics/classification.py:958: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/ashwindeo/Dev/anaconda/lib/python2.7/site-packages/sklearn/metrics/classification.py:958: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/ashwindeo/Dev/anaconda/lib/python2.7/site-packages/sklearn/metrics/classification.py:958: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no predicted samples.
  'precision', 'predicted', average, warn_for)
Random Forests - No Gender/Ethnicity

In [41]:
# Random Forests
clfForest2 = RandomForestClassifier()

parameters = {"n_estimators": range(1, 10)}
clfForest2, Xtrain, ytrain, Xtest, ytest = do_classify(clfForest2, parameters, 
                                                       dftouse, Xnames2, target2, 1, mask=mask, 
                                                       n_jobs = 4, score_func='f1')

importance_list = clfForest2.feature_importances_
name_list = dftouse[Xnames2].columns
importance_list, name_list = zip(*sorted(zip(importance_list, name_list))[-15:])
plt.barh(range(len(name_list)),importance_list,align='center')
plt.yticks(range(len(name_list)),name_list)
plt.xlabel('Relative Importance in the Random Forests 2')
plt.ylabel('Features')
plt.title('Relative importance of Each Feature')
plt.show()


using mask
/Users/ashwindeo/Dev/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:19: FutureWarning: comparison to `None` will result in an elementwise object comparison in the future.
BEST {'n_estimators': 7} 0.462593097704 [mean: 0.44040, std: 0.02275, params: {'n_estimators': 1}, mean: 0.32863, std: 0.02517, params: {'n_estimators': 2}, mean: 0.44604, std: 0.01882, params: {'n_estimators': 3}, mean: 0.36329, std: 0.02093, params: {'n_estimators': 4}, mean: 0.44538, std: 0.03085, params: {'n_estimators': 5}, mean: 0.40595, std: 0.02123, params: {'n_estimators': 6}, mean: 0.46259, std: 0.00976, params: {'n_estimators': 7}, mean: 0.41053, std: 0.02198, params: {'n_estimators': 8}, mean: 0.45159, std: 0.02418, params: {'n_estimators': 9}]
############# based on standard predict ################
Accuracy on training data: 0.98
Accuracy on test data:     0.78
[[2192  244]
 [ 468  332]]
########################################################
ADA Booster - No Gender/Ethnicity

In [42]:
# ADA Booster
clfAda2 = AdaBoostClassifier()

parameters = {"n_estimators": range(10, 60)}
clfAda2, Xtrain, ytrain, Xtest, ytest = do_classify(clfAda2, parameters, 
                                                       dftouse, Xnames2, target1, 1, mask=mask, 
                                                       n_jobs = 4, score_func='f1')

importance_list = clfAda2.feature_importances_
name_list = dftouse[Xnames2].columns
importance_list, name_list = zip(*sorted(zip(importance_list, name_list))[-15:])
plt.barh(range(len(name_list)),importance_list,align='center')
plt.yticks(range(len(name_list)),name_list)
plt.xlabel('Relative Importance in the ADA Boost 2')
plt.ylabel('Features')
plt.title('Relative importance of Each Feature')
plt.show()


using mask
/Users/ashwindeo/Dev/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:19: FutureWarning: comparison to `None` will result in an elementwise object comparison in the future.
BEST {'n_estimators': 59} 0.487530116517 [mean: 0.44959, std: 0.04826, params: {'n_estimators': 10}, mean: 0.48245, std: 0.04502, params: {'n_estimators': 11}, mean: 0.45702, std: 0.04761, params: {'n_estimators': 12}, mean: 0.46490, std: 0.03840, params: {'n_estimators': 13}, mean: 0.46699, std: 0.04261, params: {'n_estimators': 14}, mean: 0.47000, std: 0.04939, params: {'n_estimators': 15}, mean: 0.45780, std: 0.04545, params: {'n_estimators': 16}, mean: 0.46422, std: 0.04405, params: {'n_estimators': 17}, mean: 0.45915, std: 0.04123, params: {'n_estimators': 18}, mean: 0.46156, std: 0.04584, params: {'n_estimators': 19}, mean: 0.45574, std: 0.04432, params: {'n_estimators': 20}, mean: 0.46278, std: 0.04222, params: {'n_estimators': 21}, mean: 0.46741, std: 0.03892, params: {'n_estimators': 22}, mean: 0.47286, std: 0.03897, params: {'n_estimators': 23}, mean: 0.47174, std: 0.03558, params: {'n_estimators': 24}, mean: 0.47286, std: 0.03758, params: {'n_estimators': 25}, mean: 0.46680, std: 0.04269, params: {'n_estimators': 26}, mean: 0.46453, std: 0.03967, params: {'n_estimators': 27}, mean: 0.46576, std: 0.04108, params: {'n_estimators': 28}, mean: 0.46500, std: 0.03639, params: {'n_estimators': 29}, mean: 0.46535, std: 0.04011, params: {'n_estimators': 30}, mean: 0.45865, std: 0.04223, params: {'n_estimators': 31}, mean: 0.46092, std: 0.03984, params: {'n_estimators': 32}, mean: 0.46319, std: 0.04106, params: {'n_estimators': 33}, mean: 0.46285, std: 0.03822, params: {'n_estimators': 34}, mean: 0.46156, std: 0.03726, params: {'n_estimators': 35}, mean: 0.46185, std: 0.03858, params: {'n_estimators': 36}, mean: 0.46195, std: 0.03877, params: {'n_estimators': 37}, mean: 0.46132, std: 0.03834, params: {'n_estimators': 38}, mean: 0.46515, std: 0.03455, params: {'n_estimators': 39}, mean: 0.47118, std: 0.02666, params: {'n_estimators': 40}, mean: 0.47176, std: 0.02667, params: {'n_estimators': 41}, mean: 0.47253, std: 0.03001, params: {'n_estimators': 42}, mean: 0.47375, std: 0.03102, params: {'n_estimators': 43}, mean: 0.47132, std: 0.03002, params: {'n_estimators': 44}, mean: 0.46948, std: 0.03898, params: {'n_estimators': 45}, mean: 0.47246, std: 0.03769, params: {'n_estimators': 46}, mean: 0.46986, std: 0.03520, params: {'n_estimators': 47}, mean: 0.47064, std: 0.03619, params: {'n_estimators': 48}, mean: 0.48139, std: 0.02891, params: {'n_estimators': 49}, mean: 0.48203, std: 0.02349, params: {'n_estimators': 50}, mean: 0.48527, std: 0.02002, params: {'n_estimators': 51}, mean: 0.48744, std: 0.02193, params: {'n_estimators': 52}, mean: 0.48517, std: 0.01777, params: {'n_estimators': 53}, mean: 0.48519, std: 0.02184, params: {'n_estimators': 54}, mean: 0.48440, std: 0.01747, params: {'n_estimators': 55}, mean: 0.48158, std: 0.02283, params: {'n_estimators': 56}, mean: 0.48090, std: 0.01355, params: {'n_estimators': 57}, mean: 0.48637, std: 0.01877, params: {'n_estimators': 58}, mean: 0.48753, std: 0.01661, params: {'n_estimators': 59}]
############# based on standard predict ################
Accuracy on training data: 0.80
Accuracy on test data:     0.79
[[2255  181]
 [ 503  297]]
########################################################
Gradient Boosting - No Gender/Ethnicity

In [113]:
# Gradient Boosting
clfGB2 = GradientBoostingClassifier()

parameters = {"n_estimators": range(30, 60), "max_depth": [1, 2, 3, 4, 5]}
clfGB2, Xtrain, ytrain, Xtest, ytest = do_classify(clfGB2, parameters, 
                                                       dftouse, Xnames2, target1, 1, mask=mask, 
                                                       n_jobs = 4, score_func='f1')

importance_list = clfGB2.feature_importances_
name_list = dftouse[Xnames2].columns
importance_list, name_list = zip(*sorted(zip(importance_list, name_list))[-15:])
plt.barh(range(len(name_list)),importance_list,align='center')
plt.yticks(range(len(name_list)),name_list)
plt.xlabel('Relative Importance in the Gradient Boosting 2')
plt.ylabel('Features')
plt.title('Relative importance of Each Feature')
plt.show()


using mask
/Users/ChaserAcer/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:5: FutureWarning: comparison to `None` will result in an elementwise object comparison in the future.
BEST {'n_estimators': 58, 'max_depth': 5} 0.513299695638 [mean: 0.39810, std: 0.01898, params: {'n_estimators': 30, 'max_depth': 1}, mean: 0.40623, std: 0.02288, params: {'n_estimators': 31, 'max_depth': 1}, mean: 0.40466, std: 0.02430, params: {'n_estimators': 32, 'max_depth': 1}, mean: 0.40482, std: 0.02482, params: {'n_estimators': 33, 'max_depth': 1}, mean: 0.40909, std: 0.02374, params: {'n_estimators': 34, 'max_depth': 1}, mean: 0.40987, std: 0.02658, params: {'n_estimators': 35, 'max_depth': 1}, mean: 0.41244, std: 0.02551, params: {'n_estimators': 36, 'max_depth': 1}, mean: 0.41302, std: 0.02511, params: {'n_estimators': 37, 'max_depth': 1}, mean: 0.41079, std: 0.02525, params: {'n_estimators': 38, 'max_depth': 1}, mean: 0.41166, std: 0.02464, params: {'n_estimators': 39, 'max_depth': 1}, mean: 0.41407, std: 0.02193, params: {'n_estimators': 40, 'max_depth': 1}, mean: 0.41637, std: 0.02208, params: {'n_estimators': 41, 'max_depth': 1}, mean: 0.42130, std: 0.02046, params: {'n_estimators': 42, 'max_depth': 1}, mean: 0.42355, std: 0.02416, params: {'n_estimators': 43, 'max_depth': 1}, mean: 0.42004, std: 0.02149, params: {'n_estimators': 44, 'max_depth': 1}, mean: 0.42199, std: 0.02002, params: {'n_estimators': 45, 'max_depth': 1}, mean: 0.42266, std: 0.02105, params: {'n_estimators': 46, 'max_depth': 1}, mean: 0.42084, std: 0.02261, params: {'n_estimators': 47, 'max_depth': 1}, mean: 0.42260, std: 0.02249, params: {'n_estimators': 48, 'max_depth': 1}, mean: 0.42235, std: 0.02268, params: {'n_estimators': 49, 'max_depth': 1}, mean: 0.42112, std: 0.02236, params: {'n_estimators': 50, 'max_depth': 1}, mean: 0.42112, std: 0.02236, params: {'n_estimators': 51, 'max_depth': 1}, mean: 0.42389, std: 0.02846, params: {'n_estimators': 52, 'max_depth': 1}, mean: 0.42668, std: 0.02694, params: {'n_estimators': 53, 'max_depth': 1}, mean: 0.42680, std: 0.02436, params: {'n_estimators': 54, 'max_depth': 1}, mean: 0.42868, std: 0.02869, params: {'n_estimators': 55, 'max_depth': 1}, mean: 0.42869, std: 0.02893, params: {'n_estimators': 56, 'max_depth': 1}, mean: 0.42909, std: 0.02892, params: {'n_estimators': 57, 'max_depth': 1}, mean: 0.43281, std: 0.02490, params: {'n_estimators': 58, 'max_depth': 1}, mean: 0.43343, std: 0.02594, params: {'n_estimators': 59, 'max_depth': 1}, mean: 0.45318, std: 0.02979, params: {'n_estimators': 30, 'max_depth': 2}, mean: 0.45146, std: 0.03099, params: {'n_estimators': 31, 'max_depth': 2}, mean: 0.45076, std: 0.02936, params: {'n_estimators': 32, 'max_depth': 2}, mean: 0.45066, std: 0.03126, params: {'n_estimators': 33, 'max_depth': 2}, mean: 0.45448, std: 0.03033, params: {'n_estimators': 34, 'max_depth': 2}, mean: 0.45566, std: 0.03000, params: {'n_estimators': 35, 'max_depth': 2}, mean: 0.45613, std: 0.03055, params: {'n_estimators': 36, 'max_depth': 2}, mean: 0.45858, std: 0.03063, params: {'n_estimators': 37, 'max_depth': 2}, mean: 0.45918, std: 0.02738, params: {'n_estimators': 38, 'max_depth': 2}, mean: 0.46112, std: 0.02913, params: {'n_estimators': 39, 'max_depth': 2}, mean: 0.46145, std: 0.02990, params: {'n_estimators': 40, 'max_depth': 2}, mean: 0.46098, std: 0.02985, params: {'n_estimators': 41, 'max_depth': 2}, mean: 0.46260, std: 0.02953, params: {'n_estimators': 42, 'max_depth': 2}, mean: 0.46160, std: 0.02991, params: {'n_estimators': 43, 'max_depth': 2}, mean: 0.46137, std: 0.02953, params: {'n_estimators': 44, 'max_depth': 2}, mean: 0.45934, std: 0.03145, params: {'n_estimators': 45, 'max_depth': 2}, mean: 0.46402, std: 0.03234, params: {'n_estimators': 46, 'max_depth': 2}, mean: 0.46411, std: 0.03084, params: {'n_estimators': 47, 'max_depth': 2}, mean: 0.46550, std: 0.03145, params: {'n_estimators': 48, 'max_depth': 2}, mean: 0.46522, std: 0.03291, params: {'n_estimators': 49, 'max_depth': 2}, mean: 0.46689, std: 0.03184, params: {'n_estimators': 50, 'max_depth': 2}, mean: 0.46658, std: 0.03418, params: {'n_estimators': 51, 'max_depth': 2}, mean: 0.46776, std: 0.03569, params: {'n_estimators': 52, 'max_depth': 2}, mean: 0.46844, std: 0.03357, params: {'n_estimators': 53, 'max_depth': 2}, mean: 0.46927, std: 0.03156, params: {'n_estimators': 54, 'max_depth': 2}, mean: 0.46962, std: 0.03196, params: {'n_estimators': 55, 'max_depth': 2}, mean: 0.46975, std: 0.03188, params: {'n_estimators': 56, 'max_depth': 2}, mean: 0.47094, std: 0.03097, params: {'n_estimators': 57, 'max_depth': 2}, mean: 0.47196, std: 0.03103, params: {'n_estimators': 58, 'max_depth': 2}, mean: 0.47110, std: 0.03044, params: {'n_estimators': 59, 'max_depth': 2}, mean: 0.47446, std: 0.02607, params: {'n_estimators': 30, 'max_depth': 3}, mean: 0.47459, std: 0.02699, params: {'n_estimators': 31, 'max_depth': 3}, mean: 0.47755, std: 0.02399, params: {'n_estimators': 32, 'max_depth': 3}, mean: 0.47797, std: 0.02428, params: {'n_estimators': 33, 'max_depth': 3}, mean: 0.48113, std: 0.02590, params: {'n_estimators': 34, 'max_depth': 3}, mean: 0.47944, std: 0.02712, params: {'n_estimators': 35, 'max_depth': 3}, mean: 0.47893, std: 0.02783, params: {'n_estimators': 36, 'max_depth': 3}, mean: 0.48044, std: 0.02928, params: {'n_estimators': 37, 'max_depth': 3}, mean: 0.47786, std: 0.02939, params: {'n_estimators': 38, 'max_depth': 3}, mean: 0.47947, std: 0.02560, params: {'n_estimators': 39, 'max_depth': 3}, mean: 0.47716, std: 0.02693, params: {'n_estimators': 40, 'max_depth': 3}, mean: 0.48047, std: 0.02567, params: {'n_estimators': 41, 'max_depth': 3}, mean: 0.48025, std: 0.02326, params: {'n_estimators': 42, 'max_depth': 3}, mean: 0.48934, std: 0.02394, params: {'n_estimators': 43, 'max_depth': 3}, mean: 0.48947, std: 0.02738, params: {'n_estimators': 44, 'max_depth': 3}, mean: 0.48923, std: 0.02843, params: {'n_estimators': 45, 'max_depth': 3}, mean: 0.49022, std: 0.02928, params: {'n_estimators': 46, 'max_depth': 3}, mean: 0.49094, std: 0.02503, params: {'n_estimators': 47, 'max_depth': 3}, mean: 0.49143, std: 0.02580, params: {'n_estimators': 48, 'max_depth': 3}, mean: 0.49057, std: 0.02614, params: {'n_estimators': 49, 'max_depth': 3}, mean: 0.49144, std: 0.02477, params: {'n_estimators': 50, 'max_depth': 3}, mean: 0.49317, std: 0.02641, params: {'n_estimators': 51, 'max_depth': 3}, mean: 0.49287, std: 0.02652, params: {'n_estimators': 52, 'max_depth': 3}, mean: 0.49200, std: 0.02777, params: {'n_estimators': 53, 'max_depth': 3}, mean: 0.49331, std: 0.02453, params: {'n_estimators': 54, 'max_depth': 3}, mean: 0.49341, std: 0.02597, params: {'n_estimators': 55, 'max_depth': 3}, mean: 0.49309, std: 0.02984, params: {'n_estimators': 56, 'max_depth': 3}, mean: 0.49239, std: 0.03025, params: {'n_estimators': 57, 'max_depth': 3}, mean: 0.49259, std: 0.02961, params: {'n_estimators': 58, 'max_depth': 3}, mean: 0.49613, std: 0.03121, params: {'n_estimators': 59, 'max_depth': 3}, mean: 0.48001, std: 0.01917, params: {'n_estimators': 30, 'max_depth': 4}, mean: 0.48350, std: 0.02274, params: {'n_estimators': 31, 'max_depth': 4}, mean: 0.48239, std: 0.02347, params: {'n_estimators': 32, 'max_depth': 4}, mean: 0.48375, std: 0.02373, params: {'n_estimators': 33, 'max_depth': 4}, mean: 0.48799, std: 0.01644, params: {'n_estimators': 34, 'max_depth': 4}, mean: 0.48846, std: 0.02275, params: {'n_estimators': 35, 'max_depth': 4}, mean: 0.48678, std: 0.02299, params: {'n_estimators': 36, 'max_depth': 4}, mean: 0.49100, std: 0.01773, params: {'n_estimators': 37, 'max_depth': 4}, mean: 0.49123, std: 0.01673, params: {'n_estimators': 38, 'max_depth': 4}, mean: 0.49215, std: 0.01834, params: {'n_estimators': 39, 'max_depth': 4}, mean: 0.49505, std: 0.01933, params: {'n_estimators': 40, 'max_depth': 4}, mean: 0.49500, std: 0.01638, params: {'n_estimators': 41, 'max_depth': 4}, mean: 0.49266, std: 0.01781, params: {'n_estimators': 42, 'max_depth': 4}, mean: 0.49511, std: 0.01927, params: {'n_estimators': 43, 'max_depth': 4}, mean: 0.49578, std: 0.01943, params: {'n_estimators': 44, 'max_depth': 4}, mean: 0.49518, std: 0.02098, params: {'n_estimators': 45, 'max_depth': 4}, mean: 0.49730, std: 0.01809, params: {'n_estimators': 46, 'max_depth': 4}, mean: 0.49440, std: 0.01908, params: {'n_estimators': 47, 'max_depth': 4}, mean: 0.49550, std: 0.01748, params: {'n_estimators': 48, 'max_depth': 4}, mean: 0.49560, std: 0.01870, params: {'n_estimators': 49, 'max_depth': 4}, mean: 0.49696, std: 0.01869, params: {'n_estimators': 50, 'max_depth': 4}, mean: 0.49581, std: 0.01955, params: {'n_estimators': 51, 'max_depth': 4}, mean: 0.49568, std: 0.01821, params: {'n_estimators': 52, 'max_depth': 4}, mean: 0.49215, std: 0.02023, params: {'n_estimators': 53, 'max_depth': 4}, mean: 0.49317, std: 0.01618, params: {'n_estimators': 54, 'max_depth': 4}, mean: 0.49140, std: 0.02321, params: {'n_estimators': 55, 'max_depth': 4}, mean: 0.49515, std: 0.01736, params: {'n_estimators': 56, 'max_depth': 4}, mean: 0.49683, std: 0.02007, params: {'n_estimators': 57, 'max_depth': 4}, mean: 0.49713, std: 0.02063, params: {'n_estimators': 58, 'max_depth': 4}, mean: 0.49376, std: 0.02012, params: {'n_estimators': 59, 'max_depth': 4}, mean: 0.49120, std: 0.02566, params: {'n_estimators': 30, 'max_depth': 5}, mean: 0.49849, std: 0.02308, params: {'n_estimators': 31, 'max_depth': 5}, mean: 0.50097, std: 0.01980, params: {'n_estimators': 32, 'max_depth': 5}, mean: 0.49568, std: 0.02377, params: {'n_estimators': 33, 'max_depth': 5}, mean: 0.49673, std: 0.02092, params: {'n_estimators': 34, 'max_depth': 5}, mean: 0.49923, std: 0.02039, params: {'n_estimators': 35, 'max_depth': 5}, mean: 0.49635, std: 0.02345, params: {'n_estimators': 36, 'max_depth': 5}, mean: 0.49762, std: 0.01946, params: {'n_estimators': 37, 'max_depth': 5}, mean: 0.50041, std: 0.02260, params: {'n_estimators': 38, 'max_depth': 5}, mean: 0.50091, std: 0.02622, params: {'n_estimators': 39, 'max_depth': 5}, mean: 0.50116, std: 0.02211, params: {'n_estimators': 40, 'max_depth': 5}, mean: 0.50233, std: 0.02096, params: {'n_estimators': 41, 'max_depth': 5}, mean: 0.50142, std: 0.02066, params: {'n_estimators': 42, 'max_depth': 5}, mean: 0.50092, std: 0.02235, params: {'n_estimators': 43, 'max_depth': 5}, mean: 0.50052, std: 0.02756, params: {'n_estimators': 44, 'max_depth': 5}, mean: 0.50126, std: 0.02396, params: {'n_estimators': 45, 'max_depth': 5}, mean: 0.50397, std: 0.02141, params: {'n_estimators': 46, 'max_depth': 5}, mean: 0.50665, std: 0.01601, params: {'n_estimators': 47, 'max_depth': 5}, mean: 0.50451, std: 0.02626, params: {'n_estimators': 48, 'max_depth': 5}, mean: 0.50139, std: 0.02355, params: {'n_estimators': 49, 'max_depth': 5}, mean: 0.50371, std: 0.02552, params: {'n_estimators': 50, 'max_depth': 5}, mean: 0.51057, std: 0.02116, params: {'n_estimators': 51, 'max_depth': 5}, mean: 0.50626, std: 0.02595, params: {'n_estimators': 52, 'max_depth': 5}, mean: 0.50954, std: 0.02459, params: {'n_estimators': 53, 'max_depth': 5}, mean: 0.50949, std: 0.02521, params: {'n_estimators': 54, 'max_depth': 5}, mean: 0.51283, std: 0.01722, params: {'n_estimators': 55, 'max_depth': 5}, mean: 0.51151, std: 0.02576, params: {'n_estimators': 56, 'max_depth': 5}, mean: 0.51060, std: 0.02248, params: {'n_estimators': 57, 'max_depth': 5}, mean: 0.51330, std: 0.02178, params: {'n_estimators': 58, 'max_depth': 5}, mean: 0.50467, std: 0.01816, params: {'n_estimators': 59, 'max_depth': 5}]
############# based on standard predict ################
Accuracy on training data: 0.87
Accuracy on test data:     0.78
[[2103  313]
 [ 411  409]]
########################################################

In [114]:
with sns.color_palette("dark"):
    ax=make_roc("Descision Trees 2",clfTree2  , ytest, Xtest, None, labe=250, proba=True)
    make_roc("Random Forest 2"     ,clfForest2, ytest, Xtest, ax  , labe=250, proba=True);
    make_roc("ADA Boost 2"         ,clfAda2   , ytest, Xtest, ax  , labe=250, proba=True, skip=50);
    make_roc("Gradient Boost 2"    ,clfGB2    , ytest, Xtest, ax  , labe=250, proba=True, skip=50);



In [115]:
Xtrainc = Xtrain 
ytrainc = ytrain 
Xtestc = Xtest 
ytestc = ytest

Final Comparison of All Models


In [116]:
with sns.color_palette("dark"):
    ax = make_roc("svm-all-features",clfsvm, ytesta, Xtesta, None, labe=250, proba=False, skip=50)
    make_roc("svm-feature-selected",pipelinearsvm, ytesta, Xtesta, ax, labe=250, proba=False, skip=50);
    make_roc("svm-all-features-balanced",clfsvm_b, ytesta, Xtesta, ax, labe=250, proba=False, skip=50);
    make_roc("logistic-with-lasso",clflog, ytesta, Xtesta, ax, labe=250, proba=True,  skip=50);
    make_roc("Descision Trees 1",clfTree1  , ytestb, Xtestb, None, labe=250, proba=True)
    make_roc("Random Forest 1"     ,clfForest1, ytestb, Xtestb, ax  , labe=250, proba=True);
    make_roc("ADA Boost 1"         ,clfAda1   , ytestb, Xtestb, ax  , labe=250, proba=True, skip=50);
    make_roc("Gradient Boost 1"    ,clfGB1    , ytestb, Xtestb, ax  , labe=250, proba=True, skip=50);
    make_roc("Descision Trees 2",clfTree2  , ytestc, Xtestc, None, labe=250, proba=True)
    make_roc("Random Forest 2"     ,clfForest2, ytestc, Xtestc, ax  , labe=250, proba=True);
    make_roc("ADA Boost 2"         ,clfAda2   , ytestc, Xtestc, ax  , labe=250, proba=True, skip=50);
    make_roc("Gradient Boost 2"    ,clfGB2    , ytestc, Xtestc, ax  , labe=250, proba=True, skip=50);


Model of Numerical Graduation Rate


In [344]:
#Number of ccols from above divided by 3 gives the number of rows needed.  So for instance 126/3 = 42.
fig, axs = plt.subplots(43, 3, figsize=(15,100), tight_layout=True)

for item, ax in zip(dftouse[ccols], axs.flat):
    dftouse.plot(kind='scatter', ax=ax, x=item, y='afgr')


ERROR! Session/line number was not unique in database. History logging moved to new session 61