We wanted the ability to compare school districts based on similar school districts as well as by statewide. This notebook creates the groupings. The output files for these groupings were used in the Tableau visualization.
The columns for these groupings were chosen based on the New York State Education Department's definition of similar schools. Link: http://www.p12.nysed.gov/repcrd2004/information/similar-schools/guide.shtml
Start with the standard imports we have used for every notebook in this class.
In [119]:
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")
We need to use the cleaned file instead of the filtered file because we need to get some of the columns that we dropped from the filtered file.
In [120]:
districts = pd.read_csv("data/finaldata/cleaned.csv")
Apply the filtering logic that we used in the main notebook.
In [121]:
import math
print 'Total number of unique school districts: ' + str(len(np.unique(districts['Agency ID - NCES Assigned [District] Latest available year'])))
districts = districts[districts['Highest Grade Offered [District] 2009-10']=='12th Grade']
print 'Total number of school districts that have high schools: ' + str(len(districts))
districts = districts[districts['SURVYEAR']!='–']
print 'Total number of school districts that have a row on raw graduation data: ' + str(len(districts))
districts = districts[districts['AFGR']>=0]
print 'Total number of school districts with valid graduation data: ' + str(len(districts))
In [122]:
districts=districts.replace([np.inf, -np.inf], np.nan)
Mean, min, max graduation rates by state for export to the visualization.
In [123]:
#CITATION: http://stackoverflow.com/questions/17578115/pass-percentiles-to-pandas-agg-function
def percentile(n):
def percentile_(x):
return np.percentile(x, n)
percentile_.__name__ = 'percentile_%s' % n
return percentile_
In [124]:
stateresults = districts.groupby('State Name [District] Latest available year')['AFGR'].aggregate([np.min, percentile(25), np.mean, percentile(75), np.max])
stateresults.to_csv("data/finaldata/stateresults.csv")
stateresults.head(100)
Out[124]:
In [125]:
districts['Urban-centric Locale [District] 2009-10'].value_counts()
Out[125]:
Calculate the ratio of free and reduced lunch students to total students.
In [126]:
districts['r_lunch_free_reduced'] = districts['Total Free and Reduced Lunch Students [Public School] 2009-10']/districts['Total Students [Public School] 2009-10']
Determine the top quartile and bottom quartile boundaries for Large and Small school size.
In [127]:
topquartile = districts.groupby('Urban-centric Locale [District] 2009-10')[['Total Students [Public School] 2009-10']]
topq = topquartile.quantile(q=.75)
topq.head(20)
Out[127]:
In [128]:
bottomquartile = districts.groupby('Urban-centric Locale [District] 2009-10')[['Total Students [Public School] 2009-10']]
bottomq = bottomquartile.quantile(q=.25)
bottomq.head(20)
Out[128]:
In [129]:
groups = []
for name, group in topquartile:
groups.append(name)
districts['Student_Size'] = 'Medium'
for index in range(len(groups)):
districts['Student_Size'] = np.where((districts['Urban-centric Locale [District] 2009-10']==groups[index]) & (districts['Total Students [Public School] 2009-10']>topq.iloc[index]['Total Students [Public School] 2009-10']), 'Large', districts['Student_Size'])
districts['Student_Size'] = np.where((districts['Urban-centric Locale [District] 2009-10']==groups[index]) & (districts['Total Students [Public School] 2009-10']<=bottomq.iloc[index]['Total Students [Public School] 2009-10']), 'Small', districts['Student_Size'])
print districts['Student_Size'].value_counts()
Determine the top quartile and bottom quartile boundaries for free and reduced price lunch ratio.
In [130]:
topquartile = districts.groupby(['Urban-centric Locale [District] 2009-10', 'Student_Size'])[['r_lunch_free_reduced']]
topq = topquartile.quantile(q=.75)
topq.head(20)
Out[130]:
In [131]:
bottomquartile = districts.groupby(['Urban-centric Locale [District] 2009-10', 'Student_Size'])[['r_lunch_free_reduced']]
bottomq = bottomquartile.quantile(q=.25)
bottomq.head(20)
Out[131]:
In [132]:
groups = []
for name, group in topquartile:
groups.append(name)
districts['Lunch_Status'] = 'Average'
for index in range(len(groups)):
districts['Lunch_Status'] = np.where((districts['Student_Size']==groups[index][1]) & (districts['Urban-centric Locale [District] 2009-10']==groups[index][0]) & (districts['r_lunch_free_reduced']>topq.iloc[index]['r_lunch_free_reduced']), 'High', districts['Lunch_Status'])
districts['Lunch_Status'] = np.where((districts['Student_Size']==groups[index][1]) & (districts['Urban-centric Locale [District] 2009-10']==groups[index][0]) & (districts['r_lunch_free_reduced']<=bottomq.iloc[index]['r_lunch_free_reduced']), 'Low', districts['Lunch_Status'])
print districts['Lunch_Status'].value_counts()
Check even distribution of grouped similar schools row counts.
In [133]:
districts.groupby(['Urban-centric Locale [District] 2009-10', 'Student_Size', 'Lunch_Status'])['AFGR'].count()
Out[133]:
In [134]:
similardistrictresults = districts.groupby(['Urban-centric Locale [District] 2009-10', 'Student_Size', 'Lunch_Status'])['AFGR'].aggregate([np.min, percentile(25), np.mean, percentile(75), np.max])
similardistrictresults.to_csv("data/finaldata/similardistrictresults.csv")
similardistrictresults.head(100)
Out[134]:
In [136]:
districts.head()
Out[136]:
In [137]:
highfull = districts[['Agency ID - NCES Assigned [District] Latest available year', 'Latitude [District] 2009-10', 'Longitude [District] 2009-10', 'Agency Name', 'State Name [District] Latest available year', 'Urban-centric Locale [District] 2009-10', 'Student_Size', 'Lunch_Status', 'AFGR', 'i_agency_type_regional_education_services', 'i_lgo_PK', 'i_lgo_K', 'Total Number Operational Schools [Public School] 2009-10', 'i_fin_sdlc_sec', 'r_frev_ao', 'r_frev_dis', 'Total Expenditures (TOTALEXP) per Pupil (V33) [District Finance] 2009-10', 'r_stud_reg_12_W_M', 'r_stud_reg_12_W_F']]
In [138]:
lowfull = districts[['Agency ID - NCES Assigned [District] Latest available year', 'i_fin_sdlc_elem_sec', 'i_fin_sdlc_voc', 'r_stud_reg_12_W_F', 'i_lgo_PK', 'r_stud_reg_12_W_M', 'i_lgo_K', 'i_agency_type_local_school_district', 'r_frev_ao', 'r_stud_re_B', 'r_stud_912']]
In [139]:
highnoge = districts[['Agency ID - NCES Assigned [District] Latest available year', 'i_lgo_K', 'i_lgo_PK', 'i_fin_sdlc_sec', 'r_st_TS', 'Total Number Operational Schools [Public School] 2009-10', 'Total Expenditures (TOTALEXP) per Pupil (V33) [District Finance] 2009-10', 'r_frev_ao', 'r_frev_dis', 'r_stud_912', 'i_fin_sdlc_voc']]
In [140]:
lownoge = districts[['Agency ID - NCES Assigned [District] Latest available year', 'i_lgo_K', 'i_lgo_PK', 'r_st_TS', 'i_fin_sdlc_voc', 'Total Expenditures (TOTALEXP) per Pupil (V33) [District Finance] 2009-10', 'r_stud_912', 'r_frev_ao', 'i_ma_metropolitan', 'i_ucl_city_small', 'i_ma_micropolitan']]
In [141]:
finalvisdistrict = highfull.merge(lowfull, 'left', 'Agency ID - NCES Assigned [District] Latest available year', suffixes=('', '_LOWFULL'))
finalvisdistrict = finalvisdistrict.merge(highnoge, 'left', 'Agency ID - NCES Assigned [District] Latest available year', suffixes=('', '_HIGHNOGE'))
finalvisdistrict = finalvisdistrict.merge(lownoge, 'left', 'Agency ID - NCES Assigned [District] Latest available year', suffixes=('', '_LOWNOGE'))
In [142]:
finalvisdistrict.head()
Out[142]:
In [143]:
finalvisdistrict.to_csv("data/finaldata/tableaudistricts.csv", index=False)
In [ ]:
In [ ]: