Generate data needed for visualization in Tableau

Start with the standard imports we have used for every notebook in this class.


In [45]:
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")

Each of the datasheets downloaded from ELSI had download metadata on the top of them and total and key information on the bottom of them that were not data rows. This metadata, total, and key information was manually deleted before import. Some of the files had ="0" instead of 0 in the cells. This was found/replaced before import using the sed -i '' 's/="0"/0/g' *.csv command from the terminal.


In [46]:
#CITATION: This is the data from National Center for Education Statistics on Schools
#Schools for all 50 states and Washington, D.C.
#http://nces.ed.gov/ccd/elsi/
#ELSI Root Data Source: U.S. Department of Education National Center for Education Statistics Common Core of Data (CCD) "Public Elementary/Secondary School Universe Survey" 2012-13 v.1a  2013-14 v.1a.
#KEY:
#† indicates that the data are not applicable.
#– indicates that the data are missing.
#‡ indicates that the data do not meet NCES data quality standards.

schoolinformation = pd.read_csv("data/rawdata/schools/2009-2010 SCHOOL Information Tab.csv", dtype=np.str)
schoolcharacteristicsa = pd.read_csv("data/rawdata/schools/2009-2010 SCHOOL CharacteristicsA Tab.csv", dtype=np.str)
schoolcharacteristicsb = pd.read_csv("data/rawdata/schools/2009-2010 SCHOOL CharacteristicsB Tab.csv", dtype=np.str)
schoolenrollment = pd.read_csv("data/rawdata/schools/2009-2010 SCHOOL Enrollments Tab.csv", dtype=np.str)
schoolenrollmentdetails = pd.read_csv("data/rawdata/schools/2009-2010 SCHOOL Enrollment Details Tab.csv", dtype=np.str)
#schoolenrollmentK3 = pd.read_csv("tempdata/2013-2014 SCHOOL Enrollment by Grade, Race-Ethnicity, and Gender Tab PreK-3.csv", dtype=np.str)
#schoolenrollment48 = pd.read_csv("tempdata/2013-2014 SCHOOL Enrollment by Grade, Race-Ethnicity and Gender Tab 4-8.csv", dtype=np.str)
#schoolenrollment912 = pd.read_csv("tempdata/2013-2014 SCHOOL Enrollment by Grade, Race-Ethnicity and Gender Tab 9-Ungraded.csv", dtype=np.str)
schoolteacherstaff = pd.read_csv("data/rawdata/schools/2009-2010 SCHOOL Teacher and Staff Tab.csv", dtype=np.str)

Check the lengths of the datasets to see if we have a row for every school.


In [47]:
print len(schoolinformation)
print len(schoolcharacteristicsa)
print len(schoolcharacteristicsb)
print len(schoolenrollment)
print len(schoolenrollmentdetails)
#print len(schoolenrollmentK3)
#print len(schoolenrollment48)
#print len(schoolenrollment912)
print len(schoolteacherstaff)


106634
99804
99804
99804
99804
99804

Drop all of the duplicate columns.


In [49]:
schoolcharacteristicsa = schoolcharacteristicsa.drop(schoolcharacteristicsa.columns[[0, 1, 20]], 1)
schoolcharacteristicsb = schoolcharacteristicsb.drop(schoolcharacteristicsb.columns[[0, 1]], 1)
schoolenrollment = schoolenrollment.drop(schoolenrollment.columns[[0, 1]], 1)
schoolenrollmentdetails = schoolenrollmentdetails.drop(schoolenrollmentdetails.columns[[0, 1]], 1)
#schoolenrollmentK3 = schoolenrollmentK3.drop(schoolenrollmentK3.columns[[0, 1]], 1)
#schoolenrollment48 = schoolenrollment48.drop(schoolenrollment48.columns[[0, 1]], 1)
#schoolenrollment912 = schoolenrollment912.drop(schoolenrollment912.columns[[0, 1, 72]], 1)

In [50]:
schoolinformation.head()


Out[50]:
School Name State Name [Public School] Latest available year State Name [Public School] 2009-10 State Abbr [Public School] Latest available year School Name [Public School] 2009-10 School ID - NCES Assigned [Public School] Latest available year Agency Name [Public School] 2009-10 Agency ID - NCES Assigned [Public School] Latest available year County Name [Public School] 2009-10 County Number [Public School] 2009-10 Race/Ethnicity Category [Public School] 2009-10 ANSI/FIPS State Code [Public School] Latest available year Years School Reported Data [Public School] Latest available year Years School Did Not Report Data [Public School] Latest available year Location Address [Public School] 2013-14 Location City [Public School] 2013-14 Location State Abbr [Public School] 2013-14 Location ZIP [Public School] 2013-14 Location ZIP4 [Public School] 2013-14 Mailing Address [Public School] 2013-14 Mailing City [Public School] 2013-14 Mailing State Abbr [Public School] 2013-14 Mailing ZIP [Public School] 2013-14 Mailing ZIP4 [Public School] 2013-14 Phone Number [Public School] 2013-14
0 EDWARD "DUKE" ELLINGTON @ BECKHAM Michigan Michigan MI EDWARD (DUKE) ELLINGTON CONSERVATORY OF MUSIC/ART 261200004838 DETROIT CITY SCHOOL DISTRICT 2612000 WAYNE COUNTY 26163 Reported 5 categories 26 1986-2013 NaN 9860 PARK DR DETROIT MI 48213 7102 9860 PARK DRIVE DETROIT MI 48213 7102 3138528500
1 J.M. "CHICK" BUCKBEE JUVENILE CENTER West Virginia West Virginia WV J.M. "CHICK" BUCKBEE JUVENILE CENTER 540051101159 INSTITUTIONAL EDUCATIONAL PROGRAMS 5400511 HAMPSHIRE COUNTY 54027 Reported 7 categories 54 2003-2013 1986-2002 1 JERRY LN AUGUSTA WV 26704 #1 JERRY LANE AUGUSTA WV 26704 3044961369
2 JAMES H. "TIGER" MORTON JUVENILE CENTER West Virginia West Virginia WV JAMES H. "TIGER" MORTON JUVENILE CENTER 540051101146 INSTITUTIONAL EDUCATIONAL PROGRAMS 5400511 KANAWHA COUNTY 54039 Reported 7 categories 54 2003-2013 1986-2002 60 MANFRED HOLLAND WAY DUNBAR WV 25064 60 MANFRED HOLLAND WAY DUNBAR WV 25064 3047663502
3 MARY HARRIS "MOTHER" JONES ELEMENTARY Maryland Maryland MD MARY HARRIS "MOTHER" JONES ELEMENTARY 240051001257 PRINCE GEORGE'S COUNTY PUBLIC SCHOOLS 2400510 PRINCE GEORGE'S COUNTY 24033 Reported 5 categories 24 2002-2013 1986-2001 2405 TECUMSEH ST ADELPHI MD 20783 1658 2405 TECUMSEH ST ADELPHI MD 20783 1658 3014087900
4 PEMAYETV EMAHAKV CHARTER "OUR WAY SCHOOL" Florida Florida FL PEMAYETV EMAHAKV "OUR WAY SCHOOL" 120066004849 GLADES 1200660 GLADES COUNTY 12043 Reported 5 categories 12 2006-2013 1986-2005 100 EAST HARNEY POND RD NE OKEECHOBEE FL 34974 2867 100 E HARNEY POND RD NE OKEECHOBEE FL 34974 2867 8634672501

Join all of the school datasets.


In [51]:
joinedschool = schoolteacherstaff.join([schoolcharacteristicsa, schoolcharacteristicsb, schoolenrollment, schoolenrollmentdetails])

In [52]:
joinedschool = schoolinformation.merge(joinedschool, 'left', 'School ID - NCES Assigned [Public School] Latest available year', suffixes=('', '_DEL'))

In [55]:
#Need to get rid of Excel syntax ="" from some of the columns
for i, col in enumerate(joinedschool.columns):
    joinedschool[col] = joinedschool[col].map(lambda x: str(x).lstrip('="').rstrip('"'))

In [ ]:
#If by chance any rows have NaN, replace with the ELSI standard for missing data '–'
joinedschool = joinedschool.fillna('–')
joinedschool = joinedschool.replace('nan', '–')

In [57]:
# Replacing Missing Data / NA / Bad Quality data with blank, later to be turned into NaN for float columns
# CITATIOIN : http://pandas.pydata.org/pandas-docs/version/0.15.2/missing_data.html

joinedschool = joinedschool.replace('\xe2\x80\x93', '') # Replace "-" (Missing Data) with blank
joinedschool = joinedschool.replace('\xe2\x80\xa0', '') # Replace "†" (Not Applicable) with blank
joinedschool = joinedschool.replace('\xe2\x80\xa1', '') # Replace "‡" (Bad Quality) with blank

In [62]:
joinedschool['i_agency_type_regional_education_services'] = np.where(joinedschool['Agency Type [District] 2009-10']=='4-Regional education services agency', 1, 0)
joinedschool['i_agency_type_local_school_district'] = np.where(joinedschool['Agency Type [District] 2009-10']=='1-Local school district', 1, 0)
joinedschool['i_lgo_PK'] = np.where(joinedschool['Lowest Grade Offered [Public School] 2009-10']=='Prekindergarten', 1, 0)
joinedschool['i_lgo_K'] = np.where(joinedschool['Lowest Grade Offered [Public School] 2009-10']=='Kindergarten', 1, 0)
joinedschool['Black Students [Public School] 2009-10'] = joinedschool['Black Students [Public School] 2009-10'].replace('', np.nan)
joinedschool['Grades 9-12 Students [Public School] 2009-10'] = joinedschool['Grades 9-12 Students [Public School] 2009-10'].replace('', np.nan)
joinedschool['Total Students [Public School] 2009-10'] = joinedschool['Total Students [Public School] 2009-10'].replace('', np.nan)
joinedschool['Black Students [Public School] 2009-10'] = joinedschool['Black Students [Public School] 2009-10'].astype(float)
joinedschool['Grades 9-12 Students [Public School] 2009-10'] = joinedschool['Grades 9-12 Students [Public School] 2009-10'].astype(float)
joinedschool['Total Students [Public School] 2009-10'] = joinedschool['Total Students [Public School] 2009-10'].astype(float)
joinedschool['r_stud_re_B'] = joinedschool['Black Students [Public School] 2009-10']/joinedschool['Total Students [Public School] 2009-10']
joinedschool['r_stud_912'] = joinedschool['Grades 9-12 Students [Public School] 2009-10']/joinedschool['Total Students [Public School] 2009-10']
joinedschool['r_st_TS'] = joinedschool['Pupil/Teacher Ratio [Public School] 2009-10']


/Users/ChaserAcer/anaconda/lib/python2.7/site-packages/pandas/core/common.py:516: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  mask = arr == x

In [65]:
joinedschool = joinedschool[['School Name [Public School] 2009-10', 'State Name [Public School] Latest available year', 'School ID - NCES Assigned [Public School] Latest available year', 'Agency ID - NCES Assigned [Public School] Latest available year', 'i_agency_type_regional_education_services', 'i_agency_type_local_school_district', 'i_lgo_PK', 'i_lgo_K', 'r_stud_re_B', 'r_stud_912', 'r_st_TS']]

In [66]:
joinedschool.head()


Out[66]:
School Name [Public School] 2009-10 State Name [Public School] Latest available year School ID - NCES Assigned [Public School] Latest available year Agency ID - NCES Assigned [Public School] Latest available year i_agency_type_regional_education_services i_agency_type_local_school_district i_lgo_PK i_lgo_K r_stud_re_B r_stud_912 r_st_TS
0 EDWARD (DUKE) ELLINGTON CONSERVATORY OF MUSIC/ART Michigan 261200004838 2612000 0 1 0 1 0.990050 NaN 19.45
1 J.M. "CHICK" BUCKBEE JUVENILE CENTER West Virginia 540051101159 5400511 0 0 0 0 0.000000 0.888889 4.50
2 JAMES H. "TIGER" MORTON JUVENILE CENTER West Virginia 540051101146 5400511 0 0 0 0 0.266667 0.800000 7.50
3 MARY HARRIS "MOTHER" JONES ELEMENTARY Maryland 240051001257 2400510 0 1 1 0 0.129076 NaN 13.83
4 PEMAYETV EMAHAKV "OUR WAY SCHOOL Florida 120066004849 1200660 0 1 0 1 0.019900 NaN 9.14

In [67]:
joinedschool=joinedschool.replace([np.inf, -np.inf], np.nan)

In [68]:
joinedschool.fillna(value=0,inplace=True)

In [69]:
joinedschool.head()


Out[69]:
School Name [Public School] 2009-10 State Name [Public School] Latest available year School ID - NCES Assigned [Public School] Latest available year Agency ID - NCES Assigned [Public School] Latest available year i_agency_type_regional_education_services i_agency_type_local_school_district i_lgo_PK i_lgo_K r_stud_re_B r_stud_912 r_st_TS
0 EDWARD (DUKE) ELLINGTON CONSERVATORY OF MUSIC/ART Michigan 261200004838 2612000 0 1 0 1 0.990050 0.000000 19.45
1 J.M. "CHICK" BUCKBEE JUVENILE CENTER West Virginia 540051101159 5400511 0 0 0 0 0.000000 0.888889 4.50
2 JAMES H. "TIGER" MORTON JUVENILE CENTER West Virginia 540051101146 5400511 0 0 0 0 0.266667 0.800000 7.50
3 MARY HARRIS "MOTHER" JONES ELEMENTARY Maryland 240051001257 2400510 0 1 1 0 0.129076 0.000000 13.83
4 PEMAYETV EMAHAKV "OUR WAY SCHOOL Florida 120066004849 1200660 0 1 0 1 0.019900 0.000000 9.14

In [71]:
joinedschool.to_csv("data/finaldata/tableauschools.csv")

This is the full list of all column names in the schools dataframe.


In [72]:
for col in joinedschool.columns:
    print col


School Name [Public School] 2009-10
State Name [Public School] Latest available year
School ID - NCES Assigned [Public School] Latest available year
Agency ID - NCES Assigned [Public School] Latest available year
i_agency_type_regional_education_services
i_agency_type_local_school_district
i_lgo_PK
i_lgo_K
r_stud_re_B
r_stud_912
r_st_TS

In [ ]: