In this notebook we will clean and explore the 2017 High Schools dataset by the NYC Department of Education.
Let's start by opening and examining it.
In [1]:
import pandas as pd
all_high_schools = pd.read_csv('data/DOE_High_School_Directory_2017.csv')
all_high_schools.shape
Out[1]:
In [2]:
pd.set_option('display.max_columns', 453)
all_high_schools.head(3)
Out[2]:
As ou can see there are over 400 columns so let's keep only columns of interest.
Also notice that the boys
column is a flag for boys-only schools. Since we are trying to help solving the problem of women in tech it wouldn't make sense to keep them - let's filter them out.
In [3]:
boys_only = all_high_schools['boys'] == 1
columns_of_interest = ['dbn', 'school_name', 'boro', 'academicopportunities1',
'academicopportunities2', 'academicopportunities3',
'academicopportunities4', 'academicopportunities5', 'neighborhood',
'location', 'subway', 'bus', 'total_students', 'start_time', 'end_time',
'graduation_rate', 'attendance_rate', 'pct_stu_enough_variety',
'college_career_rate', 'girls', 'specialized', 'earlycollege',
'program1', 'program2', 'program3', 'program4', 'program5', 'program6',
'program7', 'program8', 'program9', 'program10', 'interest1',
'interest2', 'interest3', 'interest4', 'interest5', 'interest6',
'interest7', 'interest8', 'interest9', 'interest10', 'city', 'zip']
df = all_high_schools[~boys_only][columns_of_interest]
df.set_index('dbn', inplace=True)
df.shape
Out[3]:
In [4]:
df.head(3)
Out[4]:
Let's now make a quick comparison of college_career_rate
in girls-only schools vs mixed ones.
In [5]:
df['all'] = ""
df['girls'] = df['girls'].map({1: 'Girls-only'})
df['girls'].fillna('Mixed', inplace=True)
In [6]:
%matplotlib inline
import pylab as plt
import seaborn as sns
sns.set(style="whitegrid", color_codes=True)
ax = sns.violinplot(data=df, x='all', y="college_career_rate", hue="girls", split=True)
sns.despine(left=True)
ax.set_xlabel("")
plt.suptitle('College Career Rate by Type of School (Mixed or Girls-only)')
plt.savefig('figures/girls-only.png', bbox_inches='tight')
Now notice that there are 5 columns on "academic opportunities", 10 columns on "programs", and 10 more columns on "interests", and that in each of these areas some schools might have something that is tech-related. For each school, let's try to find whether we can find some tech related words in any of those areas and let's call it "tech inclination".
In [7]:
import numpy as np
def contains_terms(column_name, terms=["tech"]):
"""Checks if at least one of the terms is present in the given column."""
contains = []
for i, term in enumerate(terms):
contains.append(df[column_name].str.contains(terms[i], case=False))
not_null = df[column_name].notnull()
return (not_null) & (np.any(contains, axis=0))
def contains_terms_columns(column_root, n_columns, terms=["tech"]):
"""Checks if at least one of the terms is present in the columns given by its root name."""
if n_columns == 1:
return contains_terms(column_root, terms)
tech = []
for i in range(n_columns):
column_name = column_root + str(i + 1)
tech.append(contains_terms(column_name, terms))
return np.any(tech, axis=0)
In [8]:
tech_academicopportunities = contains_terms_columns('academicopportunities', 5,
terms=['technology', 'computer', 'web',
'programming', 'coding'])
len(df[tech_academicopportunities])
Out[8]:
In [9]:
# searching for 'tech' might match the word 'technical'
all_tech_program = contains_terms_columns('program', 10, terms=['programming', 'computer',
'tech'])
technical_program = contains_terms_columns('program', 10, terms=['technical'])
tech_program = (all_tech_program) & ~(technical_program)
len(df[tech_program])
Out[9]:
In [10]:
tech_interest = contains_terms_columns('interest', 10, terms=['computer', 'technology'])
len(df[tech_interest])
Out[10]:
In [11]:
tech_inclined = (tech_academicopportunities) | (tech_program) | (tech_interest)
print(len(df[tech_inclined]))
print("{:.1f}%".format(100 * len(df[tech_inclined]) / len(df)))
Since 46% of schools are tech inclined and our assumption here was that 200 high schools were enough let's use only tech-inclined schools going forward. It could help the canvassing team if they were talking to female students from schools that have some tech-inclination.
However, let's first see how schools compare with each other taking that into consideration.
In [12]:
df['tech_academicopportunities'] = tech_academicopportunities.astype(int)
df['tech_program'] = tech_program.astype(int)
df['tech_interest'] = tech_interest.astype(int)
df.head(3)
Out[12]:
In [13]:
def fill_tech_summary(academicopportunities, program, interest):
if academicopportunities:
if program:
if interest:
return 'tech_academicopportunities+program+interest'
else:
return 'tech_academicopportunities+program'
elif interest:
return 'tech_academicopportunities+interest'
else:
return 'tech_academicopportunities'
elif program:
if interest:
return 'tech_program+interest'
else:
return 'tech_program'
elif interest:
return 'tech_interest'
else:
return 'no_tech_inclination'
In [14]:
df['tech_summary'] = df.apply(lambda x: fill_tech_summary(x.loc['tech_academicopportunities'],
x.loc['tech_program'],
x.loc['tech_interest']),
axis='columns')
df['tech_summary'].head()
Out[14]:
In [15]:
fig, ax = plt.subplots(figsize=(20, 10))
ax = sns.violinplot(data=df, x='all', y="college_career_rate", hue="tech_summary", ax=ax,
hue_order=['no_tech_inclination', 'tech_interest', 'tech_program',
'tech_academicopportunities', 'tech_program+interest',
'tech_academicopportunities+program',
'tech_academicopportunities+interest',
'tech_academicopportunities+program+interest'])
sns.despine(left=True)
ax.set_xlabel("")
plt.suptitle('College Career Rate by Types of Tech Inclination')
plt.savefig('figures/types-tech-inclination.png', bbox_inches='tight')
In [16]:
def fill_tech_summary_compact(academicopportunities, program, interest):
if academicopportunities or program or interest:
return 'tech_inclined'
else:
return 'not_tech_inclined'
In [17]:
df['tech_summary_compact'] = df.apply(lambda x: fill_tech_summary_compact(
x.loc['tech_academicopportunities'],
x.loc['tech_program'],
x.loc['tech_interest']),
axis='columns')
df['tech_summary_compact'].head()
Out[17]:
In [18]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 8))
ax1 = sns.violinplot(data=df, x='all', y="college_career_rate", hue="tech_summary_compact",
split=True, inner="quartile", ax=ax1)
ax2 = sns.violinplot(data=df, x='all', y="total_students", hue="tech_summary_compact",
split=True, inner="quartile", ax=ax2)
ax1.set_xlabel("")
ax2.set_xlabel("")
sns.despine(left=True)
plt.suptitle('College Career Rate and Total Students by Tech Inclination')
plt.savefig('figures/breakdown-tech-inclination.png', bbox_inches='tight')
We can see from the violin plots above that even though tech inclined high schools have a sligtly higher college career rate median, they have slightly lower 25% and 75% quartiles. On the other hand, most high schools with 1500 or more students seem to have some kind of tech inclination.
In [19]:
new_columns = ['school_name', 'boro', 'tech_academicopportunities', 'neighborhood', 'location',
'subway', 'bus', 'total_students', 'start_time', 'end_time', 'graduation_rate',
'attendance_rate', 'pct_stu_enough_variety', 'college_career_rate', 'girls',
'specialized', 'earlycollege', 'tech_program', 'tech_interest', 'city', 'zip']
tech_schools = df[tech_inclined][new_columns]
tech_schools.head(3)
Out[19]:
Let's now shift our focus to the graduation_rate
and college_career_rate
columns. In particular, college_career_rate
's definition is "at the end of the 2014-15 school year, the percent of students who graduated 'on time' by earning a diploma four years after they entered 9th grade".
We could multiply that by the total number of students in each school and calculate the potential number of college schools each school has.
In [20]:
fig, ax = plt.subplots(figsize=(20, 8))
ax.set_xlim(0, 1.02)
ax.set_ylim(0, 1.05)
sns.regplot(tech_schools['graduation_rate'], tech_schools['college_career_rate'], order=3)
ax.set_xlabel('Graduation Rate')
ax.set_ylabel('College Career Rate')
plt.suptitle('College Career Rate by Graduation Rate')
plt.savefig('figures/college-career-and-graduation-rate.png', bbox_inches='tight')
We can see that graduation_rate
and college_career_rate
have a strong correlation. That means if we have too many college_career_rate
null values we can use graduation_rate
as a proxy.
In [21]:
potential = tech_schools['college_career_rate'] * tech_schools['total_students']
potential.sort_values(inplace=True, ascending=False)
potential
Out[21]:
In [22]:
null_college_career_rate = tech_schools.college_career_rate.isnull()
print("{:.1f}%".format(100 * len(tech_schools[null_college_career_rate]) / len(tech_schools)))
In [23]:
null_graduation_rate = tech_schools.graduation_rate.isnull()
print("{:.1f}%".format(100 * len(tech_schools[null_graduation_rate]) / len(tech_schools)))
In [24]:
print("{:.1f}%".format(100 * len(tech_schools[(null_college_career_rate) & \
(null_graduation_rate)]) \
/ len(tech_schools)))
In [25]:
fig, ax = plt.subplots(figsize=(20, 8))
sns.distplot(tech_schools['total_students'], bins=range(0, 6000, 250), kde=False, rug=True)
ax.set_xlabel('Total Students')
ax.set_ylabel('Number of Schools with that Many Students')
plt.suptitle('Number of Schools by Total Students')
Out[25]:
In [26]:
tech_schools[(null_college_career_rate) & (null_graduation_rate)]['total_students'].max()
Out[26]:
It seems that 14% of schools don't have figures on the graduation rate. Let's plot its distribution to help decide if we should either ignore the column or the schools without that data.
In [27]:
import numpy as np
fig, ax = plt.subplots(figsize=(20, 8))
ax.set_xlim(0.05, 1.15)
schools_to_plot = tech_schools[~(null_college_career_rate)]
sns.distplot(schools_to_plot['college_career_rate'], bins=np.arange(0, 1, 0.1))
ax.set_xlabel('College Career Rate')
ax.set_ylabel('Number of Schools with that Rate')
plt.suptitle('Number of Schools by College Career Rate')
fig.savefig('figures/college-career-rate.png', bbox_inches='tight')
Since some schools have a really low college career rate let's use that data and filter schools that don't have that data point.
Let's do that and also plot the distribution of schools by their number of potential college students.
In [28]:
# Copy to avoid chained indexing and the SettingWithCopy warning (http://bit.ly/2kkXW5B)
tech_col_potential = pd.DataFrame(tech_schools, copy=True)
tech_col_potential.dropna(subset=['college_career_rate'], inplace=True)
tech_col_potential['potential_college_students'] = (tech_col_potential['total_students'] *\
tech_col_potential['college_career_rate'])\
.astype(int)
tech_col_potential.sort_values('potential_college_students', inplace=True, ascending=False)
tech_col_potential.head(3)
Out[28]:
In [29]:
fig, ax = plt.subplots(figsize=(20, 8))
sns.distplot(tech_col_potential['potential_college_students'], bins=range(0, 6000, 250),
kde=False, rug=True)
ax.set_xlabel('Potential College Students')
ax.set_ylabel('Number of Schools')
plt.suptitle('Number of Schools by Potential College Students')
plt.savefig('figures/potential-college-students.png', bbox_inches='tight')
In [30]:
high_potential = tech_col_potential['potential_college_students'] > 1000
high_potential_schools = tech_col_potential[high_potential]
len(high_potential_schools)
Out[30]:
There seems to be a big gap in the number of schools with more than 1000 potential college students as compared to the number of schools with fewer potential college students.
Since we want to reduce the number of recommended stations by at least 90% and there are 24 schools with at least 1000 potential college students let's filter those and ignore the other ones.
Next, let's examine the subway
and bus
columns, which tells us which subway and bus lines are near each school.
In [31]:
high_potential_schools.loc[:, ('subway', 'bus')]
Out[31]:
In [32]:
high_potential_schools['subway_nearby'] = df.apply(lambda x: 'no subway' if pd.isnull(x['subway'])
else 'subway nearby',
axis='columns')
high_potential_schools['subway_nearby']
Out[32]:
In [33]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 8))
high_potential_schools['all'] = ""
ax1 = sns.violinplot(data=high_potential_schools, x='all', y="college_career_rate",
hue="subway_nearby", split=True, inner="quartile", ax=ax1)
ax2 = sns.violinplot(data=high_potential_schools, x='all', y="total_students",
hue="subway_nearby", split=True, inner="quartile", ax=ax2)
ax1.set_xlabel("")
ax2.set_xlabel("")
sns.despine(left=True)
plt.suptitle('College Career Rate and Total Students by Subway Nearby')
fig.savefig('figures/subway-vs-no-subway.png', bbox_inches='tight')
Notice how the 75% percentile of college career rate in high schools with a subway nearby is much higher. Also notice that the schools with the highest number of students all seem to have a subway nearby.
Going forward we will filter schools without a subway station nearby.
In [34]:
# Copy to avoid chained indexing and the SettingWithCopy warning (http://bit.ly/2kkXW5B)
close_to_subway = pd.DataFrame(high_potential_schools, copy=True)
close_to_subway.dropna(subset=['subway'], inplace=True)
close_to_subway
Out[34]:
Let's turn our attention to the location
column. We have to extract latitude and longitude in order to be able to match this dataset with the subway stations location coordinates. Let's use add_coord_columns()
which is defined in coordinates.py
.
In [35]:
import coordinates as coord
coord.add_coord_columns(close_to_subway, 'location')
close_to_subway.loc[:, ('latitude', 'longitude')]
Out[35]:
Let's plot the the schools coordinates to see their geographical distribution:
In [36]:
!pip install folium
In [37]:
import folium
close_to_subway_map = folium.Map([40.72, -73.92], zoom_start=11, tiles='CartoDB positron',
width='60%')
for i, school in close_to_subway.iterrows():
marker = folium.RegularPolygonMarker([school['latitude'], school['longitude']],
popup=school['school_name'], color='RoyalBlue',
fill_color='RoyalBlue', radius=5)
marker.add_to(close_to_subway_map)
close_to_subway_map.save('maps/close_to_subway.html')
close_to_subway_map
Out[37]:
The interactive map is available here.
It seems like we have all school data we need to perform the recommendations. Let's just clean the DataFrame
columns and save it as a pickle
binary file for later use in another Jupyter notebook.
In [38]:
close_to_subway.rename(columns={'subway': 'subway_lines'}, inplace=True)
df_to_pickle = close_to_subway.loc[:, ('school_name', 'potential_college_students', 'latitude',
'longitude', 'start_time', 'end_time', 'subway_lines',
'city')]
df_to_pickle
Out[38]:
In [39]:
df_to_pickle.to_pickle('pickle/high_schools.p')