nyc-schools_A

This notebook performs the following functions:

  • Parses the NYC Public School Guide (2016) to find the physical addresses for each school
  • Uses the openstreetmap API to assign lat / lon values to each school, given addresses
  • Finds the 50 closest Census Tracts to to school, given Lat / Lon coordinates of the centroids for each tract. The tract centroid file was obtained from the American Community Survey
  • Loads a spreadsheet file with statistics on each school, obtained from Open Data NYC. This file is merged with the info on nearby census tracts on the 'DBN' field (a unique identifier for each school)
  • Saves the merged file as a *.csv for later use.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from geopy.distance import vincenty
from bs4 import BeautifulSoup
import os, string, requests, re, pickle
%matplotlib inline

bp_data = '/Users/bryanfry/projects/proj_nyc-schools/data_files' #basepath for input files

Functions to parse ID and physical address information from the NYC High School Directory


In [2]:
# Function to find all occurrences of the string 'Contact Information' in the doc.
# The indices of these lines are returned, and serve as starting point to locate
# addresses.
def locate_CI_lines (content):
    CI_line_indices = []
    for i, line in enumerate (content):
        if 'Contact Information' in line: CI_line_indices.append (i)
    return CI_line_indices


# Function parses the information for a single public high school in the guide.
# Ingests 'content' (all text in the public school section of the guide), and the 
# line number with the contact info ('ci') for a single school.

def parse_school_info (content, CI_line_index):
    # Parse the line after 'Contact Information'.  
    # This contatins the name and the DBN information (DBN is unique ID code for school)
    line = content [CI_line_index-1]
    i = string.find (line, 'DBN')
    name = line [0:i-5]   
    DBN = line [i+4:-1]
    
    # Starting at the 'Contact' line, we look BACKWARDS to find first previous line with DBN.
    # This contains both the school name and DBN code.
    DBN_line_index = CI_line_index -1
    while not 'DBN' in content[DBN_line_index]:
        DBN_line_index -= 1
    line = content [DBN_line_index]
    i = string.find (line, 'DBN')
    name = line [0:i-5]   
    DBN = line [i+4:-1]    
    
    # Starting at the 'Contact' line, we need to locate the next line containing
    # string 'Address:'  This is not always a fixed # of lines after the
    # DBN line.
    addy_line_index = CI_line_index +1 # Number of lines to look ahead of DBN line for 'Address:'
    while not 'Address' in content[addy_line_index]:
        addy_line_index += 1
    #Get street address and zipcode
    street = content [addy_line_index][9:-1]
    zipcode = content [addy_line_index+1][-6:-1]
    
    return name, DBN, street, zipcode

Functions for geolocation


In [3]:
# Function to get the lat / lon and neighborhood, given street address and zipcode
# This uses the free openstreetmap API for geolocation!
# It typically takes ~ 4 min to work on all the schools.

def get_coords_and_hood (street, zipcode):
    zipcode = str (zipcode).zfill(5)
    try:
        street = street.replace (' ','+')
        s = 'http://nominatim.openstreetmap.org/search?format=xml'
        s = s + '&street=' + street
        s = s + '&postalcode=' + zipcode
        s = s + '&addressdetails=1'
        r = requests.get (s)
        soup = BeautifulSoup (r.content, 'lxml')
        lat = float (soup.place.attrs['lat'])
        lon = float (soup.place.attrs['lon'])
        county = soup.place.county.contents[0]
        hood = soup.place.neighbourhood.contents[0]
        display_name = soup.place.attrs['display_name']
    except:
        lat, lon, county, hood, display_name = None, None, None, None, None
    return lat, lon, county, hood, display_name

# Given a specific input geocode and a dictionary of location strings ('tags'), each 
# with lat/lon, this function returns the location in the dictionary closest to the 
# input geocode.  It uses 'vincenty distance' which accounts for global curvature to
# compute distance.  (Probably a simpler and faster 'flat earth' calc. would work just as
# well given that all locations are within only a couple degrees.)
def find_closest_loc (loc_dict, loc_query):
    dist_dict = {}
    for k, loc in loc_dict.items():  # Loop over the dictionary entries and compute distance to each one, 
                                      # populating the new dictionary dist_dict
        dist_dict[k] = vincenty (loc, loc_query).meters    
    min_loc = min (dist_dict, key=dist_dict.get)
    min_dist = dist_dict [min_loc]
    return min_loc, min_dist


# Given dictionary of tags and corresponding lat/lon tuples, find N locations in the dictionary
# closest to the input loc_query
def find_closest_n_loc (loc_dict, loc_query, n):
    dist_dict = {}
    for k, loc in loc_dict.items():
        dist_dict[k] = vincenty (loc, loc_query).meters
    loc_sorted = sorted (dist_dict, key=dist_dict.get)
    dist_sorted = sorted (dist_dict.values())
    return loc_sorted[0:n], dist_sorted[0:n]

Functions to process school outcome data


In [4]:
# Function to remove % from percentage strings, and return a number
def proc_percent (s):
    try:
        return float (s.replace ('%', ''))
    except:
        return np.nan
    
# Function to add jitter -- this addresses problems of equal min/ max for 
# bins in quintile calculation and in plotting histograms.
def add_jitter (x, amp = 0.00001):
    return x + (np.random.random (size = len (x)) * amp)
    
#########################################

# Split data into quintiles
# NOTE: qcut will return error if data has non-unique edges (ex. more than 20% of data is 0%)
# If qcut throws an error, we bypass this issue by adding a trivial amount of positive
# noise to each value. Cheesy but works fine.
def quantile(column, quantile=5):
    try:
        q = pd.qcut(column, quantile)
    except:  #Error -- add a little noise
        column = add_jitter (column)
        q = pd.qcut (column, quantile)
    return [i+1 for i in q.codes]

########################################

# This function calculates quintiles for a set of desired columns on each 
# school.  It also combines the outcome dataframe with the 20 nearest census tract
# set for each school by merging the two dataframes on DBN.

def combine_data (df_outcomes, df_tract, percent_col_list):
    df = df_outcomes[df_outcomes.Cohort == '2006']   # Limit to 2006
    df = pd.merge (df_tract, df, how = 'inner', on = 'DBN')      # Perform join on the DBNs

    for c in percent_col_list:
        df [c] = df [c].apply (proc_percent)    
    #On each of the 'interesting' percent_col_list, compute the quantiles
    for c in percent_col_list:
        c_Q = 'Q_' + c
        df [c_Q] = quantile (df[c].tolist())
    return df

Functions for Visualization


In [5]:
# This wraps the Matplotlib hist function to do NaN-removal, and add
# plot title and axis labels.

def plot_histogram (x, n_bins, title='', x_label='', y_label='', color = None):
    # First, use pandas or numpy to remvoe NaNs from the data.  The
    # presence of NaN may cause the matplotlib histogram to fail.
    try:
        x = x.dropna()  # Will work if x is a pandas DataFrame or Series
    except:
        x = np.array (x)[~np.isnan (x)] # Remove using numpy functions
   
    plt.figure()
    plt.hist (x, color = color)
    plt.title (title)
    plt.xlabel(x_label)
    plt.ylabel(y_label)
        
### TEST HISTOGRAM ###
x = np.random.normal(size = 1000)
plot_histogram (x, 20, 'Test - Dummy','X_Label','Y_Label', 'Maroon')


MAIN

First, read the Public School section of the NYC High School guide to find physical addresses for the schools. These will be used later for geolocation, assigning the schools to census tracts.


In [6]:
fp_hs_guide = os.path.join (bp_data, 'NY_Public_High_school_guide_FROM_PDF.txt')
fp_tract_centroids = os.path.join (bp_data, 'tract_centroids.txt')
#fp_hs_addresses = os.path.join (bp_data, 'HS_Addresses.csv')

with open(fp_hs_guide) as f:
    content = f.readlines()
CI_index_list = locate_CI_lines (content) # Find locations of the text 'Contact Information'

# Build list of physical addresses.
# Each list element is a tuple with name, DBN, street address, and zipcode.
school_loc_list = [parse_school_info(content, i) for i in CI_index_list]

Get geolocations for the schools


In [7]:
school_geocode_list = [get_coords_and_hood (i[2], i[3]) for i in school_loc_list] # ~ 4 min

Build a dataframe with the school location info (name, address, DBN, lat/lon, county, neighboorhood


In [8]:
df_loc = pd.DataFrame()
df_loc['NAME'] = [i[0] for i in school_loc_list]
df_loc['DBN']  = [i[1] for i in  school_loc_list]
df_loc['STREET'] = [i[2] for i in school_loc_list]
df_loc['ZIPCODE'] = [i[3] for i in school_loc_list]
df_loc['LAT'] = [i[0] for i in school_geocode_list]
df_loc['LON'] = [i[1] for i in school_geocode_list]
df_loc['COUNTY'] = [i[2] for i in school_geocode_list]
df_loc['HOOD'] = [i[3] for i in school_geocode_list]
df_loc['DISPLAY_NAME'] = [i[4] for i in school_geocode_list]
df_loc = df_loc.dropna()

Assign each school to a census tract

This is done by loading a file from the American Community Survey that contains the centroid (lat/lon) of each census tract. We then calculate Vincenty distance from each school to the centroids of each tract, and take the tract with the shortest distance. This method is not exact (it assumes uniformly shaped tracts), but it is pretty close and should always result in indentifying at least with a CLOSE census tract. Code takes ~1 min to run (Vincenty distance is slow)


In [9]:
# Load the file from ACS with centroids for each tract
fp_tract_centroids = os.path.join (bp_data, 'tract_centroids.txt')
df_tracts = pd.read_csv (fp_tract_centroids)
df_tracts = df_tracts.drop ([df_tracts.columns[0]], axis = 1) # Drop first column, unused

# Build dictionary of GEOIDS (keys) and Lat / Lon tuples (values)
tract_dict = {df_tracts.GEOID[i]: (df_tracts.LAT[i], df_tracts.LON[i]) \
              for i in range (0,len(df_tracts))}


# Assign the 50 closest tracts to each school, based on centroid distances 
tract_list, dist_list = [],[]  # Each element in these lists will be another n-element list
n = 50     # Use 20 closest tract centroids
for lat, lon, name in zip (df_loc.LAT.tolist(), df_loc.LON.tolist(), df_loc.NAME):
    loc_query = [lat, lon]
    tracts, dists = find_closest_n_loc (tract_dict, loc_query, n)
    tract_list.append (tracts)  # Append a 20-element list to the list-of-lists
    dist_list.append (dists) # Append a 20-element list to the list-of-lists

# Add the geocode (name) of the 20 closest tracts to the dataframe
tract_array = np.array (tract_list)
col_names_tract = ['GEOCODE' + str(i).zfill(2) for i in range (0,n)]
for i in range (n):
    df_loc ['GEOCODE' + str(i).zfill(2)] = tract_array[:,i]
    
df_loc.head()


Out[9]:
NAME DBN STREET ZIPCODE LAT LON COUNTY HOOD DISPLAY_NAME GEOCODE00 ... GEOCODE40 GEOCODE41 GEOCODE42 GEOCODE43 GEOCODE44 GEOCODE45 GEOCODE46 GEOCODE47 GEOCODE48 GEOCODE49
0 Academy for Language and Technology 09X365 1700 Macombs Road 10453 40.849475 -73.915611 Bronx County Morris Heights Intermediate School X303 Leadership and Commun... 36005021502 ... 36005025500 36005023704 36005017902 36005037900 36005016500 36005023900 36005016700 36061027900 36061028300 36005017702
1 Academy for Scholarship and Entrepreneurship: ... 11X270 921 East 228th Street 10466 40.888215 -73.852720 Bronx County Wakefield 921, East 228th Street, Wakefield, Bronx, Bron... 36005040400 ... 36005037600 36005034800 36005042902 36005043100 36005035000 36005034400 36005034200 36005042901 36005034000 36005042300
3 Antonia Pantoja Preparatory Academy: A College... 08X376 1980 Lafayette Avenue 10473 40.820915 -73.855857 Bronx County Clason Point Adlai E Stevenson High School, 1980, Lafayette... 36005001600 ... 36005005400 36005020400 36005005200 36005016600 36005005600 36005024000 36005023800 36005011502 36005020000 36005006000
4 Archimedes Academy for Math, Science and Techn... 08X367 456 White Plains Road 10473 40.814996 -73.856041 Bronx County Clason Point Intermediate School 174, 456, White Plains Roa... 36005001600 ... 36005011502 36005016600 36005009300 36005020200 36005021800 36005005600 36005014400 36005021001 36005015200 36005020400
5 Astor Collegiate Academy 11X299 925 Astor Avenue 10469 40.859900 -73.860322 Bronx County Morris Park Christopher Columbus High School, 925, Astor A... 36005032400 ... 36005037600 36005023800 36005025600 36005038200 36005042500 36005038000 36005042901 36005031000 36005022000 36005039300

5 rows × 59 columns

Now we load and process the *.csv file with info on school outcomes


In [10]:
# Load file with school outcomes
df_sch_outcomes = pd.read_csv (os.path.join (bp_data, 'Graduation_Outcomes.csv'))


# list of columns given as percentages.
# We will compute quintiles for each of these.
percent_col_list = ['Total Grads - % of cohort',  \
                    'Total Regents - % of cohort',\
                    'Total Regents - % of grads',\
                    'Advanced Regents - % of cohort',\
                    'Advanced Regents - % of grads',\
                    'Regents w/o Advanced - % of cohort',\
                    'Regents w/o Advanced - % of grads',\
                    'Local - % of cohort',\
                    'Local - % of grads',\
                    'Still Enrolled - % of cohort',\
                    'Dropped Out - % of cohort']\

# expand the dataframe on to include quintile on the 'interesting' school stats
df = combine_data (df_sch_outcomes[df_sch_outcomes.Demographic == 'Total Cohort'], \
                   df_loc, percent_col_list)

# There are some schools with no data.  dropna () to get rid of them
df = df.dropna()

In [11]:
df.to_csv (os.path.join (bp_data, 'df_A_school_info.csv'))

Plot the first half of the 'interesting' percentage school stats as histograms


In [12]:
for c in percent_col_list [0:6]:
    plot_histogram (df[c].tolist(), n_bins = 5, title = c, \
                    x_label = 'Percent', y_label = '# Schools', color = 'Maroon' )


Plot the histograms for remainder of the percentrage statistics


In [13]:
for c in percent_col_list [6:]:
    plot_histogram (df[c].tolist(), n_bins = 5, title = c, \
                    x_label = 'Percent', y_label = '# Schools', color = 'Maroon' )



In [14]:
df.head()


Out[14]:
NAME DBN STREET ZIPCODE LAT LON COUNTY HOOD DISPLAY_NAME GEOCODE00 ... Q_Total Regents - % of cohort Q_Total Regents - % of grads Q_Advanced Regents - % of cohort Q_Advanced Regents - % of grads Q_Regents w/o Advanced - % of cohort Q_Regents w/o Advanced - % of grads Q_Local - % of cohort Q_Local - % of grads Q_Still Enrolled - % of cohort Q_Dropped Out - % of cohort
0 Academy for Scholarship and Entrepreneurship: ... 11X270 921 East 228th Street 10466 40.888215 -73.852720 Bronx County Wakefield 921, East 228th Street, Wakefield, Bronx, Bron... 36005040400 ... 2 2 2 2 3 3 4 4 5 2
1 Astor Collegiate Academy 11X299 925 Astor Avenue 10469 40.859900 -73.860322 Bronx County Morris Park Christopher Columbus High School, 925, Astor A... 36005032400 ... 2 1 2 2 3 3 5 5 4 3
2 Banana Kelly High School 08X530 965 Longwood Avenue 10459 40.817601 -73.897985 Bronx County Melrose 965, Longwood Avenue, Melrose, Bronx, Bronx Co... 36005008500 ... 1 1 2 3 1 2 4 5 4 5
3 Belmont Preparatory High School 10X434 500 East Fordham Road 10458 40.859840 -73.888295 Bronx County Belmont 500, East Fordham Road, Belmont, Bronx, Bronx ... 36005038700 ... 4 5 2 2 5 5 1 1 4 4
4 Bronx Academy of Health Careers 11X290 800 East Gun Hill Road 10467 40.875549 -73.861423 Bronx County Williams Bridge 800, East Gun Hill Road, Williams Bridge, Bron... 36005037200 ... 3 4 3 3 4 4 2 2 3 5

5 rows × 92 columns


In [15]:
1


Out[15]:
1

In [ ]: