nyc-schools_A

This notebook performs the following functions:

Parses the NYC Public School Guide (2016) to find the physical addresses for each school
Uses the openstreetmap API to assign lat / lon values to each school, given addresses
Finds the 50 closest Census Tracts to to school, given Lat / Lon coordinates of the centroids for each tract. The tract centroid file was obtained from the American Community Survey
Loads a spreadsheet file with statistics on each school, obtained from Open Data NYC. This file is merged with the info on nearby census tracts on the 'DBN' field (a unique identifier for each school)
Saves the merged file as a *.csv for later use.



In [1]:

    
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from geopy.distance import vincenty
from bs4 import BeautifulSoup
import os, string, requests, re, pickle
%matplotlib inline

bp_data = '/Users/bryanfry/projects/proj_nyc-schools/data_files' #basepath for input files

Functions to parse ID and physical address information from the NYC High School Directory



In [2]:

    
# Function to find all occurrences of the string 'Contact Information' in the doc.
# The indices of these lines are returned, and serve as starting point to locate
# addresses.
def locate_CI_lines (content):
    CI_line_indices = []
    for i, line in enumerate (content):
        if 'Contact Information' in line: CI_line_indices.append (i)
    return CI_line_indices


# Function parses the information for a single public high school in the guide.
# Ingests 'content' (all text in the public school section of the guide), and the 
# line number with the contact info ('ci') for a single school.

def parse_school_info (content, CI_line_index):
    # Parse the line after 'Contact Information'.  
    # This contatins the name and the DBN information (DBN is unique ID code for school)
    line = content [CI_line_index-1]
    i = string.find (line, 'DBN')
    name = line [0:i-5]   
    DBN = line [i+4:-1]
    
    # Starting at the 'Contact' line, we look BACKWARDS to find first previous line with DBN.
    # This contains both the school name and DBN code.
    DBN_line_index = CI_line_index -1
    while not 'DBN' in content[DBN_line_index]:
        DBN_line_index -= 1
    line = content [DBN_line_index]
    i = string.find (line, 'DBN')
    name = line [0:i-5]   
    DBN = line [i+4:-1]    
    
    # Starting at the 'Contact' line, we need to locate the next line containing
    # string 'Address:'  This is not always a fixed # of lines after the
    # DBN line.
    addy_line_index = CI_line_index +1 # Number of lines to look ahead of DBN line for 'Address:'
    while not 'Address' in content[addy_line_index]:
        addy_line_index += 1
    #Get street address and zipcode
    street = content [addy_line_index][9:-1]
    zipcode = content [addy_line_index+1][-6:-1]
    
    return name, DBN, street, zipcode

Functions for geolocation



In [3]:

    
# Function to get the lat / lon and neighborhood, given street address and zipcode
# This uses the free openstreetmap API for geolocation!
# It typically takes ~ 4 min to work on all the schools.

def get_coords_and_hood (street, zipcode):
    zipcode = str (zipcode).zfill(5)
    try:
        street = street.replace (' ','+')
        s = 'http://nominatim.openstreetmap.org/search?format=xml'
        s = s + '&street=' + street
        s = s + '&postalcode=' + zipcode
        s = s + '&addressdetails=1'
        r = requests.get (s)
        soup = BeautifulSoup (r.content, 'lxml')
        lat = float (soup.place.attrs['lat'])
        lon = float (soup.place.attrs['lon'])
        county = soup.place.county.contents[0]
        hood = soup.place.neighbourhood.contents[0]
        display_name = soup.place.attrs['display_name']
    except:
        lat, lon, county, hood, display_name = None, None, None, None, None
    return lat, lon, county, hood, display_name

# Given a specific input geocode and a dictionary of location strings ('tags'), each 
# with lat/lon, this function returns the location in the dictionary closest to the 
# input geocode.  It uses 'vincenty distance' which accounts for global curvature to
# compute distance.  (Probably a simpler and faster 'flat earth' calc. would work just as
# well given that all locations are within only a couple degrees.)
def find_closest_loc (loc_dict, loc_query):
    dist_dict = {}
    for k, loc in loc_dict.items():  # Loop over the dictionary entries and compute distance to each one, 
                                      # populating the new dictionary dist_dict
        dist_dict[k] = vincenty (loc, loc_query).meters    
    min_loc = min (dist_dict, key=dist_dict.get)
    min_dist = dist_dict [min_loc]
    return min_loc, min_dist


# Given dictionary of tags and corresponding lat/lon tuples, find N locations in the dictionary
# closest to the input loc_query
def find_closest_n_loc (loc_dict, loc_query, n):
    dist_dict = {}
    for k, loc in loc_dict.items():
        dist_dict[k] = vincenty (loc, loc_query).meters
    loc_sorted = sorted (dist_dict, key=dist_dict.get)
    dist_sorted = sorted (dist_dict.values())
    return loc_sorted[0:n], dist_sorted[0:n]

Functions to process school outcome data



In [4]:

    
# Function to remove % from percentage strings, and return a number
def proc_percent (s):
    try:
        return float (s.replace ('%', ''))
    except:
        return np.nan
    
# Function to add jitter -- this addresses problems of equal min/ max for 
# bins in quintile calculation and in plotting histograms.
def add_jitter (x, amp = 0.00001):
    return x + (np.random.random (size = len (x)) * amp)
    
#########################################

# Split data into quintiles
# NOTE: qcut will return error if data has non-unique edges (ex. more than 20% of data is 0%)
# If qcut throws an error, we bypass this issue by adding a trivial amount of positive
# noise to each value. Cheesy but works fine.
def quantile(column, quantile=5):
    try:
        q = pd.qcut(column, quantile)
    except:  #Error -- add a little noise
        column = add_jitter (column)
        q = pd.qcut (column, quantile)
    return [i+1 for i in q.codes]

########################################

# This function calculates quintiles for a set of desired columns on each 
# school.  It also combines the outcome dataframe with the 20 nearest census tract
# set for each school by merging the two dataframes on DBN.

def combine_data (df_outcomes, df_tract, percent_col_list):
    df = df_outcomes[df_outcomes.Cohort == '2006']   # Limit to 2006
    df = pd.merge (df_tract, df, how = 'inner', on = 'DBN')      # Perform join on the DBNs

    for c in percent_col_list:
        df [c] = df [c].apply (proc_percent)    
    #On each of the 'interesting' percent_col_list, compute the quantiles
    for c in percent_col_list:
        c_Q = 'Q_' + c
        df [c_Q] = quantile (df[c].tolist())
    return df

Functions for Visualization



In [5]:

    
# This wraps the Matplotlib hist function to do NaN-removal, and add
# plot title and axis labels.

def plot_histogram (x, n_bins, title='', x_label='', y_label='', color = None):
    # First, use pandas or numpy to remvoe NaNs from the data.  The
    # presence of NaN may cause the matplotlib histogram to fail.
    try:
        x = x.dropna()  # Will work if x is a pandas DataFrame or Series
    except:
        x = np.array (x)[~np.isnan (x)] # Remove using numpy functions
   
    plt.figure()
    plt.hist (x, color = color)
    plt.title (title)
    plt.xlabel(x_label)
    plt.ylabel(y_label)
        
### TEST HISTOGRAM ###
x = np.random.normal(size = 1000)
plot_histogram (x, 20, 'Test - Dummy','X_Label','Y_Label', 'Maroon')

MAIN

First, read the Public School section of the NYC High School guide to find physical addresses for the schools. These will be used later for geolocation, assigning the schools to census tracts.



In [6]:

    
fp_hs_guide = os.path.join (bp_data, 'NY_Public_High_school_guide_FROM_PDF.txt')
fp_tract_centroids = os.path.join (bp_data, 'tract_centroids.txt')
#fp_hs_addresses = os.path.join (bp_data, 'HS_Addresses.csv')

with open(fp_hs_guide) as f:
    content = f.readlines()
CI_index_list = locate_CI_lines (content) # Find locations of the text 'Contact Information'

# Build list of physical addresses.
# Each list element is a tuple with name, DBN, street address, and zipcode.
school_loc_list = [parse_school_info(content, i) for i in CI_index_list]

Get geolocations for the schools



In [7]:

    
school_geocode_list = [get_coords_and_hood (i[2], i[3]) for i in school_loc_list] # ~ 4 min

Build a dataframe with the school location info (name, address, DBN, lat/lon, county, neighboorhood



In [8]:

    
df_loc = pd.DataFrame()
df_loc['NAME'] = [i[0] for i in school_loc_list]
df_loc['DBN']  = [i[1] for i in  school_loc_list]
df_loc['STREET'] = [i[2] for i in school_loc_list]
df_loc['ZIPCODE'] = [i[3] for i in school_loc_list]
df_loc['LAT'] = [i[0] for i in school_geocode_list]
df_loc['LON'] = [i[1] for i in school_geocode_list]
df_loc['COUNTY'] = [i[2] for i in school_geocode_list]
df_loc['HOOD'] = [i[3] for i in school_geocode_list]
df_loc['DISPLAY_NAME'] = [i[4] for i in school_geocode_list]
df_loc = df_loc.dropna()

Assign each school to a census tract

This is done by loading a file from the American Community Survey that contains the centroid (lat/lon) of each census tract. We then calculate Vincenty distance from each school to the centroids of each tract, and take the tract with the shortest distance. This method is not exact (it assumes uniformly shaped tracts), but it is pretty close and should always result in indentifying at least with a CLOSE census tract. Code takes ~1 min to run (Vincenty distance is slow)



In [9]:

    
# Load the file from ACS with centroids for each tract
fp_tract_centroids = os.path.join (bp_data, 'tract_centroids.txt')
df_tracts = pd.read_csv (fp_tract_centroids)
df_tracts = df_tracts.drop ([df_tracts.columns[0]], axis = 1) # Drop first column, unused

# Build dictionary of GEOIDS (keys) and Lat / Lon tuples (values)
tract_dict = {df_tracts.GEOID[i]: (df_tracts.LAT[i], df_tracts.LON[i]) \
              for i in range (0,len(df_tracts))}


# Assign the 50 closest tracts to each school, based on centroid distances 
tract_list, dist_list = [],[]  # Each element in these lists will be another n-element list
n = 50     # Use 20 closest tract centroids
for lat, lon, name in zip (df_loc.LAT.tolist(), df_loc.LON.tolist(), df_loc.NAME):
    loc_query = [lat, lon]
    tracts, dists = find_closest_n_loc (tract_dict, loc_query, n)
    tract_list.append (tracts)  # Append a 20-element list to the list-of-lists
    dist_list.append (dists) # Append a 20-element list to the list-of-lists

# Add the geocode (name) of the 20 closest tracts to the dataframe
tract_array = np.array (tract_list)
col_names_tract = ['GEOCODE' + str(i).zfill(2) for i in range (0,n)]
for i in range (n):
    df_loc ['GEOCODE' + str(i).zfill(2)] = tract_array[:,i]
    
df_loc.head()









    Out[9]:






  
    
      
      NAME
      DBN
      STREET
      ZIPCODE
      LAT
      LON
      COUNTY
      HOOD
      DISPLAY_NAME
      GEOCODE00
      ...
      GEOCODE40
      GEOCODE41
      GEOCODE42
      GEOCODE43
      GEOCODE44
      GEOCODE45
      GEOCODE46
      GEOCODE47
      GEOCODE48
      GEOCODE49
    
  
  
    
      0
      Academy for Language and Technology
      09X365
      1700 Macombs Road
      10453
      40.849475
      -73.915611
      Bronx County
      Morris Heights
      Intermediate School X303 Leadership and Commun...
      36005021502
      ...
      36005025500
      36005023704
      36005017902
      36005037900
      36005016500
      36005023900
      36005016700
      36061027900
      36061028300
      36005017702
    
    
      1
      Academy for Scholarship and Entrepreneurship: ...
      11X270
      921 East 228th Street
      10466
      40.888215
      -73.852720
      Bronx County
      Wakefield
      921, East 228th Street, Wakefield, Bronx, Bron...
      36005040400
      ...
      36005037600
      36005034800
      36005042902
      36005043100
      36005035000
      36005034400
      36005034200
      36005042901
      36005034000
      36005042300
    
    
      3
      Antonia Pantoja Preparatory Academy: A College...
      08X376
      1980 Lafayette Avenue
      10473
      40.820915
      -73.855857
      Bronx County
      Clason Point
      Adlai E Stevenson High School, 1980, Lafayette...
      36005001600
      ...
      36005005400
      36005020400
      36005005200
      36005016600
      36005005600
      36005024000
      36005023800
      36005011502
      36005020000
      36005006000
    
    
      4
      Archimedes Academy for Math, Science and Techn...
      08X367
      456 White Plains Road
      10473
      40.814996
      -73.856041
      Bronx County
      Clason Point
      Intermediate School 174, 456, White Plains Roa...
      36005001600
      ...
      36005011502
      36005016600
      36005009300
      36005020200
      36005021800
      36005005600
      36005014400
      36005021001
      36005015200
      36005020400
    
    
      5
      Astor Collegiate Academy
      11X299
      925 Astor Avenue
      10469
      40.859900
      -73.860322
      Bronx County
      Morris Park
      Christopher Columbus High School, 925, Astor A...
      36005032400
      ...
      36005037600
      36005023800
      36005025600
      36005038200
      36005042500
      36005038000
      36005042901
      36005031000
      36005022000
      36005039300
    
  

5 rows × 59 columns

Now we load and process the *.csv file with info on school outcomes



In [10]:

    
# Load file with school outcomes
df_sch_outcomes = pd.read_csv (os.path.join (bp_data, 'Graduation_Outcomes.csv'))


# list of columns given as percentages.
# We will compute quintiles for each of these.
percent_col_list = ['Total Grads - % of cohort',  \
                    'Total Regents - % of cohort',\
                    'Total Regents - % of grads',\
                    'Advanced Regents - % of cohort',\
                    'Advanced Regents - % of grads',\
                    'Regents w/o Advanced - % of cohort',\
                    'Regents w/o Advanced - % of grads',\
                    'Local - % of cohort',\
                    'Local - % of grads',\
                    'Still Enrolled - % of cohort',\
                    'Dropped Out - % of cohort']\

# expand the dataframe on to include quintile on the 'interesting' school stats
df = combine_data (df_sch_outcomes[df_sch_outcomes.Demographic == 'Total Cohort'], \
                   df_loc, percent_col_list)

# There are some schools with no data.  dropna () to get rid of them
df = df.dropna()



In [11]:

    
df.to_csv (os.path.join (bp_data, 'df_A_school_info.csv'))

Plot the first half of the 'interesting' percentage school stats as histograms



In [12]:

    
for c in percent_col_list [0:6]:
    plot_histogram (df[c].tolist(), n_bins = 5, title = c, \
                    x_label = 'Percent', y_label = '# Schools', color = 'Maroon' )

Plot the histograms for remainder of the percentrage statistics



In [13]:

    
for c in percent_col_list [6:]:
    plot_histogram (df[c].tolist(), n_bins = 5, title = c, \
                    x_label = 'Percent', y_label = '# Schools', color = 'Maroon' )



In [14]:

    
df.head()









    Out[14]:






  
    
      
      NAME
      DBN
      STREET
      ZIPCODE
      LAT
      LON
      COUNTY
      HOOD
      DISPLAY_NAME
      GEOCODE00
      ...
      Q_Total Regents - % of cohort
      Q_Total Regents - % of grads
      Q_Advanced Regents - % of cohort
      Q_Advanced Regents - % of grads
      Q_Regents w/o Advanced - % of cohort
      Q_Regents w/o Advanced - % of grads
      Q_Local - % of cohort
      Q_Local - % of grads
      Q_Still Enrolled - % of cohort
      Q_Dropped Out - % of cohort
    
  
  
    
      0
      Academy for Scholarship and Entrepreneurship: ...
      11X270
      921 East 228th Street
      10466
      40.888215
      -73.852720
      Bronx County
      Wakefield
      921, East 228th Street, Wakefield, Bronx, Bron...
      36005040400
      ...
      2
      2
      2
      2
      3
      3
      4
      4
      5
      2
    
    
      1
      Astor Collegiate Academy
      11X299
      925 Astor Avenue
      10469
      40.859900
      -73.860322
      Bronx County
      Morris Park
      Christopher Columbus High School, 925, Astor A...
      36005032400
      ...
      2
      1
      2
      2
      3
      3
      5
      5
      4
      3
    
    
      2
      Banana Kelly High School
      08X530
      965 Longwood Avenue
      10459
      40.817601
      -73.897985
      Bronx County
      Melrose
      965, Longwood Avenue, Melrose, Bronx, Bronx Co...
      36005008500
      ...
      1
      1
      2
      3
      1
      2
      4
      5
      4
      5
    
    
      3
      Belmont Preparatory High School
      10X434
      500 East Fordham Road
      10458
      40.859840
      -73.888295
      Bronx County
      Belmont
      500, East Fordham Road, Belmont, Bronx, Bronx ...
      36005038700
      ...
      4
      5
      2
      2
      5
      5
      1
      1
      4
      4
    
    
      4
      Bronx Academy of Health Careers
      11X290
      800 East Gun Hill Road
      10467
      40.875549
      -73.861423
      Bronx County
      Williams Bridge
      800, East Gun Hill Road, Williams Bridge, Bron...
      36005037200
      ...
      3
      4
      3
      3
      4
      4
      2
      2
      3
      5
    
  

5 rows × 92 columns



In [15]:

    
1









    Out[15]:





1



In [ ]:

	NAME	DBN	STREET	ZIPCODE	LAT	LON	COUNTY	HOOD	DISPLAY_NAME	GEOCODE00	...	GEOCODE40	GEOCODE41	GEOCODE42	GEOCODE43	GEOCODE44	GEOCODE45	GEOCODE46	GEOCODE47	GEOCODE48	GEOCODE49
0	Academy for Language and Technology	09X365	1700 Macombs Road	10453	40.849475	-73.915611	Bronx County	Morris Heights	Intermediate School X303 Leadership and Commun...	36005021502	...	36005025500	36005023704	36005017902	36005037900	36005016500	36005023900	36005016700	36061027900	36061028300	36005017702
1	Academy for Scholarship and Entrepreneurship: ...	11X270	921 East 228th Street	10466	40.888215	-73.852720	Bronx County	Wakefield	921, East 228th Street, Wakefield, Bronx, Bron...	36005040400	...	36005037600	36005034800	36005042902	36005043100	36005035000	36005034400	36005034200	36005042901	36005034000	36005042300
3	Antonia Pantoja Preparatory Academy: A College...	08X376	1980 Lafayette Avenue	10473	40.820915	-73.855857	Bronx County	Clason Point	Adlai E Stevenson High School, 1980, Lafayette...	36005001600	...	36005005400	36005020400	36005005200	36005016600	36005005600	36005024000	36005023800	36005011502	36005020000	36005006000
4	Archimedes Academy for Math, Science and Techn...	08X367	456 White Plains Road	10473	40.814996	-73.856041	Bronx County	Clason Point	Intermediate School 174, 456, White Plains Roa...	36005001600	...	36005011502	36005016600	36005009300	36005020200	36005021800	36005005600	36005014400	36005021001	36005015200	36005020400
5	Astor Collegiate Academy	11X299	925 Astor Avenue	10469	40.859900	-73.860322	Bronx County	Morris Park	Christopher Columbus High School, 925, Astor A...	36005032400	...	36005037600	36005023800	36005025600	36005038200	36005042500	36005038000	36005042901	36005031000	36005022000	36005039300