High Schools dataset cleaning and exploration

In this notebook we will clean and explore the 2017 High Schools dataset by the NYC Department of Education.

Let's start by opening and examining it.


In [1]:
import pandas as pd

all_high_schools = pd.read_csv('data/DOE_High_School_Directory_2017.csv')
all_high_schools.shape


Out[1]:
(440, 453)

In [2]:
pd.set_option('display.max_columns', 453)
all_high_schools.head(3)


Out[2]:
dbn school_name boro overview_paragraph school_10th_seats academicopportunities1 academicopportunities2 academicopportunities3 academicopportunities4 academicopportunities5 ell_programs language_classes advancedplacement_courses diplomaendorsements neighborhood shared_space campus_name building_code location phone_number fax_number school_email website subway bus grades2018 finalgrades total_students start_time end_time addtl_info1 extracurricular_activities psal_sports_boys psal_sports_girls psal_sports_coed school_sports graduation_rate attendance_rate pct_stu_enough_variety college_career_rate pct_stu_safe girls boys pbat international specialized transfer ptech earlycollege geoeligibility school_accessibility_description prgdesc1 prgdesc2 prgdesc3 prgdesc4 prgdesc5 prgdesc6 prgdesc7 prgdesc8 prgdesc9 prgdesc10 directions1 directions2 directions3 directions4 directions5 directions6 directions7 directions8 directions9 directions10 requirement1_1 requirement1_2 requirement1_3 requirement1_4 requirement1_5 requirement1_6 requirement1_7 requirement1_8 requirement1_9 requirement1_10 requirement2_1 requirement2_2 requirement2_3 requirement2_4 requirement2_5 requirement2_6 requirement2_7 requirement2_8 requirement2_9 requirement2_10 requirement3_1 requirement3_2 requirement3_3 requirement3_4 requirement3_5 requirement3_6 requirement3_7 requirement3_8 requirement3_9 requirement3_10 requirement4_1 requirement4_2 requirement4_3 requirement4_4 requirement4_5 requirement4_6 requirement4_7 requirement4_8 requirement4_9 requirement4_10 requirement5_1 requirement5_2 requirement5_3 requirement5_4 requirement5_5 requirement5_6 requirement5_7 requirement5_8 requirement5_9 requirement5_10 requirement6_1 requirement6_2 requirement6_3 requirement6_4 requirement6_5 requirement6_6 requirement6_7 requirement6_8 requirement6_9 requirement6_10 requirement7_1 requirement7_2 requirement7_3 requirement7_4 requirement7_5 requirement7_6 requirement7_7 requirement7_8 requirement7_9 requirement7_10 requirement8_1 requirement8_2 requirement8_3 requirement8_4 requirement8_5 requirement8_6 requirement8_7 requirement8_8 requirement8_9 requirement8_10 requirement9_1 requirement9_2 requirement9_3 requirement9_4 requirement9_5 requirement9_6 requirement9_7 requirement9_8 requirement9_9 requirement9_10 requirement10_1 requirement10_2 requirement10_3 requirement10_4 requirement10_5 requirement10_6 requirement10_7 requirement10_8 requirement10_9 requirement10_10 requirement11_1 requirement11_2 requirement11_3 requirement11_4 requirement11_5 requirement11_6 requirement11_7 requirement11_8 requirement11_9 requirement11_10 requirement12_1 requirement12_2 requirement12_3 requirement12_4 requirement12_5 requirement12_6 requirement12_7 requirement12_8 requirement12_9 requirement12_10 offer_rate1 offer_rate2 offer_rate3 offer_rate4 offer_rate5 offer_rate6 offer_rate7 offer_rate8 offer_rate9 offer_rate10 program1 program2 program3 program4 program5 program6 program7 program8 program9 program10 code1 code2 code3 code4 code5 code6 code7 code8 code9 code10 interest1 interest2 interest3 interest4 interest5 interest6 interest7 interest8 interest9 interest10 method1 method2 method3 method4 method5 method6 method7 method8 method9 method10 seats9ge1 seats9ge2 seats9ge3 seats9ge4 seats9ge5 seats9ge6 seats9ge7 seats9ge8 seats9ge9 seats9ge10 grade9gefilledflag1 grade9gefilledflag2 grade9gefilledflag3 grade9gefilledflag4 grade9gefilledflag5 grade9gefilledflag6 grade9gefilledflag7 grade9gefilledflag8 grade9gefilledflag9 grade9gefilledflag10 grade9geapplicants1 grade9geapplicants2 grade9geapplicants3 grade9geapplicants4 grade9geapplicants5 grade9geapplicants6 grade9geapplicants7 grade9geapplicants8 grade9geapplicants9 grade9geapplicants10 seats9swd1 seats9swd2 seats9swd3 seats9swd4 seats9swd5 seats9swd6 seats9swd7 seats9swd8 seats9swd9 seats9swd10 grade9swdfilledflag1 grade9swdfilledflag2 grade9swdfilledflag3 grade9swdfilledflag4 grade9swdfilledflag5 grade9swdfilledflag6 grade9swdfilledflag7 grade9swdfilledflag8 grade9swdfilledflag9 grade9swdfilledflag10 grade9swdapplicants1 grade9swdapplicants2 grade9swdapplicants3 grade9swdapplicants4 grade9swdapplicants5 grade9swdapplicants6 grade9swdapplicants7 grade9swdapplicants8 grade9swdapplicants9 grade9swdapplicants10 seats1specialized seats2specialized seats3specialized seats4specialized seats5specialized seats6specialized applicants1specialized applicants2specialized applicants3specialized applicants4specialized applicants5specialized applicants6specialized appperseat1specialized appperseat2specialized appperseat3specialized appperseat4specialized appperseat5specialized appperseat6specialized seats101 seats102 seats103 seats104 seats105 seats106 seats107 seats108 seats109 seats1010 admissionspriority11 admissionspriority12 admissionspriority13 admissionspriority14 admissionspriority15 admissionspriority16 admissionspriority17 admissionspriority18 admissionspriority19 admissionspriority110 admissionspriority21 admissionspriority22 admissionspriority23 admissionspriority24 admissionspriority25 admissionspriority26 admissionspriority27 admissionspriority28 admissionspriority29 admissionspriority210 admissionspriority31 admissionspriority32 admissionspriority33 admissionspriority34 admissionspriority35 admissionspriority36 admissionspriority37 admissionspriority38 admissionspriority39 admissionspriority310 admissionspriority41 admissionspriority42 admissionspriority43 admissionspriority44 admissionspriority45 admissionspriority46 admissionspriority47 admissionspriority48 admissionspriority49 admissionspriority410 admissionspriority51 admissionspriority52 admissionspriority53 admissionspriority54 admissionspriority55 admissionspriority56 admissionspriority57 admissionspriority58 admissionspriority59 admissionspriority510 admissionspriority61 admissionspriority62 admissionspriority63 admissionspriority64 admissionspriority65 admissionspriority66 admissionspriority67 admissionspriority68 admissionspriority69 admissionspriority610 admissionspriority71 admissionspriority72 admissionspriority73 admissionspriority74 admissionspriority75 admissionspriority76 admissionspriority77 admissionspriority78 admissionspriority79 admissionspriority710 eligibility1 eligibility2 eligibility3 eligibility4 eligibility5 eligibility6 eligibility7 eligibility8 eligibility9 eligibility10 auditioninformation1 auditioninformation2 auditioninformation3 auditioninformation4 auditioninformation5 auditioninformation6 auditioninformation7 auditioninformation8 auditioninformation9 auditioninformation10 common_audition1 common_audition2 common_audition3 common_audition4 common_audition5 common_audition6 common_audition7 common_audition8 common_audition9 common_audition10 grade9geapplicantsperseat1 grade9geapplicantsperseat2 grade9geapplicantsperseat3 grade9geapplicantsperseat4 grade9geapplicantsperseat5 grade9geapplicantsperseat6 grade9geapplicantsperseat7 grade9geapplicantsperseat8 grade9geapplicantsperseat9 grade9geapplicantsperseat10 grade9swdapplicantsperseat1 grade9swdapplicantsperseat2 grade9swdapplicantsperseat3 grade9swdapplicantsperseat4 grade9swdapplicantsperseat5 grade9swdapplicantsperseat6 grade9swdapplicantsperseat7 grade9swdapplicantsperseat8 grade9swdapplicantsperseat9 grade9swdapplicantsperseat10 primary_address_line_1 city zip state_code
0 31R455 Tottenville High School R Tottenville High School has long been recogniz... 1.0 Institute Programs: Science, Classics/Humanities Honors Program for grades 9-12 ROTC Program CTE Programs Visual and Performing Arts Programs English as a New Language Italian, Latin, Spanish AP Biology, AP Calculus, AP Chemistry, AP Engl... CTE, Math, Science Annadale-Huguenot-Princes Bay Yes NaN R455 100 Luten Avenue, Staten Island NY 10312 (40.5... 718-668-8800 718-317-0962 31R455@schools.nyc.gov www.tottenvillehs.net SIR to Huguenot S55, S56, S59, S78, X17, X17J, X19 9-12 9-12 3907 8am 2:20pm College Trips; Community Service Expected; Hea... Art, Asian American Assoc., Auto, Book, Broadw... Baseball, Basketball, Bowling, Cross Country, ... Badminton, Basketball, Bowling, Cross Country,... Golf Cheerleading, Dance Team 0.860 0.91 0.85 0.690 0.80 NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0 The Marine Corps Junior Reserve Officers’ Trai... This program is designed to challenge those st... This program is designed to challenge students... Academic Comprehensive Program NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Course Grades: English (90-100), Math (90-100)... Course Grades: English (92-100), Math (93-100)... NaN NaN NaN NaN NaN NaN NaN NaN Course Grade: Language Other Than English Course Grade: Language Other Than English NaN NaN NaN NaN NaN NaN NaN NaN Standardized Test Scores: English Language Art... Standardized Test Scores: English Language Art... NaN NaN NaN NaN NaN NaN NaN NaN Attendance and Punctuality Attendance and Punctuality NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN —100% of offers went to this group —100% of offers went to this group —100% of offers went to this group NaN NaN NaN NaN NaN NaN NaN Marine Corps Junior Reserve Officers’ Training... Classics Institute Science Institute Zoned NaN NaN NaN NaN NaN NaN R30B R30C R30S R30Z NaN NaN NaN NaN NaN NaN JROTC Humanities & Interdisciplinary Science & Math Zoned NaN NaN NaN NaN NaN NaN Ed. Opt. Screened Screened Zoned Guarantee NaN NaN NaN NaN NaN NaN 55.0 55.0 55.0 NaN NaN NaN NaN NaN NaN NaN Y Y Y NaN NaN NaN NaN NaN NaN NaN 190.0 736.0 955.0 NaN NaN NaN NaN NaN NaN NaN 13.0 13.0 13.0 NaN NaN NaN NaN NaN NaN NaN Y N N NaN NaN NaN NaN NaN NaN NaN 66.0 23.0 29.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN No No No Yes NaN NaN NaN NaN NaN NaN Priority to Staten Island students or residents Priority to Staten Island students or residents Priority to Staten Island students or residents Guaranteed offer to students who apply and liv... NaN NaN NaN NaN NaN NaN Then to New York City residents Then to New York City residents Then to New York City residents NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 3.0 13.0 17.0 NaN NaN NaN NaN NaN NaN NaN 5.0 2.0 2.0 NaN NaN NaN NaN NaN NaN NaN 100 Luten Avenue Staten Island 10312 NY
1 30Q450 Long Island City High School Q Our academically and aesthetically enriched pr... 1.0 CTE program(s) in: Hospitality and Tourism iLearnNYC: Program for expanded online coursew... Eight Small Learning Communities: Culinary; Re... Future Educators; Sports Medicine; Broadway Pr... Art, Music, Computers, Theater, Technology; CU... English as a New Language; Dual Language: Span... Chinese (Mandarin), French, Greek, Italian, Sp... AP Art History, AP Biology, AP Calculus, AP Ch... Arts, CTE, Math Astoria Yes NaN Q452 14-30 Broadway, Astoria NY 11106 (40.765881, -... 718-545-7095 718-545-2980 licinfo@schools.nyc.gov schools.nyc.gov/SchoolPortals/30/Q450 N, Q to Broadway Q100, Q102, Q103, Q104, Q18, Q19, Q66, Q69 9-12 9-12 2077 8:29am 3:50pm Community School; Summer Orientation Junior Reserve Officers' Training Corps (JROTC... Baseball, Basketball, Bowling, Cross Country, ... Basketball, Bowling, Fencing, Golf, Gymnastics... Cricket, Golf NaN 0.648 0.80 0.89 0.481 0.83 NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0 Offers a four-year sequence in culinary traini... In partnership with the National Academy Found... Students in this program complete sequenced co... Students explore ethical, moral, cultural, and... Students interested in being a part of Orchest... Students develop skills for careers in interna... Dual Language programs are designed to integra... . NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Standardized Test Scores: Math (1.8-4.5) Standardized Test Scores: Math (1.8-4.5) Demonstrated Interest: School Visit Demonstrated Interest: School Visit Demonstrated Interest: School Visit Demonstrated Interest: School Visit NaN NaN NaN NaN Attendance and Punctuality Attendance and Punctuality NaN NaN NaN NaN NaN NaN NaN NaN Demonstrated Interest: School Visit Demonstrated Interest: School Visit NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN —42% of offers went to this group NaN NaN NaN NaN Culinary Institute Academy of Restaurant and Hotel Management Sports Medicine & Health Sciences Community & Culture Broadway Productions Global Languages Academy Dual Language Spanish Program Zoned NaN NaN Q29J Q29M Q29N Q29P Q29Q Q29R Q29S Q29Z NaN NaN Culinary Arts Hospitality, Travel and Tourism Health Professions Law & Government Performing Arts/Visual Art & Design Hospitality, Travel and Tourism Humanities & Interdisciplinary Zoned NaN NaN Screened Screened Ed. Opt. Ed. Opt. Ed. Opt. Ed. Opt. Screened: Language Zoned Guarantee NaN NaN 127.0 86.0 127.0 58.0 64.0 42.0 58.0 NaN NaN NaN N N Y N Y N N NaN NaN NaN 341.0 115.0 538.0 311.0 256.0 131.0 41.0 NaN NaN NaN 23.0 16.0 23.0 10.0 11.0 8.0 10.0 NaN NaN NaN N N N N N Y N NaN NaN NaN 74.0 28.0 105.0 63.0 60.0 34.0 15.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Yes-15 Yes-20 Yes-20 Yes-20 Yes-20 Yes-20 No Yes NaN NaN Open to New York City residents Open to New York City residents Open to New York City residents Open to New York City residents Open to New York City residents Priority given to New York City residents who ... Priority to students whose have been in a Dual... Guaranteed offer to students who apply and liv... NaN NaN NaN NaN NaN NaN NaN Then to New York City residents Then to students who have been in a Transition... NaN NaN NaN NaN NaN NaN NaN NaN NaN Then to New York City residents NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 3.0 1.0 4.0 5.0 4.0 3.0 1.0 NaN NaN NaN 3.0 2.0 5.0 6.0 5.0 4.0 2.0 NaN NaN NaN 14-30 Broadway Astoria 11106 NY
2 30Q445 William Cullen Bryant High School Q Our school is dedicated to ensuring that all s... 1.0 CTE program(s) in: Business, Management & Admi... iLearnNYC: Program for expanded online coursew... Accounting, Entrepreneurship and Marketing, Ar... Environmental Science, Executive Internship, G... Intel Math/Science Research, Jazz Band, Law & ... English as a New Language; Transitional Biling... French, Greek, Italian, Spanish AP Biology, AP Calculus, AP Chemistry, AP Phys... Math, Science Astoria NaN NaN Q445 48-10 31st Avenue, Astoria NY 11103 (40.757072... 718-721-5404 718-728-3478 jnarvaez@schools.nyc.gov www.wcbryanths.org M, R to 46th St Q101, Q104, Q18, Q66 9-12 9-12 2437 8:50am 3:29pm Internships; Saturday Programs Arista – National Honor Society, Afro-Cuban Ha... Baseball, Basketball, Bowling, Cross Country, ... Basketball, Bowling, Cross Country, Flag Footb... Cricket, Golf NaN 0.683 0.88 0.83 0.522 0.76 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Students are introduced to forensic and legal ... Students will graduate with the life skills ne... We firmly believe that knowledge of what was, ... Focuses on students’ development of math and s... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Course Grades: English (65-100), Social Studie... NaN NaN Course Grades: English (65-100), Math (65-100)... NaN NaN NaN NaN NaN NaN Standardized Test Scores: English Language Art... NaN NaN Standardized Test Scores: English Language Art... NaN NaN NaN NaN NaN NaN Attendance and Punctuality NaN NaN Attendance and Punctuality NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN —99% of offers went to this group —98% of offers went to this group —92% of offers went to this group —96% of offers went to this group NaN NaN NaN NaN NaN NaN Forensic Science and Law Academy Business and Technology Institute Arts and Humanities Institute Math and Science Academy Zoned NaN NaN NaN NaN NaN Q15A Q15B Q15C Q15J Q15Z NaN NaN NaN NaN NaN Law & Government Business Humanities & Interdisciplinary Science & Math Zoned NaN NaN NaN NaN NaN Screened Ed. Opt. Ed. Opt. Screened Zoned Guarantee NaN NaN NaN NaN NaN 85.0 42.0 42.0 85.0 NaN NaN NaN NaN NaN NaN N Y N N NaN NaN NaN NaN NaN NaN 272.0 253.0 158.0 213.0 NaN NaN NaN NaN NaN NaN 15.0 8.0 8.0 15.0 NaN NaN NaN NaN NaN NaN N Y N N NaN NaN NaN NaN NaN NaN 40.0 60.0 32.0 22.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN No No No Yes-20 Yes NaN NaN NaN NaN NaN Priority to Queens students or residents Priority to Queens students or residents Priority to Queens students or residents Priority to Queens students or residents Guaranteed offer to students who apply and liv... NaN NaN NaN NaN NaN Then to New York City residents Then to New York City residents Then to New York City residents Then to New York City residents NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 3.0 6.0 4.0 3.0 NaN NaN NaN NaN NaN NaN 3.0 8.0 4.0 1.0 NaN NaN NaN NaN NaN NaN 48-10 31st Avenue Astoria 11103 NY

As ou can see there are over 400 columns so let's keep only columns of interest.

Also notice that the boys column is a flag for boys-only schools. Since we are trying to help solving the problem of women in tech it wouldn't make sense to keep them - let's filter them out.


In [3]:
boys_only = all_high_schools['boys'] == 1

columns_of_interest = ['dbn', 'school_name', 'boro', 'academicopportunities1', 
                       'academicopportunities2', 'academicopportunities3',
                       'academicopportunities4', 'academicopportunities5', 'neighborhood', 
                       'location', 'subway', 'bus', 'total_students', 'start_time', 'end_time',
                       'graduation_rate', 'attendance_rate', 'pct_stu_enough_variety',
                       'college_career_rate', 'girls', 'specialized', 'earlycollege',
                       'program1', 'program2', 'program3', 'program4', 'program5', 'program6',
                       'program7', 'program8', 'program9', 'program10', 'interest1',
                       'interest2', 'interest3', 'interest4', 'interest5', 'interest6',
                       'interest7', 'interest8', 'interest9', 'interest10', 'city', 'zip']

df = all_high_schools[~boys_only][columns_of_interest]
df.set_index('dbn', inplace=True)
df.shape


Out[3]:
(436, 43)

In [4]:
df.head(3)


Out[4]:
school_name boro academicopportunities1 academicopportunities2 academicopportunities3 academicopportunities4 academicopportunities5 neighborhood location subway bus total_students start_time end_time graduation_rate attendance_rate pct_stu_enough_variety college_career_rate girls specialized earlycollege program1 program2 program3 program4 program5 program6 program7 program8 program9 program10 interest1 interest2 interest3 interest4 interest5 interest6 interest7 interest8 interest9 interest10 city zip
dbn
31R455 Tottenville High School R Institute Programs: Science, Classics/Humanities Honors Program for grades 9-12 ROTC Program CTE Programs Visual and Performing Arts Programs Annadale-Huguenot-Princes Bay 100 Luten Avenue, Staten Island NY 10312 (40.5... SIR to Huguenot S55, S56, S59, S78, X17, X17J, X19 3907 8am 2:20pm 0.860 0.91 0.85 0.690 NaN NaN NaN Marine Corps Junior Reserve Officers’ Training... Classics Institute Science Institute Zoned NaN NaN NaN NaN NaN NaN JROTC Humanities & Interdisciplinary Science & Math Zoned NaN NaN NaN NaN NaN NaN Staten Island 10312
30Q450 Long Island City High School Q CTE program(s) in: Hospitality and Tourism iLearnNYC: Program for expanded online coursew... Eight Small Learning Communities: Culinary; Re... Future Educators; Sports Medicine; Broadway Pr... Art, Music, Computers, Theater, Technology; CU... Astoria 14-30 Broadway, Astoria NY 11106 (40.765881, -... N, Q to Broadway Q100, Q102, Q103, Q104, Q18, Q19, Q66, Q69 2077 8:29am 3:50pm 0.648 0.80 0.89 0.481 NaN NaN NaN Culinary Institute Academy of Restaurant and Hotel Management Sports Medicine & Health Sciences Community & Culture Broadway Productions Global Languages Academy Dual Language Spanish Program Zoned NaN NaN Culinary Arts Hospitality, Travel and Tourism Health Professions Law & Government Performing Arts/Visual Art & Design Hospitality, Travel and Tourism Humanities & Interdisciplinary Zoned NaN NaN Astoria 11106
30Q445 William Cullen Bryant High School Q CTE program(s) in: Business, Management & Admi... iLearnNYC: Program for expanded online coursew... Accounting, Entrepreneurship and Marketing, Ar... Environmental Science, Executive Internship, G... Intel Math/Science Research, Jazz Band, Law & ... Astoria 48-10 31st Avenue, Astoria NY 11103 (40.757072... M, R to 46th St Q101, Q104, Q18, Q66 2437 8:50am 3:29pm 0.683 0.88 0.83 0.522 NaN NaN NaN Forensic Science and Law Academy Business and Technology Institute Arts and Humanities Institute Math and Science Academy Zoned NaN NaN NaN NaN NaN Law & Government Business Humanities & Interdisciplinary Science & Math Zoned NaN NaN NaN NaN NaN Astoria 11103

Let's now make a quick comparison of college_career_rate in girls-only schools vs mixed ones.


In [5]:
df['all'] = ""
df['girls'] = df['girls'].map({1: 'Girls-only'})
df['girls'].fillna('Mixed', inplace=True)

In [6]:
%matplotlib inline
import pylab as plt
import seaborn as sns

sns.set(style="whitegrid", color_codes=True)
ax = sns.violinplot(data=df, x='all', y="college_career_rate", hue="girls", split=True)
sns.despine(left=True)

ax.set_xlabel("")
plt.suptitle('College Career Rate by Type of School (Mixed or Girls-only)')

plt.savefig('figures/girls-only.png', bbox_inches='tight')


Now notice that there are 5 columns on "academic opportunities", 10 columns on "programs", and 10 more columns on "interests", and that in each of these areas some schools might have something that is tech-related. For each school, let's try to find whether we can find some tech related words in any of those areas and let's call it "tech inclination".


In [7]:
import numpy as np

def contains_terms(column_name, terms=["tech"]):
    """Checks if at least one of the terms is present in the given column."""
    contains = []
    for i, term in enumerate(terms):
        contains.append(df[column_name].str.contains(terms[i], case=False))

    not_null = df[column_name].notnull()
    return (not_null) & (np.any(contains, axis=0))
    

def contains_terms_columns(column_root, n_columns, terms=["tech"]):
    """Checks if at least one of the terms is present in the columns given by its root name."""
    if n_columns == 1:
        return contains_terms(column_root, terms)
    
    tech = []
    for i in range(n_columns):
        column_name = column_root + str(i + 1)
        tech.append(contains_terms(column_name, terms))
    
    return np.any(tech, axis=0)

In [8]:
tech_academicopportunities = contains_terms_columns('academicopportunities', 5, 
                                                    terms=['technology', 'computer', 'web', 
                                                           'programming', 'coding'])
len(df[tech_academicopportunities])


Out[8]:
177

In [9]:
# searching for 'tech' might match the word 'technical' 
all_tech_program = contains_terms_columns('program', 10, terms=['programming', 'computer', 
                                                                'tech'])
technical_program = contains_terms_columns('program', 10, terms=['technical'])
tech_program = (all_tech_program) & ~(technical_program)
len(df[tech_program])


Out[9]:
71

In [10]:
tech_interest = contains_terms_columns('interest', 10, terms=['computer', 'technology'])
len(df[tech_interest])


Out[10]:
43

In [11]:
tech_inclined = (tech_academicopportunities) | (tech_program) | (tech_interest)
print(len(df[tech_inclined]))
print("{:.1f}%".format(100 * len(df[tech_inclined]) / len(df)))


200
45.9%

Since 46% of schools are tech inclined and our assumption here was that 200 high schools were enough let's use only tech-inclined schools going forward. It could help the canvassing team if they were talking to female students from schools that have some tech-inclination.

However, let's first see how schools compare with each other taking that into consideration.


In [12]:
df['tech_academicopportunities'] = tech_academicopportunities.astype(int)
df['tech_program'] = tech_program.astype(int)
df['tech_interest'] = tech_interest.astype(int)
df.head(3)


Out[12]:
school_name boro academicopportunities1 academicopportunities2 academicopportunities3 academicopportunities4 academicopportunities5 neighborhood location subway bus total_students start_time end_time graduation_rate attendance_rate pct_stu_enough_variety college_career_rate girls specialized earlycollege program1 program2 program3 program4 program5 program6 program7 program8 program9 program10 interest1 interest2 interest3 interest4 interest5 interest6 interest7 interest8 interest9 interest10 city zip all tech_academicopportunities tech_program tech_interest
dbn
31R455 Tottenville High School R Institute Programs: Science, Classics/Humanities Honors Program for grades 9-12 ROTC Program CTE Programs Visual and Performing Arts Programs Annadale-Huguenot-Princes Bay 100 Luten Avenue, Staten Island NY 10312 (40.5... SIR to Huguenot S55, S56, S59, S78, X17, X17J, X19 3907 8am 2:20pm 0.860 0.91 0.85 0.690 Mixed NaN NaN Marine Corps Junior Reserve Officers’ Training... Classics Institute Science Institute Zoned NaN NaN NaN NaN NaN NaN JROTC Humanities & Interdisciplinary Science & Math Zoned NaN NaN NaN NaN NaN NaN Staten Island 10312 0 0 0
30Q450 Long Island City High School Q CTE program(s) in: Hospitality and Tourism iLearnNYC: Program for expanded online coursew... Eight Small Learning Communities: Culinary; Re... Future Educators; Sports Medicine; Broadway Pr... Art, Music, Computers, Theater, Technology; CU... Astoria 14-30 Broadway, Astoria NY 11106 (40.765881, -... N, Q to Broadway Q100, Q102, Q103, Q104, Q18, Q19, Q66, Q69 2077 8:29am 3:50pm 0.648 0.80 0.89 0.481 Mixed NaN NaN Culinary Institute Academy of Restaurant and Hotel Management Sports Medicine & Health Sciences Community & Culture Broadway Productions Global Languages Academy Dual Language Spanish Program Zoned NaN NaN Culinary Arts Hospitality, Travel and Tourism Health Professions Law & Government Performing Arts/Visual Art & Design Hospitality, Travel and Tourism Humanities & Interdisciplinary Zoned NaN NaN Astoria 11106 1 0 0
30Q445 William Cullen Bryant High School Q CTE program(s) in: Business, Management & Admi... iLearnNYC: Program for expanded online coursew... Accounting, Entrepreneurship and Marketing, Ar... Environmental Science, Executive Internship, G... Intel Math/Science Research, Jazz Band, Law & ... Astoria 48-10 31st Avenue, Astoria NY 11103 (40.757072... M, R to 46th St Q101, Q104, Q18, Q66 2437 8:50am 3:29pm 0.683 0.88 0.83 0.522 Mixed NaN NaN Forensic Science and Law Academy Business and Technology Institute Arts and Humanities Institute Math and Science Academy Zoned NaN NaN NaN NaN NaN Law & Government Business Humanities & Interdisciplinary Science & Math Zoned NaN NaN NaN NaN NaN Astoria 11103 0 1 0

In [13]:
def fill_tech_summary(academicopportunities, program, interest):
    if academicopportunities:
        if program:
            if interest:
                return 'tech_academicopportunities+program+interest'
            else:
                return 'tech_academicopportunities+program'
            
        elif interest:
            return 'tech_academicopportunities+interest'
        else:
            return 'tech_academicopportunities'
    elif program:
        if interest:
                return 'tech_program+interest'
        else:
            return 'tech_program'
    elif interest:
        return 'tech_interest'
    else:
        return 'no_tech_inclination'

In [14]:
df['tech_summary'] = df.apply(lambda x: fill_tech_summary(x.loc['tech_academicopportunities'],
                                                          x.loc['tech_program'],
                                                          x.loc['tech_interest']), 
                              axis='columns')
df['tech_summary'].head()


Out[14]:
dbn
31R455           no_tech_inclination
30Q450    tech_academicopportunities
30Q445                  tech_program
30Q501    tech_academicopportunities
26Q430           no_tech_inclination
Name: tech_summary, dtype: object

In [15]:
fig, ax = plt.subplots(figsize=(20, 10))

ax = sns.violinplot(data=df, x='all', y="college_career_rate", hue="tech_summary", ax=ax,
                    hue_order=['no_tech_inclination', 'tech_interest', 'tech_program', 
                               'tech_academicopportunities', 'tech_program+interest',
                               'tech_academicopportunities+program',
                               'tech_academicopportunities+interest', 
                               'tech_academicopportunities+program+interest'])
sns.despine(left=True)
ax.set_xlabel("")
plt.suptitle('College Career Rate by Types of Tech Inclination')

plt.savefig('figures/types-tech-inclination.png', bbox_inches='tight')



In [16]:
def fill_tech_summary_compact(academicopportunities, program, interest):
    if academicopportunities or program or interest:
        return 'tech_inclined'
    else:
        return 'not_tech_inclined'

In [17]:
df['tech_summary_compact'] = df.apply(lambda x: fill_tech_summary_compact(
                                                          x.loc['tech_academicopportunities'],
                                                          x.loc['tech_program'],
                                                          x.loc['tech_interest']), 
                              axis='columns')
df['tech_summary_compact'].head()


Out[17]:
dbn
31R455    not_tech_inclined
30Q450        tech_inclined
30Q445        tech_inclined
30Q501        tech_inclined
26Q430    not_tech_inclined
Name: tech_summary_compact, dtype: object

In [18]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 8))

ax1 = sns.violinplot(data=df, x='all', y="college_career_rate", hue="tech_summary_compact",
                     split=True, inner="quartile", ax=ax1)

ax2 = sns.violinplot(data=df, x='all', y="total_students", hue="tech_summary_compact", 
                     split=True, inner="quartile", ax=ax2)

ax1.set_xlabel("")
ax2.set_xlabel("")
sns.despine(left=True)
plt.suptitle('College Career Rate and Total Students by Tech Inclination')

plt.savefig('figures/breakdown-tech-inclination.png', bbox_inches='tight')


We can see from the violin plots above that even though tech inclined high schools have a sligtly higher college career rate median, they have slightly lower 25% and 75% quartiles. On the other hand, most high schools with 1500 or more students seem to have some kind of tech inclination.


In [19]:
new_columns = ['school_name', 'boro', 'tech_academicopportunities', 'neighborhood', 'location',
               'subway', 'bus', 'total_students', 'start_time', 'end_time', 'graduation_rate',
               'attendance_rate', 'pct_stu_enough_variety', 'college_career_rate', 'girls',
               'specialized', 'earlycollege', 'tech_program', 'tech_interest', 'city', 'zip']
tech_schools = df[tech_inclined][new_columns]
tech_schools.head(3)


Out[19]:
school_name boro tech_academicopportunities neighborhood location subway bus total_students start_time end_time graduation_rate attendance_rate pct_stu_enough_variety college_career_rate girls specialized earlycollege tech_program tech_interest city zip
dbn
30Q450 Long Island City High School Q 1 Astoria 14-30 Broadway, Astoria NY 11106 (40.765881, -... N, Q to Broadway Q100, Q102, Q103, Q104, Q18, Q19, Q66, Q69 2077 8:29am 3:50pm 0.648 0.80 0.89 0.481 Mixed NaN NaN 0 0 Astoria 11106
30Q445 William Cullen Bryant High School Q 0 Astoria 48-10 31st Avenue, Astoria NY 11103 (40.757072... M, R to 46th St Q101, Q104, Q18, Q66 2437 8:50am 3:29pm 0.683 0.88 0.83 0.522 Mixed NaN NaN 1 0 Astoria 11103
30Q501 Frank Sinatra School of the Arts High School Q 1 Astoria 35-12 35th Avenue, Astoria NY 11106 (40.756099... M, R to Steinway St ; N, Q to 36 Ave-Washingto... Q101, Q102, Q104, Q66 828 7:45am 3:15pm 0.971 0.95 0.80 0.958 Mixed NaN NaN 0 0 Astoria 11106

Let's now shift our focus to the graduation_rate and college_career_rate columns. In particular, college_career_rate's definition is "at the end of the 2014-15 school year, the percent of students who graduated 'on time' by earning a diploma four years after they entered 9th grade".

We could multiply that by the total number of students in each school and calculate the potential number of college schools each school has.


In [20]:
fig, ax = plt.subplots(figsize=(20, 8))
ax.set_xlim(0, 1.02)
ax.set_ylim(0, 1.05)

sns.regplot(tech_schools['graduation_rate'], tech_schools['college_career_rate'], order=3)

ax.set_xlabel('Graduation Rate')
ax.set_ylabel('College Career Rate')
plt.suptitle('College Career Rate by Graduation Rate')

plt.savefig('figures/college-career-and-graduation-rate.png', bbox_inches='tight')


We can see that graduation_rate and college_career_rate have a strong correlation. That means if we have too many college_career_rate null values we can use graduation_rate as a proxy.


In [21]:
potential = tech_schools['college_career_rate'] * tech_schools['total_students']
potential.sort_values(inplace=True, ascending=False)
potential


Out[21]:
dbn
13K430    5323.708066
22K405    3176.841964
10X445    2976.890066
21K525    2754.464965
26Q495    2548.926093
31R460    2541.000000
03M485    2295.198043
22K425    2135.250016
28Q505    2098.872096
20K445    2000.339082
25Q425    1886.413943
24Q610    1855.872047
31R440    1610.560006
20K505    1604.384984
28Q620    1468.948021
01M539    1442.719025
21K540    1421.774998
31R450    1400.000005
30Q445    1272.114037
04M435    1263.888021
27Q323    1181.424023
31R605    1163.890033
21K410    1144.643964
25Q525    1133.000000
30Q450     999.037012
20K485     985.844019
17K590     947.719988
05M499     939.802994
24Q600     904.050002
24Q455     876.566994
             ...     
14K632      45.320000
16K393      29.414000
26Q315            NaN
10X524            NaN
10X565            NaN
10X264            NaN
19K422            NaN
11X509            NaN
11X508            NaN
29Q313            NaN
08X561            NaN
02M422            NaN
02M282            NaN
02M280            NaN
02M135            NaN
02M507            NaN
17K489            NaN
17K122            NaN
02M533            NaN
02M546            NaN
06M211            NaN
03M859            NaN
07X522            NaN
08X559            NaN
07X223            NaN
02M432            NaN
17K745            NaN
30Q258            NaN
22K611            NaN
22K630            NaN
dtype: float64

In [22]:
null_college_career_rate = tech_schools.college_career_rate.isnull()
print("{:.1f}%".format(100 * len(tech_schools[null_college_career_rate]) / len(tech_schools)))


14.0%

In [23]:
null_graduation_rate = tech_schools.graduation_rate.isnull()
print("{:.1f}%".format(100 * len(tech_schools[null_graduation_rate]) / len(tech_schools)))


11.0%

In [24]:
print("{:.1f}%".format(100 * len(tech_schools[(null_college_career_rate) & \
                                              (null_graduation_rate)]) \
                       / len(tech_schools)))


11.0%

In [25]:
fig, ax = plt.subplots(figsize=(20, 8))
sns.distplot(tech_schools['total_students'], bins=range(0, 6000, 250), kde=False, rug=True)

ax.set_xlabel('Total Students')
ax.set_ylabel('Number of Schools with that Many Students')
plt.suptitle('Number of Schools by Total Students')


Out[25]:
<matplotlib.text.Text at 0x113e8b320>

In [26]:
tech_schools[(null_college_career_rate) & (null_graduation_rate)]['total_students'].max()


Out[26]:
662

It seems that 14% of schools don't have figures on the graduation rate. Let's plot its distribution to help decide if we should either ignore the column or the schools without that data.


In [27]:
import numpy as np

fig, ax = plt.subplots(figsize=(20, 8))
ax.set_xlim(0.05, 1.15)

schools_to_plot = tech_schools[~(null_college_career_rate)]
sns.distplot(schools_to_plot['college_career_rate'], bins=np.arange(0, 1, 0.1))

ax.set_xlabel('College Career Rate')
ax.set_ylabel('Number of Schools with that Rate')
plt.suptitle('Number of Schools by College Career Rate')

fig.savefig('figures/college-career-rate.png', bbox_inches='tight')


Since some schools have a really low college career rate let's use that data and filter schools that don't have that data point.

Let's do that and also plot the distribution of schools by their number of potential college students.


In [28]:
# Copy to avoid chained indexing and the SettingWithCopy warning (http://bit.ly/2kkXW5B)
tech_col_potential = pd.DataFrame(tech_schools, copy=True)
tech_col_potential.dropna(subset=['college_career_rate'], inplace=True)

tech_col_potential['potential_college_students'] = (tech_col_potential['total_students'] *\
                                                    tech_col_potential['college_career_rate'])\
                                                .astype(int)

tech_col_potential.sort_values('potential_college_students', inplace=True, ascending=False)
tech_col_potential.head(3)


Out[28]:
school_name boro tech_academicopportunities neighborhood location subway bus total_students start_time end_time graduation_rate attendance_rate pct_stu_enough_variety college_career_rate girls specialized earlycollege tech_program tech_interest city zip potential_college_students
dbn
13K430 Brooklyn Technical High School K 1 Fort Greene 29 Ft Greene Place, Brooklyn NY 11217 (40.6888... D, N to Atlantic Ave – Barclays Center; G to F... B103, B25, B26, B37, B38, B41, B45, B52, B54, ... 5534 8:45am 3:15pm 0.974 0.97 0.89 0.962 Mixed 1.0 NaN 0 0 Brooklyn 11217 5323
22K405 Midwood High School K 1 Flatbush 2839 Bedford Avenue, Brooklyn NY 11210 (40.632... 2, 5 to Flatbush Ave – Brooklyn College B103, B11, B41, B44, B44-SBS, B49, B6, B8, BM1... 3986 8:45am 3:30pm 0.910 0.94 0.92 0.797 Mixed NaN NaN 0 0 Brooklyn 11210 3176
10X445 Bronx High School of Science X 1 Van Cortlandt Village 75 West 205th Street, Bronx NY 10468 (40.87995... 4 to Bedford Park Blvd - Lehman College ; B, D... Bx1, Bx10, Bx2, Bx22, Bx26, Bx28, Bx3, Bx38, B... 3010 8am 3:45pm 0.991 0.97 0.92 0.989 Mixed 1.0 NaN 0 0 Bronx 10468 2976

In [29]:
fig, ax = plt.subplots(figsize=(20, 8))
sns.distplot(tech_col_potential['potential_college_students'], bins=range(0, 6000, 250),
             kde=False, rug=True)

ax.set_xlabel('Potential College Students')
ax.set_ylabel('Number of Schools')
plt.suptitle('Number of Schools by Potential College Students')

plt.savefig('figures/potential-college-students.png', bbox_inches='tight')



In [30]:
high_potential = tech_col_potential['potential_college_students'] > 1000
high_potential_schools = tech_col_potential[high_potential]
len(high_potential_schools)


Out[30]:
24

There seems to be a big gap in the number of schools with more than 1000 potential college students as compared to the number of schools with fewer potential college students.

Since we want to reduce the number of recommended stations by at least 90% and there are 24 schools with at least 1000 potential college students let's filter those and ignore the other ones.

Next, let's examine the subway and bus columns, which tells us which subway and bus lines are near each school.


In [31]:
high_potential_schools.loc[:, ('subway', 'bus')]


Out[31]:
subway bus
dbn
13K430 D, N to Atlantic Ave – Barclays Center; G to F... B103, B25, B26, B37, B38, B41, B45, B52, B54, ...
22K405 2, 5 to Flatbush Ave – Brooklyn College B103, B11, B41, B44, B44-SBS, B49, B6, B8, BM1...
10X445 4 to Bedford Park Blvd - Lehman College ; B, D... Bx1, Bx10, Bx2, Bx22, Bx26, Bx28, Bx3, Bx38, B...
21K525 Q to Ave M B11, B49, B6, B68, B9, BM1, BM3, BM4
26Q495 NaN Q13, Q28, Q31, Q76, QM20
31R460 NaN S54, S57, S61, S91, X31
03M485 1 to 66th St - Lincoln Center ; 2, A, B, C, D ... BxM2, M10, M104, M11, M12, M20, M31, M5, M57, ...
22K425 NaN B100, B2, B31, B41, B44, B44-SBS, B49, B7, B82...
28Q505 E, J, Z to Jamaica Center - Parsons / Archer ;... Q1, Q110, Q111, Q112, Q113, Q114, Q17, Q2, Q20...
20K445 D to 79th St B1, B4, B64, B8, X28, X38
25Q425 NaN Q17, Q20A, Q20B, Q25, Q34, Q44-SBS, Q88
24Q610 7 to 33rd St B24, Q32, Q39, Q60, Q67
31R440 NaN S57, S74, S76, S78, S79-SBS, S86, X1, X2, X3, ...
20K505 F to Bay Parkway-22nd Ave; N to 20th Ave B11, B6, B8, B9
28Q620 F to 169th St Q1, Q17, Q2, Q25, Q3, Q30, Q31, Q34, Q36, Q43,...
01M539 F, J, M, Z to Delancey St-Essex St B39, M14A, M14D, M21, M22, M8, M9
21K540 D to Bay 50th St; F to Ave X; N to Gravesend ... B1, B4, B64, B82, X28, X38
31R450 SIR to St. George S40, S42, S44, S46, S48, S51, S52, S61, S62, S...
30Q445 M, R to 46th St Q101, Q104, Q18, Q66
04M435 NaN BxM10, BxM6, BxM7, BxM8, BxM9, M101, M102, M10...
27Q323 A, S to Beach 105th St-Seaside Q22, Q52, Q53, QM16
31R605 SIR to New Dorp S57, S74, S76, S78, S79-SBS, S86, X1, X15, X2,...
21K410 B to Brighton Beach; F to Neptune Ave-Van Sicl... B1, B36, B4, B68
25Q525 NaN Q17, Q20A, Q20B, Q25, Q34, Q44-SBS, Q64, Q88, QM4

In [32]:
high_potential_schools['subway_nearby'] = df.apply(lambda x: 'no subway' if pd.isnull(x['subway'])
                                                       else 'subway nearby', 
                                                   axis='columns')
high_potential_schools['subway_nearby']


Out[32]:
dbn
13K430    subway nearby
22K405    subway nearby
10X445    subway nearby
21K525    subway nearby
26Q495        no subway
31R460        no subway
03M485    subway nearby
22K425        no subway
28Q505    subway nearby
20K445    subway nearby
25Q425        no subway
24Q610    subway nearby
31R440        no subway
20K505    subway nearby
28Q620    subway nearby
01M539    subway nearby
21K540    subway nearby
31R450    subway nearby
30Q445    subway nearby
04M435        no subway
27Q323    subway nearby
31R605    subway nearby
21K410    subway nearby
25Q525        no subway
Name: subway_nearby, dtype: object

In [33]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 8))
high_potential_schools['all'] = ""

ax1 = sns.violinplot(data=high_potential_schools, x='all', y="college_career_rate", 
                     hue="subway_nearby", split=True, inner="quartile", ax=ax1)

ax2 = sns.violinplot(data=high_potential_schools, x='all', y="total_students",
                     hue="subway_nearby", split=True, inner="quartile", ax=ax2)

ax1.set_xlabel("")
ax2.set_xlabel("")
sns.despine(left=True)
plt.suptitle('College Career Rate and Total Students by Subway Nearby')

fig.savefig('figures/subway-vs-no-subway.png', bbox_inches='tight')


Notice how the 75% percentile of college career rate in high schools with a subway nearby is much higher. Also notice that the schools with the highest number of students all seem to have a subway nearby.

Going forward we will filter schools without a subway station nearby.


In [34]:
# Copy to avoid chained indexing and the SettingWithCopy warning (http://bit.ly/2kkXW5B)
close_to_subway = pd.DataFrame(high_potential_schools, copy=True)
close_to_subway.dropna(subset=['subway'], inplace=True)
close_to_subway


Out[34]:
school_name boro tech_academicopportunities neighborhood location subway bus total_students start_time end_time graduation_rate attendance_rate pct_stu_enough_variety college_career_rate girls specialized earlycollege tech_program tech_interest city zip potential_college_students subway_nearby all
dbn
13K430 Brooklyn Technical High School K 1 Fort Greene 29 Ft Greene Place, Brooklyn NY 11217 (40.6888... D, N to Atlantic Ave – Barclays Center; G to F... B103, B25, B26, B37, B38, B41, B45, B52, B54, ... 5534 8:45am 3:15pm 0.974 0.97 0.89 0.962 Mixed 1.0 NaN 0 0 Brooklyn 11217 5323 subway nearby
22K405 Midwood High School K 1 Flatbush 2839 Bedford Avenue, Brooklyn NY 11210 (40.632... 2, 5 to Flatbush Ave – Brooklyn College B103, B11, B41, B44, B44-SBS, B49, B6, B8, BM1... 3986 8:45am 3:30pm 0.910 0.94 0.92 0.797 Mixed NaN NaN 0 0 Brooklyn 11210 3176 subway nearby
10X445 Bronx High School of Science X 1 Van Cortlandt Village 75 West 205th Street, Bronx NY 10468 (40.87995... 4 to Bedford Park Blvd - Lehman College ; B, D... Bx1, Bx10, Bx2, Bx22, Bx26, Bx28, Bx3, Bx38, B... 3010 8am 3:45pm 0.991 0.97 0.92 0.989 Mixed 1.0 NaN 0 0 Bronx 10468 2976 subway nearby
21K525 Edward R. Murrow High School K 1 Midwood 1600 Avenue L, Brooklyn NY 11230 (40.619671, -... Q to Ave M B11, B49, B6, B68, B9, BM1, BM3, BM4 3885 8:05am 2:45pm 0.817 0.92 0.88 0.709 Mixed NaN NaN 0 0 Brooklyn 11230 2754 subway nearby
03M485 Fiorello H. LaGuardia High School of Music & A... M 1 Lincoln Square 100 Amsterdam Avenue, Manhattan NY 10023 (40.7... 1 to 66th St - Lincoln Center ; 2, A, B, C, D ... BxM2, M10, M104, M11, M12, M20, M31, M5, M57, ... 2713 8am 4pm 0.979 0.95 0.86 0.846 Mixed 1.0 NaN 0 0 Manhattan 10023 2295 subway nearby
28Q505 Hillcrest High School Q 1 Briarwood-Jamaica Hills 160-05 Highland Avenue, Jamaica NY 11432 (40.7... E, J, Z to Jamaica Center - Parsons / Archer ;... Q1, Q110, Q111, Q112, Q113, Q114, Q17, Q2, Q20... 3321 8am 3:30pm 0.756 0.90 1.00 0.632 Mixed NaN NaN 1 0 Jamaica 11432 2098 subway nearby
20K445 New Utrecht High School K 1 Bensonhurst West 1601 80th Street, Brooklyn NY 11214 (40.613041... D to 79th St B1, B4, B64, B8, X28, X38 3553 8:20am 3:10pm 0.743 0.88 0.85 0.563 Mixed NaN NaN 1 0 Brooklyn 11214 2000 subway nearby
24Q610 Aviation Career & Technical Education High School Q 0 Hunters Point-Sunnyside 45-30 36th Street, Long Island City NY 11101 (... 7 to 33rd St B24, Q32, Q39, Q60, Q67 2148 8am 4:15pm 0.913 0.95 0.83 0.864 Mixed NaN NaN 1 1 Long Island City 11101 1855 subway nearby
20K505 Franklin Delano Roosevelt High School K 1 Borough Park 5800 20th Avenue, Brooklyn NY 11204 (40.621299... F to Bay Parkway-22nd Ave; N to 20th Ave B11, B6, B8, B9 3177 9:30am 4:15pm 0.629 0.85 0.85 0.505 Mixed NaN NaN 1 1 Brooklyn 11204 1604 subway nearby
28Q620 Thomas A. Edison Career and Technical Educatio... Q 1 Briarwood-Jamaica Hills 165-65 84th Avenue, Jamaica NY 11432 (40.71607... F to 169th St Q1, Q17, Q2, Q25, Q3, Q30, Q31, Q34, Q36, Q43,... 2132 8am 3:30pm 0.888 0.94 0.81 0.689 Mixed NaN NaN 1 1 Jamaica 11432 1468 subway nearby
01M539 New Explorations into Science, Technology and ... M 1 Lower East Side 111 Columbia Street, Manhattan NY 10002 (40.71... F, J, M, Z to Delancey St-Essex St B39, M14A, M14D, M21, M22, M8, M9 1753 8:20am 2:40pm 0.975 0.95 0.82 0.823 Mixed NaN NaN 1 0 Manhattan 10002 1442 subway nearby
21K540 John Dewey High School K 1 Gravesend 50 Avenue X, Brooklyn NY 11223 (40.58807, -73.... D to Bay 50th St; F to Ave X; N to Gravesend ... B1, B4, B64, B82, X28, X38 2225 8:13am 3:05pm 0.699 0.86 0.78 0.639 Mixed NaN NaN 0 0 Brooklyn 11223 1421 subway nearby
31R450 Curtis High School R 0 New Brighton-St. George 105 Hamilton Avenue, Staten Island NY 10301 (4... SIR to St. George S40, S42, S44, S46, S48, S51, S52, S61, S62, S... 2500 9am 3:30pm 0.760 0.86 0.88 0.560 Mixed NaN NaN 1 1 Staten Island 10301 1400 subway nearby
30Q445 William Cullen Bryant High School Q 0 Astoria 48-10 31st Avenue, Astoria NY 11103 (40.757072... M, R to 46th St Q101, Q104, Q18, Q66 2437 8:50am 3:29pm 0.683 0.88 0.83 0.522 Mixed NaN NaN 1 0 Astoria 11103 1272 subway nearby
27Q323 Scholars' Academy Q 1 Breezy Point-Rockaway Park 320 Beach 104th Street, Rockaway Park NY 11694... A, S to Beach 105th St-Seaside Q22, Q52, Q53, QM16 1304 8:45am 3:35pm 0.992 0.95 0.84 0.906 Mixed NaN NaN 0 0 Rockaway Park 11694 1181 subway nearby
31R605 Staten Island Technical High School R 1 New Dorp-Midland Beach 485 Clawson Street, Staten Island NY 10306 (40... SIR to New Dorp S57, S74, S76, S78, S79-SBS, S86, X1, X15, X2,... 1279 7:45am 2:30pm 0.990 0.97 0.94 0.910 Mixed 1.0 NaN 0 0 Staten Island 10306 1163 subway nearby
21K410 Abraham Lincoln High School K 1 West Brighton 2800 Ocean Parkway, Brooklyn NY 11235 (40.5826... B to Brighton Beach; F to Neptune Ave-Van Sicl... B1, B36, B4, B68 2108 7:28am 3:10pm 0.658 0.83 0.80 0.543 Mixed NaN NaN 0 0 Brooklyn 11235 1144 subway nearby

Let's turn our attention to the location column. We have to extract latitude and longitude in order to be able to match this dataset with the subway stations location coordinates. Let's use add_coord_columns() which is defined in coordinates.py.


In [35]:
import coordinates as coord

coord.add_coord_columns(close_to_subway, 'location')
close_to_subway.loc[:, ('latitude', 'longitude')]


Out[35]:
latitude longitude
dbn
13K430 40.688896 -73.976435
22K405 40.632829 -73.952356
10X445 40.879958 -73.889011
21K525 40.619671 -73.959141
03M485 40.774202 -73.985976
28Q505 40.709461 -73.803001
20K445 40.613041 -74.002308
24Q610 40.743309 -73.929577
20K505 40.621299 -73.982583
28Q620 40.716071 -73.799692
01M539 40.719416 -73.979581
21K540 40.588070 -73.981925
31R450 40.645436 -74.082149
30Q445 40.757072 -73.911165
27Q323 40.584543 -73.825162
31R605 40.568299 -74.117086
21K410 40.582627 -73.968733

Let's plot the the schools coordinates to see their geographical distribution:


In [36]:
!pip install folium


Requirement already satisfied: folium in /Users/gabrielcs/anaconda/lib/python3.5/site-packages
Requirement already satisfied: six in /Users/gabrielcs/anaconda/lib/python3.5/site-packages (from folium)
Requirement already satisfied: Jinja2 in /Users/gabrielcs/anaconda/lib/python3.5/site-packages (from folium)
Requirement already satisfied: branca in /Users/gabrielcs/anaconda/lib/python3.5/site-packages (from folium)
Requirement already satisfied: MarkupSafe>=0.23 in /Users/gabrielcs/anaconda/lib/python3.5/site-packages (from Jinja2->folium)

In [37]:
import folium

close_to_subway_map = folium.Map([40.72, -73.92], zoom_start=11, tiles='CartoDB positron',
                                 width='60%')

for i, school in close_to_subway.iterrows():
    marker = folium.RegularPolygonMarker([school['latitude'], school['longitude']],
                                         popup=school['school_name'], color='RoyalBlue', 
                                         fill_color='RoyalBlue', radius=5)
    marker.add_to(close_to_subway_map)

close_to_subway_map.save('maps/close_to_subway.html')
close_to_subway_map


Out[37]:

The interactive map is available here.

It seems like we have all school data we need to perform the recommendations. Let's just clean the DataFrame columns and save it as a pickle binary file for later use in another Jupyter notebook.


In [38]:
close_to_subway.rename(columns={'subway': 'subway_lines'}, inplace=True)

df_to_pickle = close_to_subway.loc[:, ('school_name', 'potential_college_students', 'latitude',
                                       'longitude', 'start_time', 'end_time', 'subway_lines',
                                       'city')]
df_to_pickle


Out[38]:
school_name potential_college_students latitude longitude start_time end_time subway_lines city
dbn
13K430 Brooklyn Technical High School 5323 40.688896 -73.976435 8:45am 3:15pm D, N to Atlantic Ave – Barclays Center; G to F... Brooklyn
22K405 Midwood High School 3176 40.632829 -73.952356 8:45am 3:30pm 2, 5 to Flatbush Ave – Brooklyn College Brooklyn
10X445 Bronx High School of Science 2976 40.879958 -73.889011 8am 3:45pm 4 to Bedford Park Blvd - Lehman College ; B, D... Bronx
21K525 Edward R. Murrow High School 2754 40.619671 -73.959141 8:05am 2:45pm Q to Ave M Brooklyn
03M485 Fiorello H. LaGuardia High School of Music & A... 2295 40.774202 -73.985976 8am 4pm 1 to 66th St - Lincoln Center ; 2, A, B, C, D ... Manhattan
28Q505 Hillcrest High School 2098 40.709461 -73.803001 8am 3:30pm E, J, Z to Jamaica Center - Parsons / Archer ;... Jamaica
20K445 New Utrecht High School 2000 40.613041 -74.002308 8:20am 3:10pm D to 79th St Brooklyn
24Q610 Aviation Career & Technical Education High School 1855 40.743309 -73.929577 8am 4:15pm 7 to 33rd St Long Island City
20K505 Franklin Delano Roosevelt High School 1604 40.621299 -73.982583 9:30am 4:15pm F to Bay Parkway-22nd Ave; N to 20th Ave Brooklyn
28Q620 Thomas A. Edison Career and Technical Educatio... 1468 40.716071 -73.799692 8am 3:30pm F to 169th St Jamaica
01M539 New Explorations into Science, Technology and ... 1442 40.719416 -73.979581 8:20am 2:40pm F, J, M, Z to Delancey St-Essex St Manhattan
21K540 John Dewey High School 1421 40.588070 -73.981925 8:13am 3:05pm D to Bay 50th St; F to Ave X; N to Gravesend ... Brooklyn
31R450 Curtis High School 1400 40.645436 -74.082149 9am 3:30pm SIR to St. George Staten Island
30Q445 William Cullen Bryant High School 1272 40.757072 -73.911165 8:50am 3:29pm M, R to 46th St Astoria
27Q323 Scholars' Academy 1181 40.584543 -73.825162 8:45am 3:35pm A, S to Beach 105th St-Seaside Rockaway Park
31R605 Staten Island Technical High School 1163 40.568299 -74.117086 7:45am 2:30pm SIR to New Dorp Staten Island
21K410 Abraham Lincoln High School 1144 40.582627 -73.968733 7:28am 3:10pm B to Brighton Beach; F to Neptune Ave-Van Sicl... Brooklyn

In [39]:
df_to_pickle.to_pickle('pickle/high_schools.p')