Project: Project 1: Benson Date: 01/23/2017 Name: Prashant Tatineni Team: Connor Stefan, Prashant Tatineni, Rosie Hoyem

Summary of Solution Steps

Use Census tract data to locate high density residential areas specifically for women working in tech.
Join Census data with MTA station locations retrieved from ArcGIS.
Use MTA turnstile data to find highest volume stations.
Create a combined rank accounting for total volume and count of female tech workers.



In [4]:

    
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import Image

%matplotlib inline

Step 1

This python script takes in the dataset from census.gov and reformats the Census Tract value to match the format in the table downloaded from ArcGIS. It then returns 2 columns, census_tracts and counts of women in each of those census tracts.



In [5]:

    
df_input = pd.read_csv('data/raw/ACS_15_5YR_C24010_with_ann.csv')

census_tracts = []
counts = []
for index, row in df_input.iterrows():
	if index != 0:
		counts.append(int(row['HD01_VD43']))
		county_code = row['GEO.id2'][3:5]
		tract_code = row['GEO.id2'][5:]
		if county_code == '61':
			census_tracts.append('1' + tract_code)
		if county_code == '05':
			census_tracts.append('2' + tract_code)
		if county_code == '47':
			census_tracts.append('3' + tract_code)
		if county_code == '81':
			census_tracts.append('4' + tract_code)
		if county_code == '85':
			census_tracts.append('5' + tract_code)

df_output = pd.DataFrame({'BoroCT2010':census_tracts, 'stem_women':counts})
df_output.to_csv('data/stem_women.csv', index=False)

Step 2

Merge census data with ArcGIS data.



In [6]:

    
stations = pd.read_csv('data/raw/subway_station_tracts.csv')
stem_women = pd.read_csv('data/stem_women.csv')
df = pd.merge(stations, stem_women, on='BoroCT2010')

dfviz = df[df['stem_women'] > 100]

sns.barplot(x=dfviz['stem_women'], y=dfviz['name'], hue=dfviz['BoroName'], ci=None);
plt.xlabel('Women in STEM occupations living nearby',fontsize=14);
plt.ylabel('Station Name',fontsize=14);
plt.legend(loc=1);
plt.xlim(100,400);



In [7]:

    
dft50 = df[df['stem_women'] > 50].sort_values('stem_women', ascending=False)[['name','BoroName','BoroCT2010','stem_women']]
sorted_stations = dft50.groupby(['name','BoroName','BoroCT2010']).max().sort_values('stem_women', ascending=False).head(50)
sorted_stations.to_csv('data/top50_stations_women.csv')

Step 3

Import and clean MTA turnstile data for highest volume stations.



In [10]:

    
df1 = pd.read_csv('http://web.mta.info/developers/data/nyct/turnstile/turnstile_160507.txt')
df2 = pd.read_csv('http://web.mta.info/developers/data/nyct/turnstile/turnstile_160514.txt')
df3 = pd.read_csv('http://web.mta.info/developers/data/nyct/turnstile/turnstile_160521.txt')
df4 = pd.read_csv('http://web.mta.info/developers/data/nyct/turnstile/turnstile_160528.txt')
frames = [df1, df2, df3, df4]
df = pd.concat(frames)



In [21]:

    
df.columns = (u'C/A', u'UNIT', u'SCP', u'STATION', u'LINENAME', u'DIVISION', u'DATE',
       u'TIME', u'DESC', u'ENTRIES',
       u'EXITS')
df = df.sort(['C/A','UNIT','SCP','DATE','TIME'])
df['diffEntries'] = df.ENTRIES.diff()
df['diffExits'] = df.EXITS.diff()
df['machine_change'] = (df.SCP != df.SCP.shift())









    



/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:4: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)



In [22]:

    
# Edit records where turnstile machine changes; set to 0.
df = df.reset_index(drop=True)
df.diffEntries[df.index[df.machine_change==True]] = 0
df.diffExits[df.index[df.machine_change==True]] = 0









    



/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



In [23]:

    
# Further edit records where turnstile counter resulted in negative counts or counts > 6000; reset to 0.
df['counterReset'] = ((df.diffEntries > 6000) | (df.diffEntries < 0) | (df.diffExits > 6000) | (df.diffExits < 0))
df.diffEntries[df.index[df.counterReset==True]] = 0
df.diffExits[df.index[df.counterReset==True]] = 0
df.head()









    



/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy






    Out[23]:






  
    
      
      C/A
      UNIT
      SCP
      STATION
      LINENAME
      DIVISION
      DATE
      TIME
      DESC
      ENTRIES
      EXITS
      diffEntries
      diffExits
      machine_change
      counterReset
    
  
  
    
      0
      A002
      R051
      02-00-00
      59 ST
      NQR456
      BMT
      04/30/2016
      00:00:00
      REGULAR
      5639941
      1909983
      0.0
      0.0
      True
      False
    
    
      1
      A002
      R051
      02-00-00
      59 ST
      NQR456
      BMT
      04/30/2016
      04:00:00
      REGULAR
      5639991
      1909993
      50.0
      10.0
      False
      False
    
    
      2
      A002
      R051
      02-00-00
      59 ST
      NQR456
      BMT
      04/30/2016
      08:00:00
      REGULAR
      5640014
      1910024
      23.0
      31.0
      False
      False
    
    
      3
      A002
      R051
      02-00-00
      59 ST
      NQR456
      BMT
      04/30/2016
      12:00:00
      REGULAR
      5640158
      1910134
      144.0
      110.0
      False
      False
    
    
      4
      A002
      R051
      02-00-00
      59 ST
      NQR456
      BMT
      04/30/2016
      16:00:00
      REGULAR
      5640454
      1910197
      296.0
      63.0
      False
      False



In [26]:

    
df['TotalVol'] = df.diffEntries+df.diffExits
topstations = df.groupby(['STATION']).sum()
topstations = topstations.sort('TotalVol', ascending=False)
topstations.to_csv('data/volumeByStation.csv')









    



/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:3: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)
  app.launch_new_instance()

Step 4

Combined rank using top50_stations_women.csv and volumeByStation.csv



In [28]:

    
census = pd.read_csv('data/top50_stations_women.csv')



In [33]:

    
# Top 5 stations by women residents in tech occupations
census['census_rank'] = census.index + 1
census.head()









    Out[33]:






  
    
      
      name
      BoroName
      BoroCT2010
      stem_women
      census_rank
    
  
  
    
      0
      Roosevelt Island - Main St
      Manhattan
      1023801
      354
      1
    
    
      1
      86th St
      Manhattan
      1017500
      248
      2
    
    
      2
      72nd St
      Manhattan
      1012600
      224
      3
    
    
      3
      28th St
      Manhattan
      1006800
      200
      4
    
    
      4
      23rd St
      Manhattan
      1006800
      200
      5



In [34]:

    
ts = (pd.read_csv('data/volumeByStation.csv'))[:51]
ts['ts_rank']= ts.index + 1
ts.head()









    Out[34]:






  
    
      
      STATION
      ENTRIES
      EXITS
      diffEntries
      diffExits
      machine_change
      counterReset
      TotalVol
      ts_rank
    
  
  
    
      0
      34 ST-PENN STA
      488881015020
      287627882120
      4206635.0
      3584248.0
      100.0
      7.0
      7790883.0
      1
    
    
      1
      GRD CNTRL-42 ST
      744724963097
      146796271108
      3574172.0
      3192641.0
      61.0
      152.0
      6766813.0
      2
    
    
      2
      34 ST-HERALD SQ
      563482244287
      596750042788
      3026299.0
      2691563.0
      54.0
      91.0
      5717862.0
      3
    
    
      3
      14 ST-UNION SQ
      239944636667
      85311655366
      2759008.0
      2377425.0
      38.0
      0.0
      5136433.0
      4
    
    
      4
      23 ST
      1111704579992
      1034681559184
      2922280.0
      2197775.0
      71.0
      339.0
      5120055.0
      5



In [35]:

    
import re

capitalizer = lambda x: x.upper()
census['name'] = census['name'].apply(capitalizer)

clean_ave = lambda x: re.sub('AVE','AV', x)
census['name'] = census['name'].apply(clean_ave)

clean_numbers1 = lambda x: re.sub(r'(\b\d+)(RD\b)',r'\1', x)
census['name'] = census['name'].apply(clean_numbers1)

clean_numbers2 = lambda x: re.sub(r'(\b\d+)(TH\b)',r'\1', x)
census['name'] = census['name'].apply(clean_numbers2)

clean_numbers3 = lambda x: re.sub(r'(\b\d+)(ST\b)',r'\1', x)
census['name'] = census['name'].apply(clean_numbers3)

clean_numbers4 = lambda x: re.sub(r'(\b\d+)(ND\b)',r'\1', x)
census['name'] = census['name'].apply(clean_numbers4)

clean_dashes = lambda x: re.sub(r'(\w*)(\s-\s)(\w*)',r'\1-\3', x)
census['name'] = census['name'].apply(clean_dashes)

remove_parans = lambda x: re.sub(r'\(.*\)',r'', x)
census['name'] = census['name'].apply(remove_parans)

census.replace('CONCOURSE', 'CONC', inplace=True)
census.replace('AVENUE', 'AV', inplace=True)
census.replace('WASHINGTON', 'WASH', inplace=True)
census.replace('JUNCTION', 'JCT', inplace=True)
census.replace('CONCOURSE', 'CONC', inplace=True)
census.replace('WOODHAVN', 'WOODHAVEN', inplace=True)
census.replace('CENTER', 'CTR', inplace=True)
census.replace('QUEENSBRIDGE', 'QNSBRIDGE', inplace=True)
census.replace('W 4 ST-WASHINGTON SQ ', 'W 4 ST-WASH SQ', inplace=True)



In [37]:

    
census_ts = pd.merge(census, ts, how='inner', left_on='name', right_on='STATION')
census_ts['comb_rank'] = census_ts['ts_rank'] + census_ts['census_rank']



In [38]:

    
census_ts.plot.scatter('stem_women', 'TotalVol');



In [39]:

    
final = census_ts.sort_values('comb_rank', ascending=True)
final.head()









    Out[39]:






  
    
      
      name
      BoroName
      BoroCT2010
      stem_women
      census_rank
      STATION
      ENTRIES
      EXITS
      diffEntries
      diffExits
      machine_change
      counterReset
      TotalVol
      ts_rank
      comb_rank
    
  
  
    
      0
      86 ST
      Manhattan
      1017500
      248
      2
      86 ST
      394240269476
      430100768929
      2785046.0
      2316859.0
      52.0
      172.0
      5101905.0
      6
      8
    
    
      5
      23 ST
      Manhattan
      1006800
      200
      5
      23 ST
      1111704579992
      1034681559184
      2922280.0
      2197775.0
      71.0
      339.0
      5120055.0
      5
      10
    
    
      7
      FULTON ST
      Manhattan
      1001502
      184
      8
      FULTON ST
      681257714035
      716552976737
      1975340.0
      1646421.0
      96.0
      5.0
      3621761.0
      9
      17
    
    
      1
      72 ST
      Manhattan
      1012600
      224
      3
      72 ST
      486479372196
      495763311658
      1262418.0
      1120278.0
      25.0
      167.0
      2382696.0
      20
      23
    
    
      3
      28 ST
      Manhattan
      1006800
      200
      4
      28 ST
      270514360108
      370396202879
      1268076.0
      1168521.0
      37.0
      3.0
      2436597.0
      19
      23

	C/A	UNIT	SCP	STATION	LINENAME	DIVISION	DATE	TIME	DESC	ENTRIES	EXITS	diffEntries	diffExits	machine_change	counterReset
0	A002	R051	02-00-00	59 ST	NQR456	BMT	04/30/2016	00:00:00	REGULAR	5639941	1909983	0.0	0.0	True	False
1	A002	R051	02-00-00	59 ST	NQR456	BMT	04/30/2016	04:00:00	REGULAR	5639991	1909993	50.0	10.0	False	False
2	A002	R051	02-00-00	59 ST	NQR456	BMT	04/30/2016	08:00:00	REGULAR	5640014	1910024	23.0	31.0	False	False
3	A002	R051	02-00-00	59 ST	NQR456	BMT	04/30/2016	12:00:00	REGULAR	5640158	1910134	144.0	110.0	False	False
4	A002	R051	02-00-00	59 ST	NQR456	BMT	04/30/2016	16:00:00	REGULAR	5640454	1910197	296.0	63.0	False	False

	name	BoroName	BoroCT2010	stem_women	census_rank
0	Roosevelt Island - Main St	Manhattan	1023801	354	1
1	86th St	Manhattan	1017500	248	2
2	72nd St	Manhattan	1012600	224	3
3	28th St	Manhattan	1006800	200	4
4	23rd St	Manhattan	1006800	200	5

	STATION	ENTRIES	EXITS	diffEntries	diffExits	machine_change	counterReset	TotalVol	ts_rank
0	34 ST-PENN STA	488881015020	287627882120	4206635.0	3584248.0	100.0	7.0	7790883.0	1
1	GRD CNTRL-42 ST	744724963097	146796271108	3574172.0	3192641.0	61.0	152.0	6766813.0	2
2	34 ST-HERALD SQ	563482244287	596750042788	3026299.0	2691563.0	54.0	91.0	5717862.0	3
3	14 ST-UNION SQ	239944636667	85311655366	2759008.0	2377425.0	38.0	0.0	5136433.0	4
4	23 ST	1111704579992	1034681559184	2922280.0	2197775.0	71.0	339.0	5120055.0	5

	name	BoroName	BoroCT2010	stem_women	census_rank	STATION	ENTRIES	EXITS	diffEntries	diffExits	machine_change	counterReset	TotalVol	ts_rank	comb_rank
0	86 ST	Manhattan	1017500	248	2	86 ST	394240269476	430100768929	2785046.0	2316859.0	52.0	172.0	5101905.0	6	8
5	23 ST	Manhattan	1006800	200	5	23 ST	1111704579992	1034681559184	2922280.0	2197775.0	71.0	339.0	5120055.0	5	10
7	FULTON ST	Manhattan	1001502	184	8	FULTON ST	681257714035	716552976737	1975340.0	1646421.0	96.0	5.0	3621761.0	9	17
1	72 ST	Manhattan	1012600	224	3	72 ST	486479372196	495763311658	1262418.0	1120278.0	25.0	167.0	2382696.0	20	23
3	28 ST	Manhattan	1006800	200	4	28 ST	270514360108	370396202879	1268076.0	1168521.0	37.0	3.0	2436597.0	19	23