Analyzing the NYC Subway Dataset

Intro to Data Science: Final Project 1, Part 2

(Short Questions)

Rain Supplement

Austin J. Alexander


Import Directives and Initial DataFrame Creation


In [6]:
import inflect # for string manipulation
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

filename = '/Users/excalibur/py/nanodegree/intro_ds/final_project/improved-dataset/turnstile_weather_v2.csv'

# import data
data = pd.read_csv(filename)

Conflicting Rain Reports

In the current data set, certain days are labeled as both 'rain' and 'no rain', which, assumedly, means that rain occurrred in certain locations while it did not in others on the same day. [ Thankfully, the current data set is not as double-minded when it reports (only) either 'rain' or 'no-rain' at individual station locations in a single day! ]

As an example of differing rain reports in a single day based on station location:


In [7]:
dates_stations_rain = data[['DATEn', 'station', 'latitude', 'longitude', 'rain']]
may_15_11 = dates_stations_rain[dates_stations_rain['DATEn'] == '05-14-11']
stations_locations = may_15_11[['station', 'latitude', 'longitude', 'rain']]
stations_locations.drop_duplicates(inplace=True)

stat_loc_rain = stations_locations[stations_locations['rain'] == 1]
stat_loc_no_rain = stations_locations[stations_locations['rain'] == 0]


plt.title('STATION RAIN REPORTS ON 05-14-11')
plt.xlabel('longitude')
plt.ylabel('latitude')
plt.scatter(stat_loc_rain['longitude'], stat_loc_rain['latitude'], color='blue')
plt.scatter(stat_loc_no_rain['longitude'], stat_loc_no_rain['latitude'], color='yellow', edgecolors='black')
plt.show()

print "Number of stations reporting rain: " + str(stat_loc_rain['station'].count())
print "Number of stations reporting no rain: " + str(stat_loc_no_rain['station'].count())


Number of stations reporting rain: 59
Number of stations reporting no rain: 172

Using a map layer of New York City as a background image:


In [27]:
plt.figure(figsize = (10,10))
plt.title('STATION RAIN REPORTS ON 05-14-11')
plt.xlabel('longitude')
plt.ylabel('latitude')
img = plt.imread('NYmap.png')

plt.scatter(stat_loc_rain['longitude'], stat_loc_rain['latitude'], color='blue', edgecolors='black', zorder=1)
plt.scatter(stat_loc_no_rain['longitude'], stat_loc_no_rain['latitude'], color='yellow', edgecolors='black', zorder=1)

plt.imshow(img, zorder=0, extent=[data['longitude'].min(), data['longitude'].max(), data['latitude'].min(), data['latitude'].max()])
plt.show()


Apparent Conclusions

After creating a visualization and using this one date as (hopefully) the only necessary example, it seems safe to assume that conflicting rain reports are simply the result of so-called isolated or scattered showers.

[ Some other notes to keep in mind when considering the rain variable: rainy/non-rainy days are technically not random, although they might be considered random for most non-meterological purposes. Moreover, rainy/non-rainy days tend to cluster (for meterological reasons). ]