In [6]:
import inflect # for string manipulation
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
filename = '/Users/excalibur/py/nanodegree/intro_ds/final_project/improved-dataset/turnstile_weather_v2.csv'
# import data
data = pd.read_csv(filename)
In the current data set, certain days are labeled as both 'rain' and 'no rain', which, assumedly, means that rain occurrred in certain locations while it did not in others on the same day. [ Thankfully, the current data set is not as double-minded when it reports (only) either 'rain' or 'no-rain' at individual station locations in a single day! ]
As an example of differing rain reports in a single day based on station location:
In [7]:
dates_stations_rain = data[['DATEn', 'station', 'latitude', 'longitude', 'rain']]
may_15_11 = dates_stations_rain[dates_stations_rain['DATEn'] == '05-14-11']
stations_locations = may_15_11[['station', 'latitude', 'longitude', 'rain']]
stations_locations.drop_duplicates(inplace=True)
stat_loc_rain = stations_locations[stations_locations['rain'] == 1]
stat_loc_no_rain = stations_locations[stations_locations['rain'] == 0]
plt.title('STATION RAIN REPORTS ON 05-14-11')
plt.xlabel('longitude')
plt.ylabel('latitude')
plt.scatter(stat_loc_rain['longitude'], stat_loc_rain['latitude'], color='blue')
plt.scatter(stat_loc_no_rain['longitude'], stat_loc_no_rain['latitude'], color='yellow', edgecolors='black')
plt.show()
print "Number of stations reporting rain: " + str(stat_loc_rain['station'].count())
print "Number of stations reporting no rain: " + str(stat_loc_no_rain['station'].count())
Using a map layer of New York City as a background image:
In [27]:
plt.figure(figsize = (10,10))
plt.title('STATION RAIN REPORTS ON 05-14-11')
plt.xlabel('longitude')
plt.ylabel('latitude')
img = plt.imread('NYmap.png')
plt.scatter(stat_loc_rain['longitude'], stat_loc_rain['latitude'], color='blue', edgecolors='black', zorder=1)
plt.scatter(stat_loc_no_rain['longitude'], stat_loc_no_rain['latitude'], color='yellow', edgecolors='black', zorder=1)
plt.imshow(img, zorder=0, extent=[data['longitude'].min(), data['longitude'].max(), data['latitude'].min(), data['latitude'].max()])
plt.show()
After creating a visualization and using this one date as (hopefully) the only necessary example, it seems safe to assume that conflicting rain reports are simply the result of so-called isolated or scattered showers.
[ Some other notes to keep in mind when considering the rain variable: rainy/non-rainy days are technically not random, although they might be considered random for most non-meterological purposes. Moreover, rainy/non-rainy days tend to cluster (for meterological reasons). ]