takin' care of imports


In [1]:
from pandas import Series, DataFrame

In [2]:
import pandas as pd

In [3]:
import numpy as np

In [4]:
import matplotlib.pyplot as plt

In [32]:
import matplotlib

In [5]:
%matplotlib inline

read in data


In [6]:
weather = pd.read_table('data/daily_weather.tsv')

In [7]:
stations = pd.read_table('data/stations.tsv')

In [8]:
usage = pd.read_table('data/usage_2012.tsv')

repeat data fixing from previous exercise


In [9]:
weather['date'] = pd.to_datetime(weather['date'])

In [10]:
weather.loc[weather['season_code'] == 1, 'season_desc'] = 'winter'
weather.loc[weather['season_code'] == 2, 'season_desc'] = 'spring'
weather.loc[weather['season_code'] == 3, 'season_desc'] = 'summer'
weather.loc[weather['season_code'] == 4, 'season_desc'] = 'fall'

In [11]:
weather.loc[weather['season_desc'] == 'winter', 'season_code'] = 4
weather.loc[weather['season_desc'] == 'spring', 'season_code'] = 1
weather.loc[weather['season_desc'] == 'summer', 'season_code'] = 2
weather.loc[weather['season_desc'] == 'fall', 'season_code'] = 3

1a. Plot the daily temperature over the course of the year. (This should probably be a line chart.)


In [12]:
weather.plot(x='date', y='temp')
plt.show()


1b. Create a bar chart that shows the average temperature and humidity by month.


In [13]:
temp_humid = weather[['temp', 'humidity']].groupby(weather['date'].dt.month).mean()

In [14]:
temp_humid.plot(kind='bar', width=0.75, color=['#EE4444','#4444EE'])
plt.show()


2. Use a scatterplot to show how the daily rental volume varies with temperature. Use a different series (with different colors) for each season.


In [15]:
spring_daily_vol = weather.loc[weather['season_desc'] == 'spring']
summer_daily_vol = weather.loc[weather['season_desc'] == 'summer']
fall_daily_vol = weather.loc[weather['season_desc'] == 'fall']
winter_daily_vol = weather.loc[weather['season_desc'] == 'winter']

In [16]:
spr_ax = spring_daily_vol.plot(kind='scatter', x='temp', y='total_riders', c='yellow', s=50, alpha=.4)
sum_ax = summer_daily_vol.plot(kind='scatter', x='temp', y='total_riders', c='lightgreen', s=50, alpha=.4, ax=spr_ax)
fal_ax = fall_daily_vol.plot(kind='scatter', x='temp', y='total_riders', c='#ee5555', s=50, alpha=.4, ax=sum_ax)
win_ax = winter_daily_vol.plot(kind='scatter', x='temp', y='total_riders', c='lightblue', s=50, alpha=.4, ax=fal_ax)
plt.title('Temp vs Daily Rental Volume')
plt.show()


3. Create another scatterplot to show how daily rental volume varies with windspeed. As above, use a different series for each season.


In [17]:
spr_ax = spring_daily_vol.plot(kind='scatter', x='windspeed', y='total_riders', c='yellow', s=50, alpha=.4)
sum_ax = summer_daily_vol.plot(kind='scatter', x='windspeed', y='total_riders', c='lightgreen', s=50, alpha=.4, ax=spr_ax)
fal_ax = fall_daily_vol.plot(kind='scatter', x='windspeed', y='total_riders', c='#ee5555', s=50, alpha=.4, ax=sum_ax)
win_ax = winter_daily_vol.plot(kind='scatter', x='windspeed', y='total_riders', c='lightblue', s=50, alpha=.4, ax=fal_ax)
plt.title('Windspeed vs Daily Rental Volume')
plt.show()


4. How do the rental volumes vary with geography? Compute the average daily rentals for each station and use this as the radius for a scatterplot of each station's latitude and longitude.

_pull out just the station_start column from 'usage' file data, turn it into a new dataframe 'usagestations'


In [18]:
usage_stations = usage[['station_start']]

In [19]:
usage_stations.head()


Out[19]:
station_start
0 7th & R St NW / Shaw Library
1 Georgia & New Hampshire Ave NW
2 Georgia & New Hampshire Ave NW
3 14th & V St NW
4 11th & Kenyon St NW

_pull out the lat and long from 'stations' file data, turn it into a new dataframe 'stationsgeo'

_set the index of 'stationsgeo' to be the values of the 'stations' column from the same 'stations' file data; this is important because we want both of our dataframes to have similar indices


In [20]:
stations_geo = DataFrame({'lat': stations.lat, 'long': stations.long})
stations_geo.index = stations.station.values

In [21]:
stations_geo.head()


Out[21]:
lat long
20th & Bell St 38.856100 -77.051200
18th & Eads St. 38.857250 -77.053320
20th & Crystal Dr 38.856400 -77.049200
15th & Crystal Dr 38.860170 -77.049593
Aurora Hills Community Ctr/18th & Hayes St 38.857866 -77.059490

_make a new dataframe 'station_count' that sums the occurrences of each station name in 'usagestations'


In [22]:
station_count = DataFrame(usage_stations['station_start'].value_counts())

In [23]:
station_count.head()


Out[23]:
0
Massachusetts Ave & Dupont Circle NW 69850
Columbus Circle / Union Station 55146
15th & P St NW 49416
17th & Corcoran St NW 43547
14th & V St NW 40242

_create a new colum 'rides' in 'stations_geo' which is populated with the data from 'stationcount'


In [24]:
stations_geo['rides'] = station_count

In [25]:
stations_geo.head()


Out[25]:
lat long rides
20th & Bell St 38.856100 -77.051200 1688
18th & Eads St. 38.857250 -77.053320 NaN
20th & Crystal Dr 38.856400 -77.049200 5113
15th & Crystal Dr 38.860170 -77.049593 3094
Aurora Hills Community Ctr/18th & Hayes St 38.857866 -77.059490 1986

get rid of all thye 'NaN' rows, then plot the data on a scatterplot where the radius of each point is the total number of rides at that location divided by the 366 days of the year


In [26]:
cleared = stations_geo.dropna()

In [27]:
cleared.head()


Out[27]:
lat long rides
20th & Bell St 38.856100 -77.051200 1688
20th & Crystal Dr 38.856400 -77.049200 5113
15th & Crystal Dr 38.860170 -77.049593 3094
Aurora Hills Community Ctr/18th & Hayes St 38.857866 -77.059490 1986
Pentagon City Metro / 12th & S Hayes St 38.862303 -77.059936 4231

In [77]:
cleared.plot(kind='scatter', x='long', y='lat', s=(cleared['rides'] / 366) * 5, alpha=0.6, figsize=(10, 10))
plt.show()