takin' care of imports



In [1]:

    
from pandas import Series, DataFrame



In [2]:

    
import pandas as pd



In [3]:

    
import numpy as np



In [4]:

    
import matplotlib.pyplot as plt



In [32]:

    
import matplotlib



In [5]:

    
%matplotlib inline

read in data



In [6]:

    
weather = pd.read_table('data/daily_weather.tsv')



In [7]:

    
stations = pd.read_table('data/stations.tsv')



In [8]:

    
usage = pd.read_table('data/usage_2012.tsv')

repeat data fixing from previous exercise



In [9]:

    
weather['date'] = pd.to_datetime(weather['date'])



In [10]:

    
weather.loc[weather['season_code'] == 1, 'season_desc'] = 'winter'
weather.loc[weather['season_code'] == 2, 'season_desc'] = 'spring'
weather.loc[weather['season_code'] == 3, 'season_desc'] = 'summer'
weather.loc[weather['season_code'] == 4, 'season_desc'] = 'fall'



In [11]:

    
weather.loc[weather['season_desc'] == 'winter', 'season_code'] = 4
weather.loc[weather['season_desc'] == 'spring', 'season_code'] = 1
weather.loc[weather['season_desc'] == 'summer', 'season_code'] = 2
weather.loc[weather['season_desc'] == 'fall', 'season_code'] = 3

1a. Plot the daily temperature over the course of the year. (This should probably be a line chart.)



In [12]:

    
weather.plot(x='date', y='temp')
plt.show()

1b. Create a bar chart that shows the average temperature and humidity by month.



In [13]:

    
temp_humid = weather[['temp', 'humidity']].groupby(weather['date'].dt.month).mean()



In [14]:

    
temp_humid.plot(kind='bar', width=0.75, color=['#EE4444','#4444EE'])
plt.show()

2. Use a scatterplot to show how the daily rental volume varies with temperature. Use a different series (with different colors) for each season.



In [15]:

    
spring_daily_vol = weather.loc[weather['season_desc'] == 'spring']
summer_daily_vol = weather.loc[weather['season_desc'] == 'summer']
fall_daily_vol = weather.loc[weather['season_desc'] == 'fall']
winter_daily_vol = weather.loc[weather['season_desc'] == 'winter']



In [16]:

    
spr_ax = spring_daily_vol.plot(kind='scatter', x='temp', y='total_riders', c='yellow', s=50, alpha=.4)
sum_ax = summer_daily_vol.plot(kind='scatter', x='temp', y='total_riders', c='lightgreen', s=50, alpha=.4, ax=spr_ax)
fal_ax = fall_daily_vol.plot(kind='scatter', x='temp', y='total_riders', c='#ee5555', s=50, alpha=.4, ax=sum_ax)
win_ax = winter_daily_vol.plot(kind='scatter', x='temp', y='total_riders', c='lightblue', s=50, alpha=.4, ax=fal_ax)
plt.title('Temp vs Daily Rental Volume')
plt.show()

3. Create another scatterplot to show how daily rental volume varies with windspeed. As above, use a different series for each season.



In [17]:

    
spr_ax = spring_daily_vol.plot(kind='scatter', x='windspeed', y='total_riders', c='yellow', s=50, alpha=.4)
sum_ax = summer_daily_vol.plot(kind='scatter', x='windspeed', y='total_riders', c='lightgreen', s=50, alpha=.4, ax=spr_ax)
fal_ax = fall_daily_vol.plot(kind='scatter', x='windspeed', y='total_riders', c='#ee5555', s=50, alpha=.4, ax=sum_ax)
win_ax = winter_daily_vol.plot(kind='scatter', x='windspeed', y='total_riders', c='lightblue', s=50, alpha=.4, ax=fal_ax)
plt.title('Windspeed vs Daily Rental Volume')
plt.show()

4. How do the rental volumes vary with geography? Compute the average daily rentals for each station and use this as the radius for a scatterplot of each station's latitude and longitude.

_pull out just the station_start column from 'usage' file data, turn it into a new dataframe 'usagestations'



In [18]:

    
usage_stations = usage[['station_start']]



In [19]:

    
usage_stations.head()









    Out[19]:






  
    
      
      station_start
    
  
  
    
      0
      7th & R St NW / Shaw Library
    
    
      1
      Georgia & New Hampshire Ave NW
    
    
      2
      Georgia & New Hampshire Ave NW
    
    
      3
      14th & V St NW
    
    
      4
      11th & Kenyon St NW

_pull out the lat and long from 'stations' file data, turn it into a new dataframe 'stationsgeo'

_set the index of 'stationsgeo' to be the values of the 'stations' column from the same 'stations' file data; this is important because we want both of our dataframes to have similar indices



In [20]:

    
stations_geo = DataFrame({'lat': stations.lat, 'long': stations.long})
stations_geo.index = stations.station.values



In [21]:

    
stations_geo.head()









    Out[21]:






  
    
      
      lat
      long
    
  
  
    
      20th & Bell St
      38.856100
      -77.051200
    
    
      18th & Eads St.
      38.857250
      -77.053320
    
    
      20th & Crystal Dr
      38.856400
      -77.049200
    
    
      15th & Crystal Dr
      38.860170
      -77.049593
    
    
      Aurora Hills Community Ctr/18th & Hayes St
      38.857866
      -77.059490

_make a new dataframe 'station_count' that sums the occurrences of each station name in 'usagestations'



In [22]:

    
station_count = DataFrame(usage_stations['station_start'].value_counts())



In [23]:

    
station_count.head()









    Out[23]:






  
    
      
      0
    
  
  
    
      Massachusetts Ave & Dupont Circle NW
      69850
    
    
      Columbus Circle / Union Station
      55146
    
    
      15th & P St NW
      49416
    
    
      17th & Corcoran St NW
      43547
    
    
      14th & V St NW
      40242

_create a new colum 'rides' in 'stations_geo' which is populated with the data from 'stationcount'



In [24]:

    
stations_geo['rides'] = station_count



In [25]:

    
stations_geo.head()









    Out[25]:






  
    
      
      lat
      long
      rides
    
  
  
    
      20th & Bell St
      38.856100
      -77.051200
      1688
    
    
      18th & Eads St.
      38.857250
      -77.053320
      NaN
    
    
      20th & Crystal Dr
      38.856400
      -77.049200
      5113
    
    
      15th & Crystal Dr
      38.860170
      -77.049593
      3094
    
    
      Aurora Hills Community Ctr/18th & Hayes St
      38.857866
      -77.059490
      1986

get rid of all thye 'NaN' rows, then plot the data on a scatterplot where the radius of each point is the total number of rides at that location divided by the 366 days of the year



In [26]:

    
cleared = stations_geo.dropna()



In [27]:

    
cleared.head()









    Out[27]:






  
    
      
      lat
      long
      rides
    
  
  
    
      20th & Bell St
      38.856100
      -77.051200
      1688
    
    
      20th & Crystal Dr
      38.856400
      -77.049200
      5113
    
    
      15th & Crystal Dr
      38.860170
      -77.049593
      3094
    
    
      Aurora Hills Community Ctr/18th & Hayes St
      38.857866
      -77.059490
      1986
    
    
      Pentagon City Metro / 12th & S Hayes St
      38.862303
      -77.059936
      4231



In [77]:

    
cleared.plot(kind='scatter', x='long', y='lat', s=(cleared['rides'] / 366) * 5, alpha=0.6, figsize=(10, 10))
plt.show()

	station_start
0	7th & R St NW / Shaw Library
1	Georgia & New Hampshire Ave NW
2	Georgia & New Hampshire Ave NW
3	14th & V St NW
4	11th & Kenyon St NW

	lat	long
20th & Bell St	38.856100	-77.051200
18th & Eads St.	38.857250	-77.053320
20th & Crystal Dr	38.856400	-77.049200
15th & Crystal Dr	38.860170	-77.049593
Aurora Hills Community Ctr/18th & Hayes St	38.857866	-77.059490

	0
Massachusetts Ave & Dupont Circle NW	69850
Columbus Circle / Union Station	55146
15th & P St NW	49416
17th & Corcoran St NW	43547
14th & V St NW	40242