Question 1

Compute the average temperature by season ('season_desc'). (The temperatures are numbers between 0 and 1, but don't worry about that. Let's say that's the Shellman temperature scale.)

I get

season_desc
Fall           0.711445
Spring         0.321700
Summer         0.554557
Winter         0.419368

Which clearly looks wrong. Figure out what's wrong with the original data and fix it.


In [22]:
from pandas import DataFrame, Series

In [23]:
import pandas as pd

In [24]:
import numpy as np

In [25]:
weather_data = pd.read_table('data/daily_weather.tsv')

In [26]:
season_mapping = {'Spring': 'Winter', 'Winter': 'Fall', 'Fall': 'Summer', 'Summer': 'Spring'}

In [27]:
def fix_seasons(x):
    return season_mapping[x]

In [28]:
weather_data['season_desc'] = weather_data['season_desc'].apply(fix_seasons)

In [29]:
weather_data.pivot_table(index='season_desc', values='temp', aggfunc=np.mean)


Out[29]:
season_desc
Fall      0.419368
Spring    0.554557
Summer    0.711445
Winter    0.321700
Name: temp, dtype: float64

In this case, a pivot table is not really required, so a simple use of groupby and mean() will do the job.


In [30]:
weather_data.groupby('season_desc')['temp'].mean()


Out[30]:
season_desc
Fall      0.419368
Spring    0.554557
Summer    0.711445
Winter    0.321700
Name: temp, dtype: float64

Question 2

Various of the columns represent dates or datetimes, but out of the box pd.read_table won't treat them correctly. This makes it hard to (for example) compute the number of rentals by month. Fix the dates and compute the number of rentals by month.


In [31]:
weather_data['Month'] = pd.DatetimeIndex(weather_data.date).month

In [32]:
weather_data.groupby('Month')['total_riders'].sum()


Out[32]:
Month
1      96744
2     103137
3     164875
4     174224
5     195865
6     202830
7     203607
8     214503
9     218573
10    198841
11    152664
12    123713
Name: total_riders, dtype: int64

Question 3

Investigate how the number of rentals varies with temperature. Is this trend constant across seasons? Across months?


In [33]:
pd.concat([weather_data['temp'], weather_data['total_riders']], axis=1).corr()


Out[33]:
temp total_riders
temp 1.000000 0.713793
total_riders 0.713793 1.000000

Check how correlation between temp and total riders varies across months.


In [34]:
weather_data[['total_riders', 'temp', 'Month']].groupby('Month').corr()


Out[34]:
temp total_riders
Month
1 temp 1.000000 0.689495
total_riders 0.689495 1.000000
2 temp 1.000000 0.716206
total_riders 0.716206 1.000000
3 temp 1.000000 0.735575
total_riders 0.735575 1.000000
4 temp 1.000000 0.533387
total_riders 0.533387 1.000000
5 temp 1.000000 0.065599
total_riders 0.065599 1.000000
6 temp 1.000000 -0.330884
total_riders -0.330884 1.000000
7 temp 1.000000 -0.184704
total_riders -0.184704 1.000000
8 temp 1.000000 0.288264
total_riders 0.288264 1.000000
9 temp 1.000000 -0.418753
total_riders -0.418753 1.000000
10 temp 1.000000 0.466666
total_riders 0.466666 1.000000
11 temp 1.000000 0.511232
total_riders 0.511232 1.000000
12 temp 1.000000 0.690062
total_riders 0.690062 1.000000

Check how correlation between temp and total riders varies across seasons.


In [35]:
weather_data[['total_riders', 'temp', 'season_desc']].groupby('season_desc').corr()


Out[35]:
temp total_riders
season_desc
Fall temp 1.000000 0.626532
total_riders 0.626532 1.000000
Spring temp 1.000000 0.415687
total_riders 0.415687 1.000000
Summer temp 1.000000 -0.325743
total_riders -0.325743 1.000000
Winter temp 1.000000 0.811410
total_riders 0.811410 1.000000

Investigate total riders by month versus average monthly temp.


In [36]:
month_riders = weather_data.groupby('Month')['total_riders'].sum()

In [37]:
month_avg_temp = weather_data.groupby('Month')['temp'].mean()

In [38]:
pd.concat([month_riders, month_avg_temp], axis=1)


Out[38]:
total_riders temp
Month
1 96744 0.275181
2 103137 0.315337
3 164875 0.449411
4 174224 0.468809
5 195865 0.612366
6 202830 0.675111
7 203607 0.752366
8 214503 0.711801
9 218573 0.620083
10 198841 0.500049
11 152664 0.336101
12 123713 0.322880

Investigate total riders by season versus average seasonal temp.


In [39]:
season_riders = weather_data.groupby('season_desc')['total_riders'].sum()

In [40]:
season_temp = weather_data.groupby('season_desc')['temp'].mean()

In [41]:
pd.concat([season_riders, season_temp], axis=1)


Out[41]:
total_riders temp
season_desc
Fall 515476 0.419368
Spring 571273 0.554557
Summer 641479 0.711445
Winter 321348 0.321700

Question 4

There are various types of users in the usage data sets. What sorts of things can you say about how they use the bikes differently?

Investigate correlations between casual and reg riders on work days and holidays.


In [42]:
weather_data[['no_casual_riders', 'no_reg_riders', 'is_work_day', 'is_holiday']].corr()


Out[42]:
no_casual_riders no_reg_riders is_work_day is_holiday
no_casual_riders 1.000000 0.274984 -0.539919 0.029720
no_reg_riders 0.274984 1.000000 0.437003 -0.164190
is_work_day -0.539919 0.437003 1.000000 -0.258418
is_holiday 0.029720 -0.164190 -0.258418 1.000000

Investigate correlations between casual and reg riders and windspeed.


In [43]:
weather_data[['no_casual_riders', 'no_reg_riders', 'windspeed']].corr()


Out[43]:
no_casual_riders no_reg_riders windspeed
no_casual_riders 1.000000 0.274984 -0.158371
no_reg_riders 0.274984 1.000000 -0.265985
windspeed -0.158371 -0.265985 1.000000

In [44]:
usage = pd.read_table('data/usage_2012.tsv')

Compare average rental duration between customer types.


In [45]:
usage.groupby('cust_type')['duration_mins'].mean()


Out[45]:
cust_type
Casual        41.493881
Registered    12.091809
Subscriber    11.383536
Name: duration_mins, dtype: float64