Question 1

Compute the average temperature by season ('season_desc'). (The temperatures are numbers between 0 and 1, but don't worry about that. Let's say that's the Shellman temperature scale.)

I get

season_desc
Fall           0.711445
Spring         0.321700
Summer         0.554557
Winter         0.419368

Which clearly looks wrong. Figure out what's wrong with the original data and fix it.



In [22]:

    
from pandas import DataFrame, Series



In [23]:

    
import pandas as pd



In [24]:

    
import numpy as np



In [25]:

    
weather_data = pd.read_table('data/daily_weather.tsv')



In [26]:

    
season_mapping = {'Spring': 'Winter', 'Winter': 'Fall', 'Fall': 'Summer', 'Summer': 'Spring'}



In [27]:

    
def fix_seasons(x):
    return season_mapping[x]



In [28]:

    
weather_data['season_desc'] = weather_data['season_desc'].apply(fix_seasons)



In [29]:

    
weather_data.pivot_table(index='season_desc', values='temp', aggfunc=np.mean)









    Out[29]:





season_desc
Fall      0.419368
Spring    0.554557
Summer    0.711445
Winter    0.321700
Name: temp, dtype: float64

In this case, a pivot table is not really required, so a simple use of groupby and mean() will do the job.



In [30]:

    
weather_data.groupby('season_desc')['temp'].mean()









    Out[30]:





season_desc
Fall      0.419368
Spring    0.554557
Summer    0.711445
Winter    0.321700
Name: temp, dtype: float64

Question 2

Various of the columns represent dates or datetimes, but out of the box pd.read_table won't treat them correctly. This makes it hard to (for example) compute the number of rentals by month. Fix the dates and compute the number of rentals by month.



In [31]:

    
weather_data['Month'] = pd.DatetimeIndex(weather_data.date).month



In [32]:

    
weather_data.groupby('Month')['total_riders'].sum()









    Out[32]:





Month
1      96744
2     103137
3     164875
4     174224
5     195865
6     202830
7     203607
8     214503
9     218573
10    198841
11    152664
12    123713
Name: total_riders, dtype: int64

Question 3

Investigate how the number of rentals varies with temperature. Is this trend constant across seasons? Across months?



In [33]:

    
pd.concat([weather_data['temp'], weather_data['total_riders']], axis=1).corr()









    Out[33]:






  
    
      
      temp
      total_riders
    
  
  
    
      temp
      1.000000
      0.713793
    
    
      total_riders
      0.713793
      1.000000

Check how correlation between temp and total riders varies across months.



In [34]:

    
weather_data[['total_riders', 'temp', 'Month']].groupby('Month').corr()









    Out[34]:






  
    
      
      
      temp
      total_riders
    
    
      Month
      
      
      
    
  
  
    
      1
      temp
      1.000000
      0.689495
    
    
      total_riders
      0.689495
      1.000000
    
    
      2
      temp
      1.000000
      0.716206
    
    
      total_riders
      0.716206
      1.000000
    
    
      3
      temp
      1.000000
      0.735575
    
    
      total_riders
      0.735575
      1.000000
    
    
      4
      temp
      1.000000
      0.533387
    
    
      total_riders
      0.533387
      1.000000
    
    
      5
      temp
      1.000000
      0.065599
    
    
      total_riders
      0.065599
      1.000000
    
    
      6
      temp
      1.000000
      -0.330884
    
    
      total_riders
      -0.330884
      1.000000
    
    
      7
      temp
      1.000000
      -0.184704
    
    
      total_riders
      -0.184704
      1.000000
    
    
      8
      temp
      1.000000
      0.288264
    
    
      total_riders
      0.288264
      1.000000
    
    
      9
      temp
      1.000000
      -0.418753
    
    
      total_riders
      -0.418753
      1.000000
    
    
      10
      temp
      1.000000
      0.466666
    
    
      total_riders
      0.466666
      1.000000
    
    
      11
      temp
      1.000000
      0.511232
    
    
      total_riders
      0.511232
      1.000000
    
    
      12
      temp
      1.000000
      0.690062
    
    
      total_riders
      0.690062
      1.000000

Check how correlation between temp and total riders varies across seasons.



In [35]:

    
weather_data[['total_riders', 'temp', 'season_desc']].groupby('season_desc').corr()









    Out[35]:






  
    
      
      
      temp
      total_riders
    
    
      season_desc
      
      
      
    
  
  
    
      Fall
      temp
      1.000000
      0.626532
    
    
      total_riders
      0.626532
      1.000000
    
    
      Spring
      temp
      1.000000
      0.415687
    
    
      total_riders
      0.415687
      1.000000
    
    
      Summer
      temp
      1.000000
      -0.325743
    
    
      total_riders
      -0.325743
      1.000000
    
    
      Winter
      temp
      1.000000
      0.811410
    
    
      total_riders
      0.811410
      1.000000

Investigate total riders by month versus average monthly temp.



In [36]:

    
month_riders = weather_data.groupby('Month')['total_riders'].sum()



In [37]:

    
month_avg_temp = weather_data.groupby('Month')['temp'].mean()



In [38]:

    
pd.concat([month_riders, month_avg_temp], axis=1)









    Out[38]:






  
    
      
      total_riders
      temp
    
    
      Month
      
      
    
  
  
    
      1
      96744
      0.275181
    
    
      2
      103137
      0.315337
    
    
      3
      164875
      0.449411
    
    
      4
      174224
      0.468809
    
    
      5
      195865
      0.612366
    
    
      6
      202830
      0.675111
    
    
      7
      203607
      0.752366
    
    
      8
      214503
      0.711801
    
    
      9
      218573
      0.620083
    
    
      10
      198841
      0.500049
    
    
      11
      152664
      0.336101
    
    
      12
      123713
      0.322880

Investigate total riders by season versus average seasonal temp.



In [39]:

    
season_riders = weather_data.groupby('season_desc')['total_riders'].sum()



In [40]:

    
season_temp = weather_data.groupby('season_desc')['temp'].mean()



In [41]:

    
pd.concat([season_riders, season_temp], axis=1)









    Out[41]:






  
    
      
      total_riders
      temp
    
    
      season_desc
      
      
    
  
  
    
      Fall
      515476
      0.419368
    
    
      Spring
      571273
      0.554557
    
    
      Summer
      641479
      0.711445
    
    
      Winter
      321348
      0.321700

Question 4

There are various types of users in the usage data sets. What sorts of things can you say about how they use the bikes differently?

Investigate correlations between casual and reg riders on work days and holidays.



In [42]:

    
weather_data[['no_casual_riders', 'no_reg_riders', 'is_work_day', 'is_holiday']].corr()









    Out[42]:






  
    
      
      no_casual_riders
      no_reg_riders
      is_work_day
      is_holiday
    
  
  
    
      no_casual_riders
      1.000000
      0.274984
      -0.539919
      0.029720
    
    
      no_reg_riders
      0.274984
      1.000000
      0.437003
      -0.164190
    
    
      is_work_day
      -0.539919
      0.437003
      1.000000
      -0.258418
    
    
      is_holiday
      0.029720
      -0.164190
      -0.258418
      1.000000

Investigate correlations between casual and reg riders and windspeed.



In [43]:

    
weather_data[['no_casual_riders', 'no_reg_riders', 'windspeed']].corr()









    Out[43]:






  
    
      
      no_casual_riders
      no_reg_riders
      windspeed
    
  
  
    
      no_casual_riders
      1.000000
      0.274984
      -0.158371
    
    
      no_reg_riders
      0.274984
      1.000000
      -0.265985
    
    
      windspeed
      -0.158371
      -0.265985
      1.000000



In [44]:

    
usage = pd.read_table('data/usage_2012.tsv')

Compare average rental duration between customer types.



In [45]:

    
usage.groupby('cust_type')['duration_mins'].mean()









    Out[45]:





cust_type
Casual        41.493881
Registered    12.091809
Subscriber    11.383536
Name: duration_mins, dtype: float64

		temp	total_riders
Month
1	temp	1.000000	0.689495
1	total_riders	0.689495	1.000000
2	temp	1.000000	0.716206
2	total_riders	0.716206	1.000000
3	temp	1.000000	0.735575
3	total_riders	0.735575	1.000000
4	temp	1.000000	0.533387
4	total_riders	0.533387	1.000000
5	temp	1.000000	0.065599
5	total_riders	0.065599	1.000000
6	temp	1.000000	-0.330884
6	total_riders	-0.330884	1.000000
7	temp	1.000000	-0.184704
7	total_riders	-0.184704	1.000000
8	temp	1.000000	0.288264
8	total_riders	0.288264	1.000000
9	temp	1.000000	-0.418753
9	total_riders	-0.418753	1.000000
10	temp	1.000000	0.466666
10	total_riders	0.466666	1.000000
11	temp	1.000000	0.511232
11	total_riders	0.511232	1.000000
12	temp	1.000000	0.690062
12	total_riders	0.690062	1.000000

		temp	total_riders
season_desc
Fall	temp	1.000000	0.626532
Fall	total_riders	0.626532	1.000000
Spring	temp	1.000000	0.415687
Spring	total_riders	0.415687	1.000000
Summer	temp	1.000000	-0.325743
Summer	total_riders	-0.325743	1.000000
Winter	temp	1.000000	0.811410
Winter	total_riders	0.811410	1.000000

	total_riders	temp
Month
1	96744	0.275181
2	103137	0.315337
3	164875	0.449411
4	174224	0.468809
5	195865	0.612366
6	202830	0.675111
7	203607	0.752366
8	214503	0.711801
9	218573	0.620083
10	198841	0.500049
11	152664	0.336101
12	123713	0.322880

	total_riders	temp
season_desc
Fall	515476	0.419368
Spring	571273	0.554557
Summer	641479	0.711445
Winter	321348	0.321700

	no_casual_riders	no_reg_riders	is_work_day	is_holiday
no_casual_riders	1.000000	0.274984	-0.539919	0.029720
no_reg_riders	0.274984	1.000000	0.437003	-0.164190
is_work_day	-0.539919	0.437003	1.000000	-0.258418
is_holiday	0.029720	-0.164190	-0.258418	1.000000