Exercise 04

Estimate a regression using the Capital Bikeshare data

Forecast use of a city bikeshare system

We'll be working with a dataset from Capital Bikeshare that was used in a Kaggle competition (data dictionary).

Get started on this competition through Kaggle Scripts

Bike sharing systems are a means of renting bicycles where the process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city. Using these systems, people are able rent a bike from a one location and return it to a different place on an as-needed basis. Currently, there are over 500 bike-sharing programs around the world.

The data generated by these systems makes them attractive for researchers because the duration of travel, departure location, arrival location, and time elapsed is explicitly recorded. Bike sharing systems therefore function as a sensor network, which can be used for studying mobility in a city. In this competition, participants are asked to combine historical usage patterns with weather data in order to forecast bike rental demand in the Capital Bikeshare program in Washington, D.C.



In [1]:

    
import pandas as pd
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt

# read the data and set the datetime as the index
import zipfile
with zipfile.ZipFile('../datasets/bikeshare.csv.zip', 'r') as z:
    f = z.open('bikeshare.csv')
    bikes = pd.read_csv(f, index_col='datetime', parse_dates=True)

# "count" is a method, so it's best to name that column something else
bikes.rename(columns={'count':'total'}, inplace=True)

bikes.head()









    Out[1]:






  
    
      
      season
      holiday
      workingday
      weather
      temp
      atemp
      humidity
      windspeed
      casual
      registered
      total
    
    
      datetime
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2011-01-01 00:00:00
      1
      0
      0
      1
      9.84
      14.395
      81
      0.0
      3
      13
      16
    
    
      2011-01-01 01:00:00
      1
      0
      0
      1
      9.02
      13.635
      80
      0.0
      8
      32
      40
    
    
      2011-01-01 02:00:00
      1
      0
      0
      1
      9.02
      13.635
      80
      0.0
      5
      27
      32
    
    
      2011-01-01 03:00:00
      1
      0
      0
      1
      9.84
      14.395
      75
      0.0
      3
      10
      13
    
    
      2011-01-01 04:00:00
      1
      0
      0
      1
      9.84
      14.395
      75
      0.0
      0
      1
      1

datetime - hourly date + timestamp
season -
- 1 = spring
- 2 = summer
- 3 = fall
- 4 = winter
holiday - whether the day is considered a holiday
workingday - whether the day is neither a weekend nor holiday
weather -
- 1: Clear, Few clouds, Partly cloudy, Partly cloudy
- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
- 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
temp - temperature in Celsius
atemp - "feels like" temperature in Celsius
humidity - relative humidity
windspeed - wind speed
casual - number of non-registered user rentals initiated
registered - number of registered user rentals initiated
total - number of total rentals



In [2]:

    
bikes.shape









    Out[2]:





(10886, 11)

Exercise 4.1

What is the relation between the temperature and total?

For a one percent increase in temperature how much the bikes shares increases?

Using sklearn estimate a linear regression and predict the total bikes share when the temperature is 31 degrees



In [3]:

    
# Pandas scatter plot
bikes.plot(kind='scatter', x='temp', y='total', alpha=0.2)









    Out[3]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f57b8eb47b8>



In [ ]:

Exercise 04.2

Evaluate the model using the MSE



In [ ]:

Exercise 04.3

Does the scale of the features matter?

Let's say that temperature was measured in Fahrenheit, rather than Celsius. How would that affect the model?



In [ ]:

Exercise 04.4

Run a regression model using as features the temperature and temperature$^2$ using the OLS equations



In [ ]:

Exercise 04.5

Estimate a regression using more features ['temp', 'season', 'weather', 'humidity'].

How is the performance compared to using only the temperature?



In [ ]:

Exercise 04.6

Split the data in train and test

Which of the following models is the best in the testing set?

['temp', 'season', 'weather', 'humidity']
['temp', 'season', 'weather']
['temp', 'season', 'humidity']



In [ ]:

	season	holiday	workingday	weather	temp	atemp	humidity	windspeed	casual	registered	total
datetime
2011-01-01 00:00:00	1	0	0	1	9.84	14.395	81	0.0	3	13	16
2011-01-01 01:00:00	1	0	0	1	9.02	13.635	80	0.0	8	32	40
2011-01-01 02:00:00	1	0	0	1	9.02	13.635	80	0.0	5	27	32
2011-01-01 03:00:00	1	0	0	1	9.84	14.395	75	0.0	3	10	13
2011-01-01 04:00:00	1	0	0	1	9.84	14.395	75	0.0	0	1	1