Linear Regression

Agenda

Introducing the bikeshare dataset
- Reading in the data
- Visualizing the data
Linear regression basics
- Form of linear regression
- Building a linear regression model
- Using the model for prediction
- Does the scale of the features matter?
Working with multiple features
- Visualizing the data (part 2)
- Adding more features to the model
Choosing between models
- Feature selection
- Evaluation metrics for regression problems
- Comparing models with train/test split and RMSE
- Comparing testing RMSE with null RMSE
Creating features
- Handling categorical features
- Feature engineering
Comparing linear regression with other models



In [2]:

    
#import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
bikes = pd.read_csv('../data/2016-Q1-Trips-History-Data.csv')
bikes.head()
bikes['start'] = pd.to_datetime(bikes['Start date'])
%time bikes['end'] = pd.to_datetime(bikes['End date'])









    



CPU times: user 1min 34s, sys: 252 ms, total: 1min 34s
Wall time: 1min 34s



In [3]:

    
bikes.head()









    Out[3]:






  
    
      
      Duration (ms)
      Start date
      End date
      Start station number
      Start station
      End station number
      End station
      Bike number
      Member Type
      start
      end
    
  
  
    
      0
      301295
      3/31/2016 23:59
      4/1/2016 0:04
      31280
      11th & S St NW
      31506
      1st & Rhode Island Ave NW
      W00022
      Registered
      2016-03-31 23:59:00
      2016-04-01 00:04:00
    
    
      1
      557887
      3/31/2016 23:59
      4/1/2016 0:08
      31275
      New Hampshire Ave & 24th St NW
      31114
      18th St & Wyoming Ave NW
      W01294
      Registered
      2016-03-31 23:59:00
      2016-04-01 00:08:00
    
    
      2
      555944
      3/31/2016 23:59
      4/1/2016 0:08
      31101
      14th & V St NW
      31221
      18th & M St NW
      W01416
      Registered
      2016-03-31 23:59:00
      2016-04-01 00:08:00
    
    
      3
      766916
      3/31/2016 23:57
      4/1/2016 0:09
      31226
      34th St & Wisconsin Ave NW
      31214
      17th & Corcoran St NW
      W01090
      Registered
      2016-03-31 23:57:00
      2016-04-01 00:09:00
    
    
      4
      139656
      3/31/2016 23:57
      3/31/2016 23:59
      31011
      23rd & Crystal Dr
      31009
      27th & Crystal Dr
      W21934
      Registered
      2016-03-31 23:57:00
      2016-03-31 23:59:00



In [15]:

    
bikes['hour_of_day'] = (bikes.start.dt.hour + (bikes.start.dt.minute/60).round(2))

hours = bikes.groupby('hour_of_day').agg('count')
hours['hour'] = hours.index

hours.start.plot()
# import seaborn as sns

sns.lmplot(x='hour', y='start', data=hours, aspect=1.5, scatter_kws={'alpha':0.2})









    Out[15]:





<seaborn.axisgrid.FacetGrid at 0x111583be0>



In [18]:

    
hours[5:8].start.plot()
# import seaborn as sns

sns.lmplot(x='hour', y='start', data=hours[5:8], aspect=1.5, scatter_kws={'alpha':0.5})









    Out[18]:





<seaborn.axisgrid.FacetGrid at 0x1148acfd0>

Form of linear regression

$y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n$

$y$ is the response
$\beta_0$ is the intercept
$\beta_1$ is the coefficient for $x_1$ (the first feature)
$\beta_n$ is the coefficient for $x_n$ (the nth feature)

The $\beta$ values are called the model coefficients:

These values are estimated (or "learned") during the model fitting process using the least squares criterion.
Specifically, we are find the line (mathematically) which minimizes the sum of squared residuals (or "sum of squared errors").
And once we've learned these coefficients, we can use the model to predict the response.

In the diagram above:

The black dots are the observed values of x and y.
The blue line is our least squares line.
The red lines are the residuals, which are the vertical distances between the observed values and the least squares line.

Building a linear regression model

scikit-learn's 4-step modeling pattern



In [19]:

    
# fit a linear regression model
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
feature_cols = ['al']
X = hours[['hour']]
y = hours.start
linreg.fit(X, y)









    



/Users/mahmoud/.virtualenvs/eventx/lib/python3.5/site-packages/scipy/linalg/basic.py:884: RuntimeWarning: internal gelsd driver lwork query error, required iwork dimension not returned. This is likely the result of LAPACK bug 0038, fixed in LAPACK 3.2.2 (released July 21, 2010). Falling back to 'gelss' driver.
  warnings.warn(mesg, RuntimeWarning)






    Out[19]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)



In [23]:

    
hours['pred'] = linreg.predict(X)

# put the plots together
plt.scatter(hours.hour, hours.start)
plt.plot(hours.hour, hours.pred, color='red')
plt.xlabel('hours')
plt.ylabel('count')









    Out[23]:





<matplotlib.text.Text at 0x114b43748>



In [41]:

    
# fit a linear regression model
from sklearn.linear_model import LinearRegression
linreg = None
linreg = LinearRegression()

partial_hours = hours.loc[5.5:9]

X = partial_hours[['hour']]
y = partial_hours.start
linreg.fit(X, y)

hours.loc[5.5:9, 'pred'] = linreg.predict(partial_hours[['hour']])

# put the plots together
plt.scatter(hours.hour, hours.start)
plt.plot(partial_hours.hour, partial_hours.pred, color='red')
plt.xlabel('hours')
plt.ylabel('count')









    Out[41]:





<matplotlib.text.Text at 0x11635ff60>

Step 1: Import the class you plan to use

Step 2: "Instantiate" the "estimator"

"Estimator" is scikit-learn's term for "model"
"Instantiate" means "make an instance of"

Created an object that "knows" how to do Linear Regression, and is just waiting for data
Name of the object does not matter
Can specify tuning parameters (aka "hyperparameters") during this step
All parameters not specified are set to their defaults

Step 3: Fit the model with data (aka "model training")

Model is "learning" the relationship between X and y in our "training data"
Process through which learning occurs varies by model
Occurs in-place

Once a model has been fit with data, it's called a "fitted model"

Step 4: Predict the response for a new observation

New observations are called "out-of-sample" data
Uses the information it learned during the model training process

Interpreting the intercept ($\beta_0$):

It is the value of $y$ when $x$=0.
Thus, it is the estimated number of rentals when the temperature is 0 degrees Celsius.
Note: It does not always make sense to interpret the intercept. (Why?)

Interpreting the "temp" coefficient ($\beta_1$):

It is the change in $y$ divided by change in $x$, or the "slope".
Thus, a temperature increase of 1 degree Celsius is associated with a rental increase of 9.17 bikes.
This is not a statement of causation.
$\beta_1$ would be negative if an increase in temperature was associated with a decrease in rentals.

	Duration (ms)	Start date	End date	Start station number	Start station	End station number	End station	Bike number	Member Type	start	end
0	301295	3/31/2016 23:59	4/1/2016 0:04	31280	11th & S St NW	31506	1st & Rhode Island Ave NW	W00022	Registered	2016-03-31 23:59:00	2016-04-01 00:04:00
1	557887	3/31/2016 23:59	4/1/2016 0:08	31275	New Hampshire Ave & 24th St NW	31114	18th St & Wyoming Ave NW	W01294	Registered	2016-03-31 23:59:00	2016-04-01 00:08:00
2	555944	3/31/2016 23:59	4/1/2016 0:08	31101	14th & V St NW	31221	18th & M St NW	W01416	Registered	2016-03-31 23:59:00	2016-04-01 00:08:00
3	766916	3/31/2016 23:57	4/1/2016 0:09	31226	34th St & Wisconsin Ave NW	31214	17th & Corcoran St NW	W01090	Registered	2016-03-31 23:57:00	2016-04-01 00:09:00
4	139656	3/31/2016 23:57	3/31/2016 23:59	31011	23rd & Crystal Dr	31009	27th & Crystal Dr	W21934	Registered	2016-03-31 23:57:00	2016-03-31 23:59:00

Linear Regression

Agenda

Read in Bike Share Data

Form of linear regression

Building a linear regression model

scikit-learn's 4-step modeling pattern