- Introducing the bikeshare dataset
- Reading in the data
- Visualizing the data

- Linear regression basics
- Form of linear regression
- Building a linear regression model
- Using the model for prediction
- Does the scale of the features matter?

- Working with multiple features
- Visualizing the data (part 2)
- Adding more features to the model

- Choosing between models
- Feature selection
- Evaluation metrics for regression problems
- Comparing models with train/test split and RMSE
- Comparing testing RMSE with null RMSE

- Creating features
- Handling categorical features
- Feature engineering

- Comparing linear regression with other models

```
In [2]:
```#import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
bikes = pd.read_csv('../data/2016-Q1-Trips-History-Data.csv')
bikes.head()
bikes['start'] = pd.to_datetime(bikes['Start date'])
%time bikes['end'] = pd.to_datetime(bikes['End date'])

```
```

```
In [3]:
```bikes.head()

```
Out[3]:
```

```
In [15]:
```bikes['hour_of_day'] = (bikes.start.dt.hour + (bikes.start.dt.minute/60).round(2))
hours = bikes.groupby('hour_of_day').agg('count')
hours['hour'] = hours.index
hours.start.plot()
# import seaborn as sns
sns.lmplot(x='hour', y='start', data=hours, aspect=1.5, scatter_kws={'alpha':0.2})

```
Out[15]:
```

```
In [18]:
```hours[5:8].start.plot()
# import seaborn as sns
sns.lmplot(x='hour', y='start', data=hours[5:8], aspect=1.5, scatter_kws={'alpha':0.5})

```
Out[18]:
```

$y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n$

- $y$ is the response
- $\beta_0$ is the intercept
- $\beta_1$ is the coefficient for $x_1$ (the first feature)
- $\beta_n$ is the coefficient for $x_n$ (the nth feature)

The $\beta$ values are called the **model coefficients**:

- These values are estimated (or "learned") during the model fitting process using the
**least squares criterion**. - Specifically, we are find the line (mathematically) which minimizes the
**sum of squared residuals**(or "sum of squared errors"). - And once we've learned these coefficients, we can use the model to predict the response.

In the diagram above:

- The black dots are the
**observed values**of x and y. - The blue line is our
**least squares line**. - The red lines are the
**residuals**, which are the vertical distances between the observed values and the least squares line.

```
In [19]:
```# fit a linear regression model
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
feature_cols = ['al']
X = hours[['hour']]
y = hours.start
linreg.fit(X, y)

```
Out[19]:
```

```
In [23]:
```hours['pred'] = linreg.predict(X)
# put the plots together
plt.scatter(hours.hour, hours.start)
plt.plot(hours.hour, hours.pred, color='red')
plt.xlabel('hours')
plt.ylabel('count')

```
Out[23]:
```

```
In [41]:
```# fit a linear regression model
from sklearn.linear_model import LinearRegression
linreg = None
linreg = LinearRegression()
partial_hours = hours.loc[5.5:9]
X = partial_hours[['hour']]
y = partial_hours.start
linreg.fit(X, y)
hours.loc[5.5:9, 'pred'] = linreg.predict(partial_hours[['hour']])
# put the plots together
plt.scatter(hours.hour, hours.start)
plt.plot(partial_hours.hour, partial_hours.pred, color='red')
plt.xlabel('hours')
plt.ylabel('count')

```
Out[41]:
```

**Step 1:** Import the class you plan to use

**Step 2:** "Instantiate" the "estimator"

- "Estimator" is scikit-learn's term for "model"
- "Instantiate" means "make an instance of"

- Created an object that "knows" how to do Linear Regression, and is just waiting for data
- Name of the object does not matter
- Can specify tuning parameters (aka "hyperparameters") during this step
- All parameters not specified are set to their defaults

**Step 3:** Fit the model with data (aka "model training")

- Model is "learning" the relationship between X and y in our "training data"
- Process through which learning occurs varies by model
- Occurs in-place

- Once a model has been fit with data, it's called a "fitted model"

**Step 4:** Predict the response for a new observation

- New observations are called "out-of-sample" data
- Uses the information it learned during the model training process

Interpreting the **intercept** ($\beta_0$):

- It is the value of $y$ when $x$=0.
- Thus, it is the estimated number of rentals when the temperature is 0 degrees Celsius.
**Note:**It does not always make sense to interpret the intercept. (Why?)

Interpreting the **"temp" coefficient** ($\beta_1$):

- It is the change in $y$ divided by change in $x$, or the "slope".
- Thus, a temperature increase of 1 degree Celsius is
**associated with**a rental increase of 9.17 bikes. - This is not a statement of causation.
- $\beta_1$ would be
**negative**if an increase in temperature was associated with a**decrease**in rentals.