This lab is in introduction to linear regression using Python and Scikit-Learn. This lab serves as a foundation for more complex algortithms and machine learning models that you will encounter in the course. We will train a linear regression model to predict housing price.
Each learning objective will correspond to a #TODO in the student lab notebook -- try to complete that notebook first before reviewing this solution notebook.
In [ ]:
!sudo chown -R jupyter:jupyter /home/jupyter/training-data-analyst
In [36]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns # Seaborn is a Python data visualization library based on matplotlib.
%matplotlib inline
We will use the USA housing prices dataset found on Kaggle. The data contains the following columns:
Next, we read the dataset into a Pandas dataframe.
In [60]:
df_USAhousing = pd.read_csv('../USA_Housing.csv')
In [61]:
# Show the first five row.
df_USAhousing.head()
Out[61]:
Let's check for any null values.
In [62]:
df_USAhousing.isnull().sum()
Out[62]:
In [63]:
df_USAhousing.describe()
Out[63]:
In [64]:
df_USAhousing.info()
Let's take a peek at the first and last five rows of the data for all columns.
In [65]:
print(df_USAhousing,5) # TODO 1
In [66]:
sns.pairplot(df_USAhousing)
Out[66]:
In [67]:
sns.distplot(df_USAhousing['Price'])
Out[67]:
In [68]:
sns.heatmap(df_USAhousing.corr()) # TODO 2
Out[68]:
Regression is a supervised machine learning process. It is similar to classification, but rather than predicting a label, we try to predict a continuous value. Linear regression defines the relationship between a target variable (y) and a set of predictive features (x). Simply stated, If you need to predict a number, then use regression.
Let's now begin to train our regression model! We will need to first split up our data into an X array that contains the features to train on, and a y array with the target variable, in this case the Price column. We will toss out the Address column because it only has text info that the linear regression model can't use.
In [81]:
X = df_USAhousing[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
'Avg. Area Number of Bedrooms', 'Area Population']]
y = df_USAhousing['Price']
Now let's split the data into a training set and a testing set. We will train out model on the training set and then use the test set to evaluate the model. Note that we are using 40% of the data for testing.
If an integer for random state is not specified in the code, then every time the code is executed, a new random value is generated and the train and test datasets will have different values each time. However, if a fixed value is assigned -- like random_state = 0 or 1 or 101 or any other integer, then no matter how many times you execute your code the result would be the same, e.g. the same values will be in the train and test datasets. Thus, the random state that you provide is used as a seed to the random number generator. This ensures that the random numbers are generated in the same order.
In [82]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=101)
In [83]:
from sklearn.linear_model import LinearRegression
In [84]:
lm = LinearRegression()
In [85]:
lm.fit(X_train,y_train) # TODO 3
Out[85]:
In [86]:
# print the intercept
print(lm.intercept_)
In [87]:
coeff_df = pd.DataFrame(lm.coef_,X.columns,columns=['Coefficient'])
coeff_df
Out[87]:
Interpreting the coefficients:
In [88]:
predictions = lm.predict(X_test)
In [89]:
plt.scatter(y_test,predictions)
Out[89]:
Residual Histogram
In [90]:
sns.distplot((y_test-predictions),bins=50);
Here are three common evaluation metrics for regression problems:
Mean Absolute Error (MAE) is the mean of the absolute value of the errors:
$$\frac 1n\sum_{i=1}^n|y_i-\hat{y}_i|$$Mean Squared Error (MSE) is the mean of the squared errors:
$$\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2$$Root Mean Squared Error (RMSE) is the square root of the mean of the squared errors:
$$\sqrt{\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2}$$Comparing these metrics:
All of these are loss functions, because we want to minimize them.
In [91]:
from sklearn import metrics
In [92]:
print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))
Copyright 2020 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.