We will apply Linear Regression on a fake housing dataset
The United States housing data contains the following columns:
In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
In [3]:
housing = pd.read_csv('USA_Housing.csv')
In [13]:
housing.head()
#displays the head of the data i.e. first few columns of the data
Out[13]:
In [14]:
housing.info()
#displays information about the data
In [15]:
housing.describe()
#gives brief description of the data
Out[15]:
In [16]:
housing.columns
Out[16]:
In [17]:
sns.pairplot(housing)
Out[17]:
In [18]:
sns.distplot(housing['Price'])
#distribution plot of 'Price'
Out[18]:
In [12]:
sns.heatmap(housing.corr())
#heatmap-the darker the more related two features are.
Out[12]:
We will first need to split our data into an X array that contains the features to train on, and a y array with the target variable,in this cases the 'Price' column. We will also toss out 'Address' olumn because it only has text that LinearRegression model cant use.
In [22]:
X = housing[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
'Avg. Area Number of Bedrooms', 'Area Population']]
y = housing['Price']
In [23]:
from sklearn.model_selection import train_test_split
In [24]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=101)
In [26]:
from sklearn.linear_model import LinearRegression
In [27]:
lm = LinearRegression()
In [28]:
lm.fit(X_train,y_train)
Out[28]:
In [29]:
# print the intercept
print(lm.intercept_)
In [30]:
coeff_df = pd.DataFrame(lm.coef_,X.columns,columns=['Coefficient'])
coeff_df
Out[30]:
Interpreting the coefficients:
This obviuosly doesnt make sense as it was a fake dataset but we get an idea of how Linear Regression works!
In [32]:
predictions = lm.predict(X_test)
In [37]:
plt.scatter(y_test,predictions)
#scatter plot resembles a straight line which means that we had a good prediction
Out[37]:
Residual Histogram
In [34]:
sns.distplot((y_test-predictions),bins=50);
Here are three common evaluation metrics for regression problems:
Mean Absolute Error (MAE) is the mean of the absolute value of the errors:
$$\frac 1n\sum_{i=1}^n|y_i-\hat{y}_i|$$Mean Squared Error (MSE) is the mean of the squared errors:
$$\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2$$Root Mean Squared Error (RMSE) is the square root of the mean of the squared errors:
$$\sqrt{\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2}$$Comparing these metrics:
All of these are loss functions, because we want to minimize them.
In [35]:
from sklearn import metrics
In [36]:
print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))