Supervised Machine Learning algorithms consist of target / outcome variable (or dependent variable) which is to be predicted from a given set of features/predictors (independent variables). Using these set of features, we generate a function that map inputs to desired outputs. The training process continues until the model achieves a desired level of accuracy/performance score on the training data. An supervised learning problem is called Regression where the output variable is continuous valued.
In this notebook you will explore machine learning regression with Python Scikit-Learn library.
For questions, comments and suggestions, please contact parantapa[dot]goswami[at]viseo[dot]com
Initially we require:
In [1]:
# Write code to import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# For visualzing plots in this notebook
%matplotlib inline
This small dataset is a collection of housing prices in Portland, Oregan, USA. It contains collected information on some houses sold and the prices.
Each house is described with following features:
House prices are in US dollars.
The data is provided in the file housing_price.txt. Use read_csv() module from pandas to import the data.
In [2]:
# We start by importing the data using pandas
# Hint: use "read_csv" method, Note that comma (",") is the field separator, and we have no "header"
housing = pd.read_csv('housing_price.txt', sep=",", header=None)
# We name the columns based on above features
housing.columns = ["Area", "Bedrooms", "Price"]
# We sneak peek into the data
# Hint: use dataframe "head" method with "n" parameter
housing.head(n=5)
Out[2]:
In [3]:
# Write code to get a summary of the data
# Hint: use "DataFrame.describe()" on our dataframe housing
housing.describe()
Out[3]:
In [4]:
# Write code to create Scatter Plot between "Area" and "Price"
# Hint: use "DataFrame.plot.scatter()" on our dataframe housing,
# mention the "x" and "y" axis features
housing.plot.scatter(x="Area", y="Price")
Out[4]:
You will now train a Linear Regression model using "Area" feature to predict "Price".
Note: All machine learning algorithm implementations work efficiently with numpy matrices and arrays. The current format of our data is pandas dataframes. Fortunately, pandas provides DataFrame.as_matrix() method to convert dataframes to numpy arrays. It accepts columns attribute to convert only certain columns of the dataframe.
Question: What is your input X here? What is your output y here?
In [5]:
# Write code to convert desired dataframe columns into numpy arrays
# Hint: "columns" atttribute of DataFrame.as_matrix() accepts only list.
# Even if you wish to select only one column, you have to pass it in a list.
X = housing.as_matrix(columns=["Area"])
y = housing.as_matrix(columns=["Price"])
Using the following step, train a Linear Regression model on housing price dataset:
LinearRegression module from sklearnLinearRegression and fit with the X and y arrays you created
In [6]:
# Write code to learn a linear regression model on housing price dataset
from sklearn.linear_model import LinearRegression # TO DELETE
lin_reg = LinearRegression()
lin_reg.fit(X, y)
Out[6]:
Use your trained LinearRegression model to predict on the same dataset (i.e. "Area" features stored as numpy array X). Also calculate Mean Squared Error using the mean_squared_error() method from sklearn library.
In [7]:
# Write code to predict prices using the trained LinearRegression
y_predicted = lin_reg.predict(X)
# Importing modules to calculate MSE
from sklearn.metrics import mean_squared_error
# Write code to calculate and print the MSE on the predicted values.
# Hint 1: use "mean_squared_error()" method
# Hint 2: you have to pass both original y and predicted y to compute the MSE.
mse = mean_squared_error(y, y_predicted)
print("MSE = ", mse)
Question: Why such a huge MSE?
Use LinearRegression.score() method to get coefficient of determination $R^2$ of the prediction. It is calculated based on MSE. The best possible score is $1.0$ and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of $y$, disregarding the input features, would get a $R^2$ score of $0.0$.
Warning: You need to pass input matrix X and original output y to score() method. This method first performs testing with the trained model and then calculates the $R^2$ score.
In [8]:
# Write code to get coefficient of determination using "score()"
# Hint: you have to pass both X and original y to score()
R2 = lin_reg.score(X, y)
print("R2 = ", R2)
Now, we will visualize the predicted prices along with the actual data.
Note: DataFrame.plot.scatter() returns a axes object. You can use that axes object to add more visualizations to the same plot.
In [9]:
# Write code to create a scatter plot with the data as above.
# Then add the best line to that.
# Hint 1: store the returned "axes" object and then use "axes.plot()"
# to plot the best_line
# Hint 2: "axes.plot()" takes the X and the y_predicted arrays
ax = housing.plot.scatter(x="Area", y="Price")
ax.plot(X, y_predicted, "r")
Out[9]:
In [10]:
# Write code to convert desired dataframe columns into numpy arrays
X = housing.as_matrix(columns=["Area","Bedrooms"])
y = housing.as_matrix(columns=["Price"])
# Write code to train a LinearRegression model
lin_reg = LinearRegression()
lin_reg.fit(X, y)
# Write code to calculate and print the MSE
R2 = lin_reg.score(X, y)
print("R2 = ", R2)
In [14]:
# Write code to create a 3D scatter plot for "Area", "Bedroom" and actual "Price"
# Then add visualization of "Area" and "Bedroom" against the predicted price.
from mpl_toolkits.mplot3d import Axes3D
y_pred = lin_reg.predict(X)
fig_scatter = plt.figure()
ax = fig_scatter.add_subplot(111, projection='3d')
ax.scatter(housing["Area"], housing["Bedrooms"], housing["Price"])
ax.scatter(housing["Area"], housing["Bedrooms"], y_pred)
ax.set_xlabel("Area")
ax.set_ylabel("Bedrooms")
ax.set_zlabel("Price")
Out[14]:
This portion of the notebook is for your own practice. Do not hesitate to contact me for any question.
This dataset poses the regression problem of predicting heating load of different buildings. More details of this dataset can be found in this link. The buildings are described using following 8 features:
The data is provided in the file energy_efficiency.csv. Like before, use read_csv() module from pandas to import the data.
In [15]:
# Write code to import the data using pandas
# Hint: note that comma (",") is the field separator, and we have no "header"
energy = pd.read_csv("energy_efficiency.csv", sep=",", header=None)
# We name the columns based on above features
energy.columns = ["Compactness","Surface","Wall", "Roof", "Heiht",
"Orientation","Glazing","GlazingDist", "Heating"]
# We sneak peek into the data
# Hint: use dataframe "head" method with "n" parameter
energy.head(n=5)
Out[15]:
In this section, we will train various regression models on the Energy Efficiency Dataset. We will measure the performances in terms of $R^2$ and compare their performances.
First, identify the inputs and outputs, and convert them to numpy matrices.
In [16]:
# Write code to convert desired dataframe columns into numpy arrays
X = energy.as_matrix(columns=["Compactness","Surface","Wall", "Roof", "Heiht",
"Orientation","Glazing","GlazingDist"])
y = energy.as_matrix(columns=["Heating"])
A portion of the dataset is to be set aside to be used only for testing. Fortunately sklearn provides a train_test_split() module to do that. You can specify a ratio for test_size parameter.
In this exercise, we will retain $20\%$ data for testing.
In [17]:
# Importing the module
from sklearn.model_selection import train_test_split
# Write code for splitting the data into train and test sets.
# Hint: use "train_test_split" on X and y, and test size should be 0.2 (20%)
X_train,X_test,y_train,y_test = train_test_split(X, y, test_size=0.2)
In [21]:
# Write code to train a Linear Regression model and to test its performance on the test set
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)
lin_reg_R2 = lin_reg.score(X_test, y_test)
print("Linear Regression R2 = ", lin_reg_R2)
Use SVR module from sklearn.
In [22]:
# Write code to import necessary module for SVR
from sklearn.svm import SVR
# Write code to train a SVR model and to test its performance on the test set
svr = SVR() # TO DELETE
svr.fit(X_train, y_train)
svr_R2 = svr.score(X_test, y_test)
print("Support Vector Regression R2 = ", svr_R2)
Use RandomForestRegressor from sklearn library.
In [23]:
# Write code to import necessary module for RandomForestRegressor
from sklearn.ensemble import RandomForestRegressor
# Write code to train a RandomForestRegressor model and to test its performance on the test set
rfr = RandomForestRegressor()
rfr.fit(X_train, y_train)
rfr_R2 = rfr.score(X_test, y_test)
print("Random Forest Regressor R2 = ", rfr_R2)