In this assignment you are going to learn how Linear Regression works by using the code for linear regression and gradient descent we have been looking at in the class. You are also going to use linear regression from scikit-learn library for machine learning. You are going to learn how to download data from kaggle (a website for datasets and machine learning) and upload submissions to kaggle competitions. And you will be able to compete with the world.
The dataset you are going to use in this assignment is called House Prices, available at kaggle. To download the dataset go to dataset data tab. Download 'train.csv', 'test.csv', 'data_description.txt' and 'sample_submission.csv.gz' files. 'train.csv' is going to be used for training the model. 'test.csv' is used to test the model i.e. generalization. 'data_description.txt' contain feature description of the dataset. 'sample_submission.csv.gz' contain sample submission file that you need to generate to be submitted to kaggle.
In [4]:
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn import linear_model
import matplotlib.pyplot as plt
import matplotlib as mpl
# read house_train.csv data in pandas dataframe df_train using pandas read_csv function
df_train = pd.read_csv('datasets/house_price/train.csv', encoding='utf-8')
In [5]:
# check data by printing first few rows
df_train.head()
Out[5]:
In [6]:
# check columns in dataset
df_train.columns
Out[6]:
In [7]:
# check correlation matrix, darker means more correlation
corrmat = df_train.corr()
f, aX_train= plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, vmax=.8, square=True);
In [8]:
# SalePrice correlation matrix with top k variables
k = 10 #number of variables for heatmap
cols = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index
cm = np.corrcoef(df_train[cols].values.T)
sns.set(font_scale=1.25)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()
In [9]:
#scatterplot with some important variables
cols = ['SalePrice', 'OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF', 'YearBuilt']
sns.set()
sns.pairplot(df_train[cols], size = 2.5)
plt.show();
Use Linear Regression code below using X="GrLivArea" as input variable and y="SalePrice" as target variable. Use different values of $\alpha$ given in table below and comment on why they are useful or not and which one is a good choice.
In [7]:
# Load X and y variables from pandas dataframe df_train
cols = ['GrLivArea']
X_train = np.array(df_train[cols])
y_train = np.array(df_train[["SalePrice"]])
# Get m = number of samples and n = number of features
m = X_train.shape[0]
n = X_train.shape[1]
# append a column of 1's to X for theta_0
X_train = np.insert(X_train,0,1,axis=1)
In [11]:
iterations = 1500
alpha = 0.000000001 # change it and find what happens
def h(X, theta): #Linear hypothesis function
hx = np.dot(X,theta)
return hx
def computeCost(theta,X,y): #Cost function
"""
theta is an n- dimensional vector, X is matrix with n- columns and m- rows
y is a matrix with m- rows and 1 column
"""
#note to self: *.shape is (rows, columns)
return float((1./(2*m)) * np.dot((h(X,theta)-y).T,(h(X,theta)-y)))
#Actual gradient descent minimizing routine
def gradientDescent(X,y, theta_start = np.zeros((n+1,1))):
"""
theta_start is an n- dimensional vector of initial theta guess
X is input variable matrix with n- columns and m- rows. y is a matrix with m- rows and 1 column.
"""
theta = theta_start
j_history = [] #Used to plot cost as function of iteration
theta_history = [] #Used to visualize the minimization path later on
for meaninglessvariable in range(iterations):
tmptheta = theta
# append for plotting
j_history.append(computeCost(theta,X,y))
theta_history.append(list(theta[:,0]))
#Simultaneously updating theta values
for j in range(len(tmptheta)):
tmptheta[j] = theta[j] - (alpha/m)*np.sum((h(X,theta) - y)*np.array(X[:,j]).reshape(m,1))
theta = tmptheta
return theta, theta_history, j_history
In [12]:
#Actually run gradient descent to get the best-fit theta values
initial_theta = np.zeros((n+1,1));
theta, theta_history, j_history = gradientDescent(X_train,y_train,initial_theta)
plt.plot(j_history)
plt.title("Convergence of Cost Function")
plt.xlabel("Iteration number")
plt.ylabel("Cost function")
plt.show()
In [13]:
# predict output for training data
hx_train= h(X_train, theta)
# plot it
plt.scatter(X_train[:,1],y_train)
plt.plot(X_train[:,1],hx_train[:,0], color='red')
plt.show()
In this task we will use the model trained above to predict "SalePrice" on test data. Test data has all the input variables/features but no target variable. Out aim is to use the trained model to predict the target variable for test data. This is called generalization i.e. how good your model works on unseen data. The output in the form "Id","SalePrice" in a .csv file should be submitted to kaggle. Please provide your score on kaggle after this step as an image. It will be compared to the 5 feature Linear Regression later.
In [14]:
# read data in pandas frame df_test and check first few rows
# write code here
df_test.head()
Out[14]:
In [15]:
# check statistics of test data, make sure no data is missing.
print(df_test.shape)
df_test[cols].describe()
Out[15]:
In [16]:
# Get X_test, no target variable (SalePrice) provided in test data. It is what we need to predict.
X_test = np.array(df_test[cols])
#Insert the usual column of 1's into the "X" matrix
X_test = np.insert(X_test,0,1,axis=1)
In [17]:
# predict test data labels i.e. y_test
predict = h(X_test, theta)
In [18]:
# save prediction as .csv file
pd.DataFrame({'Id': df_test.Id, 'SalePrice': predict[:,0]}).to_csv("predict1.csv", index=False)
In [19]:
from IPython.display import Image
Image(filename='images/asgn_01.png', width=500)
Out[19]:
In this task we are going to use Linear Regression class from scikit-learn library to train the same model. The aim is to move from understanding algorithm to using an exisiting well established library. There is a Linear Regression example available on scikit-learn website as well.
In [20]:
# import scikit-learn linear model
from sklearn import linear_model
# get X and y
# write code here
# Create linear regression object
# write code here check link above for example
# Train the model using the training sets. Use fit(X,y) command
# write code here
# The coefficients
print('Intercept: \n', regr.intercept_)
print('Coefficients: \n', regr.coef_)
# The mean squared error
print("Mean squared error: %.2f"
% np.mean((regr.predict(X_train) - y_train) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(X_train, y_train))
In [21]:
# read test X without 1's
# write code here
In [22]:
# predict output for test data. Use predict(X) command.
predict2 = # write code here
In [23]:
# remove negative sales by replacing them with zeros
predict2[predict2<0] = 0
In [24]:
# save prediction as predict2.csv file
# write code here
Lastly use columns ['OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF', 'YearBuilt'] and scikit-learn or the code given above to predict output on test data. Upload it to kaggle like earlier and see how much it improves your score.
In [25]:
# define columns ['OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF', 'YearBuilt']
# write code here
# check features range and statistics. Training dataset looks fine as all features has same count.
df_train[cols].describe()
Out[25]:
In [26]:
# Load X and y variables from pandas dataframe df_train
# write code here
# Get m = number of samples and n = number of features
# write code here
In [27]:
#Feature normalizing the columns (subtract mean, divide by standard deviation)
#Store the mean and std for later use
#Note don't modify the original X matrix, use a copy
stored_feature_means, stored_feature_stds = [], []
Xnorm = np.array(X_train).copy()
for icol in range(Xnorm.shape[1]):
stored_feature_means.append(np.mean(Xnorm[:,icol]))
stored_feature_stds.append(np.std(Xnorm[:,icol]))
#Skip the first column if 1's
# if not icol: continue
#Faster to not recompute the mean and std again, just used stored values
Xnorm[:,icol] = (Xnorm[:,icol] - stored_feature_means[-1])/stored_feature_stds[-1]
# check data after normalization
pd.DataFrame(data=Xnorm,columns=cols).describe()
Out[27]:
In [28]:
# Run Linear Regression from scikit-learn or code given above.
# write code here. Repeat from above.
In [29]:
# To predict output using ['OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF', 'YearBuilt'] as input features.
# Check features range and statistics to see if there is any missing data.
# As you can see from count "GarageCars" and "TotalBsmtSF" has 1 missing value each.
df_test[cols].describe()
Out[29]:
In [30]:
# Replace missing value with the mean of the feature
df_test['GarageCars'] = df_test['GarageCars'].fillna((df_test['GarageCars'].mean()))
df_test['TotalBsmtSF'] = df_test['TotalBsmtSF'].fillna((df_test['TotalBsmtSF'].mean()))
In [31]:
df_test[cols].describe()
Out[31]:
In [32]:
# read test X without 1's
# write code here
# predict using trained model
predict3 = # write code here
# replace any negative predicted saleprice by zero
predict3[predict3<0] = 0
In [33]:
# predict target/output variable for test data using the trained model and upload to kaggle.
# write code to save output as predict3.csv here
Raschka, Sebastian. Python machine learning. Birmingham, UK: Packt Publishing, 2015. Print.