In [16]:
import pandas as pd
import seaborn as sns
%matplotlib inline
import matplotlib.pyplot as plt
In [25]:
# HINTS
# http://www.ritchieng.com/machine-learning-project-boston-home-prices/
In [9]:
from sklearn.datasets import load_boston
boston = load_boston()
# features
df = pd.DataFrame(boston.data)
df.columns = boston.feature_names
# dependent variable
df['PRICE'] = boston.target
df.head(3)
Out[9]:
In [29]:
# Lets use only one feature
df1 = df[['LSTAT', 'PRICE']]
X = df1['LSTAT']
y = df1['PRICE']
In [17]:
sns.jointplot(x="LSTAT", y="PRICE", data=df, kind="reg", size=4);
In [58]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, explained_variance_score
In [60]:
lr = LinearRegression()
lr.fit(X.to_frame(), y.to_frame())
# check that the coeffients are the expected ones.
b1 = lr.coef_[0]
b0 = lr.intercept_
print "b0: ", b0
print "b1: ", b1
y_pred = lr.predict(X.to_frame())
print "Mean squared error: ", mean_squared_error(y, y_pred)
print 'R^2 score : ', r2_score(y, y_pred)
print 'Explained variance score (in simple regression = R^2): ', explained_variance_score(y, y_pred)
# correlation
from scipy.stats import pearsonr
print "pearson correlation: ", pearsonr(X, y)
# p-values
# scikit-learn's LinearRegression doesn't calculate this information.
# we can take a look at statsmodels for this kind of statistical analysis in Python.
In [21]:
from scipy import stats
slope, intercept, r_value, p_value, std_err = stats.linregress(df["LSTAT"], df["PRICE"])
print "slope: ", slope
print "intercept: ", intercept
print "R^2: ", r_value * r_value
print "Standard Error: ", std_err
print "p-value ", p_value
In simple linear regression setting, we can simply check whether β1 =0 or not. Here β1 =-0.95 != 0 so there is a relationship between price and lstat.
Accordingly, the p-value (associated with the t-statistic) is << 1 so the alternative hypothesis is correct => there is some relationshop between lstat and price
This can be answered saying that, given a cerntain X, can we predict y with a a high level of accuracy? Or the prediction is slightly better than a random guess?
The quantity to help us here is R^2 statistic.
R^2 measures the proportion of the variability of y that can be explained using X.
Also R^2, in simple linear regression, is equal to r^2 = squared correlation, a measure of the linear relationship between X and y.
Hear R^2 = 0.5 so it is a relatively good value R^2ε[0,1]. It is challenging though to determine what is good R^2. It depends on the application.
=> use prediction interval: if we want to predict individual response
=> use confidence interval: if we want to predict the average response.
The reply can come by plotting the residuals. If the relationship is linear then the residuals should not show any pattern.
We use few variables (here we choose lstat, age, and indus) as features and we want to predict price.
In [24]:
# Lets use only one feature
X = df[['LSTAT', 'AGE', 'INDUS']]
y = df['PRICE']
In multiple regression setting, we check wether β1=β2=β3=...=0.
In multiple regression the F-statistic is used to determine wether or not we reject H0.
So we look at the correspoinding p-value to the F-statistic.
=> One way is to examine the p-values associated with each predictors t-statistic. If low we keep the relative predictors help. BUT if this way might lead t ofalse discoveries (e.g. if p is large)
=> Another way is to models with different subsets of variables, user metrics for model quality and get best score.
=>We check the fraction of variance explained: R^2.
=>Plotting of data will also give us good insights.
=> use prediction interval: if we want to predict individual response
=> use confidence interval: if we want to predict the average response.
In [ ]: