Multivariate Regression

Let's grab a small little data set of Blue Book car values:


In [1]:
import pandas as pd

df = pd.read_excel('http://cdn.sundog-soft.com/Udemy/DataScience/cars.xls')

In [2]:
df.head()


Out[2]:
Price Mileage Make Model Trim Type Cylinder Liter Doors Cruise Sound Leather
0 17314.103129 8221 Buick Century Sedan 4D Sedan 6 3.1 4 1 1 1
1 17542.036083 9135 Buick Century Sedan 4D Sedan 6 3.1 4 1 1 0
2 16218.847862 13196 Buick Century Sedan 4D Sedan 6 3.1 4 1 1 0
3 16336.913140 16342 Buick Century Sedan 4D Sedan 6 3.1 4 1 0 0
4 16339.170324 19832 Buick Century Sedan 4D Sedan 6 3.1 4 1 0 1

We can use pandas to split up this matrix into the feature vectors we're interested in, and the value we're trying to predict.

Note how we are avoiding the make and model; regressions don't work well with ordinal values, unless you can convert them into some numerical order that makes sense somehow.

Let's scale our feature data into the same range so we can easily compare the coefficients we end up with.


In [4]:
import statsmodels.api as sm
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()

X = df[['Mileage', 'Cylinder', 'Doors']]
y = df['Price']

X[['Mileage', 'Cylinder', 'Doors']] = scale.fit_transform(X[['Mileage', 'Cylinder', 'Doors']].as_matrix())

print (X)

est = sm.OLS(y, X).fit()

est.summary()


      Mileage  Cylinder     Doors
0   -1.417485  0.527410  0.556279
1   -1.305902  0.527410  0.556279
2   -0.810128  0.527410  0.556279
3   -0.426058  0.527410  0.556279
4    0.000008  0.527410  0.556279
5    0.293493  0.527410  0.556279
6    0.335001  0.527410  0.556279
7    0.382369  0.527410  0.556279
8    0.511409  0.527410  0.556279
9    0.914768  0.527410  0.556279
10  -1.171368  0.527410  0.556279
11  -0.581834  0.527410  0.556279
12  -0.390532  0.527410  0.556279
13  -0.003899  0.527410  0.556279
14   0.430591  0.527410  0.556279
15   0.480156  0.527410  0.556279
16   0.509822  0.527410  0.556279
17   0.757160  0.527410  0.556279
18   1.594886  0.527410  0.556279
19   1.810849  0.527410  0.556279
20  -1.326046  0.527410  0.556279
21  -1.129860  0.527410  0.556279
22  -0.667658  0.527410  0.556279
23  -0.405792  0.527410  0.556279
24  -0.112796  0.527410  0.556279
25  -0.044552  0.527410  0.556279
26   0.190700  0.527410  0.556279
27   0.337442  0.527410  0.556279
28   0.566102  0.527410  0.556279
29   0.660837  0.527410  0.556279
..        ...       ...       ...
774 -0.161262 -0.914896  0.556279
775 -0.089234 -0.914896  0.556279
776 -0.040523 -0.914896  0.556279
777  0.002572 -0.914896  0.556279
778  0.236603 -0.914896  0.556279
779  0.249666 -0.914896  0.556279
780  0.357220 -0.914896  0.556279
781  0.365521 -0.914896  0.556279
782  0.434131 -0.914896  0.556279
783  0.517269 -0.914896  0.556279
784  0.589908 -0.914896  0.556279
785  0.599186 -0.914896  0.556279
786  0.793052 -0.914896  0.556279
787  1.033554 -0.914896  0.556279
788  1.045762 -0.914896  0.556279
789  1.205567 -0.914896  0.556279
790  1.541414 -0.914896  0.556279
791  1.561070 -0.914896  0.556279
792  1.725026 -0.914896  0.556279
793  1.851502 -0.914896  0.556279
794 -1.709871  0.527410  0.556279
795 -1.474375  0.527410  0.556279
796 -1.187849  0.527410  0.556279
797 -1.079929  0.527410  0.556279
798 -0.682430  0.527410  0.556279
799 -0.439853  0.527410  0.556279
800 -0.089966  0.527410  0.556279
801  0.079605  0.527410  0.556279
802  0.750446  0.527410  0.556279
803  1.932565  0.527410  0.556279

[804 rows x 3 columns]
C:\Users\Frank\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\sklearn\utils\validation.py:429: DataConversionWarning: Data with input dtype int64 was converted to float64 by StandardScaler.
  warnings.warn(msg, _DataConversionWarning)
C:\Users\Frank\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\ipykernel\__main__.py:8: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
C:\Users\Frank\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\pandas\core\indexing.py:477: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
Out[4]:
OLS Regression Results
Dep. Variable: Price R-squared: 0.064
Model: OLS Adj. R-squared: 0.060
Method: Least Squares F-statistic: 18.11
Date: Mon, 15 May 2017 Prob (F-statistic): 2.23e-11
Time: 10:31:39 Log-Likelihood: -9207.1
No. Observations: 804 AIC: 1.842e+04
Df Residuals: 801 BIC: 1.843e+04
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Mileage -1272.3412 804.623 -1.581 0.114 -2851.759 307.077
Cylinder 5587.4472 804.509 6.945 0.000 4008.252 7166.642
Doors -1404.5513 804.275 -1.746 0.081 -2983.288 174.185
Omnibus: 157.913 Durbin-Watson: 0.008
Prob(Omnibus): 0.000 Jarque-Bera (JB): 257.529
Skew: 1.278 Prob(JB): 1.20e-56
Kurtosis: 4.074 Cond. No. 1.03
The table of coefficients above gives us the values to plug into an equation of form: B0 + B1 * Mileage + B2 * model_ord + B3 * doors In this example, it's pretty clear that the number of cylinders is more important than anything based on the coefficients. Could we have figured that out earlier?

In [5]:
y.groupby(df.Doors).mean()


Out[5]:
Doors
2    23807.135520
4    20580.670749
Name: Price, dtype: float64

Surprisingly, more doors does not mean a higher price! (Maybe it implies a sport car in some cases?) So it's not surprising that it's pretty useless as a predictor here. This is a very small data set however, so we can't really read much meaning into it.

Activity

Mess around with the fake input data, and see if you can create a measurable influence of number of doors on price. Have some fun with it - why stop at 4 doors?


In [ ]: