Multivariate Regression

Let's grab a small little data set of Blue Book car values:



In [1]:

    
import pandas as pd

df = pd.read_excel('http://cdn.sundog-soft.com/Udemy/DataScience/cars.xls')



In [2]:

    
df.head()









    Out[2]:






  
    
      
      Price
      Mileage
      Make
      Model
      Trim
      Type
      Cylinder
      Liter
      Doors
      Cruise
      Sound
      Leather
    
  
  
    
      0
      17314.103129
      8221
      Buick
      Century
      Sedan 4D
      Sedan
      6
      3.1
      4
      1
      1
      1
    
    
      1
      17542.036083
      9135
      Buick
      Century
      Sedan 4D
      Sedan
      6
      3.1
      4
      1
      1
      0
    
    
      2
      16218.847862
      13196
      Buick
      Century
      Sedan 4D
      Sedan
      6
      3.1
      4
      1
      1
      0
    
    
      3
      16336.913140
      16342
      Buick
      Century
      Sedan 4D
      Sedan
      6
      3.1
      4
      1
      0
      0
    
    
      4
      16339.170324
      19832
      Buick
      Century
      Sedan 4D
      Sedan
      6
      3.1
      4
      1
      0
      1

We can use pandas to split up this matrix into the feature vectors we're interested in, and the value we're trying to predict.

Note how we are avoiding the make and model; regressions don't work well with ordinal values, unless you can convert them into some numerical order that makes sense somehow.

Let's scale our feature data into the same range so we can easily compare the coefficients we end up with.



In [4]:

    
import statsmodels.api as sm
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()

X = df[['Mileage', 'Cylinder', 'Doors']]
y = df['Price']

X[['Mileage', 'Cylinder', 'Doors']] = scale.fit_transform(X[['Mileage', 'Cylinder', 'Doors']].as_matrix())

print (X)

est = sm.OLS(y, X).fit()

est.summary()









    



      Mileage  Cylinder     Doors
0   -1.417485  0.527410  0.556279
1   -1.305902  0.527410  0.556279
2   -0.810128  0.527410  0.556279
3   -0.426058  0.527410  0.556279
4    0.000008  0.527410  0.556279
5    0.293493  0.527410  0.556279
6    0.335001  0.527410  0.556279
7    0.382369  0.527410  0.556279
8    0.511409  0.527410  0.556279
9    0.914768  0.527410  0.556279
10  -1.171368  0.527410  0.556279
11  -0.581834  0.527410  0.556279
12  -0.390532  0.527410  0.556279
13  -0.003899  0.527410  0.556279
14   0.430591  0.527410  0.556279
15   0.480156  0.527410  0.556279
16   0.509822  0.527410  0.556279
17   0.757160  0.527410  0.556279
18   1.594886  0.527410  0.556279
19   1.810849  0.527410  0.556279
20  -1.326046  0.527410  0.556279
21  -1.129860  0.527410  0.556279
22  -0.667658  0.527410  0.556279
23  -0.405792  0.527410  0.556279
24  -0.112796  0.527410  0.556279
25  -0.044552  0.527410  0.556279
26   0.190700  0.527410  0.556279
27   0.337442  0.527410  0.556279
28   0.566102  0.527410  0.556279
29   0.660837  0.527410  0.556279
..        ...       ...       ...
774 -0.161262 -0.914896  0.556279
775 -0.089234 -0.914896  0.556279
776 -0.040523 -0.914896  0.556279
777  0.002572 -0.914896  0.556279
778  0.236603 -0.914896  0.556279
779  0.249666 -0.914896  0.556279
780  0.357220 -0.914896  0.556279
781  0.365521 -0.914896  0.556279
782  0.434131 -0.914896  0.556279
783  0.517269 -0.914896  0.556279
784  0.589908 -0.914896  0.556279
785  0.599186 -0.914896  0.556279
786  0.793052 -0.914896  0.556279
787  1.033554 -0.914896  0.556279
788  1.045762 -0.914896  0.556279
789  1.205567 -0.914896  0.556279
790  1.541414 -0.914896  0.556279
791  1.561070 -0.914896  0.556279
792  1.725026 -0.914896  0.556279
793  1.851502 -0.914896  0.556279
794 -1.709871  0.527410  0.556279
795 -1.474375  0.527410  0.556279
796 -1.187849  0.527410  0.556279
797 -1.079929  0.527410  0.556279
798 -0.682430  0.527410  0.556279
799 -0.439853  0.527410  0.556279
800 -0.089966  0.527410  0.556279
801  0.079605  0.527410  0.556279
802  0.750446  0.527410  0.556279
803  1.932565  0.527410  0.556279

[804 rows x 3 columns]






    



C:\Users\Frank\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\sklearn\utils\validation.py:429: DataConversionWarning: Data with input dtype int64 was converted to float64 by StandardScaler.
  warnings.warn(msg, _DataConversionWarning)
C:\Users\Frank\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\ipykernel\__main__.py:8: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
C:\Users\Frank\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\pandas\core\indexing.py:477: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s






    Out[4]:





OLS Regression Results

  Dep. Variable:           Price         R-squared:             0.064 


  Model:                    OLS          Adj. R-squared:        0.060 


  Method:              Least Squares     F-statistic:           18.11 


  Date:              Mon, 15 May 2017    Prob (F-statistic):  2.23e-11 


  Time:                  10:31:39        Log-Likelihood:      -9207.1 


  No. Observations:          804         AIC:                1.842e+04


  Df Residuals:              801         BIC:                1.843e+04


  Df Model:                    3                                      


  Covariance Type:       nonrobust                                    




              coef      std err       t       P>|t|   [0.025     0.975]  


  Mileage   -1272.3412    804.623     -1.581   0.114  -2851.759    307.077


  Cylinder   5587.4472    804.509      6.945   0.000   4008.252   7166.642


  Doors     -1404.5513    804.275     -1.746   0.081  -2983.288    174.185




  Omnibus:        157.913    Durbin-Watson:         0.008


  Prob(Omnibus):   0.000     Jarque-Bera (JB):    257.529


  Skew:            1.278     Prob(JB):           1.20e-56


  Kurtosis:        4.074     Cond. No.               1.03

The table of coefficients above gives us the values to plug into an equation of form: B0 + B1 * Mileage + B2 * model_ord + B3 * doors In this example, it's pretty clear that the number of cylinders is more important than anything based on the coefficients. Could we have figured that out earlier?



In [5]:

    
y.groupby(df.Doors).mean()









    Out[5]:





Doors
2    23807.135520
4    20580.670749
Name: Price, dtype: float64

Surprisingly, more doors does not mean a higher price! (Maybe it implies a sport car in some cases?) So it's not surprising that it's pretty useless as a predictor here. This is a very small data set however, so we can't really read much meaning into it.

Activity

Mess around with the fake input data, and see if you can create a measurable influence of number of doors on price. Have some fun with it - why stop at 4 doors?



In [ ]:

	Price	Mileage	Make	Model	Trim	Type	Cylinder	Liter	Doors	Cruise	Sound	Leather
0	17314.103129	8221	Buick	Century	Sedan 4D	Sedan	6	3.1	4	1	1	1
1	17542.036083	9135	Buick	Century	Sedan 4D	Sedan	6	3.1	4	1	1	0
2	16218.847862	13196	Buick	Century	Sedan 4D	Sedan	6	3.1	4	1	1	0
3	16336.913140	16342	Buick	Century	Sedan 4D	Sedan	6	3.1	4	1	0	0
4	16339.170324	19832	Buick	Century	Sedan 4D	Sedan	6	3.1	4	1	0	1

Dep. Variable:	Price	R-squared:	0.064
Model:	OLS	Adj. R-squared:	0.060
Method:	Least Squares	F-statistic:	18.11
Date:	Mon, 15 May 2017	Prob (F-statistic):	2.23e-11
Time:	10:31:39	Log-Likelihood:	-9207.1
No. Observations:	804	AIC:	1.842e+04
Df Residuals:	801	BIC:	1.843e+04
Df Model:	3
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Mileage	-1272.3412	804.623	-1.581	0.114	-2851.759	307.077
Cylinder	5587.4472	804.509	6.945	0.000	4008.252	7166.642
Doors	-1404.5513	804.275	-1.746	0.081	-2983.288	174.185

Omnibus:	157.913	Durbin-Watson:	0.008
Prob(Omnibus):	0.000	Jarque-Bera (JB):	257.529
Skew:	1.278	Prob(JB):	1.20e-56
Kurtosis:	4.074	Cond. No.	1.03