Multivariate Regression

Let's grab a small little data set of Blue Book car values:



In [1]:

    
import pandas as pd

df = pd.read_excel('http://cdn.sundog-soft.com/Udemy/DataScience/cars.xls')



In [2]:

    
df.head()









    Out[2]:






  
    
      
      Price
      Mileage
      Make
      Model
      Trim
      Type
      Cylinder
      Liter
      Doors
      Cruise
      Sound
      Leather
    
  
  
    
      0
      17314.103129
      8221
      Buick
      Century
      Sedan 4D
      Sedan
      6
      3.1
      4
      1
      1
      1
    
    
      1
      17542.036083
      9135
      Buick
      Century
      Sedan 4D
      Sedan
      6
      3.1
      4
      1
      1
      0
    
    
      2
      16218.847862
      13196
      Buick
      Century
      Sedan 4D
      Sedan
      6
      3.1
      4
      1
      1
      0
    
    
      3
      16336.913140
      16342
      Buick
      Century
      Sedan 4D
      Sedan
      6
      3.1
      4
      1
      0
      0
    
    
      4
      16339.170324
      19832
      Buick
      Century
      Sedan 4D
      Sedan
      6
      3.1
      4
      1
      0
      1

We can use pandas to split up this matrix into the feature vectors we're interested in, and the value we're trying to predict.

Note how we use pandas.Categorical to convert textual category data (model name) into an ordinal number that we can work with.

This is actually a questionable thing to do in the real world - doing a regression on categorical data only works well if there is some inherent order to the categories!



In [3]:

    
import statsmodels.api as sm

df['Model_ord'] = pd.Categorical(df.Model).codes
X = df[['Mileage', 'Model_ord', 'Doors']]
y = df[['Price']]

X1 = sm.add_constant(X)
est = sm.OLS(y, X1).fit()

est.summary()









    Out[3]:





OLS Regression Results

  Dep. Variable:           Price         R-squared:             0.042 


  Model:                    OLS          Adj. R-squared:        0.038 


  Method:              Least Squares     F-statistic:           11.57 


  Date:              Mon, 25 Jul 2016    Prob (F-statistic):  1.98e-07 


  Time:                  12:16:16        Log-Likelihood:      -8519.1 


  No. Observations:          804         AIC:                1.705e+04


  Df Residuals:              800         BIC:                1.706e+04


  Df Model:                    3                                      


  Covariance Type:       nonrobust                                    




               coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  const       3.125e+04   1809.549     17.272   0.000   2.77e+04  3.48e+04


  Mileage       -0.1765      0.042     -4.227   0.000     -0.259    -0.095


  Model_ord    -39.0387     39.326     -0.993   0.321   -116.234    38.157


  Doors      -1652.9303    402.649     -4.105   0.000  -2443.303  -862.558




  Omnibus:        206.410    Durbin-Watson:         0.080 


  Prob(Omnibus):   0.000     Jarque-Bera (JB):    470.872 


  Skew:            1.379     Prob(JB):           5.64e-103


  Kurtosis:        5.541     Cond. No.           1.15e+05

The table of coefficients above gives us the values to plug into an equation of form: B0 + B1 * Mileage + B2 * model_ord + B3 * doors But in this example, it's pretty clear that mileage is more important than anything based on the std err's. Could we have figured that out earlier?



In [4]:

    
y.groupby(df.Doors).mean()









    Out[4]:






  
    
      
      Price
    
    
      Doors
      
    
  
  
    
      2
      23807.135520
    
    
      4
      20580.670749

Surprisingly, more doors does not mean a higher price! (Maybe it implies a sport car in some cases?) So it's not surprising that it's pretty useless as a predictor here. This is a very small data set however, so we can't really read much meaning into it.

Activity

Mess around with the fake input data, and see if you can create a measurable influence of number of doors on price. Have some fun with it - why stop at 4 doors?



In [ ]:

	Price	Mileage	Make	Model	Trim	Type	Cylinder	Liter	Doors	Cruise	Sound	Leather
0	17314.103129	8221	Buick	Century	Sedan 4D	Sedan	6	3.1	4	1	1	1
1	17542.036083	9135	Buick	Century	Sedan 4D	Sedan	6	3.1	4	1	1	0
2	16218.847862	13196	Buick	Century	Sedan 4D	Sedan	6	3.1	4	1	1	0
3	16336.913140	16342	Buick	Century	Sedan 4D	Sedan	6	3.1	4	1	0	0
4	16339.170324	19832	Buick	Century	Sedan 4D	Sedan	6	3.1	4	1	0	1

Dep. Variable:	Price	R-squared:	0.042
Model:	OLS	Adj. R-squared:	0.038
Method:	Least Squares	F-statistic:	11.57
Date:	Mon, 25 Jul 2016	Prob (F-statistic):	1.98e-07
Time:	12:16:16	Log-Likelihood:	-8519.1
No. Observations:	804	AIC:	1.705e+04
Df Residuals:	800	BIC:	1.706e+04
Df Model:	3
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
const	3.125e+04	1809.549	17.272	0.000	2.77e+04 3.48e+04
Mileage	-0.1765	0.042	-4.227	0.000	-0.259 -0.095
Model_ord	-39.0387	39.326	-0.993	0.321	-116.234 38.157
Doors	-1652.9303	402.649	-4.105	0.000	-2443.303 -862.558

Omnibus:	206.410	Durbin-Watson:	0.080
Prob(Omnibus):	0.000	Jarque-Bera (JB):	470.872
Skew:	1.379	Prob(JB):	5.64e-103
Kurtosis:	5.541	Cond. No.	1.15e+05