Let's grab a small little data set of Blue Book car values:
In [1]:
import pandas as pd
df = pd.read_excel('http://cdn.sundog-soft.com/Udemy/DataScience/cars.xls')
In [2]:
df.head()
Out[2]:
We can use pandas to split up this matrix into the feature vectors we're interested in, and the value we're trying to predict.
Note how we use pandas.Categorical to convert textual category data (model name) into an ordinal number that we can work with.
In [3]:
import statsmodels.api as sm
df['Model_ord'] = pd.Categorical(df.Model).codes
X = df[['Mileage', 'Model_ord', 'Doors']]
y = df[['Price']]
X1 = sm.add_constant(X)
est = sm.OLS(y, X1).fit()
est.summary()
Out[3]:
In [4]:
y.groupby(df.Doors).mean()
Out[4]:
Surprisingly, more doors does not mean a higher price! So it's not surprising that it's pretty useless as a predictor here. This is a very small data set however, so we can't really read much meaning into it.
Mess around with the fake input data, and see if you can create a measurable influence of number of doors on price. Have some fun with it - why stop at 4 doors?
In [ ]: