4-Model the Data

"All models are wrong, Some of them are useful"

  • The power and limits of models
  • Tradeoff between Prediction Accuracy and Model Interpretability
  • Assessing Model Accuracy
  • Regression models (Simple, Multiple)
  • Classification model

In [91]:
# Load the libraries
import numpy as np
import pandas as pd
from scipy import stats
from sklearn import linear_model

In [30]:
# Load the data again!
df = pd.read_csv("data/Weed_Price.csv", parse_dates=[-1])
df.sort(columns=['State','date'], inplace=True)
df1 = df[df.State=="California"].copy()
df1.set_index("date", inplace=True)
print df1.shape
idx = pd.date_range(df1.index.min(), df1.index.max())
df1 = df1.reindex(idx)
df1.fillna(method = "ffill", inplace=True)
print df1.shape


(449, 7)
(532, 7)

In [10]:
df1.head()


Out[10]:
State HighQ HighQN MedQ MedQN LowQ LowQN
2013-12-27 California 248.77 12021 193.44 12724 193.88 770
2013-12-28 California 248.74 12025 193.44 12728 193.88 770
2013-12-29 California 248.76 12047 193.55 12760 193.60 772
2013-12-30 California 248.82 12065 193.54 12779 193.80 773
2013-12-31 California 248.76 12082 193.54 12792 193.80 773

In [12]:
#Reading demographics data
demographics = pd.DataFrame.from_csv("data/Demographics_State.csv",header=0,index_col=False,sep=',')

In [17]:
demographics.rename(columns={'region':'State'}, inplace=True)
demographics.head()


Out[17]:
State total_population percent_white percent_black percent_asian percent_hispanic per_capita_income median_rent median_age
0 alabama 4799277 67 26 1 4 23680 501 38.1
1 alaska 720316 63 3 5 6 32651 978 33.6
2 arizona 6479703 57 4 3 30 25358 747 36.3
3 arkansas 2933369 74 15 1 7 22170 480 37.5
4 california 37659181 40 6 13 38 29527 1119 35.4

In [18]:
df['State'] = df['State'].str.lower()
df.head()


Out[18]:
State HighQ HighQN MedQ MedQN LowQ LowQN date
20094 alabama 339.65 1033 198.04 926 147.15 122 2013-12-27
20859 alabama 339.65 1033 198.04 926 147.15 122 2013-12-28
21573 alabama 339.75 1036 198.26 929 149.49 123 2013-12-29
22287 alabama 339.75 1036 198.81 930 149.49 123 2013-12-30
22797 alabama 339.42 1040 198.68 932 149.49 123 2013-12-31

In [20]:
df_demo = pd.merge(df, demographics, how="inner", on="State")
df_demo.head()


Out[20]:
State HighQ HighQN MedQ MedQN LowQ LowQN date total_population percent_white percent_black percent_asian percent_hispanic per_capita_income median_rent median_age
0 alabama 339.65 1033 198.04 926 147.15 122 2013-12-27 4799277 67 26 1 4 23680 501 38.1
1 alabama 339.65 1033 198.04 926 147.15 122 2013-12-28 4799277 67 26 1 4 23680 501 38.1
2 alabama 339.75 1036 198.26 929 149.49 123 2013-12-29 4799277 67 26 1 4 23680 501 38.1
3 alabama 339.75 1036 198.81 930 149.49 123 2013-12-30 4799277 67 26 1 4 23680 501 38.1
4 alabama 339.42 1040 198.68 932 149.49 123 2013-12-31 4799277 67 26 1 4 23680 501 38.1

Correlation


In [25]:
corr_bw_percapita_highq = stats.pearsonr(df_demo.per_capita_income, df_demo.HighQ)[0]
print corr_bw_percapita_highq


0.120703416389

Exercise Find correlation between percent_white and highQ


In [ ]:

Impact of de-regulation


In [27]:
state_location = pd.read_csv("data/State_Location.csv")
state_location.head()


Out[27]:
region state latitude longitude status
0 alaska AK 61.3850 -152.2683 legal
1 alabama AL 32.7990 -86.8073 medical-limited
2 arkansas AR 34.9513 -92.3809 illegal
3 arizona AZ 33.7712 -111.3877 medical
4 california CA 36.1700 -119.7462 decriminalized+medical

In [28]:
pd.unique(state_location.status)


Out[28]:
array(['legal', 'medical-limited', 'illegal', 'medical',
       'decriminalized+medical', 'decriminalized'], dtype=object)

Exercise Find mean prices of HighQ weed for states that are legal and for states that are illegal


In [ ]:

Finding good time of the week to buy weed in California


In [45]:
df['year'] = pd.DatetimeIndex(df['date']).year
df['month'] = pd.DatetimeIndex(df['date']).month
df['week'] = pd.DatetimeIndex(df['date']).week
df['weekday'] = pd.DatetimeIndex(df['date']).weekday

In [42]:
df_demo_ca = df_demo[df_demo.State=="california"].copy()
df_demo_ca['year'] = pd.DatetimeIndex(df_demo_ca['date']).year
df_demo_ca['month'] = pd.DatetimeIndex(df_demo_ca['date']).month
df_demo_ca['week'] = pd.DatetimeIndex(df_demo_ca['date']).week
df_demo_ca['weekday'] = pd.DatetimeIndex(df_demo_ca['date']).weekday

df_demo_ca.head()


Out[42]:
State HighQ HighQN MedQ MedQN LowQ LowQN date total_population percent_white percent_black percent_asian percent_hispanic per_capita_income median_rent median_age year month week weekday
1796 california 248.77 12021 193.44 12724 193.88 770 2013-12-27 37659181 40 6 13 38 29527 1119 35.4 2013 12 52 4
1797 california 248.74 12025 193.44 12728 193.88 770 2013-12-28 37659181 40 6 13 38 29527 1119 35.4 2013 12 52 5
1798 california 248.76 12047 193.55 12760 193.60 772 2013-12-29 37659181 40 6 13 38 29527 1119 35.4 2013 12 52 6
1799 california 248.82 12065 193.54 12779 193.80 773 2013-12-30 37659181 40 6 13 38 29527 1119 35.4 2013 12 1 0
1800 california 248.76 12082 193.54 12792 193.80 773 2013-12-31 37659181 40 6 13 38 29527 1119 35.4 2013 12 1 1

In [43]:
df_demo_ca.groupby("weekday").HighQ.mean()


Out[43]:
weekday
0    245.348923
1    245.364375
2    245.359219
3    245.342813
4    245.434219
5    245.425556
6    245.359231
Name: HighQ, dtype: float64

Exercise If I need to buy weed on a wednesday, which state should I be in?


In [46]:
df.groupby(["State", "weekday"]).HighQ.mean()


Out[46]:
State          weekday
Alabama        0          339.556308
               1          339.577656
               2          339.559375
               3          339.552500
               4          339.577969
               5          339.568254
               6          339.541385
Alaska         0          291.463231
               1          291.384687
               2          291.390000
               3          291.597500
               4          291.506406
               5          291.596032
               6          291.438923
Arizona        0          300.612000
               1          300.653906
               2          300.643750
               3          300.655000
               4          300.751562
               5          300.715873
               6          300.642308
Arkansas       0          347.930462
               1          348.011250
               2          348.017656
               3          347.934844
               4          348.278281
               5          348.235238
               6          347.991077
California     0          245.348923
               1          245.364375
                             ...    
Virginia       5          368.245238
               6          368.116462
Washington     0          233.381231
               1          233.377969
               2          233.361875
               3          233.363594
               4          233.422500
               5          233.426032
               6          233.382308
West Virginia  0          357.408000
               1          357.428906
               2          357.368125
               3          357.330156
               4          357.395938
               5          357.424444
               6          357.406462
Wisconsin      0          352.981077
               1          353.050313
               2          353.040469
               3          353.000938
               4          353.160000
               5          353.141905
               6          352.996308
Wyoming        0          338.837538
               1          339.090625
               2          339.045781
               3          338.932969
               4          339.260469
               5          339.072698
               6          338.910769
Name: HighQ, dtype: float64

In [47]:
df_st_wk = df.groupby(["State", "weekday"]).HighQ.mean()

In [53]:
df_st_wk.reset_index()


Out[53]:
State weekday HighQ
0 Alabama 0 339.556308
1 Alabama 1 339.577656
2 Alabama 2 339.559375
3 Alabama 3 339.552500
4 Alabama 4 339.577969
5 Alabama 5 339.568254
6 Alabama 6 339.541385
7 Alaska 0 291.463231
8 Alaska 1 291.384687
9 Alaska 2 291.390000
10 Alaska 3 291.597500
11 Alaska 4 291.506406
12 Alaska 5 291.596032
13 Alaska 6 291.438923
14 Arizona 0 300.612000
15 Arizona 1 300.653906
16 Arizona 2 300.643750
17 Arizona 3 300.655000
18 Arizona 4 300.751562
19 Arizona 5 300.715873
20 Arizona 6 300.642308
21 Arkansas 0 347.930462
22 Arkansas 1 348.011250
23 Arkansas 2 348.017656
24 Arkansas 3 347.934844
25 Arkansas 4 348.278281
26 Arkansas 5 348.235238
27 Arkansas 6 347.991077
28 California 0 245.348923
29 California 1 245.364375
... ... ... ...
327 Virginia 5 368.245238
328 Virginia 6 368.116462
329 Washington 0 233.381231
330 Washington 1 233.377969
331 Washington 2 233.361875
332 Washington 3 233.363594
333 Washington 4 233.422500
334 Washington 5 233.426032
335 Washington 6 233.382308
336 West Virginia 0 357.408000
337 West Virginia 1 357.428906
338 West Virginia 2 357.368125
339 West Virginia 3 357.330156
340 West Virginia 4 357.395938
341 West Virginia 5 357.424444
342 West Virginia 6 357.406462
343 Wisconsin 0 352.981077
344 Wisconsin 1 353.050313
345 Wisconsin 2 353.040469
346 Wisconsin 3 353.000938
347 Wisconsin 4 353.160000
348 Wisconsin 5 353.141905
349 Wisconsin 6 352.996308
350 Wyoming 0 338.837538
351 Wyoming 1 339.090625
352 Wyoming 2 339.045781
353 Wyoming 3 338.932969
354 Wyoming 4 339.260469
355 Wyoming 5 339.072698
356 Wyoming 6 338.910769

357 rows × 3 columns


In [54]:
#Answer:

Regression

Predicting price of HighQ weed in CA


In [127]:
model_data = df1.loc[:,['HighQ']].copy()
idx = pd.date_range(model_data.index.min(), model_data.index.max()+ 30)
model_data.reset_index(inplace=True)
model_data.set_index("index", inplace=True)
model_data = model_data.reindex(idx)
model_data.tail(35)


Out[127]:
HighQ
2015-06-07 241.91
2015-06-08 241.93
2015-06-09 241.93
2015-06-10 241.93
2015-06-11 241.93
2015-06-12 NaN
2015-06-13 NaN
2015-06-14 NaN
2015-06-15 NaN
2015-06-16 NaN
2015-06-17 NaN
2015-06-18 NaN
2015-06-19 NaN
2015-06-20 NaN
2015-06-21 NaN
2015-06-22 NaN
2015-06-23 NaN
2015-06-24 NaN
2015-06-25 NaN
2015-06-26 NaN
2015-06-27 NaN
2015-06-28 NaN
2015-06-29 NaN
2015-06-30 NaN
2015-07-01 NaN
2015-07-02 NaN
2015-07-03 NaN
2015-07-04 NaN
2015-07-05 NaN
2015-07-06 NaN
2015-07-07 NaN
2015-07-08 NaN
2015-07-09 NaN
2015-07-10 NaN
2015-07-11 NaN

In [128]:
model_data['IND'] = np.arange(model_data.shape[0])
model_data.tail(35)


Out[128]:
HighQ IND
2015-06-07 241.91 527
2015-06-08 241.93 528
2015-06-09 241.93 529
2015-06-10 241.93 530
2015-06-11 241.93 531
2015-06-12 NaN 532
2015-06-13 NaN 533
2015-06-14 NaN 534
2015-06-15 NaN 535
2015-06-16 NaN 536
2015-06-17 NaN 537
2015-06-18 NaN 538
2015-06-19 NaN 539
2015-06-20 NaN 540
2015-06-21 NaN 541
2015-06-22 NaN 542
2015-06-23 NaN 543
2015-06-24 NaN 544
2015-06-25 NaN 545
2015-06-26 NaN 546
2015-06-27 NaN 547
2015-06-28 NaN 548
2015-06-29 NaN 549
2015-06-30 NaN 550
2015-07-01 NaN 551
2015-07-02 NaN 552
2015-07-03 NaN 553
2015-07-04 NaN 554
2015-07-05 NaN 555
2015-07-06 NaN 556
2015-07-07 NaN 557
2015-07-08 NaN 558
2015-07-09 NaN 559
2015-07-10 NaN 560
2015-07-11 NaN 561

In [129]:
model_data['IND_SQ'] = model_data['IND']**2
x = model_data.ix[0:532, ["IND","IND_SQ"]]
y = model_data.ix[0:532, "HighQ"]

In [132]:
x_test = model_data.ix[532:, ["IND","IND_SQ"]]

In [130]:
print x.shape, y.shape


(532, 2) (532,)

In [131]:
ols = linear_model.LinearRegression(fit_intercept=True)
ols.fit(x, y)


Out[131]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [133]:
ols_predict = ols.predict(x_test)

In [136]:
ols_predict


Out[136]:
array([ 242.22975061,  242.22165677,  242.21357583,  242.2055078 ,
        242.19745268,  242.18941047,  242.18138116,  242.17336477,
        242.16536128,  242.15737069,  242.14939302,  242.14142825,
        242.13347639,  242.12553744,  242.11761139,  242.10969826,
        242.10179803,  242.09391071,  242.08603629,  242.07817479,
        242.07032619,  242.0624905 ,  242.05466772,  242.04685784,
        242.03906087,  242.03127681,  242.02350566,  242.01574741,
        242.00800208,  242.00026965])

In [143]:
ols.coef_


Out[143]:
array([ -1.49671672e-02,   6.45382476e-06])

Exercise Predict prices for MedQ for CA


In [ ]: