Getting familiar with iPython notebook



In [1]:

    
print("hello world")









    



hello world



In [2]:

    
########################################
#                   1                  #
########################################

# Load the boston dataset included with sklearn

##### Start solution code #####
from sklearn import datasets
dataset = datasets.load_boston()
##### End solution code #####



In [3]:

    
########################################
#                   2                  #
########################################

# Run this cell to see what it does.
# Then modify the code to print both the description and the data point.

dataset.DESCR
dataset.data[0]

##### Start solution code #####
print(dataset.DESCR)
print(dataset.data[0])
##### end solution code









    



Boston House Prices dataset

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
http://archive.ics.uci.edu/ml/datasets/Housing


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
**References**

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
   - many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)

[  6.32000000e-03   1.80000000e+01   2.31000000e+00   0.00000000e+00
   5.38000000e-01   6.57500000e+00   6.52000000e+01   4.09000000e+00
   1.00000000e+00   2.96000000e+02   1.53000000e+01   3.96900000e+02
   4.98000000e+00]

Loading the data



In [4]:

    
# Make plots appear inline rather than in a separate window
# no-import-all prevents importing * from numpy and matplotlib
%pylab inline --no-import-all

# Import some useful libraries
import scipy
import numpy
import pandas
import seaborn # Importing seaborn automatically makes our plots look better
import matplotlib.pyplot as pyplt









    



Populating the interactive namespace from numpy and matplotlib



In [5]:

    
df = pandas.read_csv("candy_choices.csv")
df.count()









    Out[5]:





gender         173
candy          174
flavor          52
age            169
ethnicity      174
shirt color    174
dtype: int64



In [6]:

    
# Each event will contain a tuple (selection index, selection, time since previous selection)
event_list = [] 

i = 0
time_since_last = {} 

for item in df["candy"].values:
    if item in time_since_last:
        event_list.append((i, item, time_since_last[item]))
    
    for e in time_since_last.keys():
        time_since_last[e]+=1
            
    time_since_last[item] = 0
    i += 1



In [7]:

    
event_list[:10]









    Out[7]:





[(4, 'reeses', 3),
 (5, 'starburst', 1),
 (7, 'airhead', 4),
 (8, 'starburst', 2),
 (9, 'reeses', 4),
 (11, 'kitkat', 9),
 (12, 'airhead', 4),
 (13, 'kitkat', 1),
 (14, 'kitkat', 0),
 (15, 'kitkat', 0)]

Plots of interselection times



In [8]:

    
def plot_interselection_time(events, color, candy_name):
    # Pull out the interselection times for the appropriate candy
    candy = [] 
    for (i, choice, time) in events:
        if choice == candy_name:
            candy.append(time)
            
    # Plot the interselection times
    plt.plot(range(len(candy)), candy, color=color, label=candy_name)
    
    # Add a legend and label the axes
    plt.legend(frameon=True, shadow=True, framealpha=0.7, loc=0, prop={'size':14})
    plt.xlabel("Selection number", fontsize=14)
    plt.ylabel("Interselection time", fontsize=14)



In [9]:

    
plot_interselection_time(event_list, "orange", "airhead")



In [10]:

    
plot_interselection_time(event_list, "red", "starburst")
plot_interselection_time(event_list, "orange", "airhead")



In [11]:

    
########################################
#                   3                  #
########################################

# Modify this function so that a 5 on the x-axis corresponds to
# the 5th time any candy was chosen

def plot_interselection_time_scaled(events, color, candy_name):
    # Pull out the interselection times for the appropriate candy
    candy = [] 
    for (i, choice, time) in events:
        if choice == candy_name:
            candy.append(time)
            
    # Plot the interselection times
    plt.plot(range(len(candy)), candy, color=color, label=candy_name)
    
    # Add a legend and label the axes
    plt.legend(frameon=True, shadow=True, framealpha=0.7, loc=0, prop={'size':14})
    plt.xlabel("Selection number", fontsize=14)
    plt.ylabel("Interselection time", fontsize=14)
    
    
##### Start solution code #####
def plot_interselection_time_scaled(events, color, candy_name):
    # Pull out the interselection times for the appropriate candy
    candy = [] 
    selection_numbers = []
    for (i, choice, time) in events:
        if choice == candy_name:
            candy.append(time)
            selection_numbers.append(i)
            
    # Plot the interselection times
    plt.plot(selection_numbers, candy, color=color, label=candy_name)
    
    # Add a legend and label the axes
    plt.legend(frameon=True, shadow=True, framealpha=0.7, loc=0, prop={'size':14})
    plt.xlabel("Selection number", fontsize=14)
    plt.ylabel("Interselection time", fontsize=14)
##### End solution code



In [12]:

    
plot_interselection_time_scaled(event_list, "orange", "airhead")



In [13]:

    
plot_interselection_time_scaled(event_list, "red", "starburst")
plot_interselection_time_scaled(event_list, "orange", "airhead")



In [14]:

    
plot_interselection_time_scaled(event_list, "blue", "reeses")
plot_interselection_time_scaled(event_list, "green", "rolo")
plot_interselection_time_scaled(event_list, "yellow", "kitkat")
plot_interselection_time_scaled(event_list, "purple", "hersheys")
plot_interselection_time_scaled(event_list, "red", "starburst")
plot_interselection_time_scaled(event_list, "orange", "airhead")



In [15]:

    
plot_interselection_time_scaled(event_list, "blue", "reeses")
plot_interselection_time_scaled(event_list, "green", "rolo")

Build training points



In [16]:

    
# Each sharedStateEvent will be a map from all candy types to the time since that candy was selected
shared_state_events = [{"airhead":0, "starburst":0, "hersheys":0, "reeses":0, "kitkat":0, "rolo":0}]


import copy

i = 0
time_since_last = {}
for item in df["candy"].values:
    if not item in time_since_last:
        time_since_last[item] = 0
    
    event_list.append((i, item, time_since_last[item]))
    
    curr_shared_event = copy.deepcopy(shared_state_events[-1])
    curr_shared_event[item] = time_since_last[item]
    shared_state_events.append(curr_shared_event)
    
    time_since_last[item] = 0
    
    for e in time_since_last.keys():
        if e!=item:
            time_since_last[e]+=1
        
    i = i+1



In [17]:

    
events_frame = pandas.DataFrame(shared_state_events)



In [18]:

    
events_frame









    Out[18]:






  
    
      
      airhead
      hersheys
      kitkat
      reeses
      rolo
      starburst
    
  
  
    
      0  
       0
        0
       0
        0
       0
        0
    
    
      1  
       0
        0
       0
        0
       0
        0
    
    
      2  
       0
        0
       0
        0
       0
        0
    
    
      3  
       0
        0
       0
        0
       0
        0
    
    
      4  
       0
        0
       0
        0
       0
        0
    
    
      5  
       0
        0
       0
        3
       0
        0
    
    
      6  
       0
        0
       0
        3
       0
        1
    
    
      7  
       0
        0
       0
        3
       0
        1
    
    
      8  
       4
        0
       0
        3
       0
        1
    
    
      9  
       4
        0
       0
        3
       0
        2
    
    
      10 
       4
        0
       0
        4
       0
        2
    
    
      11 
       4
        0
       0
        4
       0
        2
    
    
      12 
       4
        0
       9
        4
       0
        2
    
    
      13 
       4
        0
       9
        4
       0
        2
    
    
      14 
       4
        0
       1
        4
       0
        2
    
    
      15 
       4
        0
       0
        4
       0
        2
    
    
      16 
       4
        0
       0
        4
       0
        2
    
    
      17 
       4
        0
       0
        4
       0
        2
    
    
      18 
       4
        0
       0
        4
       6
        2
    
    
      19 
       4
        0
       0
        4
       0
        2
    
    
      20 
       6
        0
       0
        4
       0
        2
    
    
      21 
       6
       13
       0
        4
       0
        2
    
    
      22 
       6
       13
       0
       11
       0
        2
    
    
      23 
       6
       13
       0
        0
       0
        2
    
    
      24 
       6
       13
       6
        0
       0
        2
    
    
      25 
       6
       13
       0
        0
       0
        2
    
    
      26 
       5
       13
       0
        0
       0
        2
    
    
      27 
       0
       13
       0
        0
       0
        2
    
    
      28 
       0
        6
       0
        0
       0
        2
    
    
      29 
       0
        0
       0
        0
       0
        2
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      145
       0
        1
       0
        5
       0
       10
    
    
      146
       0
        9
       0
        5
       0
       10
    
    
      147
       0
        0
       0
        5
       0
       10
    
    
      148
       0
        0
       8
        5
       0
       10
    
    
      149
       0
        1
       8
        5
       0
       10
    
    
      150
       0
        0
       8
        5
       0
       10
    
    
      151
       0
        0
       8
        5
       7
       10
    
    
      152
       0
        0
       8
        5
       0
       10
    
    
      153
       0
        2
       8
        5
       0
       10
    
    
      154
       0
        2
       8
        5
       1
       10
    
    
      155
       0
        2
       8
        5
       1
       31
    
    
      156
       0
        2
       8
        5
       1
        0
    
    
      157
       0
        2
       8
        5
       1
        0
    
    
      158
       0
        2
       8
        5
       3
        0
    
    
      159
       0
        5
       8
        5
       3
        0
    
    
      160
       0
        5
       8
       22
       3
        0
    
    
      161
       0
        1
       8
       22
       3
        0
    
    
      162
       0
        1
       8
       22
       3
        0
    
    
      163
       0
        1
       8
       22
       0
        0
    
    
      164
       0
        1
       8
       22
       0
        0
    
    
      165
       0
        1
       8
        4
       0
        0
    
    
      166
       0
        1
       8
        0
       0
        0
    
    
      167
       0
        1
       8
        0
       0
        0
    
    
      168
       0
        1
       8
        0
       0
        0
    
    
      169
       0
        7
       8
        0
       0
        0
    
    
      170
       0
        7
       8
        1
       0
        0
    
    
      171
       0
        7
       8
        0
       0
        0
    
    
      172
       0
        7
       8
        0
       0
        0
    
    
      173
       0
        7
       8
        0
       0
        0
    
    
      174
       0
        7
       8
        0
       0
       16
    
  

175 rows × 6 columns



In [19]:

    
# Set a random seed so we will get the same results each time
import random
random.seed(5656)

# Randomly select 30 events for our test set
test_indices = set(random.sample(range(events_frame.shape[0]), 30))

# Split our data into training and test data
train_features = []
train_labels = []
test_features = []
test_labels = []

i = 0
for airhead, hersheys, kitkat, reeses, rolo, starburst in events_frame.values:
    if i in test_indices:
        # Use starburst as our label, and all others as our features
        test_features.append([airhead, hersheys, kitkat, reeses, rolo])
        test_labels.append(starburst)
    else:
        train_features.append([airhead, hersheys, kitkat, reeses, rolo])
        train_labels.append(starburst)
    
    i += 1

Linear regression model with all features



In [20]:

    
import sklearn
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
model = linear_model.LinearRegression()
model.fit(train_features, train_labels)









    Out[20]:





LinearRegression(copy_X=True, fit_intercept=True, normalize=False)



In [21]:

    
# See which features had the most influence on our model 
zip(events_frame.columns, model.coef_)









    Out[21]:





[('airhead', 0.30620078846315663),
 ('hersheys', 0.068253337998380445),
 ('kitkat', 0.14889057862735286),
 ('reeses', 0.063000309132943078),
 ('rolo', 0.13841710463674176)]



In [22]:

    
# Print mean squared error and R^2 on the training set
print(numpy.mean((model.predict(train_features) - train_labels) ** 2))
print(model.score(train_features, train_labels))









    



35.331685859
0.0724177303131



In [23]:

    
# Plot predicted and true interarrival times on the training set

plt.plot(train_labels, color="green", label="True value")
plt.plot(model.predict(train_features), label="Predicted value")

plt.xlabel("Selection number", fontsize=14)
plt.ylabel("Interselection time", fontsize=14)
plt.legend(frameon=True, shadow=True, framealpha=0.7, loc=0, prop={"size": 14})









    Out[23]:





<matplotlib.legend.Legend at 0x1a09f438>



In [24]:

    
# Print mean squared error and R^2 on the test set
print(numpy.mean((model.predict(test_features) - test_labels) ** 2))
print(model.score(test_features, test_labels))









    



50.564517194
0.0260237677732



In [25]:

    
# Plot predicted and true time since selection on the test set 

plt.plot(test_labels, color="green", label="True value")
plt.plot(model.predict(test_features), label="Predicted value")

plt.xlabel("Selection number", fontsize=14)
plt.ylabel("Time since selection", fontsize=14)
plt.legend(frameon=True, shadow=True, framealpha=0.7, loc=0, prop={"size": 14})









    Out[25]:





<matplotlib.legend.Legend at 0x198ac828>

Model performance with restricted features



In [26]:

    
# Restrict the features to just Airhead and Kitkat - the two most influential features

train_features_res = [[e[0], e[2]] for e in train_features]
train_labels_res = train_labels
test_features_res = [[e[0], e[2]] for e in test_features]
test_labels_res = test_labels
model_res = linear_model.LinearRegression()
model_res.fit(train_features_res, train_labels_res)









    Out[26]:





LinearRegression(copy_X=True, fit_intercept=True, normalize=False)



In [27]:

    
# Plot predicted and true interarrival times on the training set

plt.plot(train_labels_res, color="green", label="True interselection time")
plt.plot(model_res.predict(train_features_res), label="Predicted interselection time")

plt.xlabel("Selection number", fontsize=14)
plt.ylabel("Interselection time", fontsize=14)
plt.legend(frameon=True, shadow=True, framealpha=0.7, loc=0, prop={"size": 14})









    Out[27]:





<matplotlib.legend.Legend at 0x194bd3c8>



In [28]:

    
# Print the mean squared error and R^2 of the restricted model on the training set

print(numpy.mean((model_res.predict(train_features_res) - train_labels_res) ** 2))
print(model_res.score(train_features_res, train_labels_res))









    



36.6121458529
0.0388011066363



In [29]:

    
# Plot predicted and true interarrival times on the test set

plt.plot(test_labels_res, color="green", label="True interselecton time")
plt.plot(model_res.predict(test_features_res), label="Predicted interselection time")

plt.xlabel("Selection number", fontsize=14)
plt.ylabel("Interselection time", fontsize=14)
plt.legend(frameon=True, shadow=True, framealpha=0.7, loc=0, prop={"size": 14})









    Out[29]:





<matplotlib.legend.Legend at 0x1755b978>



In [30]:

    
# Print the mean squared error and R^2 of the restricted model on the test set

print(numpy.mean((model_res.predict(test_features_res) - test_labels_res) ** 2))
print(model_res.score(test_features_res, test_labels_res))









    



51.0043482151
0.0175517208797

	airhead	hersheys	kitkat	reeses	rolo	starburst
0	0	0	0	0	0	0
1	0	0	0	0	0	0
2	0	0	0	0	0	0
3	0	0	0	0	0	0
4	0	0	0	0	0	0
5	0	0	0	3	0	0
6	0	0	0	3	0	1
7	0	0	0	3	0	1
8	4	0	0	3	0	1
9	4	0	0	3	0	2
10	4	0	0	4	0	2
11	4	0	0	4	0	2
12	4	0	9	4	0	2
13	4	0	9	4	0	2
14	4	0	1	4	0	2
15	4	0	0	4	0	2
16	4	0	0	4	0	2
17	4	0	0	4	0	2
18	4	0	0	4	6	2
19	4	0	0	4	0	2
20	6	0	0	4	0	2
21	6	13	0	4	0	2
22	6	13	0	11	0	2
23	6	13	0	0	0	2
24	6	13	6	0	0	2
25	6	13	0	0	0	2
26	5	13	0	0	0	2
27	0	13	0	0	0	2
28	0	6	0	0	0	2
29	0	0	0	0	0	2
...	...	...	...	...	...	...
145	0	1	0	5	0	10
146	0	9	0	5	0	10
147	0	0	0	5	0	10
148	0	0	8	5	0	10
149	0	1	8	5	0	10
150	0	0	8	5	0	10
151	0	0	8	5	7	10
152	0	0	8	5	0	10
153	0	2	8	5	0	10
154	0	2	8	5	1	10
155	0	2	8	5	1	31
156	0	2	8	5	1	0
157	0	2	8	5	1	0
158	0	2	8	5	3	0
159	0	5	8	5	3	0
160	0	5	8	22	3	0
161	0	1	8	22	3	0
162	0	1	8	22	3	0
163	0	1	8	22	0	0
164	0	1	8	22	0	0
165	0	1	8	4	0	0
166	0	1	8	0	0	0
167	0	1	8	0	0	0
168	0	1	8	0	0	0
169	0	7	8	0	0	0
170	0	7	8	1	0	0
171	0	7	8	0	0	0
172	0	7	8	0	0	0
173	0	7	8	0	0	0
174	0	7	8	0	0	16

	airhead	hersheys	kitkat	reeses	rolo	starburst
0	0	0	0	0	0	0
1	0	0	0	0	0	0
2	0	0	0	0	0	0
3	0	0	0	0	0	0
4	0	0	0	0	0	0
5	0	0	0	3	0	0
6	0	0	0	3	0	1
7	0	0	0	3	0	1
8	4	0	0	3	0	1
9	4	0	0	3	0	2
10	4	0	0	4	0	2
11	4	0	0	4	0	2
12	4	0	9	4	0	2
13	4	0	9	4	0	2
14	4	0	1	4	0	2
15	4	0	0	4	0	2
16	4	0	0	4	0	2
17	4	0	0	4	0	2
18	4	0	0	4	6	2
19	4	0	0	4	0	2
20	6	0	0	4	0	2
21	6	13	0	4	0	2
22	6	13	0	11	0	2
23	6	13	0	0	0	2
24	6	13	6	0	0	2
25	6	13	0	0	0	2
26	5	13	0	0	0	2
27	0	13	0	0	0	2
28	0	6	0	0	0	2
29	0	0	0	0	0	2
...	...	...	...	...	...	...
145	0	1	0	5	0	10
146	0	9	0	5	0	10
147	0	0	0	5	0	10
148	0	0	8	5	0	10
149	0	1	8	5	0	10
150	0	0	8	5	0	10
151	0	0	8	5	7	10
152	0	0	8	5	0	10
153	0	2	8	5	0	10
154	0	2	8	5	1	10
155	0	2	8	5	1	31
156	0	2	8	5	1	0
157	0	2	8	5	1	0
158	0	2	8	5	3	0
159	0	5	8	5	3	0
160	0	5	8	22	3	0
161	0	1	8	22	3	0
162	0	1	8	22	3	0
163	0	1	8	22	0	0
164	0	1	8	22	0	0
165	0	1	8	4	0	0
166	0	1	8	0	0	0
167	0	1	8	0	0	0
168	0	1	8	0	0	0
169	0	7	8	0	0	0
170	0	7	8	1	0	0
171	0	7	8	0	0	0
172	0	7	8	0	0	0
173	0	7	8	0	0	0
174	0	7	8	0	0	16

	airhead	hersheys	kitkat	reeses	rolo	starburst
0	0	0	0	0	0	0
1	0	0	0	0	0	0
2	0	0	0	0	0	0
3	0	0	0	0	0	0
4	0	0	0	0	0	0
5	0	0	0	3	0	0
6	0	0	0	3	0	1
7	0	0	0	3	0	1
8	4	0	0	3	0	1
9	4	0	0	3	0	2
10	4	0	0	4	0	2
11	4	0	0	4	0	2
12	4	0	9	4	0	2
13	4	0	9	4	0	2
14	4	0	1	4	0	2
15	4	0	0	4	0	2
16	4	0	0	4	0	2
17	4	0	0	4	0	2
18	4	0	0	4	6	2
19	4	0	0	4	0	2
20	6	0	0	4	0	2
21	6	13	0	4	0	2
22	6	13	0	11	0	2
23	6	13	0	0	0	2
24	6	13	6	0	0	2
25	6	13	0	0	0	2
26	5	13	0	0	0	2
27	0	13	0	0	0	2
28	0	6	0	0	0	2
29	0	0	0	0	0	2
...	...	...	...	...	...	...
145	0	1	0	5	0	10
146	0	9	0	5	0	10
147	0	0	0	5	0	10
148	0	0	8	5	0	10
149	0	1	8	5	0	10
150	0	0	8	5	0	10
151	0	0	8	5	7	10
152	0	0	8	5	0	10
153	0	2	8	5	0	10
154	0	2	8	5	1	10
155	0	2	8	5	1	31
156	0	2	8	5	1	0
157	0	2	8	5	1	0
158	0	2	8	5	3	0
159	0	5	8	5	3	0
160	0	5	8	22	3	0
161	0	1	8	22	3	0
162	0	1	8	22	3	0
163	0	1	8	22	0	0
164	0	1	8	22	0	0
165	0	1	8	4	0	0
166	0	1	8	0	0	0
167	0	1	8	0	0	0
168	0	1	8	0	0	0
169	0	7	8	0	0	0
170	0	7	8	1	0	0
171	0	7	8	0	0	0
172	0	7	8	0	0	0
173	0	7	8	0	0	0
174	0	7	8	0	0	16