Getting familiar with iPython notebook


In [1]:
print("hello world")


hello world

In [2]:
########################################
#                   1                  #
########################################

# Load the boston dataset included with sklearn

##### Start solution code #####
from sklearn import datasets
dataset = datasets.load_boston()
##### End solution code #####

In [3]:
########################################
#                   2                  #
########################################

# Run this cell to see what it does.
# Then modify the code to print both the description and the data point.

dataset.DESCR
dataset.data[0]

##### Start solution code #####
print(dataset.DESCR)
print(dataset.data[0])
##### end solution code


Boston House Prices dataset

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
http://archive.ics.uci.edu/ml/datasets/Housing


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
**References**

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
   - many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)

[  6.32000000e-03   1.80000000e+01   2.31000000e+00   0.00000000e+00
   5.38000000e-01   6.57500000e+00   6.52000000e+01   4.09000000e+00
   1.00000000e+00   2.96000000e+02   1.53000000e+01   3.96900000e+02
   4.98000000e+00]

Loading the data


In [4]:
# Make plots appear inline rather than in a separate window
# no-import-all prevents importing * from numpy and matplotlib
%pylab inline --no-import-all

# Import some useful libraries
import scipy
import numpy
import pandas
import seaborn # Importing seaborn automatically makes our plots look better
import matplotlib.pyplot as pyplt


Populating the interactive namespace from numpy and matplotlib

In [5]:
df = pandas.read_csv("candy_choices.csv")
df.count()


Out[5]:
gender         173
candy          174
flavor          52
age            169
ethnicity      174
shirt color    174
dtype: int64

In [6]:
# Each event will contain a tuple (selection index, selection, time since previous selection)
event_list = [] 

i = 0
time_since_last = {} 

for item in df["candy"].values:
    if item in time_since_last:
        event_list.append((i, item, time_since_last[item]))
    
    for e in time_since_last.keys():
        time_since_last[e]+=1
            
    time_since_last[item] = 0
    i += 1

In [7]:
event_list[:10]


Out[7]:
[(4, 'reeses', 3),
 (5, 'starburst', 1),
 (7, 'airhead', 4),
 (8, 'starburst', 2),
 (9, 'reeses', 4),
 (11, 'kitkat', 9),
 (12, 'airhead', 4),
 (13, 'kitkat', 1),
 (14, 'kitkat', 0),
 (15, 'kitkat', 0)]

Plots of interselection times


In [8]:
def plot_interselection_time(events, color, candy_name):
    # Pull out the interselection times for the appropriate candy
    candy = [] 
    for (i, choice, time) in events:
        if choice == candy_name:
            candy.append(time)
            
    # Plot the interselection times
    plt.plot(range(len(candy)), candy, color=color, label=candy_name)
    
    # Add a legend and label the axes
    plt.legend(frameon=True, shadow=True, framealpha=0.7, loc=0, prop={'size':14})
    plt.xlabel("Selection number", fontsize=14)
    plt.ylabel("Interselection time", fontsize=14)

In [9]:
plot_interselection_time(event_list, "orange", "airhead")



In [10]:
plot_interselection_time(event_list, "red", "starburst")
plot_interselection_time(event_list, "orange", "airhead")



In [11]:
########################################
#                   3                  #
########################################

# Modify this function so that a 5 on the x-axis corresponds to
# the 5th time any candy was chosen

def plot_interselection_time_scaled(events, color, candy_name):
    # Pull out the interselection times for the appropriate candy
    candy = [] 
    for (i, choice, time) in events:
        if choice == candy_name:
            candy.append(time)
            
    # Plot the interselection times
    plt.plot(range(len(candy)), candy, color=color, label=candy_name)
    
    # Add a legend and label the axes
    plt.legend(frameon=True, shadow=True, framealpha=0.7, loc=0, prop={'size':14})
    plt.xlabel("Selection number", fontsize=14)
    plt.ylabel("Interselection time", fontsize=14)
    
    
##### Start solution code #####
def plot_interselection_time_scaled(events, color, candy_name):
    # Pull out the interselection times for the appropriate candy
    candy = [] 
    selection_numbers = []
    for (i, choice, time) in events:
        if choice == candy_name:
            candy.append(time)
            selection_numbers.append(i)
            
    # Plot the interselection times
    plt.plot(selection_numbers, candy, color=color, label=candy_name)
    
    # Add a legend and label the axes
    plt.legend(frameon=True, shadow=True, framealpha=0.7, loc=0, prop={'size':14})
    plt.xlabel("Selection number", fontsize=14)
    plt.ylabel("Interselection time", fontsize=14)
##### End solution code

In [12]:
plot_interselection_time_scaled(event_list, "orange", "airhead")



In [13]:
plot_interselection_time_scaled(event_list, "red", "starburst")
plot_interselection_time_scaled(event_list, "orange", "airhead")



In [14]:
plot_interselection_time_scaled(event_list, "blue", "reeses")
plot_interselection_time_scaled(event_list, "green", "rolo")
plot_interselection_time_scaled(event_list, "yellow", "kitkat")
plot_interselection_time_scaled(event_list, "purple", "hersheys")
plot_interselection_time_scaled(event_list, "red", "starburst")
plot_interselection_time_scaled(event_list, "orange", "airhead")



In [15]:
plot_interselection_time_scaled(event_list, "blue", "reeses")
plot_interselection_time_scaled(event_list, "green", "rolo")


Build training points


In [16]:
# Each sharedStateEvent will be a map from all candy types to the time since that candy was selected
shared_state_events = [{"airhead":0, "starburst":0, "hersheys":0, "reeses":0, "kitkat":0, "rolo":0}]


import copy

i = 0
time_since_last = {}
for item in df["candy"].values:
    if not item in time_since_last:
        time_since_last[item] = 0
    
    event_list.append((i, item, time_since_last[item]))
    
    curr_shared_event = copy.deepcopy(shared_state_events[-1])
    curr_shared_event[item] = time_since_last[item]
    shared_state_events.append(curr_shared_event)
    
    time_since_last[item] = 0
    
    for e in time_since_last.keys():
        if e!=item:
            time_since_last[e]+=1
        
    i = i+1

In [17]:
events_frame = pandas.DataFrame(shared_state_events)

In [18]:
events_frame


Out[18]:
airhead hersheys kitkat reeses rolo starburst
0 0 0 0 0 0 0
1 0 0 0 0 0 0
2 0 0 0 0 0 0
3 0 0 0 0 0 0
4 0 0 0 0 0 0
5 0 0 0 3 0 0
6 0 0 0 3 0 1
7 0 0 0 3 0 1
8 4 0 0 3 0 1
9 4 0 0 3 0 2
10 4 0 0 4 0 2
11 4 0 0 4 0 2
12 4 0 9 4 0 2
13 4 0 9 4 0 2
14 4 0 1 4 0 2
15 4 0 0 4 0 2
16 4 0 0 4 0 2
17 4 0 0 4 0 2
18 4 0 0 4 6 2
19 4 0 0 4 0 2
20 6 0 0 4 0 2
21 6 13 0 4 0 2
22 6 13 0 11 0 2
23 6 13 0 0 0 2
24 6 13 6 0 0 2
25 6 13 0 0 0 2
26 5 13 0 0 0 2
27 0 13 0 0 0 2
28 0 6 0 0 0 2
29 0 0 0 0 0 2
... ... ... ... ... ... ...
145 0 1 0 5 0 10
146 0 9 0 5 0 10
147 0 0 0 5 0 10
148 0 0 8 5 0 10
149 0 1 8 5 0 10
150 0 0 8 5 0 10
151 0 0 8 5 7 10
152 0 0 8 5 0 10
153 0 2 8 5 0 10
154 0 2 8 5 1 10
155 0 2 8 5 1 31
156 0 2 8 5 1 0
157 0 2 8 5 1 0
158 0 2 8 5 3 0
159 0 5 8 5 3 0
160 0 5 8 22 3 0
161 0 1 8 22 3 0
162 0 1 8 22 3 0
163 0 1 8 22 0 0
164 0 1 8 22 0 0
165 0 1 8 4 0 0
166 0 1 8 0 0 0
167 0 1 8 0 0 0
168 0 1 8 0 0 0
169 0 7 8 0 0 0
170 0 7 8 1 0 0
171 0 7 8 0 0 0
172 0 7 8 0 0 0
173 0 7 8 0 0 0
174 0 7 8 0 0 16

175 rows × 6 columns


In [19]:
# Set a random seed so we will get the same results each time
import random
random.seed(5656)

# Randomly select 30 events for our test set
test_indices = set(random.sample(range(events_frame.shape[0]), 30))

# Split our data into training and test data
train_features = []
train_labels = []
test_features = []
test_labels = []

i = 0
for airhead, hersheys, kitkat, reeses, rolo, starburst in events_frame.values:
    if i in test_indices:
        # Use starburst as our label, and all others as our features
        test_features.append([airhead, hersheys, kitkat, reeses, rolo])
        test_labels.append(starburst)
    else:
        train_features.append([airhead, hersheys, kitkat, reeses, rolo])
        train_labels.append(starburst)
    
    i += 1

Linear regression model with all features


In [20]:
import sklearn
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
model = linear_model.LinearRegression()
model.fit(train_features, train_labels)


Out[20]:
LinearRegression(copy_X=True, fit_intercept=True, normalize=False)

In [21]:
# See which features had the most influence on our model 
zip(events_frame.columns, model.coef_)


Out[21]:
[('airhead', 0.30620078846315663),
 ('hersheys', 0.068253337998380445),
 ('kitkat', 0.14889057862735286),
 ('reeses', 0.063000309132943078),
 ('rolo', 0.13841710463674176)]

In [22]:
# Print mean squared error and R^2 on the training set
print(numpy.mean((model.predict(train_features) - train_labels) ** 2))
print(model.score(train_features, train_labels))


35.331685859
0.0724177303131

In [23]:
# Plot predicted and true interarrival times on the training set

plt.plot(train_labels, color="green", label="True value")
plt.plot(model.predict(train_features), label="Predicted value")

plt.xlabel("Selection number", fontsize=14)
plt.ylabel("Interselection time", fontsize=14)
plt.legend(frameon=True, shadow=True, framealpha=0.7, loc=0, prop={"size": 14})


Out[23]:
<matplotlib.legend.Legend at 0x1a09f438>

In [24]:
# Print mean squared error and R^2 on the test set
print(numpy.mean((model.predict(test_features) - test_labels) ** 2))
print(model.score(test_features, test_labels))


50.564517194
0.0260237677732

In [25]:
# Plot predicted and true time since selection on the test set 

plt.plot(test_labels, color="green", label="True value")
plt.plot(model.predict(test_features), label="Predicted value")

plt.xlabel("Selection number", fontsize=14)
plt.ylabel("Time since selection", fontsize=14)
plt.legend(frameon=True, shadow=True, framealpha=0.7, loc=0, prop={"size": 14})


Out[25]:
<matplotlib.legend.Legend at 0x198ac828>

Model performance with restricted features


In [26]:
# Restrict the features to just Airhead and Kitkat - the two most influential features

train_features_res = [[e[0], e[2]] for e in train_features]
train_labels_res = train_labels
test_features_res = [[e[0], e[2]] for e in test_features]
test_labels_res = test_labels
model_res = linear_model.LinearRegression()
model_res.fit(train_features_res, train_labels_res)


Out[26]:
LinearRegression(copy_X=True, fit_intercept=True, normalize=False)

In [27]:
# Plot predicted and true interarrival times on the training set

plt.plot(train_labels_res, color="green", label="True interselection time")
plt.plot(model_res.predict(train_features_res), label="Predicted interselection time")

plt.xlabel("Selection number", fontsize=14)
plt.ylabel("Interselection time", fontsize=14)
plt.legend(frameon=True, shadow=True, framealpha=0.7, loc=0, prop={"size": 14})


Out[27]:
<matplotlib.legend.Legend at 0x194bd3c8>

In [28]:
# Print the mean squared error and R^2 of the restricted model on the training set

print(numpy.mean((model_res.predict(train_features_res) - train_labels_res) ** 2))
print(model_res.score(train_features_res, train_labels_res))


36.6121458529
0.0388011066363

In [29]:
# Plot predicted and true interarrival times on the test set

plt.plot(test_labels_res, color="green", label="True interselecton time")
plt.plot(model_res.predict(test_features_res), label="Predicted interselection time")

plt.xlabel("Selection number", fontsize=14)
plt.ylabel("Interselection time", fontsize=14)
plt.legend(frameon=True, shadow=True, framealpha=0.7, loc=0, prop={"size": 14})


Out[29]:
<matplotlib.legend.Legend at 0x1755b978>

In [30]:
# Print the mean squared error and R^2 of the restricted model on the test set

print(numpy.mean((model_res.predict(test_features_res) - test_labels_res) ** 2))
print(model_res.score(test_features_res, test_labels_res))


51.0043482151
0.0175517208797