Game time 2 / ML 4 Prelab!

We introduced you to this game in ML-III. Today we're going to go all in! Before class on Friday, make sure that you have a classifier for each disease (D1 and D2) that you feel performs the best of all of the classifiers that you've constructed and that you can estimate the accuracy for.

To help you along, we've leveled you up to 500 examples of each! They're formatted just like the samples that you've already worked with, but they're all in one dataset (D1.csv and D2.csv).

You should start with the code from last time. In the interests of recording your research steps, remember that whatever you try should be recorded and noted in the iPython notebook.

Let's go ahead and load up the data.



In [6]:

    
# numpy provides python tools to easily load comma separated files.
import numpy as np

# use numpy to load disease #1 data
d1 = np.loadtxt(open("../31_Data_ML-IV/D1.csv", "rb"), delimiter=",")

# features are all rows for columns before 200
# The canonical way to name this is that X is our matrix of
# examples by features.
X1 = d1[:,:200]

# labels are in all rows at the 200th column
# The canonical way to name this is that y is our vector of
# labels.
y1 = d1[:,200]

# use numpy to load disease #2 data
d2 = np.loadtxt(open("../31_Data_ML-IV/D2.csv", "rb"), delimiter=",")

# features are all rows for columns before 200
X2 = d2[:,:200]
# labels are in all rows at the 200th column
y2 = d2[:,200]

Random Seeds

Sometimes we want to do things randomly... the same way over and over (Groundhog Day) style - after all, we're in Pennsylvania). Setting a random state lets us do this. The code below does not set a random state. Notice how the performance changes from run to run (run it a few times to see what happens).



In [7]:

    
# Import the function to split our data:
from sklearn.cross_validation import train_test_split

# Split things into training and testing - let's have 30% of our data end up as testing
X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y1, test_size=.33)

# First, we need to import the classifier
from sklearn.tree import DecisionTreeClassifier

# Now we're going to get a decision tree classifier with the default parameters
classifier = DecisionTreeClassifier()

# The 'fit' syntax is the same
classifier.fit(X1_train, y1_train)

# As is the 'score' syntax
train_score = classifier.score(X1_train, y1_train)
test_score = classifier.score(X1_test, y1_test)


print("Training Accuracy: " + str(train_score))
print("Testing Accuracy: " + str(test_score))









    



Training Accuracy: 1.0
Testing Accuracy: 0.521212121212

If we want to do the same thing each time, we can't. This is because the computer, at various points, needs to make up random numbers. This process is different each time. To make work reproducible, we may want to tell the computer to make up pseudorandom numbers predictably. We can do this by defining an initial state for our random number generator. sklearn usually does this through a function parameter called random_state. We could re-write the code above to use 42 as the random state. Once we do that (below), running the code will return the same result each time.



In [8]:

    
# Import the function to split our data:
from sklearn.cross_validation import train_test_split

# Split things into training and testing - let's have 30% of our data end up as testing
X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y1, test_size=.33, random_state=42)

# First, we need to import the classifier
from sklearn.tree import DecisionTreeClassifier

# Now we're going to get a decision tree classifier with the default parameters
classifier = DecisionTreeClassifier(random_state=42)

# The 'fit' syntax is the same
classifier.fit(X1_train, y1_train)

# As is the 'score' syntax
train_score = classifier.score(X1_train, y1_train)
test_score = classifier.score(X1_test, y1_test)


print("Training Accuracy: " + str(train_score))
print("Testing Accuracy: " + str(test_score))









    



Training Accuracy: 1.0
Testing Accuracy: 0.50303030303

Cross Validation

Instead of using a separate training and testing partition, let's try out cross validation! A bit of light googling leads us to an example in scikit learn. This seems pretty handy. Let's try it out!



In [9]:

    
# Import the function to split our data:
from sklearn.cross_validation import cross_val_score

# First, we need to import the classifier
from sklearn.tree import DecisionTreeClassifier

# Now we're going to get a decision tree classifier with the default parameters
classifier = DecisionTreeClassifier(random_state=42)

# Now we get the scores using this cross_val_score function
# Note: We don't have to split the data with this approach.
scores = cross_val_score(classifier, X1, y1)

# This lets us calculate the accuracy + 2x standard deviation
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

# NOTE HOWEVER THAT WE DO NOT GET A CLASSIFIER BACK.
# IF WE TRY TO RUN ANY OF THE CODE BELOW, WE WOULD
# GET AN ERROR. IF WE WANT A CLASSIFIER INSTEAD OF
# A PERFORMANCE ASSESSMENT, WE FIRST NEED TO FIT A
# NEW ONE. WE COULD DO THIS OVER ALL THE DATA WITH
# THE CODE:
# classifier.fit(X1, y1)
# YOU MAY USE THIS APPROACH FOR YOUR HOMEWORK
# IF YOU USE CROSS VALIDATION TO ASSESS PERFORMANCE.

# As is the 'score' syntax
#train_score = classifier.score(X1_train, y1_train)
#test_score = classifier.score(X1_test, y1_test)


#print("Training Accuracy: " + str(train_score))
#print("Testing Accuracy: " + str(test_score))









    



Accuracy: 0.53 (+/- 0.04)

Be ready to use these handy new tools in class!

Q1 This prelab question should be easy. What's been your favorite porition of the class thus far?

Q2 What's your least favorite portion of the class thus far?