Discuss your thoughts about the pre-lab reading material with your table. As a group, come up with specific concerns, if any, that you have with the approaches used or the criticisms about the approaches.
We have machine learning at our fingertips, and we've seen some of the dangers. Now we're going to spend this week on a game. In this game, we have two goals: 1) We want to build the best predictor that we can, but 2)at all times we want to have an accurate idea of how well the predictor works.
For this game, we've managed to get our hands on some data about two diseases (D1 and D2). Each of these datasets has features in columns and examples in rows. Each feature represents a clinical measurement, while each row represents a person. We want to be able to predict whether or not a person has a disease (the last column).
We'll supply you with four datasets for each disease throughout the week. For the first day, we've given you two of them. We also provide example code to read the data. From there, the path that you take is up to you. We do not know the best predictor or even what the maximum achievable accuracy for these data! This is a chance to experiment and find out what best captures disease status.
We can use anything in the scikit-learn toolkit. It's a powerful set of tools. We use it regularly in our own lab, so this exercise is hands on with the real thing.
First, let's get loading both datasets out of the way:
In [0]:
# numpy provides python tools to easily load comma separated files.
import numpy as np
# use numpy to load disease #1 data
d1 = np.loadtxt(open("../30_Data_ML-III/D1.csv", "rb"), delimiter=",")
# features are all rows for columns before 200
# The canonical way to name this is that X is our matrix of
# examples by features.
X1 = d1[:,:200]
# labels are in all rows at the 200th column
# The canonical way to name this is that y is our vector of
# labels.
y1 = d1[:,200]
# use numpy to load disease #2 data
d2 = np.loadtxt(open("../30_Data_ML-III/D2.csv", "rb"), delimiter=",")
# features are all rows for columns before 200
X2 = d2[:,:200]
# labels are in all rows at the 200th column
y2 = d2[:,200]
We need to find out how to use this thing! We ran some code in the previous notebook that did this for us, but now we need to make things work on our own. Googling for "svm sklearn classifier" gets us to this page. This page has documentation for the package. Partway down the page, we see: "SVC, NuSVC and LinearSVC are classes capable of performing multi-class classification on a dataset." As we keep reading, we see that SVC provides an implementation. Let's try that!
We get to the documentation for SVC and it says many things. At the top, there's a box that says:
class sklearn.svm.SVC(C=1.0, kernel='rbf', degree=3, gamma='auto', coef0=0.0, shrinking=True, probability=False, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=-1, decision_function_shape=None, random_state=None)
How should we interpret all of this?
The first part tells us where a function lives, so the SVC function lives in sklearn.svm. It seems we're going to need to import it from there.
In [0]:
# First we need to import svms from sklearn
from sklearn.svm import SVC
The parts inside the parentheses give us the ability to set or change parameters. Anything with an equals sign after it has a default parameter set. In this case, the default C is set to 1.0. There's also a box that gives some description of what each parameter is (only a few of them may make sense to us right now). If we scroll to the bottom of the box, we'll get some examples provided by the helpful sklearn team, though they don't know about the names of our datasets. They'll often use the standard name X for features and y for labels.
Let's go ahead and run an SVM using all the defaults on our data
In [0]:
# Get an SVC with default parameters as our algorithm
classifier = SVC()
# Fit the classifier to our datasets
classifier.fit(X1, y1)
# Apply the classifier back to our data and get an accuracy measure
train_score = classifier.score(X1, y1)
# Print the accuracy
print(train_score)
Ouch! Only about 50% accuracy. That's painful! We learned that we could modify C to make the algorithm try to fit the data we show it better. Let's ramp up C and see what happens!
In [0]:
# Get an SVC with a high C
classifier = SVC(C = 100)
# Fit the classifier to our datasets
classifier.fit(X1, y1)
# Apply the classifier back to our data and get an accuracy measure
train_score = classifier.score(X1, y1)
# Print the accuracy
print(train_score)
import sklearn
Nice! 100% accuracy. This seems like we're on the right track. What we'd really like to do is figure out how we do on held out testing data though. Fortunately, sklearn provides a helper function to make holding out some of the data easy. This function is called train_test_split and we can find its documentation. If we weren't sure where to go, the sklearn documentation has a full section on cross validation.
Note: Software changes over time. The current release of sklearn on CoCalc is 0.17. There's a new version, 0.18, also available. There are also minor version numbers (e.g. the final 1 in 0.17.1). These don't change functionality. Between the two major versions the location of the train_test_split function changed. If you ever want to know what version of sklearn you're working with, you can create a code block and run this code:
import sklearn
print(sklearn.__version__)
Make sure that when you look at the documentation, you choose the version that matches what you're working with.
Let's go ahead and split our data into training and testing portions.
In [0]:
# Import the function to split our data:
from sklearn.cross_validation import train_test_split
# Split things into training and testing - let's have 30% of our data end up as testing
X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y1, test_size=.33)
Now let's go ahead and train our classifier on the training data and test it on some held out test data
In [0]:
# Get an SVC again using C = 100
classifier = SVC(C = 100)
# Fit the classifier to the training data:
classifier.fit(X1_train, y1_train)
# Now we're going to apply it to the training labels first:
train_score = classifier.score(X1_train, y1_train)
# We're also going to applying it to the testing labels:
test_score = classifier.score(X1_test, y1_test)
print("Training Accuracy: " + str(train_score))
print("Testing Accuracy: " + str(test_score))
Nice! Now we can see that while our training accuracy is very high, our testing accuracy is much lower. We could say that our model has "overfit" to the data. We learned about overfitting before. You'll get a chance to play with this SVM a bit more below. Before we move to that though, we want to show you how easy it is to use a different classifier. You might imagine that a classifier could be composed of a cascading series of rules. If this is true, then consider that. Otherwise, consider this other thing. This type of algorithm is called a decision tree, and we're going to rain one now.
sklearn has a handy decision tree classifier that we can use. By using the SVM classifier, we've already learned most of what we need to know to use it.
In [0]:
# First, we need to import the classifier
from sklearn.tree import DecisionTreeClassifier
# Now we're going to get a decision tree classifier with the default parameters
classifier = DecisionTreeClassifier()
# The 'fit' syntax is the same
classifier.fit(X1_train, y1_train)
# As is the 'score' syntax
train_score = classifier.score(X1_train, y1_train)
test_score = classifier.score(X1_test, y1_test)
print("Training Accuracy: " + str(train_score))
print("Testing Accuracy: " + str(test_score))
Oof! That's pretty overfit! We're perfect on the training data but basically flipping a coin on the held out data. A DecisionTreeClassifier has two parameters max_features and max_depth that can really help us prevent overfitting. Let's train a very small tree (no more than 8 features) that's very short (no more than 3 deep).
In [0]:
# Now we're going to get a decision tree classifier with selected parameters
classifier = DecisionTreeClassifier(max_features=8, max_depth=3)
# The 'fit' syntax is the same
classifier.fit(X1_train, y1_train)
# As is the 'score' syntax
train_score = classifier.score(X1_train, y1_train)
test_score = classifier.score(X1_test, y1_test)
print("Training Accuracy: " + str(train_score))
print("Testing Accuracy: " + str(test_score))
Things are less overfit, but it's still not clear that this is working too well.
Q1: Setup and fit a classifier and report the training and testing accuracies (3pts).
In [0]:
Q2: Setup and fit a classifier and report the training and testing accuracies (3pts).
In [0]:
Q3: Setup and fit a classifier and report the training and testing accuracies (3pts).
In [0]:
Q4: Which of your classifiers do you think is best and why?