Scikit-Learn singalong: EEG Eye State Classification

Author: Kevin Yang

Contact: kyang@h2o.ai

This tutorial replicates Erin LeDell's oncology demo using Scikit Learn and Pandas, and is intended to provide a comparison of the syntactical and performance differences between sklearn and H2O implementations of Gradient Boosting Machines.

We'll be using Pandas, Numpy and the collections package for most of the data exploration.


In [1]:
import pandas as pd
import numpy as np
from collections import Counter

Download EEG Data

The following code downloads a copy of the EEG Eye State dataset. All data is from one continuous EEG measurement with the Emotiv EEG Neuroheadset. The duration of the measurement was 117 seconds. The eye state was detected via a camera during the EEG measurement and added later manually to the file after analysing the video frames. '1' indicates the eye-closed and '0' the eye-open state. All values are in chronological order with the first measured value at the top of the data.

Let's import the same dataset directly with pandas


In [2]:
csv_url = "http://www.stat.berkeley.edu/~ledell/data/eeg_eyestate_splits.csv"
data = pd.read_csv(csv_url)

Explore Data

Once we have loaded the data, let's take a quick look. First the dimension of the frame:


In [3]:
data.shape


Out[3]:
(14980, 16)

Now let's take a look at the top of the frame:


In [4]:
data.head()


Out[4]:
AF3 F7 F3 FC5 T7 P7 O1 O2 P8 T8 FC6 F4 F8 AF4 eyeDetection split
0 4329.23 4009.23 4289.23 4148.21 4350.26 4586.15 4096.92 4641.03 4222.05 4238.46 4211.28 4280.51 4635.90 4393.85 0 valid
1 4324.62 4004.62 4293.85 4148.72 4342.05 4586.67 4097.44 4638.97 4210.77 4226.67 4207.69 4279.49 4632.82 4384.10 0 test
2 4327.69 4006.67 4295.38 4156.41 4336.92 4583.59 4096.92 4630.26 4207.69 4222.05 4206.67 4282.05 4628.72 4389.23 0 train
3 4328.72 4011.79 4296.41 4155.90 4343.59 4582.56 4097.44 4630.77 4217.44 4235.38 4210.77 4287.69 4632.31 4396.41 0 train
4 4326.15 4011.79 4292.31 4151.28 4347.69 4586.67 4095.90 4627.69 4210.77 4244.10 4212.82 4288.21 4632.82 4398.46 0 train

The first two columns contain an ID and the response. The "diagnosis" column is the response. Let's take a look at the column names. The data contains derived features from the medical images of the tumors.


In [5]:
data.columns.tolist()


Out[5]:
['AF3',
 'F7',
 'F3',
 'FC5',
 'T7',
 'P7',
 'O1',
 'O2',
 'P8',
 'T8',
 'FC6',
 'F4',
 'F8',
 'AF4',
 'eyeDetection',
 'split']

To select a subset of the columns to look at, typical Pandas indexing applies:


In [6]:
columns = ['AF3', 'eyeDetection', 'split']
data[columns].head(10)


Out[6]:
AF3 eyeDetection split
0 4329.23 0 valid
1 4324.62 0 test
2 4327.69 0 train
3 4328.72 0 train
4 4326.15 0 train
5 4321.03 0 train
6 4319.49 0 test
7 4325.64 0 test
8 4326.15 0 test
9 4326.15 0 train

Now let's select a single column, for example -- the response column, and look at the data more closely:


In [7]:
data['eyeDetection'].head()


Out[7]:
0    0
1    0
2    0
3    0
4    0
Name: eyeDetection, dtype: int64

It looks like a binary response, but let's validate that assumption:


In [8]:
data['eyeDetection'].unique()


Out[8]:
array([0, 1])

We can query the categorical "levels" as well ('B' and 'M' stand for "Benign" and "Malignant" diagnosis):


In [9]:
data['eyeDetection'].nunique()


Out[9]:
2

Since "diagnosis" column is the response we would like to predict, we may want to check if there are any missing values, so let's look for NAs. To figure out which, if any, values are missing, we can use the isna method on the diagnosis column. The columns in an H2O Frame are also H2O Frames themselves, so all the methods that apply to a Frame also apply to a single column.


In [10]:
data.isnull()


Out[10]:
AF3 F7 F3 FC5 T7 P7 O1 O2 P8 T8 FC6 F4 F8 AF4 eyeDetection split
0 False False False False False False False False False False False False False False False False
1 False False False False False False False False False False False False False False False False
2 False False False False False False False False False False False False False False False False
3 False False False False False False False False False False False False False False False False
4 False False False False False False False False False False False False False False False False
5 False False False False False False False False False False False False False False False False
6 False False False False False False False False False False False False False False False False
7 False False False False False False False False False False False False False False False False
8 False False False False False False False False False False False False False False False False
9 False False False False False False False False False False False False False False False False
10 False False False False False False False False False False False False False False False False
11 False False False False False False False False False False False False False False False False
12 False False False False False False False False False False False False False False False False
13 False False False False False False False False False False False False False False False False
14 False False False False False False False False False False False False False False False False
15 False False False False False False False False False False False False False False False False
16 False False False False False False False False False False False False False False False False
17 False False False False False False False False False False False False False False False False
18 False False False False False False False False False False False False False False False False
19 False False False False False False False False False False False False False False False False
20 False False False False False False False False False False False False False False False False
21 False False False False False False False False False False False False False False False False
22 False False False False False False False False False False False False False False False False
23 False False False False False False False False False False False False False False False False
24 False False False False False False False False False False False False False False False False
25 False False False False False False False False False False False False False False False False
26 False False False False False False False False False False False False False False False False
27 False False False False False False False False False False False False False False False False
28 False False False False False False False False False False False False False False False False
29 False False False False False False False False False False False False False False False False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
14950 False False False False False False False False False False False False False False False False
14951 False False False False False False False False False False False False False False False False
14952 False False False False False False False False False False False False False False False False
14953 False False False False False False False False False False False False False False False False
14954 False False False False False False False False False False False False False False False False
14955 False False False False False False False False False False False False False False False False
14956 False False False False False False False False False False False False False False False False
14957 False False False False False False False False False False False False False False False False
14958 False False False False False False False False False False False False False False False False
14959 False False False False False False False False False False False False False False False False
14960 False False False False False False False False False False False False False False False False
14961 False False False False False False False False False False False False False False False False
14962 False False False False False False False False False False False False False False False False
14963 False False False False False False False False False False False False False False False False
14964 False False False False False False False False False False False False False False False False
14965 False False False False False False False False False False False False False False False False
14966 False False False False False False False False False False False False False False False False
14967 False False False False False False False False False False False False False False False False
14968 False False False False False False False False False False False False False False False False
14969 False False False False False False False False False False False False False False False False
14970 False False False False False False False False False False False False False False False False
14971 False False False False False False False False False False False False False False False False
14972 False False False False False False False False False False False False False False False False
14973 False False False False False False False False False False False False False False False False
14974 False False False False False False False False False False False False False False False False
14975 False False False False False False False False False False False False False False False False
14976 False False False False False False False False False False False False False False False False
14977 False False False False False False False False False False False False False False False False
14978 False False False False False False False False False False False False False False False False
14979 False False False False False False False False False False False False False False False False

14980 rows × 16 columns


In [11]:
data['eyeDetection'].isnull()


Out[11]:
0        False
1        False
2        False
3        False
4        False
5        False
6        False
7        False
8        False
9        False
10       False
11       False
12       False
13       False
14       False
15       False
16       False
17       False
18       False
19       False
20       False
21       False
22       False
23       False
24       False
25       False
26       False
27       False
28       False
29       False
         ...  
14950    False
14951    False
14952    False
14953    False
14954    False
14955    False
14956    False
14957    False
14958    False
14959    False
14960    False
14961    False
14962    False
14963    False
14964    False
14965    False
14966    False
14967    False
14968    False
14969    False
14970    False
14971    False
14972    False
14973    False
14974    False
14975    False
14976    False
14977    False
14978    False
14979    False
Name: eyeDetection, dtype: bool

The isna method doesn't directly answer the question, "Does the diagnosis column contain any NAs?", rather it returns a 0 if that cell is not missing (Is NA? FALSE == 0) and a 1 if it is missing (Is NA? TRUE == 1). So if there are no missing values, then summing over the whole column should produce a summand equal to 0.0. Let's take a look:


In [12]:
data['eyeDetection'].isnull().sum()


Out[12]:
0

Great, no missing labels.

Out of curiosity, let's see if there is any missing data in this frame:


In [13]:
data.isnull().sum()


Out[13]:
AF3             0
F7              0
F3              0
FC5             0
T7              0
P7              0
O1              0
O2              0
P8              0
T8              0
FC6             0
F4              0
F8              0
AF4             0
eyeDetection    0
split           0
dtype: int64

The next thing I may wonder about in a binary classification problem is the distribution of the response in the training data. Is one of the two outcomes under-represented in the training set? Many real datasets have what's called an "imbalanace" problem, where one of the classes has far fewer training examples than the other class. Let's take a look at the distribution, both visually and numerically.


In [14]:
Counter(data['eyeDetection'])


Out[14]:
Counter({0: 8257, 1: 6723})

Ok, the data is not exactly evenly distributed between the two classes -- there are more 0's than 1's in the dataset. However, this level of imbalance shouldn't be much of an issue for the machine learning algos. (We will revisit this later in the modeling section below).

Let's calculate the percentage that each class represents:


In [15]:
n = data.shape[0]  # Total number of training samples
np.array(Counter(data['eyeDetection']).values())/float(n)


Out[15]:
array([ 0.5512016,  0.4487984])

Split H2O Frame into a train and test set

So far we have explored the original dataset (all rows). For the machine learning portion of this tutorial, we will break the dataset into three parts: a training set, validation set and a test set.

If you want H2O to do the splitting for you, you can use the split_frame method. However, we have explicit splits that we want (for reproducibility reasons), so we can just subset the Frame to get the partitions we want.


In [16]:
train = data[data['split']=="train"]
train.shape


Out[16]:
(8988, 16)

In [17]:
valid = data[data['split']=="valid"]
valid.shape


Out[17]:
(2996, 16)

In [18]:
test = data[data['split']=="test"]
test.shape


Out[18]:
(2996, 16)

Machine Learning in H2O

We will do a quick demo of the H2O software -- trying to predict eye state (open/closed) from EEG data.

Specify the predictor set and response

The response, y, is the 'diagnosis' column, and the predictors, x, are all the columns aside from the first two columns ('id' and 'diagnosis').


In [46]:
y = 'eyeDetection'
x = data.columns.drop(['eyeDetection','split'])

Split H2O Frame into a train and test set


In [20]:
from sklearn.ensemble import GradientBoostingClassifier

In [21]:
import sklearn

In [22]:
test.shape


Out[22]:
(2996, 16)

Train and Test a GBM model


In [31]:
model = GradientBoostingClassifier(n_estimators=100,
                                    max_depth=4,
                                    learning_rate=0.1)

In [48]:
X=train[x].reset_index(drop=True)
y=train[y].reset_index(drop=True)

model.fit(X, y)


Out[48]:
GradientBoostingClassifier(init=None, learning_rate=0.1, loss='deviance',
              max_depth=4, max_features=None, max_leaf_nodes=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=100,
              presort='auto', random_state=None, subsample=1.0, verbose=0,
              warm_start=False)

In [49]:
print(model)


GradientBoostingClassifier(init=None, learning_rate=0.1, loss='deviance',
              max_depth=4, max_features=None, max_leaf_nodes=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=100,
              presort='auto', random_state=None, subsample=1.0, verbose=0,
              warm_start=False)

Inspect Model


In [53]:
model.get_params()


Out[53]:
{'init': None,
 'learning_rate': 0.1,
 'loss': 'deviance',
 'max_depth': 4,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'presort': 'auto',
 'random_state': None,
 'subsample': 1.0,
 'verbose': 0,
 'warm_start': False}

Model Performance on a Test Set


In [61]:
from sklearn.metrics import r2_score, roc_auc_score, mean_squared_error
y_pred = model.predict(X)

r2_score(y_pred, y)


Out[61]:
0.54512915254897387

In [62]:
roc_auc_score(y_pred, y)


Out[62]:
0.89097094432760837

In [63]:
mean_squared_error(y_pred, y)


Out[63]:
0.11103693813974187

Cross-validated Performance


In [75]:
from sklearn import cross_validation

cross_validation.cross_val_score(model, X, y, scoring='roc_auc', cv=5)


Out[75]:
array([ 0.54945509,  0.55455629,  0.32538286,  0.38222385,  0.42590001])

In [76]:
cross_validation.cross_val_score(model, valid[x].reset_index(drop=True), valid['eyeDetection'].reset_index(drop=True), scoring='roc_auc', cv=5)


Out[76]:
array([ 0.64409495,  0.55143686,  0.30297715,  0.36688253,  0.40355729])

In [ ]: