Author: Kevin Yang
Contact: kyang@h2o.ai
This tutorial replicates Erin LeDell's oncology demo using Scikit Learn and Pandas, and is intended to provide a comparison of the syntactical and performance differences between sklearn and H2O implementations of Gradient Boosting Machines.
We'll be using Pandas, Numpy and the collections package for most of the data exploration.
In [1]:
import pandas as pd
import numpy as np
from collections import Counter
The following code downloads a copy of the EEG Eye State dataset. All data is from one continuous EEG measurement with the Emotiv EEG Neuroheadset. The duration of the measurement was 117 seconds. The eye state was detected via a camera during the EEG measurement and added later manually to the file after analysing the video frames. '1' indicates the eye-closed and '0' the eye-open state. All values are in chronological order with the first measured value at the top of the data.
Let's import the same dataset directly with pandas
In [2]:
csv_url = "http://www.stat.berkeley.edu/~ledell/data/eeg_eyestate_splits.csv"
data = pd.read_csv(csv_url)
In [3]:
data.shape
Out[3]:
Now let's take a look at the top of the frame:
In [4]:
data.head()
Out[4]:
The first two columns contain an ID and the response. The "diagnosis" column is the response. Let's take a look at the column names. The data contains derived features from the medical images of the tumors.
In [5]:
data.columns.tolist()
Out[5]:
To select a subset of the columns to look at, typical Pandas indexing applies:
In [6]:
columns = ['AF3', 'eyeDetection', 'split']
data[columns].head(10)
Out[6]:
Now let's select a single column, for example -- the response column, and look at the data more closely:
In [7]:
data['eyeDetection'].head()
Out[7]:
It looks like a binary response, but let's validate that assumption:
In [8]:
data['eyeDetection'].unique()
Out[8]:
We can query the categorical "levels" as well ('B' and 'M' stand for "Benign" and "Malignant" diagnosis):
In [9]:
data['eyeDetection'].nunique()
Out[9]:
Since "diagnosis" column is the response we would like to predict, we may want to check if there are any missing values, so let's look for NAs. To figure out which, if any, values are missing, we can use the isna
method on the diagnosis column. The columns in an H2O Frame are also H2O Frames themselves, so all the methods that apply to a Frame also apply to a single column.
In [10]:
data.isnull()
Out[10]:
In [11]:
data['eyeDetection'].isnull()
Out[11]:
The isna
method doesn't directly answer the question, "Does the diagnosis column contain any NAs?", rather it returns a 0 if that cell is not missing (Is NA? FALSE == 0) and a 1 if it is missing (Is NA? TRUE == 1). So if there are no missing values, then summing over the whole column should produce a summand equal to 0.0. Let's take a look:
In [12]:
data['eyeDetection'].isnull().sum()
Out[12]:
Great, no missing labels.
Out of curiosity, let's see if there is any missing data in this frame:
In [13]:
data.isnull().sum()
Out[13]:
The next thing I may wonder about in a binary classification problem is the distribution of the response in the training data. Is one of the two outcomes under-represented in the training set? Many real datasets have what's called an "imbalanace" problem, where one of the classes has far fewer training examples than the other class. Let's take a look at the distribution, both visually and numerically.
In [14]:
Counter(data['eyeDetection'])
Out[14]:
Ok, the data is not exactly evenly distributed between the two classes -- there are more 0's than 1's in the dataset. However, this level of imbalance shouldn't be much of an issue for the machine learning algos. (We will revisit this later in the modeling section below).
Let's calculate the percentage that each class represents:
In [15]:
n = data.shape[0] # Total number of training samples
np.array(Counter(data['eyeDetection']).values())/float(n)
Out[15]:
So far we have explored the original dataset (all rows). For the machine learning portion of this tutorial, we will break the dataset into three parts: a training set, validation set and a test set.
If you want H2O to do the splitting for you, you can use the split_frame
method. However, we have explicit splits that we want (for reproducibility reasons), so we can just subset the Frame to get the partitions we want.
In [16]:
train = data[data['split']=="train"]
train.shape
Out[16]:
In [17]:
valid = data[data['split']=="valid"]
valid.shape
Out[17]:
In [18]:
test = data[data['split']=="test"]
test.shape
Out[18]:
We will do a quick demo of the H2O software -- trying to predict eye state (open/closed) from EEG data.
The response, y
, is the 'diagnosis' column, and the predictors, x
, are all the columns aside from the first two columns ('id' and 'diagnosis').
In [46]:
y = 'eyeDetection'
x = data.columns.drop(['eyeDetection','split'])
In [20]:
from sklearn.ensemble import GradientBoostingClassifier
In [21]:
import sklearn
In [22]:
test.shape
Out[22]:
In [31]:
model = GradientBoostingClassifier(n_estimators=100,
max_depth=4,
learning_rate=0.1)
In [48]:
X=train[x].reset_index(drop=True)
y=train[y].reset_index(drop=True)
model.fit(X, y)
Out[48]:
In [49]:
print(model)
In [53]:
model.get_params()
Out[53]:
In [61]:
from sklearn.metrics import r2_score, roc_auc_score, mean_squared_error
y_pred = model.predict(X)
r2_score(y_pred, y)
Out[61]:
In [62]:
roc_auc_score(y_pred, y)
Out[62]:
In [63]:
mean_squared_error(y_pred, y)
Out[63]:
In [75]:
from sklearn import cross_validation
cross_validation.cross_val_score(model, X, y, scoring='roc_auc', cv=5)
Out[75]:
In [76]:
cross_validation.cross_val_score(model, valid[x].reset_index(drop=True), valid['eyeDetection'].reset_index(drop=True), scoring='roc_auc', cv=5)
Out[76]:
In [ ]: