Today's class will be an adaptation of the first two parts of Olivier Grisel's splended tutorial on Parallel Machine Learning with scikit-learn and IPython that he gave at Strata this year.
Please start by cloning the repository for this tutorial and starting IPython Notebook in the notebooks
directory:
git clone https://github.com/ogrisel/parallel_ml_tutorial.git
cd parallel_ml_tutorial/notebooks
ipython notebook
Click on "OO - Tutorial Setup
"
In [4]:
import numpy
In [5]:
import scipy
In [6]:
import pylab
In [7]:
import sklearn
In [8]:
import psutil
In [9]:
import pandas
In [10]:
import IPython.parallel
Finding the location of an installed package and its version:
In [11]:
numpy.__path__
Out[11]:
In [12]:
numpy.__version__
Out[12]:
In [13]:
%run ../fetch_data.py
In [14]:
!ls -lh ../datasets/
In [15]:
import numpy as np
In [16]:
a = np.array([1, 2, 3])
In [17]:
a
Out[17]:
In [18]:
b = np.array([[0, 2, 4], [1, 3, 5]])
In [19]:
b
Out[19]:
In [20]:
b.shape
Out[20]:
In [21]:
b.dtype
Out[21]:
In [22]:
a.shape
Out[22]:
In [23]:
a.dtype
Out[23]:
In [24]:
np.zeros(5)
Out[24]:
In [25]:
np.ones(shape=(3, 4), dtype=np.int32)
Out[25]:
In [26]:
c = b * 0.5
In [27]:
c
Out[27]:
In [28]:
c.shape
Out[28]:
In [29]:
c.dtype
Out[29]:
In [30]:
a
Out[30]:
In [31]:
d = a + c
In [32]:
d
Out[32]:
In [33]:
d[0]
Out[33]:
In [34]:
d[0, 0]
Out[34]:
In [35]:
d[:, 0]
Out[35]:
In [36]:
d.sum()
Out[36]:
In [37]:
d.mean()
Out[37]:
In [38]:
d.sum(axis=0)
Out[38]:
In [39]:
d.mean(axis=1)
Out[39]:
In [40]:
e = np.arange(12)
In [41]:
e
Out[41]:
In [42]:
f = e.reshape(3, 4)
In [43]:
f
Out[43]:
In [44]:
e
Out[44]:
In [45]:
e[5:] = 0
In [46]:
e
Out[46]:
In [47]:
f
Out[47]:
In [48]:
a
Out[48]:
In [49]:
b
Out[49]:
In [50]:
d
Out[50]:
In [51]:
np.concatenate([a, a, a])
Out[51]:
In [52]:
np.vstack([a, b, d])
Out[52]:
In [53]:
np.hstack([b, d])
Out[53]:
In [54]:
%matplotlib inline
In [55]:
import matplotlib.pyplot as plt
In [56]:
x = np.linspace(0, 2, 10)
In [57]:
x
Out[57]:
In [126]:
_ = plt.plot(x, 'o-')
In [125]:
plt.plot(x, x, 'o-', label='linear')
plt.plot(x, x ** 2, 'x-', label='quadratic')
plt.legend(loc='best')
plt.title('Linear vs Quadratic progression')
plt.xlabel('Input')
plt.ylabel('Output')
Out[125]:
In [60]:
samples = np.random.normal(loc=1.0, scale=0.5, size=1000)
In [61]:
samples.shape
Out[61]:
In [62]:
samples.dtype
Out[62]:
In [63]:
samples[:30]
Out[63]:
In [122]:
_ = plt.hist(samples, bins=50)
In [65]:
samples_1 = np.random.normal(loc=1, scale=.5, size=10000)
samples_2 = np.random.standard_t(df=10, size=10000)
In [123]:
bins = np.linspace(-3, 3, 50)
_ = plt.hist(samples_1, bins=bins, alpha=0.5, label='samples 1')
_ = plt.hist(samples_2, bins=bins, alpha=0.5, label='samples 2')
plt.legend(loc='upper left')
Out[123]:
In [124]:
plt.scatter(samples_1, samples_2, alpha=0.1)
Out[124]:
O1 - An Introduction to Predictive Modeling in Python.ipynb
In [68]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
Let us have a look at the Titanic dataset from the Kaggle Getting Started challenge at:
https://www.kaggle.com/c/titanic-gettingStarted
We can load the CSV file as a pandas data frame in one line:
In [69]:
data = pd.read_csv('https://dl.dropboxusercontent.com/u/5743203/data/titanic/titanic_train.csv')
VARIABLE DESCRIPTIONS:
survival Survival
(0 = No; 1 = Yes)
pclass Passenger Class
(1 = 1st; 2 = 2nd; 3 = 3rd)
name Name
sex Sex
age Age
sibsp Number of Siblings/Spouses Aboard
parch Number of Parents/Children Aboard
ticket Ticket Number
fare Passenger Fare
cabin Cabin
embarked Port of Embarkation
(C = Cherbourg; Q = Queenstown; S = Southampton)
SPECIAL NOTES:
Pclass is a proxy for socio-economic status (SES)
1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower
Age is in Years; Fractional if Age less than One (1)
If the Age is Estimated, it is in the form xx.5
With respect to the family relation variables (i.e. sibsp and parch)
some relations were ignored. The following are the definitions used
for sibsp and parch.
Sibling: Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic
Spouse: Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)
Parent: Mother or Father of Passenger Aboard Titanic
Child: Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic
Other family relatives excluded from this study include cousins,
nephews/nieces, aunts/uncles, and in-laws. Some children travelled
only with a nanny, therefore parch=0 for them. As well, some
travelled with very close friends or neighbors in a village, however,
the definitions do not support such relations.
pandas data frames have a HTML table representation in the IPython notebook. Let's have a look at the first 5 rows:
In [70]:
data.head(5)
Out[70]:
In [71]:
data.count()
Out[71]:
The data frame has 891 rows. Some passengers have missing information though: in particular Age and Cabin info can be missing. The meaning of the columns is explained on the challenge website:
https://www.kaggle.com/c/titanic-gettingStarted/data
A data frame can be converted into a numpy array by calling the values
attribute:
In [72]:
data.values
Out[72]:
However this cannot be directly fed to a scikit-learn model:
the target variable (survival) is mixed with the input data
some attribute such as unique ids have no predictive values for the task
the values are heterogeneous (string labels for categories, integers and floating point numbers)
some attribute values are missing (nan: "not a number")
The goal of the challenge is to predict whether a passenger has survived from others known attribute. Let us have a look at the Survived
columns:
In [73]:
data.Survived.dtype
Out[73]:
data.Survived
is an instance of the pandas Series
class with an integer dtype:
In [74]:
data.Survived.__class__.__module__, data.Survived.__class__.__name__
Out[74]:
The data
object is an instance pandas DataFrame
class:
In [75]:
data.__class__.__module__, data.__class__.__name__
Out[75]:
Series
can be seen as homegeneous, 1D columns. DataFrame
instances are heterogenous collections of columns with the same length.
The original data frame can be aggregated by counting rows for each possible value of the Survived
column:
In [76]:
data.groupby('Survived').count()['Survived']
Out[76]:
From this the subset of the full passengers list, about 2/3 perished in the event. So if we are to build a predictive model from this data, a baseline model to compare the performance to would be to always predict death. Such a constant model would reach around 62% predictive accuracy (which is higher than predicting at random):
In [77]:
np.mean(data.Survived == 0)
Out[77]:
In other words, our no–information rate (NIR) (recall last week)
pandas Series
instances can be converted to regular 1D numpy arrays by using the values
attribute:
In [78]:
target = data.Survived.values
In [79]:
type(target)
Out[79]:
In [80]:
target.dtype
Out[80]:
In [81]:
target[:5]
Out[81]:
sklearn
estimators all work with homegeneous numerical feature descriptors passed as a numpy array. Therefore passing the raw data frame will not work out of the box.
Let us start simple and build a first model that only uses readily available numerical features as input, namely data.Fare
, data.Pclass
and data.Age
.
In [82]:
numerical_features = data.get(['Fare', 'Pclass', 'Age'])
numerical_features.head(5)
Out[82]:
Unfortunately some passengers do not have age information:
In [83]:
numerical_features.count()
Out[83]:
Let's use pandas fillna
method to input the median age for those passengers:
In [84]:
median_features = numerical_features.dropna().median()
median_features
Out[84]:
In [85]:
imputed_features = numerical_features.fillna(median_features)
imputed_features.count()
Out[85]:
In [86]:
imputed_features.head(5)
Out[86]:
Now that the data frame is clean, we can convert it into an homogeneous numpy array of floating point values:
In [87]:
features_array = imputed_features.values
features_array
Out[87]:
Let's take the 80% of the data for training a first model and keep 20% for computing is generalization score:
In [88]:
from sklearn.cross_validation import train_test_split
features_train, features_test, target_train, target_test = train_test_split(
features_array, target, test_size=0.20, random_state=0)
In [89]:
features_train.shape
Out[89]:
In [90]:
features_test.shape
Out[90]:
In [91]:
target_train.shape
Out[91]:
In [92]:
target_test.shape
Out[92]:
Let's start with a simple model from sklearn, namely LogisticRegression
:
In [93]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(features_train, target_train)
Out[93]:
In [94]:
target_predicted = lr.predict(features_test)
In [95]:
from sklearn.metrics import accuracy_score
accuracy_score(target_test, target_predicted)
Out[95]:
This first model has around 73% accuracy: this is better than our baselines that always predicts death.
(interlude adapted from Diagnostic tests: ROC curves, sensitivity, and specificity by Michael Walker)
Null hypothesis (H0) is true | Null hypothesis (H0) is false | |
---|---|---|
Reject null hypothesis | Type I error False positive |
Correct outcome True positive |
Fail to reject null hypothesis | Correct outcome True negative |
Type II error False negative |
The function roc_curve
computes the receiver operating characteristic curve, or ROC curve (quoting Wikipedia):
A receiver operating characteristic (ROC), or simply ROC curve, is a graphical plot which illustrates the performance of a binary classifier system as its discrimination threshold is varied. It is created by plotting the fraction of true positives out of the positives (TPR = true positive rate) vs. the fraction of false positives out of the negatives (FPR = false positive rate), at various threshold settings. TPR is also known as sensitivity, and FPR is one minus the specificity or true negative rate.
The coef_
attribute of a fitted linear model such as LogisticRegression
holds the weights of each features:
In [96]:
feature_names = numerical_features.columns.values
feature_names
Out[96]:
In [97]:
lr.coef_
Out[97]:
In [98]:
x = np.arange(len(feature_names))
plt.bar(x, lr.coef_.ravel())
_ = plt.xticks(x + 0.5, feature_names, rotation=30)
In this case, survival is slightly positively linked with Fare (the higher the fare, the higher the likelyhood the model will predict survival) while passenger from first class and lower ages are predicted to survive more often than older people from the 3rd class.
First-class cabins where closer to the lifeboats and children and women reportedly had the priority. Our model seems to capture that historical data. We will see later if the sex of the passenger can be used as an informative predictor to increase the predictive accuracy of the model.
Logistic Regression is a probabilistic models: instead of just predicting a binary outcome (survived or not) given the input features it can also estimates the posterior probability of the outcome given the input features using the predict_proba
method:
In [99]:
target_predicted_proba = lr.predict_proba(features_test)
target_predicted_proba[:5]
Out[99]:
By default the decision threshold is 0.5: if we vary the decision threshold from 0 to 1 we could generate a family of binary classifier models that address all the possible trade offs between false positive and false negative prediction errors.
We can summarize the performance of a binary classifier for all the possible thresholds by plotting the ROC curve and quantifying the Area under the ROC curve:
In [100]:
from sklearn.metrics import roc_curve
from sklearn.metrics import auc
def plot_roc_curve(target_test, target_predicted_proba):
fpr, tpr, thresholds = roc_curve(target_test, target_predicted_proba[:, 1])
roc_auc = auc(fpr, tpr)
# Plot ROC curve
plt.plot(fpr, tpr, label='ROC curve (area = %0.3f)' % roc_auc)
plt.plot([0, 1], [0, 1], 'k--') # random predictions curve
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate or (1 - Specifity)')
plt.ylabel('True Positive Rate or (Sensitivity)')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
In [101]:
plot_roc_curve(target_test, target_predicted_proba)
Here the area under ROC curve is 0.756 which is very similar to the accuracy (0.732). However the ROC-AUC score of a random model is expected to 0.5 on average while the accuracy score of a random model depends on the class imbalance of the data. ROC-AUC can be seen as a way to callibrate the predictive accuracy of a model against class imbalance.
It is possible to see the details of the false positive and false negative errors by computing the confusion matrix:
In [102]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(target_test, target_predicted))
Another way to quantify the quality of a binary classifier on imbalanced data is to compute the precision, recall and f1-score of a model (at the default fixed decision threshold of 0.5).
In [103]:
from sklearn.metrics import classification_report
print(classification_report(target_test, target_predicted,
target_names=['not survived', 'survived']))
We previously decided to randomly split the data to evaluate the model on 20% of held-out data. However the location randomness of the split might have a significant impact in the estimated accuracy:
In [104]:
features_train, features_test, target_train, target_test = train_test_split(
features_array, target, test_size=0.20, random_state=0)
lr.fit(features_train, target_train).score(features_test, target_test)
Out[104]:
In [105]:
features_train, features_test, target_train, target_test = train_test_split(
features_array, target, test_size=0.20, random_state=1)
lr.fit(features_train, target_train).score(features_test, target_test)
Out[105]:
In [106]:
features_train, features_test, target_train, target_test = train_test_split(
features_array, target, test_size=0.20, random_state=2)
lr.fit(features_train, target_train).score(features_test, target_test)
Out[106]:
So instead of using a single train / test split, we can use a group of them and compute the min, max and mean scores as an estimation of the real test score while not underestimating the variability:
In [107]:
from sklearn.cross_validation import cross_val_score
scores = cross_val_score(lr, features_array, target, cv=5)
scores
Out[107]:
In [108]:
scores.min(), scores.mean(), scores.max()
Out[108]:
cross_val_score
reports accuracy by default be it can also be used to report other performance metrics such as ROC-AUC or f1-score:
In [109]:
scores = cross_val_score(lr, features_array, target, cv=5,
scoring='roc_auc')
scores.min(), scores.mean(), scores.max()
Out[109]:
Exercise:
Compute cross-validated scores for other classification metrics.
Change the number of cross-validation folds between 3 and 10: what is the impact on the mean score? on the processing time?
Hints:
The list of classification metrics is available in the online documentation:
http://scikit-learn.org/stable/modules/model_evaluation.html#common-cases-predefined-values
You can use the %%time
cell magic on the first line of an IPython cell to measure the time of the execution of the cell.
In [109]:
Let us now try to build richer models by including more features as potential predictors for our model.
Categorical variables such as data.Embarked
or data.Sex
can be converted as boolean indicators features also known as dummy variables or one-hot-encoded features:
In [110]:
pd.get_dummies(data.Sex, prefix='Sex').head(5)
Out[110]:
In [111]:
pd.get_dummies(data.Embarked, prefix='Embarked').head(5)
Out[111]:
We can combine those new numerical features with the previous features using pandas.concat
along axis=1
:
In [112]:
rich_features = pd.concat([data.get(['Fare', 'Pclass', 'Age']),
pd.get_dummies(data.Sex, prefix='Sex'),
pd.get_dummies(data.Embarked, prefix='Embarked')],
axis=1)
rich_features.head(5)
Out[112]:
By construction the new Sex_male
feature is redundant with Sex_female
. Let us drop it:
In [113]:
rich_features_no_male = rich_features.drop('Sex_male', 1)
rich_features_no_male.head(5)
Out[113]:
Let us not forget to imput the median age for passengers without age information:
In [114]:
rich_features_final = rich_features_no_male.fillna(rich_features_no_male.dropna().median())
rich_features_final.head(5)
Out[114]:
We can finally cross-validate a logistic regression model on this new data an observe that the mean score has significantly increased:
In [115]:
%%time
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_score
lr = LogisticRegression(C=1)
scores = cross_val_score(lr, rich_features_final, target, cv=5, scoring='accuracy')
print(scores.min(), scores.mean(), scores.max())
Exercise:
change the value of the parameter C
. Does it have an impact on the score?
fit a new instance of the logistic regression model on the full dataset.
plot the weights for the features of this newly fitted logistic regression model.
In [115]:
A non-parametric hierachical classification technique
non-parametric: no parameters, no distribution assumptions
hierarchical: consists of a sequence of questions which yield a class label when applied to any record
Using a configuration of nodes and edges.
More concretely, as a multiway tree, which is a type of (directed acyclic) graph.
In a decision tree, the nodes represent questions (test conditions) and the edges are the answers to these questions.
The top node of the tree is called the root node. This node has 0 incoming edges, and 2+ outgoing edges.
An internal node has 1 incoming edge, and 2+ outgoing edges. Internal nodes represent test conditions.
A leaf node has 1 incoming edge and, 0 outgoing edges. Leaf nodes correspond to class labels.
The nodes in our tree are connected by directed edges.
These directed edges lead from parent nodes to child nodes.
source: http://www-users.cs.umn.edu/~kumar/dmbook/ch4.pdf
source: http://www-users.cs.umn.edu/~kumar/dmbook/ch4.pdf
But this is generally too complex to be practical $\rightarrow O(2^n)$.
That is, a fast solution good enough to solve the problem
greedy – algorithm makes locally optimal decision at each step
recursive – splits task into subtasks, solves each the same way`
local optimum – solution for a given neighborhood of points
1) If all records in $D_t$ belong to class $X$, then $t$ is a leaf node corresponding to class $X$.
2) If $D_t$ contains records from both classes, then a test condition is created to partition the records further. In this case, $t$ is an internal node whose outgoing edges correspond to the possible outcomes of this test condition.
These outgoing edges terminate in child nodes. A record $d$ in $D_t$ is assigned to one of these child nodes based on the outcome of the test condition applied to $d$.
3) These steps are then recursively applied to each child node.
Decision trees are easy to interpret, but the algorithms to create them are a bit complicated.
Test conditions can create binary splits
Alternatively, we can create multiway splits
Multiway splits can produce purer subsets, but may lead to overfitting!
For continuous features, we can use either method
Therefore we want each step to create the partition with the highest possible purity.
We need an objective function to optimize!
We want our objective function to measure the gain in purity from a particular split.
Therefore we want it to depend on the class distribution over the nodes (before and after the split).
For example, let $p(i|t)$ be the probability of class $i$ at node $t$ (eg, the fraction of records labeled $i$ at node $t$).
We are using the frequentist definition of probability here!
Then for a binary $(0/1)$ classification problem,
The minimum purity partition is given by the distribution:
$$ p(0|t) = p(1|t) = 0.5 $$
The maximum purity partition is given (eg) by the distribution: $$ p(0|t) = 1 – p(1|t) = 1 $$
Some measures of impurity include:
$$ Entropy(t) = -\sum_{i=0}^{c-1}p(i|t) log_2 p(i|t)
\\ Gini(t) = 1 - \sum_{i=0}^{c-1}[p(i|t)]^2
\\ Classification error(t) = 1 - \max_i[p(i|t)] $$
Despite consistency, different measures may create different splits.
Generally speaking, a test condition with a high number of outcomes can lead to overfitting (ex: a split with one outcome per record).
One way of dealing with this is to restrict the algorithm to binary splits only (CART).
Another way is to use a splitting criterion which explicitly penalizes the number of outcomes (C4.5)
We can use a function of the information gain called the gain ratio to explicitly penalize high numbers of outcomes:
$$ gain\: ratio = \frac{\Delta_{info}}{-\sum p(v_i) log_2 p(v_i)} $$(Where $p(v_i)$ refers to the probability of label $i$ at node $v$)
This is a form of regularization!
Impurity measures put us on the right track, but on their own they are not enough to tell us how our split will do.
Why is this true?
We still need to look at impurity before & after the split.
We can make this comparison using the gain:
$$ \Delta = I(parent) - \sum_{children} \frac{N_j}{N}I(child_j) $$
(Here $I$ is the impurity measure, $N_j$ denotes the number of records at child node $j$, and $N$ denotes the number of records at the parent node.)
N.B. When $I$ is the entropy, this quantity is called the information gain.
In addition to determining splits, we also need a stopping criterion to tell us when we’re done.
For example, we can stop when all records belong to the same class, or when all records have the same attributes.
This is correct in principle, but would likely lead to overfitting.
One possibility is pre-pruning, which involves setting a minimum threshold on the gain, and stopping when no split achieves a gain above this threshold.
This prevents overfitting, but is difficult to calibrate in practice (may preserve bias!)
Alternatively we could build the full tree, and then perform pruning as a post-processing step.
To prune a tree, we examine the nodes from the bottom-up and simplify pieces of the tree (according to some criteria).
Complicated subtrees can be replaced either with a single node, or with a simpler (child) subtree.
The first approach is called subtree replacement, and the second is subtree raising.
sklearn
also implement non linear models that are known to perform very well for data-science projects where datasets have not too many features (e.g. less than 5000).
In particular let us have a look at Decision Trees:
In [117]:
from sklearn import tree
from sklearn.cross_validation import cross_val_score
clf = tree.DecisionTreeClassifier(criterion='gini')
scores = cross_val_score(clf, rich_features_final, target, cv=5, scoring='accuracy')
print(scores.min(), scores.mean(), scores.max())
Decision Trees seem to do slightly better than the logistic regression model on this data.
Exercise:
Change the value of the criterion
to 'entropy', do you get a different mean score?
On your own, go through the tutorials available on scikit-learn's website (15min)
http://scikit-learn.org/dev/modules/tree.html
(Do not worry about the pydot segment if you do not have it installed)
Try Decision Trees on your own data!
In [ ]: