Introduction to Machine Learning with Python

Module 2

Learning Activity 1: Load the required libraries


In [1]:
import scipy
import numpy as np
import pandas as pd
import plotly.plotly as py

import visplots

from plotly.graph_objs import *
from plotly.offline import download_plotlyjs, init_notebook_mode, iplot
from sklearn import preprocessing, metrics
from sklearn.cross_validation import train_test_split, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.grid_search import GridSearchCV, RandomizedSearchCV
from scipy.stats.distributions import randint

init_notebook_mode()

print("libraries all imported, ready to go")


/Users/santiaago/anaconda/lib/python2.7/site-packages/matplotlib/font_manager.py:273: UserWarning:

Matplotlib is building the font cache using fc-list. This may take a moment.

libraries all imported, ready to go

Learning Activity 2: Importing the data

The dataset we will be using throughout this workshop is an adapted version of the Wine Quality case study, available from the UCI Machine Learning repository (https://archive.ics.uci.edu/ml/datasets/Wine+Quality). The goal of this case study is to model the wine quality (into "low" or "high" quality) based on physicochemical tests (such as fixed and volatile acidity, citric acid, etc.).

The first thing you will need to do in order to work with the wine quality dataset is to read the contents from the provided wine_quality.csv data file using the read_csv command. You should also try to explore the first few rows of the imported wine DataFrame using the head function from the pandas package (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html):


In [8]:
# Import the data and explore the first few rows
df = pd.read_csv('data/wine_quality.csv')
#print df
df.head()
#df.tail()


Out[8]:
fixed_acidity volatile_acidity citric_acid residual_sugar chlorides free_sulfur_dioxide total_sulfur_dioxide density pH sulphates quality
0 7.4 0.70 0.00 1.9 0.076 11 34 0.9978 3.51 0.56 low
1 7.8 0.88 0.00 2.6 0.098 25 67 0.9968 3.20 0.68 low
2 7.8 0.76 0.04 2.3 0.092 15 54 0.9970 3.26 0.65 low
3 11.2 0.28 0.56 1.9 0.075 17 60 0.9980 3.16 0.58 low
4 7.4 0.70 0.00 1.9 0.076 11 34 0.9978 3.51 0.56 low

In order to feed the data into our classification models and sklearn, the imported wine quality DataFrame needs to be converted into a numpy array. For more information on numpy arrays, see http://scipy-lectures.github.io/intro/numpy/array_object.html.

In addition, it is always a good practice to check the dimensionality of the imported data using the shape command prior to constructing any classification model to make sure you have really imported all the data, and imported it in the correct way (e.g. one common mistake is to get the separator wrong and end up with only one column).


In [21]:
# Convert to numpy array and check the dimensionality
array = df.as_matrix()
print len(array)
print len(array[0])
print np.shape(array)
print type(array)


1487
11
(1487, 11)
<type 'numpy.ndarray'>

Test Activity 3: Inspect your data by indexing and index slicing

To select elements in an array, you specify their indices with square bracket notation. For a two-dimensional array, the first index indicates the row number and the second index indicates the column number. Try selecting the values of the first and second columns of the first sample in the npArray:


In [28]:
# Print the 1st row and 1st column of npArray
print array[0]
print array[:,0]
print np.shape(array[0])
print np.shape(array[:,0])


[7.4 0.7 0.0 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 'low']
[7.4 7.8 7.8 ..., 7.9 7.9 7.1]
(11,)
(1487,)

In [29]:
# Print the 1st row and 2nd column of npArray
print array[0]
print array[:,1]
print np.shape(array[0])
print np.shape(array[:,1])


[7.4 0.7 0.0 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 'low']
[0.7 0.88 0.76 ..., 0.41 0.44 0.44]
(11,)
(1487,)

To select ranges of elements, we use "index slicing". Index slicing is the technical name for the syntax A[lower:upper], where lower refers to the lower bound index that is included, and upper refers to the upper bound index that is not included. Try selecting the first three samples (rows):


In [51]:
# Print the first 3 rows of npArray
print array[0:3]
print np.shape(array[0:3])


[[7.4 0.7 0.0 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 'low']
 [7.8 0.88 0.0 2.6 0.098 25.0 67.0 0.9968 3.2 0.68 'low']
 [7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.997 3.26 0.65 'low']]
(3, 11)

and also the first three samples (rows) of the last column:


In [32]:
# Print the first 3 rows from the last column of npArray
print array[0:3,-1]


['low' 'low' 'low']

Learning Activity 4: Split the data into input features, X, and outputs, y

Subsequently, we need to split our initial dataset into the data matrix X (independent variable) and the associated class vector y (dependent or target variable). The input features, X, are the variables that you use to predict the outcome. In this data set, there are ten input features stored in columns 1-10 (index 0-9, although the upper bound is not included so the range for indexing is 0:10), all of which have continuous values. The output label, y, holds the information of whether the wine has been rated as high or low quality, and is stored in the final (eleventh) column (index 10). To split the data, we need to assign the columns of the input features and the columns of the output labels to different arrays:


In [65]:
# Split to input matrix X and class vector y
y = array[:,-1]
x = array[:,0:10]

Try printing the size of the input matrix X and class vector y using the "shape" command:


In [66]:
# Print the dimensions of X and y
print np.shape(y)
print np.shape(x)


(1487,)
(1487, 10)

Exploratory Data Analysis

Visualisation is an integral part of Data Science. Exploratory data analysis (EDA) is the field dealing with the analysis of data sets as a means of summarising their main characteristics, most often using visual methods.

Plotly is an online collaborative data analysis and graphing tool that we will use in order to construct fully interactive graphs. The Plotly API allows you to access all of the library's interactive functionality directly from Python (or other programming languages such as R, JavaScript and MATLAB, among others). Crucially, Plotly has recently been made open-source, which now enables plotting offline without requiring access to their API. Plotly Offline brings interactive Plotly graphs to the offline Jupyter (IPython) Notebook environment.

Learning Activity 5: Investigate the y frequencies

An important aspect to understand before applying any classification algorithms is how the output labels are distributed. Are they evenly distributed? Imbalances in distribution of labels can often lead to poor classification results for the minority class even if the classification results for the majority class are very good.


In [67]:
# Print the y frequencies
lows = 0
for e in y:
    if e == 'low':
        lows = lows + 1

highs = len(y) - lows

yFreq = scipy.stats.itemfreq(y)
print 'lows ' + str(lows)
print 'hights ' + str(highs)
print yFreq


lows 987
hights 500
[['high' 500]
 ['low' 987]]

In our current dataset, you can see that the y values are categorical (i.e. they can only take one of a discrete set of values) and have a non-numeric representation, "high" vs. "low". This can be problematic for scikit-learn and plotting functions in Python, since they assume numerical values, so we need to map the text categories to numerical representations using LabelEncoder and the fit_transform function from the preprocessing module:


In [68]:
# Convert the categorical to numeric values, and print the y frequencies
le = preprocessing.LabelEncoder()
y = le.fit_transform(y)
print y


[1 1 1 ..., 0 0 0]

Visualising the data in some way is a good way to get a feel for how the data is distributed. As a simple example, try plotting the frequencies of the class labels (held in yFreq), 1 and 0, and see how they are distributed using a barplot from Plotly:


In [69]:
# Display the y frequencies in a barplot with Plotly
py.sign_in('santiaago', 'dlekv8an96')
data = [
    graph_objs.Bar(
        x=['low', 'high'],
        y=[lows, highs]
    )
]

plot_url = py.plot(data, filename='basic-bar')


High five! You successfuly sent some data to your account on plotly. View your plot in your browser at https://plot.ly/~santiaago/0 or inside your plot.ly account where it is named 'basic-bar'

More examples on Plotly barplots can be found at https://plot.ly/python/bar-charts/. In addition, a full list of arguments on barplots can be found at https://plot.ly/python/reference/#bar/.

Learning Activity 6: Data scaling

It is usually advisable to scale your data prior to fitting a classification model to avoid attributes with greater numeric ranges dominating those with smaller numeric ranges. In order to investigate the range and descriptive statistics of our features, we can apply the describe() function from pandas to the original wineQ DataFrame (not the numpy array!). For instance:


In [70]:
# Print the descriptive statistics of the wineQ DataFrame
df.describe()


Out[70]:
fixed_acidity volatile_acidity citric_acid residual_sugar chlorides free_sulfur_dioxide total_sulfur_dioxide density pH sulphates
count 1487.000000 1487.000000 1487.000000 1487.000000 1487.000000 1487.000000 1487.000000 1487.000000 1487.000000 1487.000000
mean 8.272966 0.465716 0.303423 2.898588 0.079183 18.746469 61.755884 0.996225 3.293692 0.639388
std 1.784884 0.192836 0.180989 2.198372 0.047974 12.936183 45.799977 0.002646 0.154856 0.180488
min 4.600000 0.080000 0.000000 0.900000 0.012000 1.000000 6.000000 0.989000 2.740000 0.250000
25% 7.000000 0.317500 0.170000 1.900000 0.059000 9.000000 25.000000 0.994965 3.190000 0.530000
50% 7.900000 0.440000 0.320000 2.200000 0.076000 15.000000 47.000000 0.996720 3.300000 0.610000
75% 9.200000 0.600000 0.430000 2.800000 0.088000 27.000000 93.000000 0.997905 3.390000 0.720000
max 15.900000 1.330000 1.000000 15.500000 0.611000 82.000000 289.000000 1.003200 3.900000 2.000000

Boxplots are a powerful visual aid, commonly used in order to investigate simultaneously the range differences of the input features. Boxplots are a standardised way of displaying the distribution of the data based on the "five number summary" (minimum, first quartile, median, third quartile, and maximum). For example, try and plot the features of the raw matrix X using the script for the boxplots:


In [83]:
# Create a boxplot of the raw data
import plotly.graph_objs as go

traces = []
t0 = go.Box(y=x[:,0])
t1 = go.Box(y=x[:,1])
t2 = go.Box(y=x[:,2])
t3 = go.Box(y=x[:,3])
t4 = go.Box(y=x[:,4])
t5 = go.Box(y=x[:,5])
t6 = go.Box(y=x[:,6])
t7 = go.Box(y=x[:,7])
t8 = go.Box(y=x[:,8])
t9 = go.Box(y=x[:,9])

traces = [t1,t2,t3,t4,t5,t6,t7,t8,t9]
data = traces
plot_url = py.plot(data, filename='basic-box-plot')

There are many ways of scaling but one common scaling mechanism is auto-scaling, where for each column, the values are centred around the mean and divided by their standard deviation. This scaling mechanism can be applied by calling the scale() function in scikit-learn’s preprocessing module.


In [84]:
# Auto-scale the data
x_scale = preprocessing.scale(x)


/Users/santiaago/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py:420: DataConversionWarning:

Data with input dtype object was converted to float64 by the scale function.

Try to re-run the previous plotting script and have a look at the outcome of the boxplot after scaling. Alternatively, if you feel more adventurous, you create a more enhanced version of the boxplot. You can find more online examples at https://plot.ly/python/box-plots/, and also a full list of boxplot arguments at https://plot.ly/python/reference/#box.


In [85]:
# Create a boxplot of the scaled data (simple or enhanced)
traces1 = []
t0 = go.Box(y=x_scale[:,0])
t1 = go.Box(y=x_scale[:,1])
t2 = go.Box(y=x_scale[:,2])
t3 = go.Box(y=x_scale[:,3])
t4 = go.Box(y=x_scale[:,4])
t5 = go.Box(y=x_scale[:,5])
t6 = go.Box(y=x_scale[:,6])
t7 = go.Box(y=x_scale[:,7])
t8 = go.Box(y=x_scale[:,8])
t9 = go.Box(y=x_scale[:,9])

traces1 = [t1,t2,t3,t4,t5,t6,t7,t8,t9]
data = traces1
plot_url = py.plot(data, filename='basic-box-plot')

Learning Activity 7: Investigate the relationship between input features

You can visualise the relationship between two variables (features) using a simple scatter plot. This step can give you a good first indication of the ML model model to apply and its complexity (linear vs. non-linear). At this stage, let’s plot the first two variables against each other:


In [87]:
# Create a scatter plot of the first two features
trace1 = go.Scatter(
    x = x[:,0],
    y = x[:,1],
    mode = 'markers'
)

data = [trace1]

# Plot and embed in ipython notebook!
py.iplot(data, filename='basic-scatter')


Out[87]:

We can also relate associations between features to their y classifications by making the colour of the points dependent on the corresponding y value:


In [ ]:
# Create an enhanced scatter plot of the first two features

Examples of Plotly scatterplots can be found at https://plot.ly/python/line-and-scatter/ (or for a list of arguments refer to https://plot.ly/python/reference/#scatter/).

Bonus Activity 8: Try plotting di€fferent combinations of three features (f1, f2, f3) in the same plot.

The scatterplots we have seen so far investigated the relationship between two variables (features). A three-dimensional graph lets you introduce a third axis, typically called the z axis, and can help you understand the relationship between three variables. Plotly's fully interactive functionality allows you to plot, hover, zoom and rotate 3-dimensional scatterplots. For a full list of arguments on 3d plots in Plotly visit https://plot.ly/python/reference/#scatter3d. Other examples on 3D scatterplots using Plotly can be found at https://plot.ly/python/3d-scatter-plots/.

Hint: Investigate the Scatter3d object from Plotly

Axes in 3D Plotly plots work a bit differently than in 2D (axes are bound to a Scene object -- use help(Scene)).


In [ ]:
# Create a 3D scatterplot using the first three features
trace1 = go.Scatter(
    x = x[:,0],
    y = x[:,1],
    mode = 'markers'
)

data = [trace1]

# Plot and embed in ipython notebook!
py.iplot(data, filename='basic-scatter')

Bonus Activity 9: Try di€fferent combinations of f1 and f2 (in a grid/scatterplot matrix if you can).

A scatterplot matrix shows a grid of scatterplots where each attribute is plotted against all other attributes. For example, try to create a scatterplot matrix of the first four features. You can find further information on how to create and customise subplots with Plotly at https://plot.ly/python/subplots/.

Hints: You may want to use nested loops that iterate through the rows and columns of the grid, and also import and make use of the make_subplots() function from Plotly


In [ ]:
# Create a grid plot of scatterplots using a combination of features

Bonus Activity 10: Create a correlation matrix and plot a heatmap of correlations between the input features in X

Often, the different features (variables) in X are not completely independent from each other. For example, fixed acidity is related to volatile acidity. To quickly identify which features are related and to what extent, it is useful to see how they are correlated. You can do this by creating a correlation matrix from X using corrcoef() in the numpy module:


In [ ]:
# Calculate the correlation coefficient

To search for linear relationships between features across all pairs of features, you can use a heatmap of correlations (directly from X), which is simply a matrix of subplots whose colours represent the sizes of the correlations:


In [ ]:
# Create a heatmap of the correlation coefficients

Module 3

Learning Activity 1: Split the data into training and test sets

Training and testing a classification model on the same dataset is a methodological mistake: a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data (poor generalisation). To use different datasets for training and testing, we need to split the wine dataset into two disjoint sets: train and test (Holdout method) using the train_test_split function.


In [178]:
# Split into training and test sets
X = df.as_matrix()[:,0:10]
y = df.as_matrix()[:,-1]
xTrain, xTest, yTrain, yTest = train_test_split(X, y)
#xTrain, xTest = train_test_split(df)
#xTrain = xTrain.as_matrix()
#xTest = xTest.as_matrix()
#yTrain = xTrain[:,-1]
#yTest = xTest[:,-1]
#xTrain = xTrain[:,0:10]
#xTest = xTest[:,0:10]

XTrain and yTrain are the two arrays you use to train your model. XTest and yTest are the two arrays that you use to evaluate your model. By default, scikit-learn splits the data so that 25% of it is used for testing, but you can also specify the proportion of data you want to use for training and testing.


You can check the sizes of the different training and test sets by using the shape attribute:


In [179]:
# Print the dimensionality of the individual splits
print np.shape(xTrain)
print np.shape(xTest)
print np.shape(yTrain)
print np.shape(yTest)


(1115, 10)
(372, 10)
(1115,)
(372,)

You can also investigate how the class labels are distributed within the yTest vector by using the itemfreq function as previously


In [180]:
# Calculate the frequency of classes in yTest
scipy.stats.itemfreq(yTest)


Out[180]:
array([['high', 135],
       ['low', 237]], dtype=object)

We can see that 129 random samples of class 0 (high quality) and 243 random samples of class 1 (low quality) are included in the yTest set.

Learning Activity 2: Apply KNN classification algorithm with scikit-learn

To build KNN models using scikit-learn, you will be using the KNeighborsClassifier object, which allows you to set the value of K using the n_neighbors parameter (http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html). The optimal choice for the value K is highly data-dependent: in general a larger K suppresses the effects of noise, but makes the classification boundaries less distinct.

For every classification model built with scikit-learn, we will follow four main steps: 1) Building the classification model (using either default, pre-defined or optimised parameters), 2) Training the model with data, 3) Testing the model, and 4) Performance evaluation using various metrics.

We are going to start by trying two pre-defined random values of K and compare their performance. Let us start with a small number of K such as K=3.


In [181]:
# Build a KNN classifier with 3 nearest neighbors
knn3 = KNeighborsClassifier(n_neighbors=3)
knn3.fit(xTrain, yTrain)
yPredK3 = knn3.predict(xTest)

print("Overall Accuracy:", round(metrics.accuracy_score(yTest, yPredK3), 2))


('Overall Accuracy:', 0.77)

Let us try a larger value of K, for instance K = 99 or another number of your own choice; remember, it is good practice to select an odd number for K in a binary classification problem to avoid ties. Can you generate the KNN model and print the overall performance for a larger K (such as K=99) using as guidance the previous example?


In [182]:
# Build a KNN classifier with 99 nearest neighbors
knn99 = KNeighborsClassifier(n_neighbors=99)
knn99.fit(xTrain, yTrain)
yPredK99 = knn99.predict(xTest)

print("Overall Accuracy:", round(metrics.accuracy_score(yTest, yPredK99), 2))


('Overall Accuracy:', 0.76)

Learning Activity 3: Calculate validation metrics for your classifier

In a classification task, once you have created your predictive model, you will need to evaluate it. Evaluation functions help you to do this by reporting the performance of the model through four main performance metrics: precision, recall and specificity for the different classes, and overall accuracy. To understand these metrics, it is useful to create a confusion matrix, which records all the true positive, true negative, false positive and false negative values.

We can compute the confusion matrix for our classifier using the confusion_matrix function in the metrics module.


In [183]:
# Get the confusion matrix for your classifier using metrics.confusion_matrix
mat = metrics.confusion_matrix(yTest, yPredK3)
print(mat)


[[ 87  48]
 [ 37 200]]

Because performance metrics are such an important step of model evaluation, scikit-learn offers a wrapper around these functions, metrics.classification_report, to facilitate their computation. It also offers the function metrics.accuracy_score that we tried before to compute the overall accuracy.


In [184]:
# Report the metrics using metrics.classification_report

Learning Activity 4: Plot the decision boundaries for di€fferent models

We can visualise the classification boundary created by the KNN classifier using the built-in function visplots.knnDecisionPlot. For easier visualisation, you can (interactively) select to view only the test samples from the plot. Remember though that the decision boundary has been built using the training data!


In [185]:
# Check the arguments of the function
help(visplots.knnDecisionPlot)

# Visualise the boundaries
yTrain = le.fit_transform(yTrain)
yTest = le.fit_transform(yTest)
header =  df.columns
visplots.knnDecisionPlot(xTrain, yTrain, xTest, yTest, header, n_neighbors = 3)
visplots.knnDecisionPlot(xTrain, yTrain, xTest, yTest, header, n_neighbors = 99)


Help on function knnDecisionPlot in module visplots:

knnDecisionPlot(XTrain, yTrain, XTest, yTest, header, n_neighbors, weights='uniform')

For smaller values of K the decision boundaries present many "creases". In this case the models may suffer from instances of overfitting. For larger values of K, we can see that the decision boundaries are less distinct and tend towards linearity. In these cases the boundaries may be too simple and unable to learn thus leading to cases of underfitting.

Test Activity 5: Try di€fferent weight configurations

Under some circumstances, it is better to give more importance ("weight" in computing terms) to nearer neighbors. This can be accomplished through the weights parameter. When weights = 'distance', weights are assigned to the training data points in a way that is proportional to the inverse of the distance from the query point. In other words, nearer neighbors contribute more to the fit.

What if we use weights based on distance? Does it improve the overall performance?


In [186]:
# Build the classifier with two pre-defined parameters (n_neighbors and weights)
knn = KNeighborsClassifier(n_neighbors=3, weights= 'distance')
knn.fit(xTrain, yTrain)
yPredK = knn.predict(xTest)

print(metrics.classification_report(yTest, yPredK))
print("accuracy: ", round(metrics.accuracy_score(yTest, yPredK), 2))

# Visualise the boundaries of a KNN model with weights equal to "distance"
visplots.knnDecisionPlot(xTrain, yTrain, xTest, yTest, header, n_neighbors = 3, weights = 'distance')


             precision    recall  f1-score   support

          0       0.72      0.69      0.70       135
          1       0.83      0.85      0.84       237

avg / total       0.79      0.79      0.79       372

('accuracy: ', 0.79)

Module 4

Learning Activity 1: Implement k-fold cross-validation

Let us estimate the accuracy of the classifier on the wine quality dataset by splitting the data 5 consecutive times (the parameter cv gives the number of samples the data is split into) using the cross_val_score function. For example, try to implement cross-validation for knn3, your KNN model with k=3:


In [187]:
# Implement cross-validation for knn3
knn3scores = cross_val_score(knn3, X, y, cv = 5)
print(knn3scores)
print("Mean of scores KNN3", knn3scores.mean())

knn10 = KNeighborsClassifier(n_neighbors=10)
knn10.fit(xTrain, yTrain)
knn10scores = cross_val_score(knn10, X, y, cv = 5)
print(knn10scores)
print("Mean of scores KNN10", knn10scores.mean())


[ 0.54697987  0.60067114  0.68013468  0.77441077  0.60942761]
('Mean of scores KNN3', 0.64232481413689468)
[ 0.52348993  0.59395973  0.6969697   0.84175084  0.65319865]
('Mean of scores KNN10', 0.66187377126974445)

What happens if we change the number of K?

Parameter Tuning

Learning Activity 2: Grid search on hyperparameters

Rather than trying one-by-one predefined values of K, we can automate this process. The scikit-learn library provides the grid search function GridSearchCV (http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.GridSearchCV.html), which allows us to exhaustively search for the optimum combination of parameters by evaluating models trained with a particular algorithm with all provided parameter combinations. Further details and examples on grid search with scikit-learn can be found at http://scikit-learn.org/stable/modules/grid_search.html

You can use the GridSearchCV function with the validation technique of your choice (in this example, 10-fold cross-validation has been applied) to search for a parametisation of the KNN algorithm that gives a more optimal model:


In [188]:
# Grid search with 10-fold cross-validation using a dictionary of parameters
n_neighbors = np.arange(1, 51, 2)   
weights     = ['uniform','distance'] 
params  = [{'n_neighbors': n_neighbors, 'weights': weights}] 

gridCV = GridSearchCV(KNeighborsClassifier(), params, cv=10)
gridCV.fit(xTrain, yTrain)

bestN = gridCV.best_params_['n_neighbors']
bestW = gridCV.best_params_['weights']

print("Best parameters: n_neighbors=", bestN, "and weight=", bestW)


('Best parameters: n_neighbors=', 39, 'and weight=', 'distance')

We can also graphically represent the results of the grid search using a heatmap:


In [189]:
# Visualise the grid search results using a heatmap
scores = np.zeros((len(n_neighbors), len(weights)))

# grid_scores_ contains parameter settings and scores
for score in gridCV.grid_scores_: 
    ne = score[0]['n_neighbors'] 
    i = np.argmax(n_neighbors == ne) 
    j = 0 if (score[0]['weights'] == 'uniform') else 1
    scores[i,j] = score[1]

# Make a heatmap with the performance
data = [
    Heatmap(
        x = n_neighbors,
        y = weights,
        z = scores.T,
        colorscale='Jet',
        reversescale=True,
        colorbar = dict(
            title = "Classification Accuracy",
            len = 4,
            nticks=10
        )
    )
]

layout = Layout(
    xaxis = dict(title = "Number of K nearest neighbors", tickvals = n_neighbors),
    yaxis = dict(title = "Weights"),
    height= 250
)

fig = dict(data = data, layout = layout)

iplot(fig)


When evaluating the resulting model it is important to do it on held-out samples that were not seen during the grid search process (XTest).
So, we are testing our independent XTest dataset using the optimised model:


In [191]:
# Build the classifier using the optimal parameters detected by grid search 
knnBest = KNeighborsClassifier(n_neighbors=bestN, weights=bestW)
knnBest.fit(xTrain, yTrain)
yPredKnnBest = knnBest.predict(xTest)

# Report the test accuracy and performance
print(metrics.classification_report(yTest, yPredKnnBest))
print("Overall Accuracy:", round(metrics.accuracy_score(yTest, yPredKnnBest), 2))


             precision    recall  f1-score   support

          0       0.79      0.61      0.69       135
          1       0.81      0.91      0.85       237

avg / total       0.80      0.80      0.79       372

('Overall Accuracy:', 0.8)

Learning Activity 3: Systematic variation of the K neighbors, and the bias-variance trade-off

We can graphically represent and investigate how the systematic increase of the number of K neighbors influences the validation, train and test accuracy (and attempt to detect cases of over- or under-fitting).


In [192]:
# Explore the benefit of cross-validated results vs. simple training and test data separation
train_scores = []
test_scores  = []
cv_scores    = [x[0] for x in scores]

for n in n_neighbors:
    knn = KNeighborsClassifier(n_neighbors=n)
    knn.fit(xTrain, yTrain)
    train_scores.append(metrics.accuracy_score(yTrain, knn.predict(xTrain)))
    test_scores.append(metrics.accuracy_score(yTest, knn.predict(xTest)))

In [193]:
# Plot the train, test and validation accuracies
trace0 = Scatter(
    x = n_neighbors,
    y = train_scores,
    mode = "lines+markers",
    name = "Training Scores"
)

trace1 = Scatter(
    x = n_neighbors,
    y = test_scores,
    mode = "lines+markers",
    name = "Test Scores"
)

trace2 = Scatter(
    x = n_neighbors,
    y = cv_scores,
    mode = "lines+markers",
    name = "CV Scores"
)

layout = Layout(
    xaxis = dict(title = 'number of neighbors'),
    yaxis = dict(title = 'prediction accuracy')
)

fig = Figure(data=[trace0, trace1, trace2], layout=layout)

iplot(fig)


Test Activity 4: Randomized search on hyperparameters

Unlike GridSearchCV, RandomizedSearchCV does not exhaustively try all the parameter settings. Instead, it samples a fixed number of parameter settings based on the distributions you specify (e.g. you might specify that one parameter should be sampled uniformly while another is sampled following a Gaussian distribution). The number of parameter settings that are tried is given by n_iter. If all parameters are presented as a list, sampling without replacement is performed. If at least one parameter is given as a distribution, sampling with replacement is used. You should use continuous distributions for continuous parameters. Further details can be found at http://scikit-learn.org/stable/modules/grid_search.html


In [194]:
# Conduct a randomised search on hyperparameters
parameters = {'n_neighbors': randint(1,200)}

randomCV = RandomizedSearchCV(KNeighborsClassifier(),
                              param_distributions=parameters, n_iter=20)

randomCV.fit(xTrain, yTrain)


Out[194]:
RandomizedSearchCV(cv=None, error_score='raise',
          estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
          fit_params={}, iid=True, n_iter=20, n_jobs=1,
          param_distributions={'n_neighbors': <scipy.stats._distn_infrastructure.rv_frozen object at 0x11a183d50>},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          scoring=None, verbose=0)

In [195]:
# Print the optimal n_neighbors detected by randomised search
yRandomCV = randomCV.predict(xTest)

# Report the test accuracy and performance
print(metrics.classification_report(yTest, yRandomCV))
print("Overall Accuracy:", round(metrics.accuracy_score(yTest, yRandomCV), 2))


             precision    recall  f1-score   support

          0       0.70      0.50      0.58       135
          1       0.75      0.88      0.81       237

avg / total       0.73      0.74      0.73       372

('Overall Accuracy:', 0.74)

We can also graphically represent the results of the randomised search using a scatterplot:


In [202]:
# Visualise the randomised search results using a scatterplot
neighbor = [score_tuple[0]['n_neighbors'] for score_tuple in randomCV.grid_scores_]
result   = [score_tuple[1] for score_tuple in randomCV.grid_scores_]

print neighbor
print result

trace = go.Scatter(
    x = neighbor,
    y = result,
    mode = 'markers'
)
data = [trace]

# Plot and embed in ipython notebook!
py.iplot(data, filename='basic-scatter')


[52, 64, 2, 70, 133, 59, 33, 61, 25, 35, 79, 198, 44, 117, 4, 19, 173, 20, 63, 1]
[0.7488789237668162, 0.74708520179372195, 0.69686098654708517, 0.74708520179372195, 0.74260089686098651, 0.74977578475336326, 0.75784753363228696, 0.74977578475336326, 0.7542600896860987, 0.7569506726457399, 0.74349775784753358, 0.74708520179372195, 0.74977578475336326, 0.74170403587443945, 0.70313901345291485, 0.74170403587443945, 0.73991031390134532, 0.7488789237668162, 0.7488789237668162, 0.7542600896860987]
Out[202]:

In [205]:
# Build the classifier using the optimal parameters detected by randomised search
#print neighbor[33]
knn33 = KNeighborsClassifier(n_neighbors=33)
knn33.fit(xTrain, yTrain)
y33 = knn33.predict(xTest)

print(metrics.classification_report(yTest, y33))
print("Overall Accuracy:", round(metrics.accuracy_score(yTest, y33), 2))


             precision    recall  f1-score   support

          0       0.70      0.50      0.58       135
          1       0.75      0.88      0.81       237

avg / total       0.73      0.74      0.73       372

('Overall Accuracy:', 0.74)

Module 5

Learning Activity 1: Decision Tree

Decision Tree classifiers construct classification models in the form of a tree structure. A decision tree progressively splits the training set into smaller subsets. Each node of the tree represents a subset of the data. Once a new sample is presented to the data, it is classified according to the test condition generated for each node of the tree.

Let us build a simple decision tree with 3 layers. We will first evaluate the accuracy, then plot the decision boundaries just as we did for knn. (See http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html for the documentation of the classifier.)


In [208]:
# Build a Decision Tree classifier with 3 layers
dtc = DecisionTreeClassifier(max_depth=3)
dtc.fit(xTrain, yTrain)
predDT = dtc.predict(xTest)

print(metrics.classification_report(yTest, predDT))
print("Overall Accuracy:", round(metrics.accuracy_score(yTest, predDT),2))

visplots.dtDecisionPlot(xTrain, yTrain, xTest, yTest, header, max_depth=3)


             precision    recall  f1-score   support

          0       0.88      0.70      0.78       135
          1       0.85      0.95      0.89       237

avg / total       0.86      0.85      0.85       372

('Overall Accuracy:', 0.85)

Learning Activity 2: Random Forests

The random forests model is an ensemble method since it aggregates a group of decision trees into an ensemble (http://scikit-learn.org/stable/modules/ensemble.html). Ensemble learning involves the combination of several models to solve a single prediction problem. It works by generating multiple classifiers/models which learn and make predictions independently. Those predictions are then combined into a single (mega) prediction that should be as good or better than the prediction made by any one classifer. Unlike single decision trees which are likely to suffer from high Variance or high Bias (depending on how they are tuned) Random Forests use averaging to find a natural balance between the two extremes.

Let us start by building a simple Random Forest model which consists of 100 independently trained decision trees. For further details and examples on how to construct a Random Forest, see http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html


In [228]:
# Build a Random Forest classifier with 100 decision trees
rf = RandomForestClassifier(n_estimators=100, random_state=0)
rf.fit(xTrain, yTrain)
predRF = rf.predict(xTest)

print(metrics.classification_report(yTest, predRF))
print("Overall Accuracy:", round(metrics.accuracy_score(yTest, predRF),2))


             precision    recall  f1-score   support

          0       0.90      0.82      0.86       135
          1       0.90      0.95      0.93       237

avg / total       0.90      0.90      0.90       372

('Overall Accuracy:', 0.9)

Learning Activity 3: Visualising the RF accuracy

We can also investigate how the overall test accuracy gets influenced with the increase of n_estimators (decision trees) in our model. In order to do so, we can use the provided rfAvgAcc function from visplots:


In [229]:
# Visualise the average accuracy 
visplots.rfAvgAcc(rfModel=rf, XTest=xTest, yTest=yTest)


Learning Activity 4: Feature Importance

Random forests allow you to compute a heuristic for determining how “important” a feature is in predicting a target. This heuristic measures the change in prediction accuracy if you take a given feature and permute (scramble) it across the datapoints in the training set. The more the accuracy drops when the feature is permuted, the more “important” we can conclude the feature is.

We can use the feature_importances_ attribute of the RF classifier to obtain the relative importance of each feature, which we can then visualise using a simple bar plot.


In [215]:
# Display the importance of the features in a barplot
data = [
    go.Bar(
        x=rf.feature_importances_,
        y=header[0:10],
        orientation = 'h',
    )
]
plot_url = py.plot(data, filename='horizontal-bar')

Learning activity 5: Boundary visualisation

We can visualise the classification boundary created by the Random Forest using the visplots.rfDecisionPlot function. You can check the arguments passed in this function by using the help command. For easier visualisation, only the test samples have been included in the plot. And remember that the decision boundary has been built using the training data!


In [216]:
# Check the arguments of the function
help(visplots.rfDecisionPlot)


# Visualise the boundaries
visplots.rfDecisionPlot(XTrain, yTrain, XTest, yTest, header)


Help on function rfDecisionPlot in module visplots:

rfDecisionPlot(XTrain, yTrain, XTest, yTest, header, n_estimators=10)

Random forests offer several parameters that can be tuned. In this case, parameters such as n_estimators, max_features, max_depth and min_samples_leaf can be some of the parameters to be optimised.


In [217]:
# View the list of arguments to be optimised
help(RandomForestClassifier())


Help on RandomForestClassifier in module sklearn.ensemble.forest object:

class RandomForestClassifier(ForestClassifier)
 |  A random forest classifier.
 |  
 |  A random forest is a meta estimator that fits a number of decision tree
 |  classifiers on various sub-samples of the dataset and use averaging to
 |  improve the predictive accuracy and control over-fitting.
 |  The sub-sample size is always the same as the original
 |  input sample size but the samples are drawn with replacement if
 |  `bootstrap=True` (default).
 |  
 |  Read more in the :ref:`User Guide <forest>`.
 |  
 |  Parameters
 |  ----------
 |  n_estimators : integer, optional (default=10)
 |      The number of trees in the forest.
 |  
 |  criterion : string, optional (default="gini")
 |      The function to measure the quality of a split. Supported criteria are
 |      "gini" for the Gini impurity and "entropy" for the information gain.
 |      Note: this parameter is tree-specific.
 |  
 |  max_features : int, float, string or None, optional (default="auto")
 |      The number of features to consider when looking for the best split:
 |  
 |      - If int, then consider `max_features` features at each split.
 |      - If float, then `max_features` is a percentage and
 |        `int(max_features * n_features)` features are considered at each
 |        split.
 |      - If "auto", then `max_features=sqrt(n_features)`.
 |      - If "sqrt", then `max_features=sqrt(n_features)` (same as "auto").
 |      - If "log2", then `max_features=log2(n_features)`.
 |      - If None, then `max_features=n_features`.
 |  
 |      Note: the search for a split does not stop until at least one
 |      valid partition of the node samples is found, even if it requires to
 |      effectively inspect more than ``max_features`` features.
 |      Note: this parameter is tree-specific.
 |  
 |  max_depth : integer or None, optional (default=None)
 |      The maximum depth of the tree. If None, then nodes are expanded until
 |      all leaves are pure or until all leaves contain less than
 |      min_samples_split samples.
 |      Ignored if ``max_leaf_nodes`` is not None.
 |      Note: this parameter is tree-specific.
 |  
 |  min_samples_split : integer, optional (default=2)
 |      The minimum number of samples required to split an internal node.
 |      Note: this parameter is tree-specific.
 |  
 |  min_samples_leaf : integer, optional (default=1)
 |      The minimum number of samples in newly created leaves.  A split is
 |      discarded if after the split, one of the leaves would contain less then
 |      ``min_samples_leaf`` samples.
 |      Note: this parameter is tree-specific.
 |  
 |  min_weight_fraction_leaf : float, optional (default=0.)
 |      The minimum weighted fraction of the input samples required to be at a
 |      leaf node.
 |      Note: this parameter is tree-specific.
 |  
 |  max_leaf_nodes : int or None, optional (default=None)
 |      Grow trees with ``max_leaf_nodes`` in best-first fashion.
 |      Best nodes are defined as relative reduction in impurity.
 |      If None then unlimited number of leaf nodes.
 |      If not None then ``max_depth`` will be ignored.
 |      Note: this parameter is tree-specific.
 |  
 |  bootstrap : boolean, optional (default=True)
 |      Whether bootstrap samples are used when building trees.
 |  
 |  oob_score : bool
 |      Whether to use out-of-bag samples to estimate
 |      the generalization error.
 |  
 |  n_jobs : integer, optional (default=1)
 |      The number of jobs to run in parallel for both `fit` and `predict`.
 |      If -1, then the number of jobs is set to the number of cores.
 |  
 |  random_state : int, RandomState instance or None, optional (default=None)
 |      If int, random_state is the seed used by the random number generator;
 |      If RandomState instance, random_state is the random number generator;
 |      If None, the random number generator is the RandomState instance used
 |      by `np.random`.
 |  
 |  verbose : int, optional (default=0)
 |      Controls the verbosity of the tree building process.
 |  
 |  warm_start : bool, optional (default=False)
 |      When set to ``True``, reuse the solution of the previous call to fit
 |      and add more estimators to the ensemble, otherwise, just fit a whole
 |      new forest.
 |  
 |  class_weight : dict, list of dicts, "balanced", "balanced_subsample" or None, optional
 |  
 |      Weights associated with classes in the form ``{class_label: weight}``.
 |      If not given, all classes are supposed to have weight one. For
 |      multi-output problems, a list of dicts can be provided in the same
 |      order as the columns of y.
 |  
 |      The "balanced" mode uses the values of y to automatically adjust
 |      weights inversely proportional to class frequencies in the input data
 |      as ``n_samples / (n_classes * np.bincount(y))``
 |  
 |      The "balanced_subsample" mode is the same as "balanced" except that weights are
 |      computed based on the bootstrap sample for every tree grown.
 |  
 |      For multi-output, the weights of each column of y will be multiplied.
 |  
 |      Note that these weights will be multiplied with sample_weight (passed
 |      through the fit method) if sample_weight is specified.
 |  
 |  Attributes
 |  ----------
 |  estimators_ : list of DecisionTreeClassifier
 |      The collection of fitted sub-estimators.
 |  
 |  classes_ : array of shape = [n_classes] or a list of such arrays
 |      The classes labels (single output problem), or a list of arrays of
 |      class labels (multi-output problem).
 |  
 |  n_classes_ : int or list
 |      The number of classes (single output problem), or a list containing the
 |      number of classes for each output (multi-output problem).
 |  
 |  n_features_ : int
 |      The number of features when ``fit`` is performed.
 |  
 |  n_outputs_ : int
 |      The number of outputs when ``fit`` is performed.
 |  
 |  feature_importances_ : array of shape = [n_features]
 |      The feature importances (the higher, the more important the feature).
 |  
 |  oob_score_ : float
 |      Score of the training dataset obtained using an out-of-bag estimate.
 |  
 |  oob_decision_function_ : array of shape = [n_samples, n_classes]
 |      Decision function computed with out-of-bag estimate on the training
 |      set. If n_estimators is small it might be possible that a data point
 |      was never left out during the bootstrap. In this case,
 |      `oob_decision_function_` might contain NaN.
 |  
 |  References
 |  ----------
 |  
 |  .. [1] L. Breiman, "Random Forests", Machine Learning, 45(1), 5-32, 2001.
 |  
 |  See also
 |  --------
 |  DecisionTreeClassifier, ExtraTreesClassifier
 |  
 |  Method resolution order:
 |      RandomForestClassifier
 |      ForestClassifier
 |      abc.NewBase
 |      BaseForest
 |      abc.NewBase
 |      sklearn.ensemble.base.BaseEnsemble
 |      sklearn.base.BaseEstimator
 |      sklearn.base.MetaEstimatorMixin
 |      sklearn.feature_selection.from_model._LearntSelectorMixin
 |      sklearn.base.TransformerMixin
 |      sklearn.base.ClassifierMixin
 |      __builtin__.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, n_estimators=10, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, bootstrap=True, oob_score=False, n_jobs=1, random_state=None, verbose=0, warm_start=False, class_weight=None)
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  __abstractmethods__ = frozenset([])
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from ForestClassifier:
 |  
 |  predict(self, X)
 |      Predict class for X.
 |      
 |      The predicted class of an input sample is a vote by the trees in
 |      the forest, weighted by their probability estimates. That is,
 |      the predicted class is the one with highest mean probability
 |      estimate across the trees.
 |      
 |      Parameters
 |      ----------
 |      X : array-like or sparse matrix of shape = [n_samples, n_features]
 |          The input samples. Internally, it will be converted to
 |          ``dtype=np.float32`` and if a sparse matrix is provided
 |          to a sparse ``csr_matrix``.
 |      
 |      Returns
 |      -------
 |      y : array of shape = [n_samples] or [n_samples, n_outputs]
 |          The predicted classes.
 |  
 |  predict_log_proba(self, X)
 |      Predict class log-probabilities for X.
 |      
 |      The predicted class log-probabilities of an input sample is computed as
 |      the log of the mean predicted class probabilities of the trees in the
 |      forest.
 |      
 |      Parameters
 |      ----------
 |      X : array-like or sparse matrix of shape = [n_samples, n_features]
 |          The input samples. Internally, it will be converted to
 |          ``dtype=np.float32`` and if a sparse matrix is provided
 |          to a sparse ``csr_matrix``.
 |      
 |      Returns
 |      -------
 |      p : array of shape = [n_samples, n_classes], or a list of n_outputs
 |          such arrays if n_outputs > 1.
 |          The class probabilities of the input samples. The order of the
 |          classes corresponds to that in the attribute `classes_`.
 |  
 |  predict_proba(self, X)
 |      Predict class probabilities for X.
 |      
 |      The predicted class probabilities of an input sample is computed as
 |      the mean predicted class probabilities of the trees in the forest. The
 |      class probability of a single tree is the fraction of samples of the same
 |      class in a leaf.
 |      
 |      Parameters
 |      ----------
 |      X : array-like or sparse matrix of shape = [n_samples, n_features]
 |          The input samples. Internally, it will be converted to
 |          ``dtype=np.float32`` and if a sparse matrix is provided
 |          to a sparse ``csr_matrix``.
 |      
 |      Returns
 |      -------
 |      p : array of shape = [n_samples, n_classes], or a list of n_outputs
 |          such arrays if n_outputs > 1.
 |          The class probabilities of the input samples. The order of the
 |          classes corresponds to that in the attribute `classes_`.
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from BaseForest:
 |  
 |  apply(self, X)
 |      Apply trees in the forest to X, return leaf indices.
 |      
 |      Parameters
 |      ----------
 |      X : array-like or sparse matrix, shape = [n_samples, n_features]
 |          The input samples. Internally, it will be converted to
 |          ``dtype=np.float32`` and if a sparse matrix is provided
 |          to a sparse ``csr_matrix``.
 |      
 |      Returns
 |      -------
 |      X_leaves : array_like, shape = [n_samples, n_estimators]
 |          For each datapoint x in X and for each tree in the forest,
 |          return the index of the leaf x ends up in.
 |  
 |  fit(self, X, y, sample_weight=None)
 |      Build a forest of trees from the training set (X, y).
 |      
 |      Parameters
 |      ----------
 |      X : array-like or sparse matrix of shape = [n_samples, n_features]
 |          The training input samples. Internally, it will be converted to
 |          ``dtype=np.float32`` and if a sparse matrix is provided
 |          to a sparse ``csc_matrix``.
 |      
 |      y : array-like, shape = [n_samples] or [n_samples, n_outputs]
 |          The target values (class labels in classification, real numbers in
 |          regression).
 |      
 |      sample_weight : array-like, shape = [n_samples] or None
 |          Sample weights. If None, then samples are equally weighted. Splits
 |          that would create child nodes with net zero or negative weight are
 |          ignored while searching for a split in each node. In the case of
 |          classification, splits are also ignored if they would result in any
 |          single class carrying a negative weight in either child node.
 |      
 |      Returns
 |      -------
 |      self : object
 |          Returns self.
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from BaseForest:
 |  
 |  feature_importances_
 |      Return the feature importances (the higher, the more important the
 |         feature).
 |      
 |      Returns
 |      -------
 |      feature_importances_ : array, shape = [n_features]
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.ensemble.base.BaseEnsemble:
 |  
 |  __getitem__(self, index)
 |      Returns the index'th estimator in the ensemble.
 |  
 |  __iter__(self)
 |      Returns iterator over estimators in the ensemble.
 |  
 |  __len__(self)
 |      Returns the number of estimators in the ensemble.
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.base.BaseEstimator:
 |  
 |  __repr__(self)
 |  
 |  get_params(self, deep=True)
 |      Get parameters for this estimator.
 |      
 |      Parameters
 |      ----------
 |      deep: boolean, optional
 |          If True, will return the parameters for this estimator and
 |          contained subobjects that are estimators.
 |      
 |      Returns
 |      -------
 |      params : mapping of string to any
 |          Parameter names mapped to their values.
 |  
 |  set_params(self, **params)
 |      Set the parameters of this estimator.
 |      
 |      The method works on simple estimators as well as on nested objects
 |      (such as pipelines). The former have parameters of the form
 |      ``<component>__<parameter>`` so that it's possible to update each
 |      component of a nested object.
 |      
 |      Returns
 |      -------
 |      self
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from sklearn.base.BaseEstimator:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.feature_selection.from_model._LearntSelectorMixin:
 |  
 |  transform(*args, **kwargs)
 |      DEPRECATED: Support to use estimators as feature selectors will be removed in version 0.19. Use SelectFromModel instead.
 |      
 |      Reduce X to its most important features.
 |      
 |              Uses ``coef_`` or ``feature_importances_`` to determine the most
 |              important features.  For models with a ``coef_`` for each class, the
 |              absolute sum over the classes is used.
 |      
 |              Parameters
 |              ----------
 |              X : array or scipy sparse matrix of shape [n_samples, n_features]
 |                  The input samples.
 |      
 |              threshold : string, float or None, optional (default=None)
 |                  The threshold value to use for feature selection. Features whose
 |                  importance is greater or equal are kept while the others are
 |                  discarded. If "median" (resp. "mean"), then the threshold value is
 |                  the median (resp. the mean) of the feature importances. A scaling
 |                  factor (e.g., "1.25*mean") may also be used. If None and if
 |                  available, the object attribute ``threshold`` is used. Otherwise,
 |                  "mean" is used by default.
 |      
 |              Returns
 |              -------
 |              X_r : array of shape [n_samples, n_selected_features]
 |                  The input samples with only the selected features.
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.base.TransformerMixin:
 |  
 |  fit_transform(self, X, y=None, **fit_params)
 |      Fit to data, then transform it.
 |      
 |      Fits transformer to X and y with optional parameters fit_params
 |      and returns a transformed version of X.
 |      
 |      Parameters
 |      ----------
 |      X : numpy array of shape [n_samples, n_features]
 |          Training set.
 |      
 |      y : numpy array of shape [n_samples]
 |          Target values.
 |      
 |      Returns
 |      -------
 |      X_new : numpy array of shape [n_samples, n_features_new]
 |          Transformed array.
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.base.ClassifierMixin:
 |  
 |  score(self, X, y, sample_weight=None)
 |      Returns the mean accuracy on the given test data and labels.
 |      
 |      In multi-label classification, this is the subset accuracy
 |      which is a harsh metric since you require for each sample that
 |      each label set be correctly predicted.
 |      
 |      Parameters
 |      ----------
 |      X : array-like, shape = (n_samples, n_features)
 |          Test samples.
 |      
 |      y : array-like, shape = (n_samples) or (n_samples, n_outputs)
 |          True labels for X.
 |      
 |      sample_weight : array-like, shape = [n_samples], optional
 |          Sample weights.
 |      
 |      Returns
 |      -------
 |      score : float
 |          Mean accuracy of self.predict(X) wrt. y.

Create a dictionary of allowed parameter ranges for n_estimators and max_depth (or include more of the parameters you would like to tune) and conduct a grid search with cross validation using the GridSearchCV function as before:


In [219]:
# Conduct a grid search with 10-fold cross-validation using the dictionary of parameters
n_estimators = np.arange(1, 30, 5)
max_depth    = np.arange(1, 100, 5)
parameters   = [{'n_estimators': n_estimators, 'max_depth': max_depth}]

gridCV = GridSearchCV(RandomForestClassifier(), param_grid=parameters, cv=10)
gridCV.fit(xTrain, yTrain)

# Print the optimal parameters
best_n_estim   = gridCV.best_params_['n_estimators']
best_max_depth = gridCV.best_params_['max_depth']

print ("Best parameters: n_estimators=", best_n_estim,", max_depth=", best_max_depth)


('Best parameters: n_estimators=', 21, ', max_depth=', 86)

Finally, testing our independent XTest dataset using the optimised model:


In [220]:
# Build the classifier using the optimal parameters detected by grid search
clfRDF = RandomForestClassifier(n_estimators=best_n_estim, max_depth=best_max_depth)
clfRDF.fit(XTrain, yTrain)
predRF = clfRDF.predict(XTest)

print(metrics.classification_report(yTest, predRF))
print("Overall Accuracy:", round(metrics.accuracy_score(yTest, predRF),2))


             precision    recall  f1-score   support

          0       0.38      0.22      0.28       135
          1       0.64      0.79      0.71       237

avg / total       0.55      0.59      0.55       372

('Overall Accuracy:', 0.59)

We can also graphically represent the results of the grid search using a heatmap:


In [224]:
# Create a heatmap like the one you made when you applied GridSearchCV to KNN
scores = np.zeros((len(n_estimators), len(max_depth)))
for score in gridCV.grid_scores_:
    ne = score[0]['n_estimators']
    md = score[0]['max_depth']
    i = np.argmax(n_estimators == ne)
    j = np.argmax(max_depth == md)
    scores[i,j] = score[1]
print scores
data = [
    Heatmap(
        x = n_neighbors,
        y = weights,
        z = scores.T,
        colorscale='Jet',
        reversescale=True,
        colorbar = dict(
            title = "Create a Plotly heatmap based on the scores of gridCV for the combination of n_estimators and max_depth",
            len = 4,
            nticks=10
        )
    )
]

layout = Layout(
    xaxis = dict(title = "Number of estimators", tickvals = n_neighbors),
    yaxis = dict(title = "Max depth"),
    height= 750
)

fig = dict(data = data, layout = layout)

iplot(fig)


[[ 0.78834081  0.83856502  0.832287    0.82959641  0.85919283  0.84394619
   0.83766816  0.85650224  0.84215247  0.84484305  0.83408072  0.832287
   0.82959641  0.82959641  0.83139013  0.83049327  0.84304933  0.81883408
   0.84304933  0.82152466]
 [ 0.83497758  0.87085202  0.87533632  0.88251121  0.88340807  0.8690583
   0.86995516  0.8735426   0.86726457  0.87264574  0.88340807  0.87982063
   0.87713004  0.87623318  0.8690583   0.87802691  0.87623318  0.88609865
   0.87264574  0.88699552]
 [ 0.83766816  0.8735426   0.88699552  0.88789238  0.88340807  0.89327354
   0.88520179  0.88878924  0.88789238  0.8941704   0.88789238  0.88071749
   0.88699552  0.88878924  0.8896861   0.88878924  0.8896861   0.89327354
   0.88878924  0.8941704 ]
 [ 0.84125561  0.88430493  0.89596413  0.89865471  0.8941704   0.88878924
   0.88609865  0.89596413  0.89775785  0.89327354  0.88699552  0.89596413
   0.88340807  0.88699552  0.89058296  0.89506726  0.88520179  0.88430493
   0.89327354  0.86816143]
 [ 0.84843049  0.87802691  0.89506726  0.89237668  0.8896861   0.89506726
   0.88699552  0.89686099  0.88430493  0.89327354  0.89147982  0.88878924
   0.88520179  0.89865471  0.89237668  0.8896861   0.89327354  0.90403587
   0.88878924  0.89147982]
 [ 0.84663677  0.88161435  0.8896861   0.8941704   0.88340807  0.8896861
   0.88789238  0.89596413  0.88789238  0.89237668  0.88878924  0.90224215
   0.89686099  0.88699552  0.8941704   0.8941704   0.88878924  0.89058296
   0.88609865  0.88430493]]

Bonus Activity 7: Parallelisation

The scikit-learn implementation of Random Forests also features the parallel construction of the trees and the parallel computation of the predictions through the n_jobs parameter. If n_jobs=k then computations are partitioned into k jobs, and run on k cores of the machine. If n_jobs=-1 then all cores available on the machine are used.


In [226]:
# Change the value of n_jobs and estimate the excution time of fit
import timeit

n_jobs_to_try = np.arange(1,9)
elapsed_times = []

for jobs in n_jobs_to_try:
    start_time = timeit.default_timer() 
    rf = RandomForestClassifier(n_estimators=500, random_state=0, n_jobs=jobs)
    rf.fit(XTrain, yTrain)
    elapsed = timeit.default_timer() - start_time
    elapsed_times.append(elapsed)

In [227]:
# Plot a graph
data = [
    Bar(
        x=n_jobs_to_try,
        y=elapsed_times
    )
]

fig = dict(data=data, layout=Layout(yaxis=dict(title='seconds')))

iplot(fig)



In [ ]: