Getting ready for Machine Learning with Python and Scikit-Learn

Notebook version: 1.1 (Sep 26, 2015)

Author: Jesús Cid Sueiro (jcid@tsc.uc3m.es)

Changes: v.1.0 - First version

Part of this notebook is an adaptation of material created by Jason Brownlee (see his machinelearningmastery site)


In [1]:
# INITIALIZATION
# To visualize plots in the notebook
%matplotlib inline 

# Import some libraries that will be necessary for working with data and displaying plots
import numpy as np   # For numerical computtions
import urllib        # To load data from a url

# Sci-kit learn packages
from sklearn import datasets
from sklearn import preprocessing
from sklearn import cross_validation

# For plots and graphical results
import matplotlib                 
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D   
import pylab

# To read matlab files
import scipy.io

# That's default image size for this interactive session
pylab.rcParams['figure.figsize'] = 9, 6

1. Loading data

The first thing we need to start a machine learning project is data. Data may be stored in different formats. We will see some utilities in python to load dataset.

1.1 Load data from CSV

It is very common for you to have a dataset as a CSV file on your local workstation or on a remote server.

This recipe show you how to load a CSV file from a URL, in this case the Pima Indians diabetes classification dataset from the UCI Machine Learning Repository.

From the prepared X and y variables, you can train a machine learning model.


In [2]:
# Load the Pima Indians diabetes dataset from CSV URL
# URL for the Pima Indians Diabetes dataset (UCI Machine Learning Repository)
url = "http://goo.gl/j0Rvxq"

# download the file
raw_data = urllib.urlopen(url)

# load the CSV data as a numpy matrix
dataset = np.loadtxt(raw_data, delimiter=",")

print(dataset.shape)
# separate the data from the target attributes
X_pima = dataset[:,0:7]
y_pima = dataset[:,8]


(768, 9)

1.2 Load data from .mat files (standard matlab output)

Here you can find other two example datasets stored in matlab files provided with this notebook:

The first one is an adaptation of the `STOCK` dataset, taken originally from the StatLib Repository. The goal of this problem is to predict the values of the stocks of a given airplane company, given the values of another 9 companies in the same day.

The second one is the `CONCRETE` dataset which is taken from the Machine Learning Repository at the University of California Irvine. To do so, just uncomment the block of code entitled CONCRETE and place comments in STOCK in the cell bellow. Remind that you must run the cells again to see the changes. The goal of the CONCRETE dataset tas is to predict the compressive strength of cement mixtures based on eight observed variables related to the composition of the mixture and the age of the material). </small>

In this case, data are already split in train and test datasets.


In [3]:
# Let us start by loading the data into the workspace, and visualizing the dimensions of all matrices

# STOCK
# Use this code block to use the airplanes dataset.
data = scipy.io.loadmat('stock.mat')
X_tr = data['xTrain']
X_tst = data['xTest']
y_tr = data['sTrain']
y_tst = data['sTest']

# CONCRETE DATASET. 
# Use this code block to use the concrete dataset.
#data = scipy.io.loadmat('concrete.mat')
#X_tr = data['X_tr']
#X_tst = data['X_tst']
#y_tr = data['S_tr']
#y_tst = data['S_tst']

print X_tr.shape
print y_tr.shape
print X_tst.shape
print y_tst.shape

X = X_tr
y = y_tr


(380, 9)
(380, 1)
(190, 9)
(190, 1)

1.3. Packaged datasets

The scikit-learn library is packaged with datasets. These datasets are useful for getting a handle on a given machine learning algorithm or library feature before using it in your own work.

This recipe demonstrates how to load the famous Iris flowers dataset.


In [4]:
# Load dataset package from sci-kit learn. Uncomment the dataset you need.
#dataset = datasets.load_boston()     # Load and return the boston house-prices dataset (regression).
dataset = datasets.load_diabetes()   # Load and return the diabetes dataset (regression).
#dataset = datasets.load_linnerud()   # Load and return the linnerud dataset (multivariate regression).

The dataset is loaded in a python dictionary with several keys


In [5]:
print dataset.keys()


['data', 'target']

The number of keys depend on the dataset, but at least you should see:

  • data: A matrix of observations
  • target: The values of the target variable for the given problem We will load them in variables X and y which is a usual notation for input and target variables in machine learning.

In [6]:
X_pkg = dataset['data']
y_pkg = dataset['target']

1.4. Sample generators

sklearn is provided with some tools to generate artifitial datasets. This are very usefull to test machine learning algorithms under controlled size and complexity.

For regression problems you can use make_regression, which produces regression targets as an optionally-sparse random linear combination of random features, with noise. Its informative features may be uncorrelated, or low rank (few features account for most of the variance).

Other regression generators generate functions deterministically from randomized features. make_sparse_uncorrelated produces a target as a linear combination of four features with fixed coefficients.

Finally, other generators encode explicitly non-linear relations:

  • make_friedman1 is related by polynomial and sine transforms;
  • make_friedman2 includes feature multiplication and reciprocation; and
  • make_friedman3 is similar with an arctan transformation on the target.

The sintax make_regression is:

`sklearn.datasets.make_regression(n_samples=100, n_features=100, n_informative=10, n_targets=1, bias=0.0, effective_rank=None, tail_strength=0.5, noise=0.0, shuffle=True, coef=False, random_state=None)``

Exercise 1:

Generate an artifitial dataset with 200 samples, 2 informative features and a single target. Name X_art and y_art the input and target variables. Print the size of the input matrix and the target array.


In [7]:
# Solution
X_art, y_art = datasets.make_regression(
    n_samples=200, n_features=2, n_informative=2, n_targets=1)
print X_art.shape
print y_art.shape


(200, 2)
(200,)

1.3. Scatter plots

When the instances of the dataset are multidimensional, they cannot be visualized directly, but we can get a first rough idea about the regression task if we plot each of the one-dimensional variables against the target data. These representations are known as scatter plots

1.3.1. 2D plots

plot and scatter can be used for 2D plots.


In [8]:
# Select a dataset
#X_all = X_pima
#y_all = y_pima
X_all = X_pkg
y_all = y_pkg
#X_all = X_art
#y_all = y_art
#X_all = X_tr
#y_all = y_tr

nrows = 4
ncols = 1 + (X_all.shape[1]-1)/nrows

# Some adjustment for the subplot.
pylab.subplots_adjust(hspace=0.2)

# Plot all variables
for idx in range(X_all.shape[1]):
    ax = plt.subplot(nrows,ncols,idx+1)
    ax.scatter(X_all[:,idx], y_all)    # <-- This is the key command
    ax.get_xaxis().set_ticks([])
    ax.get_yaxis().set_ticks([])
    plt.ylabel('Target')


1.3.2. 3D Plots

With the addition of a third coordinate, plot and scattercan be used for 3D plotting.

Exercise 2:

Take the dataset generated in exercise 1 and

  • Visualize the target vs each feature.
  • Generate a 3D plot of the target variable vs both features.
Exercise 3:

Select the diabetes dataset. Visualize the target versus components 2 and 4.


In [9]:
# SOLUTION
x0 = X_all[:,2]
x1 = X_all[:,4]
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(x0, x1, y_all, zdir=u'z', s=20, c=u'b', depthshade=True)
plt.xlabel('$x_2$')
plt.ylabel('$x_4$')


Out[9]:
<matplotlib.text.Text at 0x10a4badd0>

2. Data preparation

Your data must be prepared before you can build models. The data preparation process can involve three steps:

  • data selection,
  • data preprocessing and
  • data transformation

We will explore some data preparation techniques here.

2.1. Data rescaling

Your preprocessed data may contain attributes with a mixtures of scales for various quantities such as dollars, kilograms and sales volume.

Many machine learning methods expect or are more effective if the data attributes have the same scale. Two popular data scaling methods are normalization and standardiation.

  • Normalization: Rescaling intances to have unit norm.
  • Standardization: shifting the distribution of each attribute to have a mean of zero and a standard deviation of one (unit variance).

In [10]:
# normalize the data attributes
norm_X = preprocessing.normalize(X_all)

# Standardize the data attributes for the Iris dataset.
# standardize the data attributes
stand_X = preprocessing.scale(X_all)

Exercise: Compute the maximum, minimum, mean and standar deviation of each feature in X_all. Then, check the columnwise maximum and minimum values in norm_X and the mean and standard deviation in stand_X.


In [11]:
# SOLUTION:
print "Original dataset:"
print np.linalg.norm(X_all[0:6], axis=1)
print X_all.mean(axis=0)
print X_all.std(axis=0)

print "Normalized instances:"
print np.linalg.norm(norm_X[0:6], axis=1)

print "Standardized instances:"
print stand_X.mean(axis=0)
print stand_X.std(axis=0)


Original dataset:
[ 0.11861433  0.16138499  0.12975063  0.12351008  0.08575931  0.20538758]
[ -3.63428493e-16   1.30834257e-16  -8.04534920e-16   1.28165452e-16
  -8.83531559e-17   1.32702421e-16  -4.57464634e-16   3.77730150e-16
  -3.83085422e-16  -3.41288202e-16]
[ 0.04756515  0.04756515  0.04756515  0.04756515  0.04756515  0.04756515
  0.04756515  0.04756515  0.04756515  0.04756515]
Normalized instances:
[ 1.  1.  1.  1.  1.  1.]
Standardized instances:
[ -9.54490383e-18  -8.38946810e-17   2.41134413e-17   2.05968977e-17
  -5.92788764e-17  -5.45064245e-17   5.32505161e-17   2.71778578e-16
   2.95138474e-18  -2.02515229e-17]
[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]

3. Training and test splits

A key concept in machine learning is generalization: the ability of a regression algorith to keep a good perfomance with data not used during training. To evaluate the performance of a training algorithm we need to take out some data from the learning process.

The cross_validation package is usefull for this purpose.

The simplest form of data splitting can be done with the train_test_split method. It provides two dataset

  • Training dataset: To be used for training machine learning models
  • Test dataset: to be used for evaluation purposes.

It is of crucial importance that the data split provides two subsets with the same statistical properties. the best wey to remove any bias from the data split is to shuffle the data set before splitting.


In [12]:
## Remove this code to use the `stock` or the `concrete` datasets.
X_tr, X_tst, y_tr, y_tst = cross_validation.train_test_split(
    X_all, y_all, test_size=0.4, random_state=0)

print X_tr.shape
print X_tst.shape
print y_tr.shape
print y_tst.shape


(265, 10)
(177, 10)
(265,)
(177,)
Exercise:

Verify, visually, that the data splits are not biased. To do so, plot 3D scatter plots of the targe vsdifferent pairs of input components.


In [13]:
i = 2
j = 4

x0_tr = X_tr[:,i]
x1_tr = X_tr[:,j]
x0_tst = X_tst[:,i]
x1_tst = X_tst[:,j]

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(x0_tr, x1_tr, y_tr, zdir=u'z', s=20, c=u'b', depthshade=True)
ax.scatter(x0_tst, x1_tst, y_tst, zdir=u'z', s=20, c=u'r', depthshade=True)


Out[13]:
<mpl_toolkits.mplot3d.art3d.Path3DCollection at 0x10ab69f50>