Notebook version: 1.1 (Sep 26, 2015)
Author: Jesús Cid Sueiro (jcid@tsc.uc3m.es)
Changes: v.1.0 - First version
Part of this notebook is an adaptation of material created by Jason Brownlee (see his machinelearningmastery site)
In [1]:
# INITIALIZATION
# To visualize plots in the notebook
%matplotlib inline
# Import some libraries that will be necessary for working with data and displaying plots
import numpy as np # For numerical computtions
import urllib # To load data from a url
# Sci-kit learn packages
from sklearn import datasets
from sklearn import preprocessing
from sklearn import cross_validation
# For plots and graphical results
import matplotlib
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import pylab
# To read matlab files
import scipy.io
# That's default image size for this interactive session
pylab.rcParams['figure.figsize'] = 9, 6
The first thing we need to start a machine learning project is data. Data may be stored in different formats. We will see some utilities in python to load dataset.
It is very common for you to have a dataset as a CSV file on your local workstation or on a remote server.
This recipe show you how to load a CSV file from a URL, in this case the Pima Indians diabetes classification dataset from the UCI Machine Learning Repository.
From the prepared X and y variables, you can train a machine learning model.
In [2]:
# Load the Pima Indians diabetes dataset from CSV URL
# URL for the Pima Indians Diabetes dataset (UCI Machine Learning Repository)
url = "http://goo.gl/j0Rvxq"
# download the file
raw_data = urllib.urlopen(url)
# load the CSV data as a numpy matrix
dataset = np.loadtxt(raw_data, delimiter=",")
print(dataset.shape)
# separate the data from the target attributes
X_pima = dataset[:,0:7]
y_pima = dataset[:,8]
Here you can find other two example datasets stored in matlab files provided with this notebook:
The first one is an adaptation of the `STOCK` dataset, taken originally from the StatLib Repository. The goal of this problem is to predict the values of the stocks of a given airplane company, given the values of another 9 companies in the same day.
The second one is the `CONCRETE` dataset which is taken from the Machine Learning Repository at the University of California Irvine. To do so, just uncomment the block of code entitled CONCRETE and place comments in STOCK in the cell bellow. Remind that you must run the cells again to see the changes. The goal of the CONCRETE dataset tas is to predict the compressive strength of cement mixtures based on eight observed variables related to the composition of the mixture and the age of the material). </small>
In this case, data are already split in train and test datasets.
In [3]:
# Let us start by loading the data into the workspace, and visualizing the dimensions of all matrices
# STOCK
# Use this code block to use the airplanes dataset.
data = scipy.io.loadmat('stock.mat')
X_tr = data['xTrain']
X_tst = data['xTest']
y_tr = data['sTrain']
y_tst = data['sTest']
# CONCRETE DATASET.
# Use this code block to use the concrete dataset.
#data = scipy.io.loadmat('concrete.mat')
#X_tr = data['X_tr']
#X_tst = data['X_tst']
#y_tr = data['S_tr']
#y_tst = data['S_tst']
print X_tr.shape
print y_tr.shape
print X_tst.shape
print y_tst.shape
X = X_tr
y = y_tr
The scikit-learn library is packaged with datasets. These datasets are useful for getting a handle on a given machine learning algorithm or library feature before using it in your own work.
This recipe demonstrates how to load the famous Iris flowers dataset.
In [4]:
# Load dataset package from sci-kit learn. Uncomment the dataset you need.
#dataset = datasets.load_boston() # Load and return the boston house-prices dataset (regression).
dataset = datasets.load_diabetes() # Load and return the diabetes dataset (regression).
#dataset = datasets.load_linnerud() # Load and return the linnerud dataset (multivariate regression).
The dataset is loaded in a python dictionary with several keys
In [5]:
print dataset.keys()
The number of keys depend on the dataset, but at least you should see:
data: A matrix of observationstarget: The values of the target variable for the given problem
We will load them in variables X and y which is a usual notation for input and target variables in machine learning.
In [6]:
X_pkg = dataset['data']
y_pkg = dataset['target']
sklearn is provided with some tools to generate artifitial datasets. This are very usefull to test machine learning algorithms under controlled size and complexity.
For regression problems you can use make_regression, which produces regression targets as an optionally-sparse random linear combination of random features, with noise. Its informative features may be uncorrelated, or low rank (few features account for most of the variance).
Other regression generators generate functions deterministically from randomized features. make_sparse_uncorrelated produces a target as a linear combination of four features with fixed coefficients.
Finally, other generators encode explicitly non-linear relations:
The sintax make_regression is:
`sklearn.datasets.make_regression(n_samples=100, n_features=100, n_informative=10, n_targets=1, bias=0.0, effective_rank=None, tail_strength=0.5, noise=0.0, shuffle=True, coef=False, random_state=None)``
In [7]:
# Solution
X_art, y_art = datasets.make_regression(
n_samples=200, n_features=2, n_informative=2, n_targets=1)
print X_art.shape
print y_art.shape
When the instances of the dataset are multidimensional, they cannot be visualized directly, but we can get a first rough idea about the regression task if we plot each of the one-dimensional variables against the target data. These representations are known as scatter plots
plot and scatter can be used for 2D plots.
In [8]:
# Select a dataset
#X_all = X_pima
#y_all = y_pima
X_all = X_pkg
y_all = y_pkg
#X_all = X_art
#y_all = y_art
#X_all = X_tr
#y_all = y_tr
nrows = 4
ncols = 1 + (X_all.shape[1]-1)/nrows
# Some adjustment for the subplot.
pylab.subplots_adjust(hspace=0.2)
# Plot all variables
for idx in range(X_all.shape[1]):
ax = plt.subplot(nrows,ncols,idx+1)
ax.scatter(X_all[:,idx], y_all) # <-- This is the key command
ax.get_xaxis().set_ticks([])
ax.get_yaxis().set_ticks([])
plt.ylabel('Target')
With the addition of a third coordinate, plot and scattercan be used for 3D plotting.
Take the dataset generated in exercise 1 and
Select the diabetes dataset. Visualize the target versus components 2 and 4.
In [9]:
# SOLUTION
x0 = X_all[:,2]
x1 = X_all[:,4]
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(x0, x1, y_all, zdir=u'z', s=20, c=u'b', depthshade=True)
plt.xlabel('$x_2$')
plt.ylabel('$x_4$')
Out[9]:
Your data must be prepared before you can build models. The data preparation process can involve three steps:
We will explore some data preparation techniques here.
Your preprocessed data may contain attributes with a mixtures of scales for various quantities such as dollars, kilograms and sales volume.
Many machine learning methods expect or are more effective if the data attributes have the same scale. Two popular data scaling methods are normalization and standardiation.
In [10]:
# normalize the data attributes
norm_X = preprocessing.normalize(X_all)
# Standardize the data attributes for the Iris dataset.
# standardize the data attributes
stand_X = preprocessing.scale(X_all)
Exercise: Compute the maximum, minimum, mean and standar deviation of each feature in X_all. Then, check the columnwise maximum and minimum values in norm_X and the mean and standard deviation in stand_X.
In [11]:
# SOLUTION:
print "Original dataset:"
print np.linalg.norm(X_all[0:6], axis=1)
print X_all.mean(axis=0)
print X_all.std(axis=0)
print "Normalized instances:"
print np.linalg.norm(norm_X[0:6], axis=1)
print "Standardized instances:"
print stand_X.mean(axis=0)
print stand_X.std(axis=0)
A key concept in machine learning is generalization: the ability of a regression algorith to keep a good perfomance with data not used during training. To evaluate the performance of a training algorithm we need to take out some data from the learning process.
The cross_validation package is usefull for this purpose.
The simplest form of data splitting can be done with the train_test_split method. It provides two dataset
It is of crucial importance that the data split provides two subsets with the same statistical properties. the best wey to remove any bias from the data split is to shuffle the data set before splitting.
In [12]:
## Remove this code to use the `stock` or the `concrete` datasets.
X_tr, X_tst, y_tr, y_tst = cross_validation.train_test_split(
X_all, y_all, test_size=0.4, random_state=0)
print X_tr.shape
print X_tst.shape
print y_tr.shape
print y_tst.shape
In [13]:
i = 2
j = 4
x0_tr = X_tr[:,i]
x1_tr = X_tr[:,j]
x0_tst = X_tst[:,i]
x1_tst = X_tst[:,j]
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(x0_tr, x1_tr, y_tr, zdir=u'z', s=20, c=u'b', depthshade=True)
ax.scatter(x0_tst, x1_tst, y_tst, zdir=u'z', s=20, c=u'r', depthshade=True)
Out[13]: