This notebook was originally put together by [Jake Vanderplas](http://www.vanderplas.com) for PyCon 2014. [Peter Prettenhofer](https://github.com/pprett) adapted it for PyCon Ukraine 2014. Source and license info is on [GitHub](https://github.com/pprett/sklearn_pycon2014/).

Part 1: Some Background

In this section we'll go through some preliminary topics and helpful background for the content in this tutorial.

By the end of this section you should:

  • Know what sort of tasks qualify as Machine Learning problems.
  • See some simple examples of machine learning
  • Know the basics of creating and manipulating numpy arrays.
  • Know the basics of scatter plots in matplotlib.

What is Machine Learning?

In this section we will begin to explore the basic principles of machine learning. Machine Learning is about building programs with tunable parameters (typically an array of floating point values) that are adjusted automatically so as to improve their behavior by adapting to previously seen data.

Machine Learning can be considered a subfield of Artificial Intelligence since those algorithms can be seen as building blocks to make computers learn to behave more intelligently by somehow generalizing rather that just storing and retrieving data items like a database system would do.

We'll take a look at two very simple machine learning tasks here. The first is a classification task: the figure shows a collection of two-dimensional data, colored according to two different class labels. A classification algorithm may be used to draw a dividing boundary between the two clusters of points:


In [1]:
# Start matplotlib inline mode, so figures will appear in the notebook
%matplotlib inline

In [2]:
# Import the example plot from the figures directory
from fig_code import plot_sgd_separator
plot_sgd_separator()


This may seem like a trivial task, but it is a simple version of a very important concept. By drawing this separating line, we have learned a model which can generalize to new data: if you were to drop another point onto the plane which is unlabeled, this algorithm could now predict whether it's a blue or a red point.

If you'd like to see the source code used to generate this, you can either open the code in the figures directory, or you can load the code using the %load magic command:


In [3]:
#Uncomment the %load command to load the contents of the file
# %load fig_code/sgd_separator.py

The next simple task we'll look at is a regression task: a simple best-fit line to a set of data:


In [4]:
from fig_code import plot_linear_regression
plot_linear_regression()


Again, this is an example of fitting a model to data, such that the model can make generalizations about new data. The model has been learned from the training data, and can be used to predict the result of test data: here, we might be given an x-value, and the model would allow us to predict the y value. Again, this might seem like a trivial problem, but it is a basic example of a type of operation that is fundamental to machine learning tasks.

Numpy

Manipulating numpy arrays is an important part of doing machine learning (or, really, any type of scientific computation) in python. This will likely be review for most: we'll quickly go through some of the most important features.


In [5]:
import numpy as np

# Generating a random array
X = np.random.random((3, 5))  # a 3 x 5 array

print X


[[ 0.52655585  0.89821643  0.30328048  0.89753047  0.84247393]
 [ 0.22313948  0.61437939  0.17672726  0.49997471  0.86305494]
 [ 0.3780648   0.38028548  0.71141325  0.31985413  0.88466264]]

In [6]:
# Accessing elements

# get a single element
print X[0, 0]

# get a row
print X[1]

# get a column
print X[:, 1]


0.526555847789
[ 0.22313948  0.61437939  0.17672726  0.49997471  0.86305494]
[ 0.89821643  0.61437939  0.38028548]

In [7]:
# Transposing an array
print X.T


[[ 0.52655585  0.22313948  0.3780648 ]
 [ 0.89821643  0.61437939  0.38028548]
 [ 0.30328048  0.17672726  0.71141325]
 [ 0.89753047  0.49997471  0.31985413]
 [ 0.84247393  0.86305494  0.88466264]]

In [8]:
# Turning a row vector into a column vector
y = np.linspace(0, 12, 5)
print y

# make into a column vector
print y[:, np.newaxis]


[  0.   3.   6.   9.  12.]
[[  0.]
 [  3.]
 [  6.]
 [  9.]
 [ 12.]]

Advanced: Numpy arrays under the hood

A numpy array is basically a contingouse memory segment plus some metadata that describes how each cell should be interpreted (dtype, could be np.float32, np.int8, or even a struct) and how to get from one cell to the next cell along a certain dimension (strides). © Scipy community, Numpy Reference, array objects


In [9]:
# get the dtype
print X.dtype

# get information on the strides
print X.strides


float64
(40, 8)

The strides specify if data is layed out in column-order (Fortran-style) or row-order (C-style). This is important in sklearn because certain algorithms require data to adhere to a specific layout; if the input does not adhere to that it is converted which entails an additional memory copy.


In [10]:
# strides if X would be fortran-style
print np.asfortranarray(X).strides

# row vs col layout is stored in flags
print X.flags


(8, 24)
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  UPDATEIFCOPY : False

Scipy Sparse Matrices

We won't make very much use of these in this tutorial, but sparse matrices are very nice in some situations. For example, in some machine learning tasks, especially those associated with textual analysis, the data may be mostly zeros. Storing all these zeros is very inefficient. We can create and manipulate sparse matrices as follows:


In [11]:
# Create a random array with a lot of zeros
X = np.random.random((10, 5))
print X


[[ 0.72087503  0.29931276  0.05533782  0.71804016  0.26632313]
 [ 0.15514544  0.45129768  0.21649749  0.60010997  0.83139827]
 [ 0.22309905  0.60213978  0.39682797  0.46406945  0.1562427 ]
 [ 0.00652021  0.1844778   0.17403563  0.38911899  0.37640774]
 [ 0.87273247  0.42189666  0.80295558  0.81097606  0.24399733]
 [ 0.37931303  0.08709536  0.82464719  0.33713986  0.72897514]
 [ 0.63521622  0.89695921  0.0139867   0.57545056  0.75869627]
 [ 0.35195562  0.78881092  0.9979826   0.25417757  0.41708792]
 [ 0.28172527  0.57306924  0.52066063  0.02701989  0.6482881 ]
 [ 0.42422506  0.96665594  0.4014633   0.8177171   0.8006717 ]]

In [12]:
X[X < 0.7] = 0
print X


[[ 0.72087503  0.          0.          0.71804016  0.        ]
 [ 0.          0.          0.          0.          0.83139827]
 [ 0.          0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.        ]
 [ 0.87273247  0.          0.80295558  0.81097606  0.        ]
 [ 0.          0.          0.82464719  0.          0.72897514]
 [ 0.          0.89695921  0.          0.          0.75869627]
 [ 0.          0.78881092  0.9979826   0.          0.        ]
 [ 0.          0.          0.          0.          0.        ]
 [ 0.          0.96665594  0.          0.8177171   0.8006717 ]]

In [13]:
from scipy import sparse

# turn X into a csr (Compressed-Sparse-Row) matrix
X_csr = sparse.csr_matrix(X)
print X_csr


  (0, 0)	0.720875026916
  (0, 3)	0.718040159836
  (1, 4)	0.831398274925
  (4, 0)	0.872732466535
  (4, 2)	0.802955583412
  (4, 3)	0.810976056704
  (5, 2)	0.824647186002
  (5, 4)	0.728975135546
  (6, 1)	0.896959209297
  (6, 4)	0.758696268848
  (7, 1)	0.788810922462
  (7, 2)	0.997982603294
  (9, 1)	0.966655941324
  (9, 3)	0.817717096293
  (9, 4)	0.800671700412

In [14]:
# convert the sparse matrix to a dense array
print X_csr.toarray()


[[ 0.72087503  0.          0.          0.71804016  0.        ]
 [ 0.          0.          0.          0.          0.83139827]
 [ 0.          0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.        ]
 [ 0.87273247  0.          0.80295558  0.81097606  0.        ]
 [ 0.          0.          0.82464719  0.          0.72897514]
 [ 0.          0.89695921  0.          0.          0.75869627]
 [ 0.          0.78881092  0.9979826   0.          0.        ]
 [ 0.          0.          0.          0.          0.        ]
 [ 0.          0.96665594  0.          0.8177171   0.8006717 ]]

In [15]:
# Sparse matrices support linear algebra:
y = np.random.random(X_csr.shape[1])
z1 = X_csr.dot(y)
z2 = X.dot(y)
np.allclose(z1, z2)


Out[15]:
True

The CSR representation can be very efficient for computations, but it is not as good for adding elements. For that, the LIL (List-In-List) representation is better:


In [16]:
# Create an empty LIL matrix and add some items
X_lil = sparse.lil_matrix((5, 5))

for i, j in np.random.randint(0, 5, (15, 2)):
    X_lil[i, j] = i + j

print X_lil
print X_lil.toarray()


  (0, 4)	4.0
  (1, 0)	1.0
  (1, 1)	2.0
  (1, 4)	5.0
  (2, 1)	3.0
  (2, 4)	6.0
  (3, 0)	3.0
  (3, 2)	5.0
  (3, 4)	7.0
  (4, 2)	6.0
  (4, 4)	8.0
[[ 0.  0.  0.  0.  4.]
 [ 1.  2.  0.  0.  5.]
 [ 0.  3.  0.  0.  6.]
 [ 3.  0.  5.  0.  7.]
 [ 0.  0.  6.  0.  8.]]

Often, once an LIL matrix is created, it is useful to convert it to a CSR format (many scikit-learn algorithms require CSR or CSC format)


In [17]:
X_csr = X_lil.tocsr()
print X_csr


  (0, 4)	4.0
  (1, 0)	1.0
  (1, 1)	2.0
  (1, 4)	5.0
  (2, 1)	3.0
  (2, 4)	6.0
  (3, 0)	3.0
  (3, 2)	5.0
  (3, 4)	7.0
  (4, 2)	6.0
  (4, 4)	8.0

There are several other sparse formats that can be useful for various problems:

  • CSC (compressed sparse column)
  • BSR (block sparse row)
  • COO (coordinate)
  • DIA (diagonal)
  • DOK (dictionary of keys)

The scipy.sparse submodule also has a lot of functions for sparse matrices including linear algebra, sparse solvers, graph algorithms, and much more.

Pandas

Numpy arrays and scipy sparse matrices are meant to store homogenious data: each value has the same data type, however, in data mining applications data is often heterogenous: columns might have different data types (e.g. numeric, categorical, datetime, text). Pandas is a library that adresses this and other short-comings of numpy. It introduces a data structure that is called a DataFrame that allows columns of mixed type, labeled axes, and has nativ support for missing values and time series.

When scikit-learn was created Pandas was not yet around thus the support for DataFrames is limited but that being said: Pandas is an important tool for for data munging and feature preprocssing that should be in the toolbox of any ML Pythoneer.

Matplotlib

Another important part of machine learning is visualization of data. The most common tool for this in Python is matplotlib. It is an extremely flexible package, but we will go over some basics here.

First, something special to IPython notebook. We can turn on the "IPython inline" mode, which will make plots show up inline in the notebook.


In [18]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

In [19]:
# plotting a line

x = np.linspace(0, 10, 100)
plt.plot(x, np.sin(x));



In [20]:
# scatter-plot points

x = np.random.normal(size=500)
y = np.random.normal(size=500)
plt.scatter(x, y);



In [21]:
# showing images
x = np.linspace(1, 12, 100)
y = x[:, np.newaxis]

im = y * np.sin(x) * np.cos(y)
print(im.shape)


(100, 100)

In [22]:
# imshow - note that origin is at the top-left!
plt.imshow(im);



In [23]:
# Contour plot - note that origin here is at the bottom-left!
plt.contour(im);