Machine Learning 101: General Concepts


ML involves building programs with tunable parameters that are adjusted automatically so as to improve their behaviour by adapting to previously seen data.

The ML algorithms in the scikit-learn library take a numpy array as an input $x$, with shape (n_samples, n_features).

Iris example

We consider the iris data set. Each sample in this data set has 4 features:

    1. sepal length
    1. sepal width
    1. petal length
    1. petal width

From this we want to classify into one of:

  • Iris Setosa

  • Iris Versicolour

  • Iris Virginica

First let's load the data:


In [2]:
from sklearn.datasets import load_iris
iris = load_iris()
n_samples, n_features = iris.data.shape

The features of each flower are stored row-wise in the data attribute of the dataset


In [3]:
iris.data[0]


Out[3]:
array([ 5.1,  3.5,  1.4,  0.2])

The information about the class of each sample is stored in the target attribute, whilst the names are stored in the attribute target_names


In [6]:
iris.target


Out[6]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [7]:
iris.target_names


Out[7]:
array(['setosa', 'versicolor', 'virginica'], 
      dtype='<U10')

Supervised and Unsupervised Learning

  • Supervised learning: data has both features and labels.

  • Unsupervised learning: data has no labels.

Classification

This is a supervised learning problem. We want to predict the value of a categorical variable, given some input variables.

Let's say we want to guess the class of an individual flower, given the measurements of petals and sepals,


In [8]:
x, y = iris.data, iris.target

We can then use a SVM with a linear kernel:


In [9]:
from sklearn.svm import LinearSVC
clf = LinearSVC()

clf is a statistical model that has parameters that control the learning algorithm


In [10]:
clf


Out[10]:
LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)

We can train the SVM with the data:


In [11]:
clf = clf.fit(x,y)

Now the model is trained. We can use it to predict the outcome of some unseen data.


In [13]:
x_new = [[5.0, 3.4, 2.2, 0.6]]
clf.predict(x_new)


Out[13]:
array([0])

The output is 0, which is the index of the first Iris class - 'setosa

We can also use a classifier to predict probabilities of the outcome using logistic regression:


In [15]:
from sklearn.linear_model import LogisticRegression
clf2 = LogisticRegression().fit(x,y)
clf2.predict_proba(x_new)


Out[15]:
array([[  8.59949100e-01,   1.39710432e-01,   3.40468528e-04]])

We read this as: 86% of being in the first class, 13% of being in the second class etc.

Regression

Regression is the task of predicting the value of a continuous variable given some input variables. We will explore a detailed example of regression in a later notebook.

Dimensionality Reduction and visualization

Dimensionality reduction involves deriving a set of new artifical features that is smaller than the original feature set while retaining most of the variance of the original data . The most common techique for dimensionality reduction is Principal Component Analysis.

PCA can be done using a truncated Singular Value Decomposition of the features matrix $x$.


In [ ]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)