ML involves building programs with tunable parameters that are adjusted automatically so as to improve their behaviour by adapting to previously seen data.
The ML algorithms in the scikit-learn library take a numpy array as an input $x$, with shape (n_samples, n_features).
First let's load the data:
In [2]:
from sklearn.datasets import load_iris
iris = load_iris()
n_samples, n_features = iris.data.shape
The features of each flower are stored row-wise in the data attribute of the dataset
In [3]:
iris.data[0]
Out[3]:
The information about the class of each sample is stored in the target attribute, whilst the names are stored in the attribute target_names
In [6]:
iris.target
Out[6]:
In [7]:
iris.target_names
Out[7]:
Supervised learning: data has both features and labels.
Unsupervised learning: data has no labels.
This is a supervised learning problem. We want to predict the value of a categorical variable, given some input variables.
Let's say we want to guess the class of an individual flower, given the measurements of petals and sepals,
In [8]:
x, y = iris.data, iris.target
We can then use a SVM with a linear kernel:
In [9]:
from sklearn.svm import LinearSVC
clf = LinearSVC()
clf is a statistical model that has parameters that control the learning algorithm
In [10]:
clf
Out[10]:
We can train the SVM with the data:
In [11]:
clf = clf.fit(x,y)
Now the model is trained. We can use it to predict the outcome of some unseen data.
In [13]:
x_new = [[5.0, 3.4, 2.2, 0.6]]
clf.predict(x_new)
Out[13]:
The output is 0, which is the index of the first Iris class - 'setosa
We can also use a classifier to predict probabilities of the outcome using logistic regression:
In [15]:
from sklearn.linear_model import LogisticRegression
clf2 = LogisticRegression().fit(x,y)
clf2.predict_proba(x_new)
Out[15]:
We read this as: 86% of being in the first class, 13% of being in the second class etc.
Dimensionality reduction involves deriving a set of new artifical features that is smaller than the original feature set while retaining most of the variance of the original data . The most common techique for dimensionality reduction is Principal Component Analysis.
PCA can be done using a truncated Singular Value Decomposition of the features matrix $x$.
In [ ]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)