One definition: "Machine learning is the semi-automated extraction of knowledge from data"
Unsupervised learning: Extracting structure from data
In [2]:
import pandas as pd
# read CSV file directly from a URL and save the results
data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv', index_col=0)
# display the first 5 rows
data.head()
Out[2]:
In [2]:
data.shape
Out[2]:
What are the features?
What is the response?
What else do we know?
In [3]:
# conventional way to import seaborn
import seaborn as sns
# allow plots to appear within the notebook
%matplotlib inline
sns.pairplot(data, x_vars=['TV','radio','newspaper'], y_vars='sales', size=7, aspect=0.7, kind='reg')
Out[3]:
$y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n$
In this case:
$y = \beta_0 + \beta_1 \times TV + \beta_2 \times Radio + \beta_3 \times Newspaper$
The $\beta$ values are called the model coefficients. These values are "learned" during the model fitting step using the "least squares" criterion. Then, the fitted model can be used to make predictions!
In [45]:
feature_cols = ['TV', 'radio', 'newspaper']
# use the list to select a subset of the original DataFrame
X = data[feature_cols]
# equivalent command to do this in one line
X = data[['TV', 'radio', 'newspaper']]
# print the first 5 rows
X.head()
Out[45]:
In [46]:
print(type(X))
print(X.shape)
In [47]:
# select a Series from the DataFrame
y = data['sales']
# equivalent command that works if there are no spaces in the column name
y = data.sales
# print the first 5 values
y.head()
Out[47]:
In [48]:
# check the type and shape of y
print(type(y))
print(y.shape)
In [49]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
In [50]:
# default split is 75% for training and 25% for testing
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
In [51]:
# import model
from sklearn.linear_model import LinearRegression
# instantiate
linreg = LinearRegression()
# fit the model to the training data (learn the coefficients)
linreg.fit(X_train, y_train)
Out[51]:
In [52]:
# print the intercept and coefficients
print(linreg.intercept_)
print(linreg.coef_)
In [53]:
# pair the feature names with the coefficients
list(zip(feature_cols, linreg.coef_))
Out[53]:
In [54]:
y_pred = linreg.predict(X_test)
print(y_pred)
We need an evaluation metric in order to compare our predictions with the actual values!
Root Mean Squared Error (RMSE) is the square root of the mean of the squared errors:
$$\sqrt{\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2}$$
In [55]:
from sklearn import metrics
import numpy as np
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
In [56]:
# import load_iris function from datasets module
from sklearn.datasets import load_iris
iris = load_iris()
type(iris)
Out[56]:
In [57]:
# print the iris data
print(iris.feature_names)
print(len(iris.data))
In [58]:
# print integers representing the species of each observation
print(iris.target)
print(len(iris.target))
In [59]:
# print the encoding scheme for species: 0 = setosa, 1 = versicolor, 2 = virginica
print(iris.target_names)
In [60]:
X = iris.data
# store response vector in "y"
y = iris.target
In [61]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=4)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
In [63]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
print(y_pred)
In [64]:
print(metrics.accuracy_score(y_test, y_pred))
Support Vector Machines (SVMs) are a powerful supervised learning algorithm used for classification. SVMs draw a boundary between clusters of data. SVMs attempt to maximize the margin between sets of points. Many lines can be drawn to separate the points above:
In [72]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import seaborn;
from scipy import stats
import pylab as pl
seaborn.set()
from sklearn.datasets.samples_generator import make_blobs
X, y = make_blobs(n_samples=50, centers=2,
random_state=0, cluster_std=0.60)
xfit = np.linspace(-1, 3.5)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='spring')
# Draw three lines that couple separate the data
for m, b, d in [(1, 0.65, 0.33), (0.5, 1.6, 0.55), (-0.2, 2.9, 0.2)]:
yfit = m * xfit + b
plt.plot(xfit, yfit, '-k')
plt.fill_between(xfit, yfit - d, yfit + d, edgecolor='none', color='#AAAAAA', alpha=0.4)
plt.xlim(-1, 3.5);
In [73]:
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=4)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
In [74]:
from sklearn.svm import SVC
clf = SVC(kernel='linear')
clf.fit(X_train, y_train)
Out[74]:
In [75]:
y_pred=clf.predict(X_test)
print(y_pred)
In [76]:
print(metrics.accuracy_score(y_test, y_pred))
Previously we saw a powerful discriminative classifier, Support Vector Machines. Here we'll take a look at motivating another powerful algorithm. This one is a non-parametric algorithm called Random Forests.
In [11]:
from sklearn.datasets import load_digits
digits = load_digits()
digits.keys()
Out[11]:
In [12]:
X = digits.data
y = digits.target
print(X.shape)
print(y.shape)
In [13]:
# set up the figure
fig = plt.figure(figsize=(6, 6)) # figure size in inches
fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)
# plot the digits: each image is 8x8 pixels
for i in range(64):
ax = fig.add_subplot(8, 8, i + 1, xticks=[], yticks=[])
ax.imshow(digits.images[i], cmap=plt.cm.binary, interpolation='nearest')
# label the image with the target value
ax.text(0, 7, str(digits.target[i]))
In [14]:
from sklearn.cross_validation import train_test_split
from sklearn import metrics
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, random_state=0)
clf = DecisionTreeClassifier(max_depth=11)
clf.fit(Xtrain, ytrain)
ypred = clf.predict(Xtest)
In [15]:
metrics.accuracy_score(ypred, ytest)
Out[15]:
In [16]:
plt.imshow(metrics.confusion_matrix(ypred, ytest),
interpolation='nearest', cmap=plt.cm.binary)
plt.grid(False)
plt.colorbar()
plt.xlabel("predicted label")
plt.ylabel("true label");
In [18]:
from sklearn.ensemble import RandomForestClassifier
In [19]:
clf = RandomForestClassifier(n_jobs=2, random_state=0)
clf.fit(Xtrain, ytrain)
ypred = clf.predict(Xtest)
In [20]:
metrics.accuracy_score(ypred, ytest)
Out[20]:
In [21]:
plt.imshow(metrics.confusion_matrix(ypred, ytest),
interpolation='nearest', cmap=plt.cm.binary)
plt.grid(False)
plt.colorbar()
plt.xlabel("predicted label")
plt.ylabel("true label");
K Means is an algorithm for unsupervised clustering: that is, finding clusters in data based on the data attributes alone (not the labels).
K Means is a relatively easy-to-understand algorithm. It searches for cluster centers which are the mean of the points within them, such that every point is closest to the cluster center it is assigned to.
Let's look at how KMeans operates on the simple clusters we looked at previously. To emphasize that this is unsupervised, we'll not plot the colors of the clusters:
In [77]:
from sklearn.datasets.samples_generator import make_blobs
X, y = make_blobs(n_samples=300, centers=4,
random_state=0, cluster_std=0.60)
plt.scatter(X[:, 0], X[:, 1], s=50);
By eye, it is relatively easy to pick out the four clusters. If you were to perform an exhaustive search for the different segmentations of the data, however, the search space would be exponential in the number of points. Fortunately, there is a well-known Expectation Maximization (EM) procedure which scikit-learn implements, so that KMeans can be solved relatively quickly.
In [78]:
from sklearn.cluster import KMeans
est = KMeans(4) # 4 clusters
est.fit(X)
y_kmeans = est.predict(X)
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='rainbow');
Let's use scikit-learn for K-means clustering on Iris dataset
In [85]:
from sklearn import datasets, cluster
import numpy as np
import matplotlib.pyplot as plt
In [80]:
np.random.seed(2)
In [81]:
# load data
iris = datasets.load_iris()
X_iris = iris.data
y_iris = iris.target
In [82]:
k_means = cluster.KMeans(n_clusters=3)
k_means.fit(X_iris)
labels = k_means.labels_
In [83]:
# check how many of the samples were correctly labeled
correct_labels = sum(y_iris == labels)
print("Result: %d out of %d samples were correctly labeled." % (correct_labels, y_iris.size))
In [86]:
from __future__ import print_function, division
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
# use seaborn plotting style defaults
In [87]:
np.random.seed(1)
X = np.dot(np.random.random(size=(2, 2)), np.random.normal(size=(2, 200))).T
plt.plot(X[:, 0], X[:, 1], 'o')
plt.axis('equal');
We can see that there is a definite trend in the data. What PCA seeks to do is to find the Principal Axes in the data, and explain how important those axes are in describing the data distribution:
In [88]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(X)
print(pca.explained_variance_)
print(pca.components_)
In [89]:
plt.plot(X[:, 0], X[:, 1], 'o', alpha=0.5)
for length, vector in zip(pca.explained_variance_, pca.components_):
v = vector * 3 * np.sqrt(length)
plt.plot([0, v[0]], [0, v[1]], '-k', lw=3)
plt.axis('equal');
In [ ]: