Machine Learning is a set of algorithms to enable computers to make and improve predictions or behaviors based on some data. This ability is not explicitly programmed. It involves models with tuneable parameters, that can adapt their values based on available data. Thence, these models can generalize this knowledge and make predictions about new (and unseen) data.
Fitting lines through data. Any middle schooler could eyeball this data and draw a reasonable line through it...however, this task is not simple for a machine. And when we move to more complicated datasets and multiple dimensions, your middle schooler will give up.
In [1]:
from IPython.core.display import Image, display
display(Image(filename='Reg1.png'))
display(Image(filename='Reg2.png'))
In [2]:
from IPython.core.display import Image, display
display(Image(filename='Cluster0.png'))
display(Image(filename='Cluster1.png'))
Scikit-Learn (http://scikit-learn.org) is a python package that uses NumPy & SciPy to enable the application of popular machine learning algorithms up on small to medium datasets.
Referring back to the machine learning models, every model in scikit is a python class with a uniform interface. Every instance of this class is an object and the general method of application is very similar.
a. Import class from module. (Here "abc" is an arbitrary algorithm.)
b. Instantiate estimator object
c. Fit model to training data
d. Use fitted model to predict
Now, we'll move from this (seemingly) abstract overview to actual application.
To motivate this discussion, lets start with a concrete problem...that of the infinite scroll.
In [ ]:
The goal of Clustering is to find an arrangement in the data such that items in the same group (or cluster) are more similar to each other than those from different clusters.
The Prototype based K-Means algorithm is quiet popular. In prototype based clustering, each group is represented/exemplified by a prototype. In K-Means, the prototype is the mean (or centroid).
In [3]:
%matplotlib inline
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
import numpy as np
X, y = make_blobs(n_samples=200,n_features=2,centers=6,cluster_std=0.8, shuffle=True,random_state=0)
plt.scatter(X[:,0],X[:,1])
Out[3]:
Steps in the K-means algorithm:
In [5]:
#import Kmeans class for the cluster module
from sklearn.cluster import KMeans
In [6]:
#instantiate the model
km = KMeans(n_clusters=3, init='random', n_init=10, max_iter=300, tol=1e-04, random_state=0)
The arguments to the algorithm:
In [7]:
#fitting the model to the data
y_km = km.fit_predict(X)
In [8]:
plt.scatter(X[y_km==0,0], X[y_km ==0,1], s=50, c='lightgreen', marker='o', label='Group A')
plt.scatter(X[y_km ==1,0], X[y_km ==1,1], s=50, c='orange', marker='o', label='Group B')
plt.scatter(X[y_km ==2,0], X[y_km ==2,1], s=50, c='white', marker='o', label='Group C')
plt.scatter(km.cluster_centers_[:,0],km.cluster_centers_[:,1], s=50, marker='o', c='black', label='Centers')
plt.legend()
Out[8]:
In [ ]:
In [9]:
display(Image(filename='1.png'))
In [10]:
from sklearn.datasets import load_iris
iris = load_iris()
n_samples, n_features = iris.data.shape
X, y = iris.data, iris.target
In [10]:
f, axarr = plt.subplots(2, 2)
axarr[0, 0].scatter(iris.data[:, 0], iris.data[:, 1],c=iris.target, cmap=plt.cm.get_cmap('RdYlBu', 3))
axarr[0, 0].set_title('Sepal length versus width')
axarr[0, 1].scatter(iris.data[:, 1], iris.data[:, 2],c=iris.target, cmap=plt.cm.get_cmap('RdYlBu', 3))
axarr[0, 1].set_title('Sepal width versus Petal Length')
axarr[1, 0].scatter(iris.data[:, 2], iris.data[:, 3],c=iris.target, cmap=plt.cm.get_cmap('RdYlBu', 3))
axarr[1, 0].set_title('Petal length versus width')
axarr[1, 1].scatter(iris.data[:, 0], iris.data[:, 2],c=iris.target, cmap=plt.cm.get_cmap('RdYlBu', 3))
axarr[1, 1].set_title('Sepal length versus Petal length')
plt.setp([a.get_xticklabels() for a in axarr[0, :]], visible=False);
plt.setp([a.get_yticklabels() for a in axarr[:, 1]], visible=False);
In [ ]:
In [1]:
#Instantiate and fit the model here
In [ ]:
In [ ]:
In [11]:
x=np.arange(100)
eps=50*np.random.randn(100)
y=2*x+eps
plt.scatter(x,y)
plt.xlabel("X")
plt.ylabel("Y")
Out[11]:
In [12]:
from sklearn.linear_model import LinearRegression
model=LinearRegression(normalize=True)
X=x[:,np.newaxis]
In [13]:
model.fit(X,y)
Out[13]:
In [14]:
X_fit=x[:,np.newaxis]
y_pred=model.predict(X_fit)
In [15]:
plt.scatter(x,y)
plt.plot(X_fit,y_pred,linewidth=2)
plt.xlabel("X")
plt.ylabel("Y")
Out[15]:
In [16]:
print model.coef_
print model.intercept_
#So a unit change is X is associated with a ___ change in Y.
In [11]:
import pandas as pd
data=pd.read_csv('addata.csv', index_col=0)
data.head(5)
Out[11]:
In [ ]:
In [ ]:
In [18]:
#from sklearn.linear_model import LinearRegression
from sklearn import linear_model
In [79]:
clf=linear_model.LinearRegression()
In [80]:
feature_cols=["TV","Radio","Newspaper"]
X=data[feature_cols]
y=data["Sales"]
In [81]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
In [20]:
#Fit the model and print the coefficients here
In [21]:
#Make predictions for the test dataset here
In [84]:
from sklearn import metrics
print np.sqrt(metrics.mean_squared_error(y_test,y_pred)) #RMSE
In [ ]: