Introduction to Python for Data Sciences |
Franck Iutzeler Fall. 2018 |
Now that we explored data structures provided by the Pandas library, we will investigate how to learn over it using Scikit-learn.
Scikit-learn is ont of the most celebrated and used machine learning library. It features a complete set of efficiently implemented machine learning algorithms for classification, regression, and clustering. Scikit-learn is designed to operate over Numpy, Scipy, and Pandas data structures.
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
Machine learning is the task of predicting properties out of some data. The dataset consists in several examples or samples and the associated target properties can be available, partially available, or not at all; we respectively call these setting supervised, semi-supervised, unsupervised. The examples are made out of one or several features or attributes that can be of different types (real number, discretes values, strings, booleans, etc.).
Learning problems can be broadly divided in a few categories:
The following flowchart can be found on the Scikit Learn website:
The package features
Learning modules: Classification, Regression, Clustering, Dimentionality reduction
Annex modules: Model selection, Preprocessing
We will illustrate this process on a simple linear model $$ y = a x + b + \nu$$ where
In [2]:
a = np.random.randn()*5 # Drawing randomly the slope
b = np.random.rand()*10 # Drawing randomly the initial point
m = 50 # number of points
x = np.random.rand(m,1)*10 # Drawing randomly abscisses
y = a*x + b + np.random.randn(m,1) # y = ax+b + noise
plt.scatter(x, y)
Out[2]:
1. Selecting and adjusting a model
As we want to fit a linear model $y=ax+b$ through the data, we will import the Linear Regression module from scikit learn with sklearn.linear_model import LinearRegression.
As our model has a non null coefficient at the origin, the model needs an intercept. This can be tuned, along with several other parameters, see Scikit Learn's linear_model documentation.
In [3]:
from sklearn.linear_model import LinearRegression
model = LinearRegression(fit_intercept=True)
print(model)
This terminates our model tuning. Notice that we have described our model, but no learning or fitting has been done.
2. Fitting the model to the data
Applying our model to the data $(x,y)$ is done using the fit method.
In [4]:
model.fit(x,y)
Out[4]:
Once the model is fitted, one can observe the learned coefficients:
In [8]:
print("Learned coefficients: a = {:.6f} \t b = {:.6f}".format(float(model.coef_),float(model.intercept_)))
print("True coefficients: a = {:.6f} \t b = {:.6f}".format(a,b))
3. Predicting from this fitted model
From a feature matrix, the method predict returns the predicted output from the fitted model.
In [9]:
xFit = np.linspace(-2,12,21).reshape(-1, 1)
In [10]:
yFit = model.predict(xFit)
In [11]:
plt.scatter(x, y , label="data")
plt.plot(xFit, yFit , label="model")
plt.legend()
Out[11]:
Scikit Learn can take as an input (i.e. passed to fit and </tt>predict</tt>) several format including:
The examples/samples of the datasets are stored as rows.
The features are the columns.
In order to cross-validate our model, it is customary to split the dataset into training and testing subsets. It can be done manually but there is also a dedicated method.
In [12]:
try:
from sklearn.model_selection import train_test_split # sklearn > ...
except:
from sklearn.cross_validation import train_test_split # sklearn < ...
xTrain, xTest, yTrain, yTest = train_test_split(x,y)
In [13]:
print(xTrain.shape,yTrain.shape)
print(xTest.shape,yTest.shape)
Let us use cross validation to compare linear model and linear model with intercept.
In [14]:
from sklearn.linear_model import LinearRegression
model1 = LinearRegression(fit_intercept=True)
model2 = LinearRegression(fit_intercept=False)
model1.fit(xTrain,yTrain)
yPre1 = model1.predict(xTest)
error1 = np.linalg.norm(yTest-yPre1)
model2.fit(xTrain,yTrain)
yPre2 = model2.predict(xTest)
error2 = np.linalg.norm(yTest-yPre2)
print("Testing Error with intercept:", error1, "\t without intercept:" ,error2)
In [15]:
plt.scatter(xTrain, yTrain , label="Train data")
plt.scatter(xTest, yTest , color= 'k' , label="Test data")
plt.plot(xTest, yPre1 , color='r', label="model w/ intercept (err = {:.1f})".format(error1))
plt.plot(xTest, yPre2 , color='m', label="model w/o intercept (err = {:.1f})".format(error2))
plt.legend()
Out[15]:
In order to quantitatively evaluate the models, Scikit Learn provide a wide range of metrics, we will see some of them in the following examples.
In [ ]:
import lib.notebook_setting as nbs
packageList = ['IPython', 'numpy', 'scipy', 'matplotlib', 'cvxopt', 'pandas', 'seaborn', 'sklearn', 'tensorflow']
nbs.packageCheck(packageList)
nbs.cssStyling()