Introduction to Python for Data Sciences |
Franck Iutzeler Fall. 2018 |
Package check and Styling
Outline
a) Classification
b) Regression
c) Exercises
In [1]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
%matplotlib inline
# we create 40 separable points in R^2 around 2 centers (random_state=6 is a seed so that the set is separable)
X, y = make_blobs(n_samples=40, n_features=2, centers=2 , random_state=6)
print(X[:5,:],y[:5]) # print the first 5 points and labels
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired)
Out[1]:
Support Vector Machines (SVM) are based on learning a vector $w$ and an intercept $b$ such that the hyperplane $w^T x - b = 0$ separates the data i.e. $a$ belongs to one class if $w^T a - b > 0$ and the other elsewhere.
They were later extended to Kernel methods that is $\kappa(w, a) - b = 0$ is now the separating curve where $\kappa$ is the kernel, typically:
In [2]:
from sklearn.svm import SVC # Support vector classifier i.e. Classifier by SVM
modelSVMLinear = SVC(kernel="linear")
modelSVMLinear.fit(X,y)
Out[2]:
The following illustration can be found in the Python Data Science Handbook by Jake VanderPlas.
In [3]:
def plot_svc_decision_function(model, ax=None, plot_support=True):
"""Plot the decision function for a 2D SVC"""
if ax is None:
ax = plt.gca()
xlim = ax.get_xlim()
ylim = ax.get_ylim()
# create grid to evaluate model
x = np.linspace(xlim[0], xlim[1], 30)
y = np.linspace(ylim[0], ylim[1], 30)
Y, X = np.meshgrid(y, x)
xy = np.vstack([X.ravel(), Y.ravel()]).T
P = model.decision_function(xy).reshape(X.shape)
# plot decision boundary and margins
ax.contour(X, Y, P, colors='k',
levels=[-1, 0, 1], alpha=0.5,
linestyles=['--', '-', '--'])
# plot support vectors
if plot_support:
ax.scatter(model.support_vectors_[:, 0],
model.support_vectors_[:, 1],
s=300, linewidth=1, facecolors='none');
ax.set_xlim(xlim)
ax.set_ylim(ylim)
In [4]:
plt.scatter(X[:, 0], X[:, 1], c=y , cmap=plt.cm.Paired)
plot_svc_decision_function(modelSVMLinear)
We see clearly that the linear SVM seeks at maximizing the margin between the hyperplane and the two well defined classes from the data.
In [5]:
# we create points in R^2 around 2 centers (random_state=48443 is a seed so that the set is *not* separable)
X, y = make_blobs(n_samples=100, n_features=2, centers=2 , random_state=48443)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired)
Out[5]:
Let us use the same linear SVM classifier. Obviously, there are misclassified points, the model is thus learnt not by maximizing the margin (which does not exist anymore) but by minimizing a penalty over misclassified data. This penalty takes the form of an allowance margin controlled by a parameter $C$. The smaller $C$ the more inclusive the margin. Finding a good value for $C$ is up to the data scientist.
In [6]:
try:
from sklearn.model_selection import train_test_split # sklearn > ...
except:
from sklearn.cross_validation import train_test_split # sklearn < ...
XTrain, XTest, yTrain, yTest = train_test_split(X,y,test_size = 0.5) # split data in two
model1 = SVC(kernel="linear",C=0.01)
model1.fit(XTrain,yTrain)
model2 = SVC(kernel="linear",C=100)
model2.fit(XTrain,yTrain)
Out[6]:
In [7]:
plt.scatter(XTrain[:, 0], XTrain[:, 1], c=yTrain , cmap=plt.cm.Paired)
plot_svc_decision_function(model1)
plt.title("C = 0.01")
Out[7]:
In [8]:
plt.scatter(XTrain[:, 0], XTrain[:, 1], c=yTrain , cmap=plt.cm.Paired)
plot_svc_decision_function(model2)
plt.title("C = 100")
Out[8]:
To find out which value of $C$ to use or globally the performance of the classifier, one can use Scikit Learn's classification metrics, for instance the confusion matrix.
In [9]:
from sklearn.metrics import confusion_matrix
yFit1 = model1.predict(XTest)
yFit2 = model2.predict(XTest)
mat1 = confusion_matrix(yTest, yFit1)
mat2 = confusion_matrix(yTest, yFit2)
print('Model with C = 0.01')
print(mat1)
print("Model with C = 100")
print(mat2)
It can also be plotted in a fancier way with seaborn.
In [10]:
import seaborn as sns
sns.heatmap(mat1, square=True, annot=True ,cbar=False)
plt.ylabel('true label')
plt.xlabel('predicted label')
Out[10]:
In [11]:
from sklearn.datasets import make_moons
X,y = make_moons(noise=0.1)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired)
Out[11]:
In [12]:
modelLinear = SVC(kernel="linear")
modelLinear.fit(X,y)
modelRbf = SVC(kernel="rbf")
modelRbf.fit(X,y)
Out[12]:
In [13]:
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired)
plot_svc_decision_function(modelLinear)
plot_svc_decision_function(modelRbf)
plt.title("The two models superposed")
Out[13]:
Let us compare the linear and rbf training error using the zero one loss (the proportion of misclassified examples).
In [14]:
from sklearn.metrics import zero_one_loss
yFitLinear = modelLinear.predict(X)
yFitRbf = modelRbf.predict(X)
print("0/1 loss -- Linear: {:.3f} Rbf: {:.3f}".format(zero_one_loss(y, yFitLinear),zero_one_loss(y, yFitRbf)))
Where there are multiples classes (as in the iris dataset of the Pandas notebook), different strategies can be adopted:
The corresponding algorithms can be found in the multiclass module documentation.
We are going to illustrate this by the iris 3-class classification problem using only the 2 petal features (width and length, this is only so that the feature vector is 2D and easy to visualize).
In [15]:
import pandas as pd
import numpy as np
iris = pd.read_csv('data/iris.csv')
classes = pd.DataFrame(iris["species"])
features = iris.drop(["species","sepal_length","sepal_width"],axis=1)
In [16]:
classes.sample(6)
Out[16]:
In [17]:
features.sample(6)
Out[17]:
In [18]:
XTrain, XTest, yTrain, yTest = train_test_split(features,classes,test_size = 0.5)
In [19]:
from sklearn.multiclass import OneVsRestClassifier
yPred = OneVsRestClassifier(SVC()).fit(XTrain, yTrain).predict(XTest)
In [20]:
print(yPred) # Note the classes are not number but everything went as expected
In [21]:
class_labels= ['virginica' , 'setosa' , 'versicolor']
sns.heatmap(confusion_matrix(yTest, yPred), square=True, annot=True ,cbar=False, xticklabels= class_labels, yticklabels=class_labels)
plt.ylabel('true label')
plt.xlabel('predicted label')
Out[21]:
The main classifiers from Scikit learn are: Linear SVM, RBF SVM (as already seen), Nearest Neighbors, Gaussian Process, Decision Tree, Random Forest, Neural Net, AdaBoost, Naive Bayes, QDA.
Use is:
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
classifiers = [
KNeighborsClassifier(3),
SVC(kernel="linear", C=0.025),
SVC(gamma=2, C=1),
GaussianProcessClassifier(1.0 * RBF(1.0), warm_start=True),
DecisionTreeClassifier(max_depth=5),
RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
MLPClassifier(alpha=1),
AdaBoostClassifier(),
GaussianNB(),
QuadraticDiscriminantAnalysis()]
Let consider the problem of predicting real values from a set of features.
We will consider the student performance dataset. The goal is to predict the final grade from the other information, we get from the documentation:
In [22]:
import pandas as pd
import numpy as np
student = pd.read_csv('data/student-mat.csv')
student.head()
Out[22]:
In [23]:
target = pd.DataFrame(student["G3"])
features = student.drop(["G3"],axis=1)
One immediate problem here is that the features are not numeric (not floats). Thankfully, Scikit Learn provides encoders to convert categorical (aka nominal, discrete) features to numerical ones.
In [24]:
from sklearn.preprocessing import LabelEncoder
lenc = LabelEncoder()
num_features = features.apply(lenc.fit_transform)
In [25]:
num_features.head()
Out[25]:
Even numerical values were encoded, as we are going to normalize, it is not really important.
The normalization is done by removing the mean and equalizing the variance per feature, in addition, we are going to add an intercept.
In [26]:
from sklearn.preprocessing import StandardScaler, add_dummy_feature
scaler = StandardScaler()
normFeatures = add_dummy_feature(scaler.fit_transform(num_features))
In [27]:
preproData = pd.DataFrame(normFeatures , columns=[ "intercept" ] + list(num_features.columns) )
In [28]:
preproData.describe().T
Out[28]:
The lasso problem is finding a regressor $w$ such that minimizes $$ \frac{1}{2 n_{samples}} \|X w - y ||^2_2 + \alpha \|w\|_1 $$
and is popular for prediction as it simultaneously selects features thanks to the $\ell_1$-term. The greater $\alpha$ the fewer features.
In [29]:
try:
from sklearn.model_selection import train_test_split # sklearn > ...
except:
from sklearn.cross_validation import train_test_split # sklearn < ...
from sklearn.linear_model import Lasso
XTrain, XTest, yTrain, yTest = train_test_split(preproData,target,test_size = 0.25)
model = Lasso(alpha=0.1)
model.fit(XTrain,yTrain)
Out[29]:
We can observe the regressor $w$ provided by the model, notice the sparsity.
In [30]:
model.coef_
Out[30]:
We can observe which coefficients are put to $0$ and which ones are positively/negatively correlated.
In [31]:
print("Value Feature")
for idx,val in enumerate(model.coef_):
print("{:6.3f} {}".format(val,preproData.columns[idx]))
Let us take a look at our predictions.
In [32]:
targetPred = model.predict(XTest)
print("Predicted True")
for idx,val in enumerate(targetPred):
print("{:4.1f} {:.0f}".format(val,float(yTest.iloc[idx])))
In [33]:
n_test = 15
alpha_tab = np.logspace(-10,1,base=2,num = n_test)
print(alpha_tab)
In [34]:
trainError = np.zeros(n_test)
testError = np.zeros(n_test)
featureNum = np.zeros(n_test)
for idx,alpha in enumerate(alpha_tab):
model = Lasso(alpha=alpha)
model.fit(XTrain,yTrain)
yPredTrain = model.predict(XTrain)
yPredTest = model.predict(XTest)
trainError[idx] = np.linalg.norm(yPredTrain-yTrain["G3"].values)/yTrain.count()
testError[idx] = np.linalg.norm(yPredTest-yTest["G3"].values)/yTest.count()
featureNum[idx] = sum(model.coef_!=0)
alpha_opt = alpha_tab[np.argmin(testError)]
In [35]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
%matplotlib inline
plt.subplot(311)
plt.xscale("log")
plt.plot(alpha_tab, trainError,label="train error")
plt.xlim([min(alpha_tab),max(alpha_tab)])
plt.legend()
plt.xticks([])
plt.axvline(x=alpha_opt)
plt.ylabel("error")
plt.subplot(312)
plt.xscale("log")
plt.plot(alpha_tab, testError,'r',label="test error")
plt.xlim([min(alpha_tab),max(alpha_tab)])
#plt.ylim([0.19, 0.21])
plt.legend()
plt.axvline(x=alpha_opt)
plt.xticks([])
plt.ylabel("error")
plt.subplot(313)
plt.xscale("log")
plt.scatter(alpha_tab, featureNum)
plt.xlim([min(alpha_tab),max(alpha_tab)])
plt.ylim([0,28])
plt.axvline(x=alpha_opt)
plt.ylabel("nb. of features")
plt.xlabel("alpha")
Out[35]:
Feature | Definition | Comment |
---|---|---|
PassengerId | ID | numeric |
Survival | Survival of the passenger | 0 = No, 1 = Yes target to predict |
Pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd |
Name | Full name w/ Mr. Mrs. etc. | string |
Sex | Sex | male or female |
Age | Age in years | numeric |
SibSp | # of siblings / spouses aboard the Titanic | numeric |
Parch | # of parents / children aboard the Titanic | |
Ticket | Ticket number | quite messy |
Fare | Passenger fare | |
cabin | Cabin number | letter + number (e.g. C85), often missing |
Embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |
In [ ]:
In [ ]:
In [ ]:
import lib.notebook_setting as nbs
packageList = ['IPython', 'numpy', 'scipy', 'matplotlib', 'cvxopt', 'pandas', 'seaborn', 'sklearn', 'tensorflow']
nbs.packageCheck(packageList)
nbs.cssStyling()