Introduction to Python for Data Sciences

Franck Iutzeler
Fall. 2018

Chap. 4 - Machine Learning with ScikitLearn

`2. Supervised Learning`

Package check and Styling

Outline

    a) Classification
    b) Regression
    c) Exercises

Warning: In the session, we will investigate examples on how to deal with popular learning problems using standard algorithms. Many other problems and algorithms exist so this course is not at all exhaustive.

a) Classification

Go to top



In [1]:

    
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
%matplotlib inline

# we create 40 separable points in R^2 around 2 centers (random_state=6 is a seed so that the set is separable)
X, y = make_blobs(n_samples=40, n_features=2, centers=2 , random_state=6)

print(X[:5,:],y[:5]) # print the first 5 points and labels

plt.scatter(X[:, 0], X[:, 1], c=y,  cmap=plt.cm.Paired)









    



(array([[  6.37734541, -10.61510727],
       [  6.50072722,  -3.82403586],
       [  4.29225906,  -8.99220442],
       [  7.39169472,  -3.1266933 ],
       [  7.64306311, -10.02356892]]), array([1, 0, 1, 0, 1]))






    Out[1]:





<matplotlib.collections.PathCollection at 0x7fc8c6fb2810>

Support Vector Machines (SVM) are based on learning a vector $w$ and an intercept $b$ such that the hyperplane $w^T x - b = 0$ separates the data i.e. $a$ belongs to one class if $w^T a - b > 0$ and the other elsewhere.

They were later extended to Kernel methods that is $\kappa(w, a) - b = 0$ is now the separating curve where $\kappa$ is the kernel, typically:

linear: $\kappa(x,y)= x^T y$ (original SVM)
polynomial: $\kappa(x,y)= (x^T y)^d$
Gaussian radial basis function (rfb): $\kappa(x,y)= \exp( - \gamma \| x - y \|^2 )$



In [2]:

    
from sklearn.svm import SVC # Support vector classifier i.e. Classifier by SVM

modelSVMLinear = SVC(kernel="linear")
modelSVMLinear.fit(X,y)









    Out[2]:





SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

The following illustration can be found in the Python Data Science Handbook by Jake VanderPlas.



In [3]:

    
def plot_svc_decision_function(model, ax=None, plot_support=True):
    """Plot the decision function for a 2D SVC"""
    if ax is None:
        ax = plt.gca()
    xlim = ax.get_xlim()
    ylim = ax.get_ylim()
    
    # create grid to evaluate model
    x = np.linspace(xlim[0], xlim[1], 30)
    y = np.linspace(ylim[0], ylim[1], 30)
    Y, X = np.meshgrid(y, x)
    xy = np.vstack([X.ravel(), Y.ravel()]).T
    P = model.decision_function(xy).reshape(X.shape)
    
    # plot decision boundary and margins
    ax.contour(X, Y, P, colors='k',
               levels=[-1, 0, 1], alpha=0.5,
               linestyles=['--', '-', '--'])
    
    # plot support vectors
    if plot_support:
        ax.scatter(model.support_vectors_[:, 0],
                   model.support_vectors_[:, 1],
                   s=300, linewidth=1, facecolors='none');
    ax.set_xlim(xlim)
    ax.set_ylim(ylim)



In [4]:

    
plt.scatter(X[:, 0], X[:, 1], c=y ,  cmap=plt.cm.Paired)
plot_svc_decision_function(modelSVMLinear)









    



/home/franck/.local/lib/python2.7/site-packages/numpy/ma/core.py:6385: MaskedArrayFutureWarning: In the future the default for ma.minimum.reduce will be axis=0, not the current None, to match np.minimum.reduce. Explicitly pass 0 or None to silence this warning.
  return self.reduce(a)
/home/franck/.local/lib/python2.7/site-packages/numpy/ma/core.py:6385: MaskedArrayFutureWarning: In the future the default for ma.maximum.reduce will be axis=0, not the current None, to match np.maximum.reduce. Explicitly pass 0 or None to silence this warning.
  return self.reduce(a)

We see clearly that the linear SVM seeks at maximizing the margin between the hyperplane and the two well defined classes from the data.

Non-separable data

In real cases, the data is usually not linearly separable as before.



In [5]:

    
# we create points in R^2 around 2 centers (random_state=48443 is a seed so that the set is *not* separable)
X, y = make_blobs(n_samples=100, n_features=2, centers=2 , random_state=48443)

plt.scatter(X[:, 0], X[:, 1], c=y,  cmap=plt.cm.Paired)









    Out[5]:





<matplotlib.collections.PathCollection at 0x7fc8c41a57d0>

Let us use the same linear SVM classifier. Obviously, there are misclassified points, the model is thus learnt not by maximizing the margin (which does not exist anymore) but by minimizing a penalty over misclassified data. This penalty takes the form of an allowance margin controlled by a parameter $C$. The smaller $C$ the more inclusive the margin. Finding a good value for $C$ is up to the data scientist.



In [6]:

    
try:
    from sklearn.model_selection import train_test_split    # sklearn > ...
except:
    from sklearn.cross_validation import train_test_split   # sklearn < ...
    
XTrain, XTest, yTrain, yTest = train_test_split(X,y,test_size = 0.5) # split data in two

model1 = SVC(kernel="linear",C=0.01)
model1.fit(XTrain,yTrain)

model2 = SVC(kernel="linear",C=100)
model2.fit(XTrain,yTrain)









    Out[6]:





SVC(C=100, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)



In [7]:

    
plt.scatter(XTrain[:, 0], XTrain[:, 1], c=yTrain ,  cmap=plt.cm.Paired)
plot_svc_decision_function(model1)
plt.title("C = 0.01")









    Out[7]:





<matplotlib.text.Text at 0x7fc8c41329d0>



In [8]:

    
plt.scatter(XTrain[:, 0], XTrain[:, 1], c=yTrain ,  cmap=plt.cm.Paired)
plot_svc_decision_function(model2)
plt.title("C = 100")









    Out[8]:





<matplotlib.text.Text at 0x7fc8c4069590>

To find out which value of $C$ to use or globally the performance of the classifier, one can use Scikit Learn's classification metrics, for instance the confusion matrix.



In [9]:

    
from sklearn.metrics import confusion_matrix


yFit1 = model1.predict(XTest)
yFit2 = model2.predict(XTest)


mat1 = confusion_matrix(yTest, yFit1)
mat2 = confusion_matrix(yTest, yFit2)

print('Model with C = 0.01')
print(mat1)
print("Model with C = 100")
print(mat2)









    



Model with C = 0.01
[[19  1]
 [ 7 23]]
Model with C = 100
[[19  1]
 [ 6 24]]

It can also be plotted in a fancier way with seaborn.



In [10]:

    
import seaborn as sns

sns.heatmap(mat1, square=True, annot=True ,cbar=False)
plt.ylabel('true label')
plt.xlabel('predicted label')









    Out[10]:





<matplotlib.text.Text at 0x7fc8ba863490>

Kernels

When the separation between classes is not linear, kernels may be used to draw separating curves instead of lines. The most popular is the Gaussian rbf.



In [11]:

    
from sklearn.datasets import make_moons

X,y = make_moons(noise=0.1)
plt.scatter(X[:, 0], X[:, 1], c=y,  cmap=plt.cm.Paired)









    Out[11]:





<matplotlib.collections.PathCollection at 0x7fc8c4204dd0>



In [12]:

    
modelLinear = SVC(kernel="linear")
modelLinear.fit(X,y)

modelRbf = SVC(kernel="rbf")
modelRbf.fit(X,y)









    Out[12]:





SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)



In [13]:

    
plt.scatter(X[:, 0], X[:, 1], c=y,  cmap=plt.cm.Paired)
plot_svc_decision_function(modelLinear)
plot_svc_decision_function(modelRbf)
plt.title("The two models superposed")









    Out[13]:





<matplotlib.text.Text at 0x7fc8ba7b1910>

Let us compare the linear and rbf training error using the zero one loss (the proportion of misclassified examples).



In [14]:

    
from sklearn.metrics import zero_one_loss

yFitLinear = modelLinear.predict(X)
yFitRbf = modelRbf.predict(X)

print("0/1 loss -- Linear: {:.3f}      Rbf: {:.3f}".format(zero_one_loss(y, yFitLinear),zero_one_loss(y, yFitRbf)))









    



0/1 loss -- Linear: 0.140      Rbf: 0.040

Multiple classes

Where there are multiples classes (as in the iris dataset of the Pandas notebook), different strategies can be adopted:

Transforming the multiclass problem into a binary one by looking at the one-vs-rest problem (for each class construct a binary classifier between it and the rest) or the one-vs-one one (where each couple of classes is considered separately). After this transformation, standard binary classifiers can be used.
Using dedicated algorithms such as decision trees

The corresponding algorithms can be found in the multiclass module documentation.

We are going to illustrate this by the iris 3-class classification problem using only the 2 petal features (width and length, this is only so that the feature vector is 2D and easy to visualize).



In [15]:

    
import pandas as pd
import numpy as np

iris = pd.read_csv('data/iris.csv')
classes = pd.DataFrame(iris["species"])
features = iris.drop(["species","sepal_length","sepal_width"],axis=1)



In [16]:

    
classes.sample(6)









    Out[16]:







  
    
      
      species
    
  
  
    
      69
      versicolor
    
    
      74
      versicolor
    
    
      110
      virginica
    
    
      143
      virginica
    
    
      130
      virginica
    
    
      133
      virginica



In [17]:

    
features.sample(6)









    Out[17]:







  
    
      
      petal_length
      petal_width
    
  
  
    
      126
      4.8
      1.8
    
    
      34
      1.5
      0.2
    
    
      42
      1.3
      0.2
    
    
      45
      1.4
      0.3
    
    
      84
      4.5
      1.5
    
    
      23
      1.7
      0.5



In [18]:

    
XTrain, XTest, yTrain, yTest = train_test_split(features,classes,test_size = 0.5)



In [19]:

    
from sklearn.multiclass import OneVsRestClassifier

yPred = OneVsRestClassifier(SVC()).fit(XTrain, yTrain).predict(XTest)



In [20]:

    
print(yPred)  # Note the classes are not number but everything went as expected









    



['virginica' 'versicolor' 'virginica' 'virginica' 'setosa' 'setosa'
 'setosa' 'versicolor' 'versicolor' 'setosa' 'virginica' 'versicolor'
 'virginica' 'setosa' 'virginica' 'setosa' 'setosa' 'virginica' 'setosa'
 'setosa' 'versicolor' 'setosa' 'virginica' 'setosa' 'setosa' 'virginica'
 'setosa' 'setosa' 'setosa' 'virginica' 'virginica' 'setosa' 'versicolor'
 'versicolor' 'setosa' 'virginica' 'setosa' 'virginica' 'virginica'
 'versicolor' 'virginica' 'virginica' 'versicolor' 'virginica' 'versicolor'
 'versicolor' 'setosa' 'virginica' 'virginica' 'versicolor' 'virginica'
 'setosa' 'setosa' 'versicolor' 'setosa' 'setosa' 'versicolor' 'virginica'
 'virginica' 'versicolor' 'versicolor' 'versicolor' 'versicolor'
 'virginica' 'virginica' 'setosa' 'versicolor' 'versicolor' 'versicolor'
 'setosa' 'versicolor' 'virginica' 'versicolor' 'setosa' 'virginica']



In [21]:

    
class_labels= ['virginica' , 'setosa' , 'versicolor']
sns.heatmap(confusion_matrix(yTest, yPred), square=True, annot=True ,cbar=False,  xticklabels= class_labels,  yticklabels=class_labels)
plt.ylabel('true label')
plt.xlabel('predicted label')









    Out[21]:





<matplotlib.text.Text at 0x7fc8c43497d0>

Other classifiers

The main classifiers from Scikit learn are: Linear SVM, RBF SVM (as already seen), Nearest Neighbors, Gaussian Process, Decision Tree, Random Forest, Neural Net, AdaBoost, Naive Bayes, QDA.

Use is:

from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis


classifiers = [
KNeighborsClassifier(3),
SVC(kernel="linear", C=0.025),
SVC(gamma=2, C=1),
GaussianProcessClassifier(1.0 * RBF(1.0), warm_start=True),
DecisionTreeClassifier(max_depth=5),
RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
MLPClassifier(alpha=1),
AdaBoostClassifier(),
GaussianNB(),
QuadraticDiscriminantAnalysis()]

b) Regression

Go to top

Let consider the problem of predicting real values from a set of features.

We will consider the student performance dataset. The goal is to predict the final grade from the other information, we get from the documentation:

# Attributes for both student-mat.csv (Math course) dataset: 1 sex - student's sex (binary: "F" - female or "M" - male) 2 age - student's age (numeric: from 15 to 22) 3 address - student's home address type (binary: "U" - urban or "R" - rural) 4 famsize - family size (binary: "LE3" - less or equal to 3 or "GT3" - greater than 3) 5 Pstatus - parent's cohabitation status (binary: "T" - living together or "A" - apart) 6 Medu - mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education) 7 Fedu - father's education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education) 8 traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour) 9 studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours) 10 failures - number of past class failures (numeric: n if 1<=n<3, else 4) 11 schoolsup - extra educational support (binary: yes or no) 12 famsup - family educational support (binary: yes or no) 13 paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no) 14 activities - extra-curricular activities (binary: yes or no) 15 nursery - attended nursery school (binary: yes or no) 16 higher - wants to take higher education (binary: yes or no) 17 internet - Internet access at home (binary: yes or no) 18 romantic - with a romantic relationship (binary: yes or no) 19 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent) 20 freetime - free time after school (numeric: from 1 - very low to 5 - very high) 21 goout - going out with friends (numeric: from 1 - very low to 5 - very high) 22 Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high) 23 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high) 24 health - current health status (numeric: from 1 - very bad to 5 - very good) 25 absences - number of school absences (numeric: from 0 to 93) 26 G1 - first period grade (numeric: from 0 to 20) 27 G2 - second period grade (numeric: from 0 to 20) 28 G3 - final grade (numeric: from 0 to 20, output target)



In [22]:

    
import pandas as pd
import numpy as np

student = pd.read_csv('data/student-mat.csv')
student.head()









    Out[22]:







  
    
      
      sex
      age
      address
      famsize
      Pstatus
      Medu
      Fedu
      traveltime
      studytime
      failures
      ...
      famrel
      freetime
      goout
      Dalc
      Walc
      health
      absences
      G1
      G2
      G3
    
  
  
    
      0
      F
      18
      U
      GT3
      A
      4
      4
      2
      2
      0
      ...
      4
      3
      4
      1
      1
      3
      6
      5
      6
      6
    
    
      1
      F
      17
      U
      GT3
      T
      1
      1
      1
      2
      0
      ...
      5
      3
      3
      1
      1
      3
      4
      5
      5
      6
    
    
      2
      F
      15
      U
      LE3
      T
      1
      1
      1
      2
      3
      ...
      4
      3
      2
      2
      3
      3
      10
      7
      8
      10
    
    
      3
      F
      15
      U
      GT3
      T
      4
      2
      1
      3
      0
      ...
      3
      2
      2
      1
      1
      5
      2
      15
      14
      15
    
    
      4
      F
      16
      U
      GT3
      T
      3
      3
      1
      2
      0
      ...
      4
      3
      2
      1
      2
      5
      4
      6
      10
      10
    
  

5 rows × 28 columns



In [23]:

    
target = pd.DataFrame(student["G3"])
features = student.drop(["G3"],axis=1)

One immediate problem here is that the features are not numeric (not floats). Thankfully, Scikit Learn provides encoders to convert categorical (aka nominal, discrete) features to numerical ones.



In [24]:

    
from sklearn.preprocessing import LabelEncoder

lenc = LabelEncoder()
num_features = features.apply(lenc.fit_transform)



In [25]:

    
num_features.head()









    Out[25]:







  
    
      
      sex
      age
      address
      famsize
      Pstatus
      Medu
      Fedu
      traveltime
      studytime
      failures
      ...
      romantic
      famrel
      freetime
      goout
      Dalc
      Walc
      health
      absences
      G1
      G2
    
  
  
    
      0
      0
      3
      1
      0
      0
      4
      4
      1
      1
      0
      ...
      0
      3
      2
      3
      0
      0
      2
      6
      2
      3
    
    
      1
      0
      2
      1
      0
      1
      1
      1
      0
      1
      0
      ...
      0
      4
      2
      2
      0
      0
      2
      4
      2
      2
    
    
      2
      0
      0
      1
      1
      1
      1
      1
      0
      1
      3
      ...
      0
      3
      2
      1
      1
      2
      2
      10
      4
      5
    
    
      3
      0
      0
      1
      0
      1
      4
      2
      0
      2
      0
      ...
      1
      2
      1
      1
      0
      0
      4
      2
      12
      11
    
    
      4
      0
      1
      1
      0
      1
      3
      3
      0
      1
      0
      ...
      0
      3
      2
      1
      0
      1
      4
      4
      3
      7
    
  

5 rows × 27 columns

Even numerical values were encoded, as we are going to normalize, it is not really important.

The normalization is done by removing the mean and equalizing the variance per feature, in addition, we are going to add an intercept.



In [26]:

    
from sklearn.preprocessing import StandardScaler, add_dummy_feature

scaler = StandardScaler()
normFeatures = add_dummy_feature(scaler.fit_transform(num_features))



In [27]:

    
preproData = pd.DataFrame(normFeatures , columns=[ "intercept" ] + list(num_features.columns) )



In [28]:

    
preproData.describe().T









    Out[28]:







  
    
      
      count
      mean
      std
      min
      25%
      50%
      75%
      max
    
  
  
    
      intercept
      395.0
      1.000000e+00
      0.000000
      1.000000
      1.000000
      1.000000
      1.000000
      1.000000
    
    
      sex
      395.0
      -7.195369e-17
      1.001268
      -0.948176
      -0.948176
      -0.948176
      1.054656
      1.054656
    
    
      age
      395.0
      1.439074e-16
      1.001268
      -1.330954
      -0.546287
      0.238380
      1.023046
      4.161713
    
    
      address
      395.0
      -1.618958e-16
      1.001268
      -1.867789
      0.535392
      0.535392
      0.535392
      0.535392
    
    
      famsize
      395.0
      6.295948e-17
      1.001268
      -0.636941
      -0.636941
      -0.636941
      1.570004
      1.570004
    
    
      Pstatus
      395.0
      -1.349132e-17
      1.001268
      -2.938392
      0.340322
      0.340322
      0.340322
      0.340322
    
    
      Medu
      395.0
      5.396527e-17
      1.001268
      -2.514630
      -0.685387
      0.229234
      1.143856
      1.143856
    
    
      Fedu
      395.0
      -1.439074e-16
      1.001268
      -2.320084
      -0.479857
      -0.479857
      0.440257
      1.360371
    
    
      traveltime
      395.0
      3.597685e-17
      1.001268
      -0.643249
      -0.643249
      -0.643249
      0.792251
      3.663251
    
    
      studytime
      395.0
      4.946817e-17
      1.001268
      -1.235351
      -1.235351
      -0.042286
      -0.042286
      2.343844
    
    
      failures
      395.0
      -1.349132e-17
      1.001268
      -0.449944
      -0.449944
      -0.449944
      -0.449944
      3.589323
    
    
      schoolsup
      395.0
      5.396527e-17
      1.001268
      -0.385040
      -0.385040
      -0.385040
      -0.385040
      2.597133
    
    
      famsup
      395.0
      -9.893633e-17
      1.001268
      -1.257656
      -1.257656
      0.795130
      0.795130
      0.795130
    
    
      paid
      395.0
      -2.698264e-17
      1.001268
      -0.919671
      -0.919671
      -0.919671
      1.087346
      1.087346
    
    
      activities
      395.0
      -9.893633e-17
      1.001268
      -1.017881
      -1.017881
      0.982433
      0.982433
      0.982433
    
    
      nursery
      395.0
      -2.698264e-17
      1.001268
      -1.968894
      0.507899
      0.507899
      0.507899
      0.507899
    
    
      higher
      395.0
      2.248553e-16
      1.001268
      -4.330127
      0.230940
      0.230940
      0.230940
      0.230940
    
    
      internet
      395.0
      -1.798842e-17
      1.001268
      -2.232677
      0.447893
      0.447893
      0.447893
      0.447893
    
    
      romantic
      395.0
      -8.994212e-17
      1.001268
      -0.708450
      -0.708450
      -0.708450
      1.411533
      1.411533
    
    
      famrel
      395.0
      -1.394103e-16
      1.001268
      -3.287804
      0.062194
      0.062194
      1.178860
      1.178860
    
    
      freetime
      395.0
      1.079305e-16
      1.001268
      -2.240828
      -0.236010
      -0.236010
      0.766399
      1.768808
    
    
      goout
      395.0
      -1.214219e-16
      1.001268
      -1.896683
      -0.997295
      -0.097908
      0.801479
      1.700867
    
    
      Dalc
      395.0
      7.195369e-17
      1.001268
      -0.540699
      -0.540699
      -0.540699
      0.583385
      3.955638
    
    
      Walc
      395.0
      8.094791e-17
      1.001268
      -1.003789
      -1.003789
      -0.226345
      0.551100
      2.105989
    
    
      health
      395.0
      7.195369e-17
      1.001268
      -1.839649
      -0.399289
      0.320890
      1.041070
      1.041070
    
    
      absences
      395.0
      5.396527e-17
      1.001268
      -0.842509
      -0.842509
      -0.221630
      0.399249
      4.279742
    
    
      G1
      395.0
      -1.214219e-16
      1.001268
      -2.385787
      -0.877487
      0.027493
      0.630813
      2.440773
    
    
      G2
      395.0
      2.248553e-17
      1.001268
      -2.229107
      -0.517187
      0.053452
      0.624092
      2.336011

Regression and Feature selection with the Lasso

The lasso problem is finding a regressor $w$ such that minimizes $$ \frac{1}{2 n_{samples}} \|X w - y ||^2_2 + \alpha \|w\|_1 $$

and is popular for prediction as it simultaneously selects features thanks to the $\ell_1$-term. The greater $\alpha$ the fewer features.



In [29]:

    
try:
    from sklearn.model_selection import train_test_split    # sklearn > ...
except:
    from sklearn.cross_validation import train_test_split   # sklearn < ...
    
    
from sklearn.linear_model import Lasso

XTrain, XTest, yTrain, yTest = train_test_split(preproData,target,test_size = 0.25)

model = Lasso(alpha=0.1)
model.fit(XTrain,yTrain)









    Out[29]:





Lasso(alpha=0.1, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)

We can observe the regressor $w$ provided by the model, notice the sparsity.



In [30]:

    
model.coef_









    Out[30]:





array([ 0.        ,  0.        , -0.04135809, -0.        , -0.        ,
       -0.04011519,  0.        , -0.        ,  0.        , -0.02139503,
       -0.        ,  0.05544144,  0.        , -0.        , -0.        ,
        0.        ,  0.2031513 , -0.        , -0.11892191,  0.19272755,
        0.        , -0.        ,  0.        ,  0.        ,  0.        ,
        0.28213827,  0.14129135,  3.84468349])

We can observe which coefficients are put to $0$ and which ones are positively/negatively correlated.



In [31]:

    
print("Value    Feature")
for idx,val in enumerate(model.coef_):
    print("{:6.3f}      {}".format(val,preproData.columns[idx]))









    



Value    Feature
 0.000      intercept
 0.000      sex
-0.041      age
-0.000      address
-0.000      famsize
-0.040      Pstatus
 0.000      Medu
-0.000      Fedu
 0.000      traveltime
-0.021      studytime
-0.000      failures
 0.055      schoolsup
 0.000      famsup
-0.000      paid
-0.000      activities
 0.000      nursery
 0.203      higher
-0.000      internet
-0.119      romantic
 0.193      famrel
 0.000      freetime
-0.000      goout
 0.000      Dalc
 0.000      Walc
 0.000      health
 0.282      absences
 0.141      G1
 3.845      G2

Let us take a look at our predictions.



In [32]:

    
targetPred = model.predict(XTest)

print("Predicted    True")
for idx,val in enumerate(targetPred):
    print("{:4.1f}          {:.0f}".format(val,float(yTest.iloc[idx])))









    



Predicted    True
10.8          10
15.3          15
 2.4          0
13.0          14
 9.2          10
18.7          19
 8.4          10
14.1          14
 7.3          8
12.0          11
13.6          15
 5.8          7
14.5          15
19.7          20
 9.7          11
13.0          14
18.6          18
 7.4          9
 7.0          9
 1.5          0
11.4          11
 7.2          0
18.8          18
14.7          15
11.2          13
11.1          13
 9.5          12
 4.7          5
 1.6          0
12.0          13
 8.2          9
 5.9          9
 6.6          8
 8.5          10
 7.7          6
 3.6          5
 5.7          0
10.4          11
 5.3          6
 8.8          9
18.7          18
10.8          12
16.7          16
 5.1          8
13.3          14
11.5          12
10.6          11
11.1          11
 1.7          0
14.1          14
11.5          11
10.6          12
 9.6          10
11.4          12
13.9          14
15.8          16
10.3          12
10.6          10
 9.9          11
 8.2          10
 8.0          8
12.6          13
 8.9          9
 3.4          0
12.8          13
13.8          13
15.3          15
 5.7          0
 6.0          8
 3.0          0
10.6          11
15.0          15
 5.1          8
13.3          13
10.1          10
 8.2          8
 1.9          0
11.5          12
 8.1          10
10.8          12
 8.6          9
 1.6          0
10.0          10
10.4          11
 7.4          8
 5.7          8
 6.5          8
 6.8          0
 5.8          6
 5.9          8
 8.7          10
 8.2          9
10.4          10
 9.3          0
14.8          16
 9.3          10
13.1          14
10.0          11
16.4          16

Regularization path

Selecting a good parameter $\alpha$ is the role of the data scientist. For instance, a easy way to do is the following.



In [33]:

    
n_test = 15
alpha_tab = np.logspace(-10,1,base=2,num = n_test)
print(alpha_tab)









    



[  9.76562500e-04   1.68354067e-03   2.90233260e-03   5.00346363e-03
   8.62569933e-03   1.48702368e-02   2.56354799e-02   4.41941738e-02
   7.61883534e-02   1.31344580e-01   2.26430916e-01   3.90354591e-01
   6.72950096e-01   1.16012939e+00   2.00000000e+00]



In [34]:

    
trainError = np.zeros(n_test)
testError = np.zeros(n_test)
featureNum = np.zeros(n_test)

for idx,alpha in enumerate(alpha_tab):
    model = Lasso(alpha=alpha)
    model.fit(XTrain,yTrain)
    yPredTrain = model.predict(XTrain)
    yPredTest = model.predict(XTest)
    
    trainError[idx] = np.linalg.norm(yPredTrain-yTrain["G3"].values)/yTrain.count()
    testError[idx] = np.linalg.norm(yPredTest-yTest["G3"].values)/yTest.count()
    featureNum[idx] = sum(model.coef_!=0)

    
alpha_opt = alpha_tab[np.argmin(testError)]



In [35]:

    
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
%matplotlib inline


plt.subplot(311)
plt.xscale("log")
plt.plot(alpha_tab, trainError,label="train error")
plt.xlim([min(alpha_tab),max(alpha_tab)])
plt.legend()
plt.xticks([])
plt.axvline(x=alpha_opt)
plt.ylabel("error")

plt.subplot(312)
plt.xscale("log")
plt.plot(alpha_tab, testError,'r',label="test error")
plt.xlim([min(alpha_tab),max(alpha_tab)])
#plt.ylim([0.19, 0.21])
plt.legend()
plt.axvline(x=alpha_opt)
plt.xticks([])
plt.ylabel("error")

plt.subplot(313)
plt.xscale("log")
plt.scatter(alpha_tab, featureNum)
plt.xlim([min(alpha_tab),max(alpha_tab)])
plt.ylim([0,28])
plt.axvline(x=alpha_opt)
plt.ylabel("nb. of features")
plt.xlabel("alpha")









    Out[35]:





<matplotlib.text.Text at 0x7fc8ba526950>

c) Exercises

Go to top

Exercise 4.2.1: a very popular binary classification exercise is the survival prediction from Titanic shipwreck on Kaggle.

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.
In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

The data - taken from Kaggle - is located in data/titanic/train.csv and has the following form:

Feature	Definition	Comment
PassengerId	ID	numeric
Survival	Survival of the passenger	0 = No, 1 = Yes target to predict
Pclass	Ticket class	1 = 1st, 2 = 2nd, 3 = 3rd
Name	Full name w/ Mr. Mrs. etc.	string
Sex	Sex	`male` or `female`
Age	Age in years	numeric
SibSp	# of siblings / spouses aboard the Titanic	numeric
Parch	# of parents / children aboard the Titanic
Ticket	Ticket number	quite messy
Fare	Passenger fare
cabin	Cabin number	letter + number (e.g. C85), often missing
Embarked	Port of Embarkation	C = Cherbourg, Q = Queenstown, S = Southampton

Load the dataset and preprocess the features. (you can remove features that seem uninteresting to you).
Perform binary classification to predict the survival of a passenger depending on its information and validate you approach.
Perform some feature engineering to improve the performance of you classifier (see e.g. here)



In [ ]:

Exercise 4.2.2: a very popular regression exercise is the house price prediction in Ames, Iowa on Kaggle.

The data - taken from Kaggle - is located in data/house_prices/train.csv.

Try to reach the best accurracy in terms of mean absolute error on the log of the prices ($Error = \frac{1}{n} \sum_{i=1}^n | \log(predicted_i) - \log(true_i) |$).
Which features (original or made up) are the most relevant?



In [ ]:

Package Check and Styling

Go to top



In [ ]:

    
import lib.notebook_setting as nbs

packageList = ['IPython', 'numpy', 'scipy', 'matplotlib', 'cvxopt', 'pandas', 'seaborn', 'sklearn', 'tensorflow']
nbs.packageCheck(packageList)

nbs.cssStyling()

	species
69	versicolor
74	versicolor
110	virginica
143	virginica
130	virginica
133	virginica

	sex	age	address	famsize	Pstatus	Medu	Fedu	traveltime	studytime	failures	...	famrel	freetime	goout	Dalc	Walc	health	absences	G1	G2	G3
0	F	18	U	GT3	A	4	4	2	2	0	...	4	3	4	1	1	3	6	5	6	6
1	F	17	U	GT3	T	1	1	1	2	0	...	5	3	3	1	1	3	4	5	5	6
2	F	15	U	LE3	T	1	1	1	2	3	...	4	3	2	2	3	3	10	7	8	10
3	F	15	U	GT3	T	4	2	1	3	0	...	3	2	2	1	1	5	2	15	14	15
4	F	16	U	GT3	T	3	3	1	2	0	...	4	3	2	1	2	5	4	6	10	10

	age	address	famsize	Pstatus	Medu	Fedu	traveltime	studytime	failures	...	romantic	famrel	freetime	goout	Dalc	Walc	health	absences	G1	G2
0	3	1	0	0	4	4	1	1	0	...	0	3	2	3	0	0	2	6	2	3
1	2	1	0	1	1	1	0	1	0	...	0	4	2	2	0	0	2	4	2	2
2	0	1	1	1	1	1	0	1	3	...	0	3	2	1	1	2	2	10	4	5
3	0	1	0	1	4	2	0	2	0	...	1	2	1	1	0	0	4	2	12	11
4	1	1	0	1	3	3	0	1	0	...	0	3	2	1	0	1	4	4	3	7

	count	mean	std	min	25%	50%	75%	max
intercept	395.0	1.000000e+00	0.000000	1.000000	1.000000	1.000000	1.000000	1.000000
sex	395.0	-7.195369e-17	1.001268	-0.948176	-0.948176	-0.948176	1.054656	1.054656
age	395.0	1.439074e-16	1.001268	-1.330954	-0.546287	0.238380	1.023046	4.161713
address	395.0	-1.618958e-16	1.001268	-1.867789	0.535392	0.535392	0.535392	0.535392
famsize	395.0	6.295948e-17	1.001268	-0.636941	-0.636941	-0.636941	1.570004	1.570004
Pstatus	395.0	-1.349132e-17	1.001268	-2.938392	0.340322	0.340322	0.340322	0.340322
Medu	395.0	5.396527e-17	1.001268	-2.514630	-0.685387	0.229234	1.143856	1.143856
Fedu	395.0	-1.439074e-16	1.001268	-2.320084	-0.479857	-0.479857	0.440257	1.360371
traveltime	395.0	3.597685e-17	1.001268	-0.643249	-0.643249	-0.643249	0.792251	3.663251
studytime	395.0	4.946817e-17	1.001268	-1.235351	-1.235351	-0.042286	-0.042286	2.343844
failures	395.0	-1.349132e-17	1.001268	-0.449944	-0.449944	-0.449944	-0.449944	3.589323
schoolsup	395.0	5.396527e-17	1.001268	-0.385040	-0.385040	-0.385040	-0.385040	2.597133
famsup	395.0	-9.893633e-17	1.001268	-1.257656	-1.257656	0.795130	0.795130	0.795130
paid	395.0	-2.698264e-17	1.001268	-0.919671	-0.919671	-0.919671	1.087346	1.087346
activities	395.0	-9.893633e-17	1.001268	-1.017881	-1.017881	0.982433	0.982433	0.982433
nursery	395.0	-2.698264e-17	1.001268	-1.968894	0.507899	0.507899	0.507899	0.507899
higher	395.0	2.248553e-16	1.001268	-4.330127	0.230940	0.230940	0.230940	0.230940
internet	395.0	-1.798842e-17	1.001268	-2.232677	0.447893	0.447893	0.447893	0.447893
romantic	395.0	-8.994212e-17	1.001268	-0.708450	-0.708450	-0.708450	1.411533	1.411533
famrel	395.0	-1.394103e-16	1.001268	-3.287804	0.062194	0.062194	1.178860	1.178860
freetime	395.0	1.079305e-16	1.001268	-2.240828	-0.236010	-0.236010	0.766399	1.768808
goout	395.0	-1.214219e-16	1.001268	-1.896683	-0.997295	-0.097908	0.801479	1.700867
Dalc	395.0	7.195369e-17	1.001268	-0.540699	-0.540699	-0.540699	0.583385	3.955638
Walc	395.0	8.094791e-17	1.001268	-1.003789	-1.003789	-0.226345	0.551100	2.105989
health	395.0	7.195369e-17	1.001268	-1.839649	-0.399289	0.320890	1.041070	1.041070
absences	395.0	5.396527e-17	1.001268	-0.842509	-0.842509	-0.221630	0.399249	4.279742
G1	395.0	-1.214219e-16	1.001268	-2.385787	-0.877487	0.027493	0.630813	2.440773
G2	395.0	2.248553e-17	1.001268	-2.229107	-0.517187	0.053452	0.624092	2.336011

	sex	age	address	famsize	Pstatus	Medu	Fedu	traveltime	studytime	failures	...	famrel	freetime	goout	Dalc	Walc	health	absences	G1	G2	G3
0	F	18	U	GT3	A	4	4	2	2	0	...	4	3	4	1	1	3	6	5	6	6
1	F	17	U	GT3	T	1	1	1	2	0	...	5	3	3	1	1	3	4	5	5	6
2	F	15	U	LE3	T	1	1	1	2	3	...	4	3	2	2	3	3	10	7	8	10
3	F	15	U	GT3	T	4	2	1	3	0	...	3	2	2	1	1	5	2	15	14	15
4	F	16	U	GT3	T	3	3	1	2	0	...	4	3	2	1	2	5	4	6	10	10

	age	address	famsize	Pstatus	Medu	Fedu	traveltime	studytime	failures	...	romantic	famrel	freetime	goout	Dalc	Walc	health	absences	G1	G2
0	3	1	0	0	4	4	1	1	0	...	0	3	2	3	0	0	2	6	2	3
1	2	1	0	1	1	1	0	1	0	...	0	4	2	2	0	0	2	4	2	2
2	0	1	1	1	1	1	0	1	3	...	0	3	2	1	1	2	2	10	4	5
3	0	1	0	1	4	2	0	2	0	...	1	2	1	1	0	0	4	2	12	11
4	1	1	0	1	3	3	0	1	0	...	0	3	2	1	0	1	4	4	3	7

	sex	age	address	famsize	Pstatus	Medu	Fedu	traveltime	studytime	failures	...	famrel	freetime	goout	Dalc	Walc	health	absences	G1	G2	G3
0	F	18	U	GT3	A	4	4	2	2	0	...	4	3	4	1	1	3	6	5	6	6
1	F	17	U	GT3	T	1	1	1	2	0	...	5	3	3	1	1	3	4	5	5	6
2	F	15	U	LE3	T	1	1	1	2	3	...	4	3	2	2	3	3	10	7	8	10
3	F	15	U	GT3	T	4	2	1	3	0	...	3	2	2	1	1	5	2	15	14	15
4	F	16	U	GT3	T	3	3	1	2	0	...	4	3	2	1	2	5	4	6	10	10

	age	address	famsize	Pstatus	Medu	Fedu	traveltime	studytime	failures	...	romantic	famrel	freetime	goout	Dalc	Walc	health	absences	G1	G2
0	3	1	0	0	4	4	1	1	0	...	0	3	2	3	0	0	2	6	2	3
1	2	1	0	1	1	1	0	1	0	...	0	4	2	2	0	0	2	4	2	2
2	0	1	1	1	1	1	0	1	3	...	0	3	2	1	1	2	2	10	4	5
3	0	1	0	1	4	2	0	2	0	...	1	2	1	1	0	0	4	2	12	11
4	1	1	0	1	3	3	0	1	0	...	0	3	2	1	0	1	4	4	3	7