Introduction to Python for Data Sciences

Franck Iutzeler
Fall. 2018
Warning: In the session, we will investigate examples on how to deal with popular learning problems using standard algorithms. Many other problems and algorithms exist so this course is not at all exhaustive.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
%matplotlib inline

# we create 40 separable points in R^2 around 2 centers (random_state=6 is a seed so that the set is separable)
X, y = make_blobs(n_samples=40, n_features=2, centers=2 , random_state=6)

print(X[:5,:],y[:5]) # print the first 5 points and labels

plt.scatter(X[:, 0], X[:, 1], c=y,  cmap=plt.cm.Paired)


(array([[  6.37734541, -10.61510727],
       [  6.50072722,  -3.82403586],
       [  4.29225906,  -8.99220442],
       [  7.39169472,  -3.1266933 ],
       [  7.64306311, -10.02356892]]), array([1, 0, 1, 0, 1]))
Out[1]:
<matplotlib.collections.PathCollection at 0x7fc8c6fb2810>

Support Vector Machines (SVM) are based on learning a vector $w$ and an intercept $b$ such that the hyperplane $w^T x - b = 0$ separates the data i.e. $a$ belongs to one class if $w^T a - b > 0$ and the other elsewhere.

They were later extended to Kernel methods that is $\kappa(w, a) - b = 0$ is now the separating curve where $\kappa$ is the kernel, typically:

  • linear: $\kappa(x,y)= x^T y$ (original SVM)
  • polynomial: $\kappa(x,y)= (x^T y)^d$
  • Gaussian radial basis function (rfb): $\kappa(x,y)= \exp( - \gamma \| x - y \|^2 )$

In [2]:
from sklearn.svm import SVC # Support vector classifier i.e. Classifier by SVM

modelSVMLinear = SVC(kernel="linear")
modelSVMLinear.fit(X,y)


Out[2]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

The following illustration can be found in the Python Data Science Handbook by Jake VanderPlas.


In [3]:
def plot_svc_decision_function(model, ax=None, plot_support=True):
    """Plot the decision function for a 2D SVC"""
    if ax is None:
        ax = plt.gca()
    xlim = ax.get_xlim()
    ylim = ax.get_ylim()
    
    # create grid to evaluate model
    x = np.linspace(xlim[0], xlim[1], 30)
    y = np.linspace(ylim[0], ylim[1], 30)
    Y, X = np.meshgrid(y, x)
    xy = np.vstack([X.ravel(), Y.ravel()]).T
    P = model.decision_function(xy).reshape(X.shape)
    
    # plot decision boundary and margins
    ax.contour(X, Y, P, colors='k',
               levels=[-1, 0, 1], alpha=0.5,
               linestyles=['--', '-', '--'])
    
    # plot support vectors
    if plot_support:
        ax.scatter(model.support_vectors_[:, 0],
                   model.support_vectors_[:, 1],
                   s=300, linewidth=1, facecolors='none');
    ax.set_xlim(xlim)
    ax.set_ylim(ylim)

In [4]:
plt.scatter(X[:, 0], X[:, 1], c=y ,  cmap=plt.cm.Paired)
plot_svc_decision_function(modelSVMLinear)


/home/franck/.local/lib/python2.7/site-packages/numpy/ma/core.py:6385: MaskedArrayFutureWarning: In the future the default for ma.minimum.reduce will be axis=0, not the current None, to match np.minimum.reduce. Explicitly pass 0 or None to silence this warning.
  return self.reduce(a)
/home/franck/.local/lib/python2.7/site-packages/numpy/ma/core.py:6385: MaskedArrayFutureWarning: In the future the default for ma.maximum.reduce will be axis=0, not the current None, to match np.maximum.reduce. Explicitly pass 0 or None to silence this warning.
  return self.reduce(a)

We see clearly that the linear SVM seeks at maximizing the margin between the hyperplane and the two well defined classes from the data.

Non-separable data

In real cases, the data is usually not linearly separable as before.


In [5]:
# we create points in R^2 around 2 centers (random_state=48443 is a seed so that the set is *not* separable)
X, y = make_blobs(n_samples=100, n_features=2, centers=2 , random_state=48443)

plt.scatter(X[:, 0], X[:, 1], c=y,  cmap=plt.cm.Paired)


Out[5]:
<matplotlib.collections.PathCollection at 0x7fc8c41a57d0>

Let us use the same linear SVM classifier. Obviously, there are misclassified points, the model is thus learnt not by maximizing the margin (which does not exist anymore) but by minimizing a penalty over misclassified data. This penalty takes the form of an allowance margin controlled by a parameter $C$. The smaller $C$ the more inclusive the margin. Finding a good value for $C$ is up to the data scientist.


In [6]:
try:
    from sklearn.model_selection import train_test_split    # sklearn > ...
except:
    from sklearn.cross_validation import train_test_split   # sklearn < ...
    
XTrain, XTest, yTrain, yTest = train_test_split(X,y,test_size = 0.5) # split data in two

model1 = SVC(kernel="linear",C=0.01)
model1.fit(XTrain,yTrain)

model2 = SVC(kernel="linear",C=100)
model2.fit(XTrain,yTrain)


Out[6]:
SVC(C=100, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [7]:
plt.scatter(XTrain[:, 0], XTrain[:, 1], c=yTrain ,  cmap=plt.cm.Paired)
plot_svc_decision_function(model1)
plt.title("C = 0.01")


Out[7]:
<matplotlib.text.Text at 0x7fc8c41329d0>

In [8]:
plt.scatter(XTrain[:, 0], XTrain[:, 1], c=yTrain ,  cmap=plt.cm.Paired)
plot_svc_decision_function(model2)
plt.title("C = 100")


Out[8]:
<matplotlib.text.Text at 0x7fc8c4069590>

To find out which value of $C$ to use or globally the performance of the classifier, one can use Scikit Learn's classification metrics, for instance the confusion matrix.


In [9]:
from sklearn.metrics import confusion_matrix


yFit1 = model1.predict(XTest)
yFit2 = model2.predict(XTest)


mat1 = confusion_matrix(yTest, yFit1)
mat2 = confusion_matrix(yTest, yFit2)

print('Model with C = 0.01')
print(mat1)
print("Model with C = 100")
print(mat2)


Model with C = 0.01
[[19  1]
 [ 7 23]]
Model with C = 100
[[19  1]
 [ 6 24]]

It can also be plotted in a fancier way with seaborn.


In [10]:
import seaborn as sns

sns.heatmap(mat1, square=True, annot=True ,cbar=False)
plt.ylabel('true label')
plt.xlabel('predicted label')


Out[10]:
<matplotlib.text.Text at 0x7fc8ba863490>

Kernels

When the separation between classes is not linear, kernels may be used to draw separating curves instead of lines. The most popular is the Gaussian rbf.


In [11]:
from sklearn.datasets import make_moons

X,y = make_moons(noise=0.1)
plt.scatter(X[:, 0], X[:, 1], c=y,  cmap=plt.cm.Paired)


Out[11]:
<matplotlib.collections.PathCollection at 0x7fc8c4204dd0>

In [12]:
modelLinear = SVC(kernel="linear")
modelLinear.fit(X,y)

modelRbf = SVC(kernel="rbf")
modelRbf.fit(X,y)


Out[12]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [13]:
plt.scatter(X[:, 0], X[:, 1], c=y,  cmap=plt.cm.Paired)
plot_svc_decision_function(modelLinear)
plot_svc_decision_function(modelRbf)
plt.title("The two models superposed")


Out[13]:
<matplotlib.text.Text at 0x7fc8ba7b1910>

Let us compare the linear and rbf training error using the zero one loss (the proportion of misclassified examples).


In [14]:
from sklearn.metrics import zero_one_loss

yFitLinear = modelLinear.predict(X)
yFitRbf = modelRbf.predict(X)

print("0/1 loss -- Linear: {:.3f}      Rbf: {:.3f}".format(zero_one_loss(y, yFitLinear),zero_one_loss(y, yFitRbf)))


0/1 loss -- Linear: 0.140      Rbf: 0.040

Multiple classes

Where there are multiples classes (as in the iris dataset of the Pandas notebook), different strategies can be adopted:

  • Transforming the multiclass problem into a binary one by looking at the one-vs-rest problem (for each class construct a binary classifier between it and the rest) or the one-vs-one one (where each couple of classes is considered separately). After this transformation, standard binary classifiers can be used.
  • Using dedicated algorithms such as decision trees

The corresponding algorithms can be found in the multiclass module documentation.

We are going to illustrate this by the iris 3-class classification problem using only the 2 petal features (width and length, this is only so that the feature vector is 2D and easy to visualize).


In [15]:
import pandas as pd
import numpy as np

iris = pd.read_csv('data/iris.csv')
classes = pd.DataFrame(iris["species"])
features = iris.drop(["species","sepal_length","sepal_width"],axis=1)

In [16]:
classes.sample(6)


Out[16]:
species
69 versicolor
74 versicolor
110 virginica
143 virginica
130 virginica
133 virginica

In [17]:
features.sample(6)


Out[17]:
petal_length petal_width
126 4.8 1.8
34 1.5 0.2
42 1.3 0.2
45 1.4 0.3
84 4.5 1.5
23 1.7 0.5

In [18]:
XTrain, XTest, yTrain, yTest = train_test_split(features,classes,test_size = 0.5)

In [19]:
from sklearn.multiclass import OneVsRestClassifier

yPred = OneVsRestClassifier(SVC()).fit(XTrain, yTrain).predict(XTest)

In [20]:
print(yPred)  # Note the classes are not number but everything went as expected


['virginica' 'versicolor' 'virginica' 'virginica' 'setosa' 'setosa'
 'setosa' 'versicolor' 'versicolor' 'setosa' 'virginica' 'versicolor'
 'virginica' 'setosa' 'virginica' 'setosa' 'setosa' 'virginica' 'setosa'
 'setosa' 'versicolor' 'setosa' 'virginica' 'setosa' 'setosa' 'virginica'
 'setosa' 'setosa' 'setosa' 'virginica' 'virginica' 'setosa' 'versicolor'
 'versicolor' 'setosa' 'virginica' 'setosa' 'virginica' 'virginica'
 'versicolor' 'virginica' 'virginica' 'versicolor' 'virginica' 'versicolor'
 'versicolor' 'setosa' 'virginica' 'virginica' 'versicolor' 'virginica'
 'setosa' 'setosa' 'versicolor' 'setosa' 'setosa' 'versicolor' 'virginica'
 'virginica' 'versicolor' 'versicolor' 'versicolor' 'versicolor'
 'virginica' 'virginica' 'setosa' 'versicolor' 'versicolor' 'versicolor'
 'setosa' 'versicolor' 'virginica' 'versicolor' 'setosa' 'virginica']

In [21]:
class_labels= ['virginica' , 'setosa' , 'versicolor']
sns.heatmap(confusion_matrix(yTest, yPred), square=True, annot=True ,cbar=False,  xticklabels= class_labels,  yticklabels=class_labels)
plt.ylabel('true label')
plt.xlabel('predicted label')


Out[21]:
<matplotlib.text.Text at 0x7fc8c43497d0>

Other classifiers

The main classifiers from Scikit learn are: Linear SVM, RBF SVM (as already seen), Nearest Neighbors, Gaussian Process, Decision Tree, Random Forest, Neural Net, AdaBoost, Naive Bayes, QDA.

Use is:

from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis


classifiers = [
KNeighborsClassifier(3),
SVC(kernel="linear", C=0.025),
SVC(gamma=2, C=1),
GaussianProcessClassifier(1.0 * RBF(1.0), warm_start=True),
DecisionTreeClassifier(max_depth=5),
RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
MLPClassifier(alpha=1),
AdaBoostClassifier(),
GaussianNB(),
QuadraticDiscriminantAnalysis()]

Let consider the problem of predicting real values from a set of features.

We will consider the student performance dataset. The goal is to predict the final grade from the other information, we get from the documentation:

# Attributes for both student-mat.csv (Math course) dataset: 1 sex - student's sex (binary: "F" - female or "M" - male) 2 age - student's age (numeric: from 15 to 22) 3 address - student's home address type (binary: "U" - urban or "R" - rural) 4 famsize - family size (binary: "LE3" - less or equal to 3 or "GT3" - greater than 3) 5 Pstatus - parent's cohabitation status (binary: "T" - living together or "A" - apart) 6 Medu - mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education) 7 Fedu - father's education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education) 8 traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour) 9 studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours) 10 failures - number of past class failures (numeric: n if 1<=n<3, else 4) 11 schoolsup - extra educational support (binary: yes or no) 12 famsup - family educational support (binary: yes or no) 13 paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no) 14 activities - extra-curricular activities (binary: yes or no) 15 nursery - attended nursery school (binary: yes or no) 16 higher - wants to take higher education (binary: yes or no) 17 internet - Internet access at home (binary: yes or no) 18 romantic - with a romantic relationship (binary: yes or no) 19 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent) 20 freetime - free time after school (numeric: from 1 - very low to 5 - very high) 21 goout - going out with friends (numeric: from 1 - very low to 5 - very high) 22 Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high) 23 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high) 24 health - current health status (numeric: from 1 - very bad to 5 - very good) 25 absences - number of school absences (numeric: from 0 to 93) 26 G1 - first period grade (numeric: from 0 to 20) 27 G2 - second period grade (numeric: from 0 to 20) 28 G3 - final grade (numeric: from 0 to 20, output target)

In [22]:
import pandas as pd
import numpy as np

student = pd.read_csv('data/student-mat.csv')
student.head()


Out[22]:
sex age address famsize Pstatus Medu Fedu traveltime studytime failures ... famrel freetime goout Dalc Walc health absences G1 G2 G3
0 F 18 U GT3 A 4 4 2 2 0 ... 4 3 4 1 1 3 6 5 6 6
1 F 17 U GT3 T 1 1 1 2 0 ... 5 3 3 1 1 3 4 5 5 6
2 F 15 U LE3 T 1 1 1 2 3 ... 4 3 2 2 3 3 10 7 8 10
3 F 15 U GT3 T 4 2 1 3 0 ... 3 2 2 1 1 5 2 15 14 15
4 F 16 U GT3 T 3 3 1 2 0 ... 4 3 2 1 2 5 4 6 10 10

5 rows × 28 columns


In [23]:
target = pd.DataFrame(student["G3"])
features = student.drop(["G3"],axis=1)

One immediate problem here is that the features are not numeric (not floats). Thankfully, Scikit Learn provides encoders to convert categorical (aka nominal, discrete) features to numerical ones.


In [24]:
from sklearn.preprocessing import LabelEncoder

lenc = LabelEncoder()
num_features = features.apply(lenc.fit_transform)

In [25]:
num_features.head()


Out[25]:
sex age address famsize Pstatus Medu Fedu traveltime studytime failures ... romantic famrel freetime goout Dalc Walc health absences G1 G2
0 0 3 1 0 0 4 4 1 1 0 ... 0 3 2 3 0 0 2 6 2 3
1 0 2 1 0 1 1 1 0 1 0 ... 0 4 2 2 0 0 2 4 2 2
2 0 0 1 1 1 1 1 0 1 3 ... 0 3 2 1 1 2 2 10 4 5
3 0 0 1 0 1 4 2 0 2 0 ... 1 2 1 1 0 0 4 2 12 11
4 0 1 1 0 1 3 3 0 1 0 ... 0 3 2 1 0 1 4 4 3 7

5 rows × 27 columns

Even numerical values were encoded, as we are going to normalize, it is not really important.

The normalization is done by removing the mean and equalizing the variance per feature, in addition, we are going to add an intercept.


In [26]:
from sklearn.preprocessing import StandardScaler, add_dummy_feature

scaler = StandardScaler()
normFeatures = add_dummy_feature(scaler.fit_transform(num_features))

In [27]:
preproData = pd.DataFrame(normFeatures , columns=[ "intercept" ] + list(num_features.columns) )

In [28]:
preproData.describe().T


Out[28]:
count mean std min 25% 50% 75% max
intercept 395.0 1.000000e+00 0.000000 1.000000 1.000000 1.000000 1.000000 1.000000
sex 395.0 -7.195369e-17 1.001268 -0.948176 -0.948176 -0.948176 1.054656 1.054656
age 395.0 1.439074e-16 1.001268 -1.330954 -0.546287 0.238380 1.023046 4.161713
address 395.0 -1.618958e-16 1.001268 -1.867789 0.535392 0.535392 0.535392 0.535392
famsize 395.0 6.295948e-17 1.001268 -0.636941 -0.636941 -0.636941 1.570004 1.570004
Pstatus 395.0 -1.349132e-17 1.001268 -2.938392 0.340322 0.340322 0.340322 0.340322
Medu 395.0 5.396527e-17 1.001268 -2.514630 -0.685387 0.229234 1.143856 1.143856
Fedu 395.0 -1.439074e-16 1.001268 -2.320084 -0.479857 -0.479857 0.440257 1.360371
traveltime 395.0 3.597685e-17 1.001268 -0.643249 -0.643249 -0.643249 0.792251 3.663251
studytime 395.0 4.946817e-17 1.001268 -1.235351 -1.235351 -0.042286 -0.042286 2.343844
failures 395.0 -1.349132e-17 1.001268 -0.449944 -0.449944 -0.449944 -0.449944 3.589323
schoolsup 395.0 5.396527e-17 1.001268 -0.385040 -0.385040 -0.385040 -0.385040 2.597133
famsup 395.0 -9.893633e-17 1.001268 -1.257656 -1.257656 0.795130 0.795130 0.795130
paid 395.0 -2.698264e-17 1.001268 -0.919671 -0.919671 -0.919671 1.087346 1.087346
activities 395.0 -9.893633e-17 1.001268 -1.017881 -1.017881 0.982433 0.982433 0.982433
nursery 395.0 -2.698264e-17 1.001268 -1.968894 0.507899 0.507899 0.507899 0.507899
higher 395.0 2.248553e-16 1.001268 -4.330127 0.230940 0.230940 0.230940 0.230940
internet 395.0 -1.798842e-17 1.001268 -2.232677 0.447893 0.447893 0.447893 0.447893
romantic 395.0 -8.994212e-17 1.001268 -0.708450 -0.708450 -0.708450 1.411533 1.411533
famrel 395.0 -1.394103e-16 1.001268 -3.287804 0.062194 0.062194 1.178860 1.178860
freetime 395.0 1.079305e-16 1.001268 -2.240828 -0.236010 -0.236010 0.766399 1.768808
goout 395.0 -1.214219e-16 1.001268 -1.896683 -0.997295 -0.097908 0.801479 1.700867
Dalc 395.0 7.195369e-17 1.001268 -0.540699 -0.540699 -0.540699 0.583385 3.955638
Walc 395.0 8.094791e-17 1.001268 -1.003789 -1.003789 -0.226345 0.551100 2.105989
health 395.0 7.195369e-17 1.001268 -1.839649 -0.399289 0.320890 1.041070 1.041070
absences 395.0 5.396527e-17 1.001268 -0.842509 -0.842509 -0.221630 0.399249 4.279742
G1 395.0 -1.214219e-16 1.001268 -2.385787 -0.877487 0.027493 0.630813 2.440773
G2 395.0 2.248553e-17 1.001268 -2.229107 -0.517187 0.053452 0.624092 2.336011

Regression and Feature selection with the Lasso

The lasso problem is finding a regressor $w$ such that minimizes $$ \frac{1}{2 n_{samples}} \|X w - y ||^2_2 + \alpha \|w\|_1 $$

and is popular for prediction as it simultaneously selects features thanks to the $\ell_1$-term. The greater $\alpha$ the fewer features.


In [29]:
try:
    from sklearn.model_selection import train_test_split    # sklearn > ...
except:
    from sklearn.cross_validation import train_test_split   # sklearn < ...
    
    
from sklearn.linear_model import Lasso

XTrain, XTest, yTrain, yTest = train_test_split(preproData,target,test_size = 0.25)

model = Lasso(alpha=0.1)
model.fit(XTrain,yTrain)


Out[29]:
Lasso(alpha=0.1, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)

We can observe the regressor $w$ provided by the model, notice the sparsity.


In [30]:
model.coef_


Out[30]:
array([ 0.        ,  0.        , -0.04135809, -0.        , -0.        ,
       -0.04011519,  0.        , -0.        ,  0.        , -0.02139503,
       -0.        ,  0.05544144,  0.        , -0.        , -0.        ,
        0.        ,  0.2031513 , -0.        , -0.11892191,  0.19272755,
        0.        , -0.        ,  0.        ,  0.        ,  0.        ,
        0.28213827,  0.14129135,  3.84468349])

We can observe which coefficients are put to $0$ and which ones are positively/negatively correlated.


In [31]:
print("Value    Feature")
for idx,val in enumerate(model.coef_):
    print("{:6.3f}      {}".format(val,preproData.columns[idx]))


Value    Feature
 0.000      intercept
 0.000      sex
-0.041      age
-0.000      address
-0.000      famsize
-0.040      Pstatus
 0.000      Medu
-0.000      Fedu
 0.000      traveltime
-0.021      studytime
-0.000      failures
 0.055      schoolsup
 0.000      famsup
-0.000      paid
-0.000      activities
 0.000      nursery
 0.203      higher
-0.000      internet
-0.119      romantic
 0.193      famrel
 0.000      freetime
-0.000      goout
 0.000      Dalc
 0.000      Walc
 0.000      health
 0.282      absences
 0.141      G1
 3.845      G2

Let us take a look at our predictions.


In [32]:
targetPred = model.predict(XTest)

print("Predicted    True")
for idx,val in enumerate(targetPred):
    print("{:4.1f}          {:.0f}".format(val,float(yTest.iloc[idx])))


Predicted    True
10.8          10
15.3          15
 2.4          0
13.0          14
 9.2          10
18.7          19
 8.4          10
14.1          14
 7.3          8
12.0          11
13.6          15
 5.8          7
14.5          15
19.7          20
 9.7          11
13.0          14
18.6          18
 7.4          9
 7.0          9
 1.5          0
11.4          11
 7.2          0
18.8          18
14.7          15
11.2          13
11.1          13
 9.5          12
 4.7          5
 1.6          0
12.0          13
 8.2          9
 5.9          9
 6.6          8
 8.5          10
 7.7          6
 3.6          5
 5.7          0
10.4          11
 5.3          6
 8.8          9
18.7          18
10.8          12
16.7          16
 5.1          8
13.3          14
11.5          12
10.6          11
11.1          11
 1.7          0
14.1          14
11.5          11
10.6          12
 9.6          10
11.4          12
13.9          14
15.8          16
10.3          12
10.6          10
 9.9          11
 8.2          10
 8.0          8
12.6          13
 8.9          9
 3.4          0
12.8          13
13.8          13
15.3          15
 5.7          0
 6.0          8
 3.0          0
10.6          11
15.0          15
 5.1          8
13.3          13
10.1          10
 8.2          8
 1.9          0
11.5          12
 8.1          10
10.8          12
 8.6          9
 1.6          0
10.0          10
10.4          11
 7.4          8
 5.7          8
 6.5          8
 6.8          0
 5.8          6
 5.9          8
 8.7          10
 8.2          9
10.4          10
 9.3          0
14.8          16
 9.3          10
13.1          14
10.0          11
16.4          16

Regularization path

Selecting a good parameter $\alpha$ is the role of the data scientist. For instance, a easy way to do is the following.


In [33]:
n_test = 15
alpha_tab = np.logspace(-10,1,base=2,num = n_test)
print(alpha_tab)


[  9.76562500e-04   1.68354067e-03   2.90233260e-03   5.00346363e-03
   8.62569933e-03   1.48702368e-02   2.56354799e-02   4.41941738e-02
   7.61883534e-02   1.31344580e-01   2.26430916e-01   3.90354591e-01
   6.72950096e-01   1.16012939e+00   2.00000000e+00]

In [34]:
trainError = np.zeros(n_test)
testError = np.zeros(n_test)
featureNum = np.zeros(n_test)

for idx,alpha in enumerate(alpha_tab):
    model = Lasso(alpha=alpha)
    model.fit(XTrain,yTrain)
    yPredTrain = model.predict(XTrain)
    yPredTest = model.predict(XTest)
    
    trainError[idx] = np.linalg.norm(yPredTrain-yTrain["G3"].values)/yTrain.count()
    testError[idx] = np.linalg.norm(yPredTest-yTest["G3"].values)/yTest.count()
    featureNum[idx] = sum(model.coef_!=0)

    
alpha_opt = alpha_tab[np.argmin(testError)]

In [35]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
%matplotlib inline


plt.subplot(311)
plt.xscale("log")
plt.plot(alpha_tab, trainError,label="train error")
plt.xlim([min(alpha_tab),max(alpha_tab)])
plt.legend()
plt.xticks([])
plt.axvline(x=alpha_opt)
plt.ylabel("error")

plt.subplot(312)
plt.xscale("log")
plt.plot(alpha_tab, testError,'r',label="test error")
plt.xlim([min(alpha_tab),max(alpha_tab)])
#plt.ylim([0.19, 0.21])
plt.legend()
plt.axvline(x=alpha_opt)
plt.xticks([])
plt.ylabel("error")

plt.subplot(313)
plt.xscale("log")
plt.scatter(alpha_tab, featureNum)
plt.xlim([min(alpha_tab),max(alpha_tab)])
plt.ylim([0,28])
plt.axvline(x=alpha_opt)
plt.ylabel("nb. of features")
plt.xlabel("alpha")


Out[35]:
<matplotlib.text.Text at 0x7fc8ba526950>
Exercise 4.2.1: a very popular binary classification exercise is the survival prediction from Titanic shipwreck on Kaggle.

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.
In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.


The data - taken from Kaggle - is located in data/titanic/train.csv and has the following form:
FeatureDefinitionComment
PassengerId ID numeric
Survival Survival of the passenger 0 = No, 1 = Yes target to predict
Pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
Name Full name w/ Mr. Mrs. etc. string
Sex Sex male or female
Age Age in years numeric
SibSp # of siblings / spouses aboard the Titanic numeric
Parch # of parents / children aboard the Titanic
Ticket Ticket number quite messy
Fare Passenger fare
cabin Cabin number letter + number (e.g. C85), often missing
Embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton
  • Load the dataset and preprocess the features. (you can remove features that seem uninteresting to you).
  • Perform binary classification to predict the survival of a passenger depending on its information and validate you approach.
  • Perform some feature engineering to improve the performance of you classifier (see e.g. here)

In [ ]:

Exercise 4.2.2: a very popular regression exercise is the house price prediction in Ames, Iowa on Kaggle.

The data - taken from Kaggle - is located in data/house_prices/train.csv.
  • Try to reach the best accurracy in terms of mean absolute error on the log of the prices ($Error = \frac{1}{n} \sum_{i=1}^n | \log(predicted_i) - \log(true_i) |$).
  • Which features (original or made up) are the most relevant?

In [ ]:


Package Check and Styling

Go to top


In [ ]:
import lib.notebook_setting as nbs

packageList = ['IPython', 'numpy', 'scipy', 'matplotlib', 'cvxopt', 'pandas', 'seaborn', 'sklearn', 'tensorflow']
nbs.packageCheck(packageList)

nbs.cssStyling()