Mustafa Atik - SWE 577


In [7]:
from sklearn import svm, linear_model, neighbors, ensemble
from sklearn import cross_validation, grid_search
from sklearn import datasets
import numpy as npes
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pylab as plt
import time
from IPython.display import YouTubeVideo
%matplotlib inline

Support vector machines (SVM)

SVM is a method that is widely used for classification and regression tasks. The most important characteristic of svm is that it tries to find a decision boundary which separates classes with equal distance. SVM decision boundaries are as wide as possible so that's why sometimes this method is called widest street approach, or large margin classifier.

SVM has other nice properties such as it is convex, theoretically motivated, nonlinear with kernels.

Let's have a look at the space shown below.


In [2]:
x1 = [1, 2, 3, 4, 5, 6]
y1 = [12, 18, 14, 8, 10, 6]

x2 = [4, 5, 7, 8]
y2 = [47, 37, 22, 24]
plt.plot(x1, y1, "o", color="yellow")
plt.plot(x2, y2, "x", color="green")

plt.plot(range(3, 8),[ i * (-14.2) + 100 for i in range(3, 8)], color="blue")
#plt.plot(range(12),[ i * - 4.4 + 45 for i in range(12)], color="blue")
plt.plot(range(9),[ i * - 2.0 + 24 for i in range(9)], color="red")
plt.plot(range(9),[ i * - 4.6 + 44 for i in range(9)], color="k")

plt.title("Figure 1: possible hyperplanes")


Out[2]:
<matplotlib.text.Text at 0x113a28890>

As it can be seen, all the lines separate the classes perfectly. What's more, we can draw more straight lines still separating perfectly. But only one of them will give us the minimum misclassification error. So which line is that?


In [3]:
x1 = [1, 2, 3, 4, 5, 6]
y1 = [12, 18, 14, 8, 10, 6]

x2 = [4, 5, 7, 8]
y2 = [47, 37, 22, 24]
plt.plot(x1, y1, "o", color="yellow")
plt.plot(x2, y2, "x", color="green")
plt.plot([4], [40], "x", color="red")
plt.plot([2], [22], "o", color="red")

plt.plot(range(3, 8),[ i * (-14.2) + 100 for i in range(3, 8)], color="blue")
plt.plot(range(9),[ i * - 4.6 + 44 for i in range(9)], color="k")
plt.plot(range(9),[ i * - 2.0 + 24 for i in range(9)], color="red")
plt.title("Figure 2: misclassification")


Out[3]:
<matplotlib.text.Text at 0x113bcd410>

Since the red and blue lines are too close to the classes, they are prone to misclassification. However, intuitively, the black one that has the largest margin to the nearest training data points does a better job at classification.

So next question is how do we get the black line?


In [4]:
x1 = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0]
y1 = [12.0, 18.0, 14.0, 8.0, 10.0, 6.0]

x2 = [4, 5, 7, 8]
y2 = [47, 37, 22, 24]
plt.plot(x1, y1, "o", color="yellow")
plt.plot(x2, y2, "x", color="green")

plt.plot([4], [40], "x", color="red")
plt.plot([2], [22], "o", color="red")

plt.plot(range(9),[ i * - 4.6 + 54 for i in range(9)], ":", color="blue")
plt.plot(range(9),[ i * - 4.6 + 44 for i in range(9)], color="blue")
plt.plot(range(8),[ i * - 4.6 + 34 for i in range(8)], ":", color="blue")
plt.title("Figure 3: maximum margin")


Out[4]:
<matplotlib.text.Text at 0x113cad0d0>

Let w is a vector perpendicular to the median line. But we do not know its length yet. And u is an unknown example. We project the u vector down on to the w vector so that we have a distance. It goes further out, it ends up on the right side of the street. Or it is a short distance which means it is on the left side. So we can formulate it like this:

$$ \vec{w} \cdot \vec{u} \geq c $$

or can be rewritten as: $$ c = -b $$ $$ \vec{w} \cdot \vec{u}_{+} + b \geq 1 \ \ \ {{for + samples}}$$ $$ \vec{w} \cdot \vec{u}_{-} + b \leq -1 \ \ \ {{for - samples}}$$ to make it mathematically convenient, reformulate it as: $$ y \in \left \{ -1, 1 \right \} \ \ \ and \ \ \ y(\vec{w} \cdot \vec{u} + b) + 1 = 0 $$ It is equal to 0 when the sample is on the street.

$$ width\ of\ the\ street = (\vec{x}_{+} - \vec{x}_{-}) \cdot \frac{\vec{w}}{\left \| \vec{w} \right \|}$$$$ width\ of\ the\ street = \frac{2}{\left \| \vec{w} \right \|}$$

So we want to maximize the margin, or the minimize the following: $$ \frac{1}{2} \left \| \vec{w} \right \|^{2} $$

So far, we've come up with the two constraints. By using Lagrange multipliers, we can get a formula which gives maximum/minimum without thinking about the constraints anymore.

$$ L = \frac{1}{2} \left \| \vec{w} \right \|^{2} - \sum \lambda \left [ y_{i} (\vec{w} \cdot \vec{x_{i}+b} - 1 ) \right ] $$

After takin the derivates of L with respect to w and b: $$ \frac{\partial L}{\partial \vec{w}} = \vec{w} - \sum \lambda \cdot y_{i} \cdot \vec{x_{i}} $$ $$ \frac{\partial L}{\partial b} = - \sum \vec{x_{i}} \cdot y_{i} = 0$$ Now the decision rule will be as: $$ \sum \lambda _{i} \cdot y_{i} \cdot \vec{x_{i}} \cdot \vec{u} + b \geq 0 $$

Kernel trick

SVM can also work with non-linear data with the help of kernel trick. Kernel trick maps data points to a higher dimension in which classification problem becomes linearly separable. Most popular kernel tricks are: linear: $$\langle x, x'\rangle$$. polynomial: $$(\gamma \langle x, x'\rangle + r)^d.d$$ is specified by keyword degree, r by coef0. rbf: $$\exp(-\gamma |x-x'|^2). \gamma$$ is specified by keyword gamma, must be greater than 0. sigmoid: $$(\tanh(\gamma \langle x,x'\rangle + r))$$, where r is specified by coef0.


In [5]:
YouTubeVideo("3liCbRZPrZA")


Out[5]:

In [5]:
from matplotlib.colors import ListedColormap
# Create color maps for 3-class classification problem, as with iris
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])

def plot_estimator(estimator, X, y, title=""):
    
    try:
        X, y = X.values, y.values
    except AttributeError:
        pass
    
    estimator.fit(X, y)
    x_min, x_max = X[:, 0].min() - .1, X[:, 0].max() + .1
    y_min, y_max = X[:, 1].min() - .1, X[:, 1].max() + .1
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
                         np.linspace(y_min, y_max, 100))

    Z = estimator.predict(np.c_[xx.ravel(), yy.ravel()])

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.figure()
    plt.pcolormesh(xx, yy, Z, cmap=cmap_light)

    # Plot also the training points
    plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold)
    plt.axis('tight')
    plt.axis('off')
    plt.title(title)
    plt.tight_layout()

In [6]:
def classify(clfs, X, y):
    for title, clf in clfs:
        X_train, X_test, y_train, y_test = cross_validation.train_test_split(
            X, y, test_size=0.3, random_state=0)
        
        clf.fit(X_train, y_train)
        score = clf.score(X_test, y_test)
        plot_estimator(clf, X, y, "{} - score: {}".format(title, score))
        plt.show()

In [7]:
def timeit(clfs, X, y):
    results = []
    for title, clf in clfs:
        X_train, X_test, y_train, y_test = cross_validation.train_test_split(
            X, y, test_size=0.3, random_state=0)
        
        trial = 500
        
        startTime = time.time()
        for i in range(trial):
            clf.fit(X_train, y_train)
        trainingTime = time.time() - startTime
        print "{} times training {} with the data that has shape {} took {}".format(
            trial, title, X_train.shape, trainingTime)
        
        startTime = time.time()
        for i in range(trial):
            for i in X_test:
                score = clf.predict(i.reshape(1, -1))
        predictionTime = time.time() - startTime
        
        print "{} prediction with {} took {}".format(trial * X_test.shape[0], title, predictionTime)
        print
        
        results.append([title, trainingTime, predictionTime])
    return results

In [8]:
def compareWRTScore(X, y):
    return classify(
        [
            ("Logistic Regression", linear_model.LogisticRegression()),
            ("SVM with Poly Kernel", svm.SVC(kernel="poly", degree=3)),
            ("SVM with RBF Kernel", svm.SVC(kernel="rbf", gamma=10)),
            ("SVM with Linear Kernel", svm.SVC(kernel="linear")),
            ("KNN with k=5", neighbors.KNeighborsClassifier()),
            ("Random Forest", ensemble.RandomForestClassifier()),
        ],
        X, y
    )
def compareWRTTime(X, y):
    return timeit(
        [
            ("Logistic Regression", linear_model.LogisticRegression()),
            ("SVM with Poly Kernel", svm.SVC(kernel="poly", degree=3)),
            ("SVM with RBF Kernel", svm.SVC(kernel="rbf", gamma=10)),
            ("SVM with Linear Kernel", svm.SVC(kernel="linear")),
            ("KNN with k=5", neighbors.KNeighborsClassifier()),
            ("Random Forest", ensemble.RandomForestClassifier()),
        ],
        X, y
    )

Comparing classifiers

1. Iris data


In [9]:
iris = datasets.load_iris()
compareWRTScore(iris.data[:,:2], iris.target)


2. Wine Classification


In [10]:
wine = pd.read_table("data/wine/wine.txt", sep='\s+')
attributes = ['Alcohol',
            'Malic acid',
            'Ash',
            'Alcalinity of ash',
            'Magnesium',
            'Total phenols',
            'Flavanoids',
            'Nonflavanoid phenols',
            'Proanthocyanins',
            'Color intensity',
            'Hue',
            'OD280/OD315 of diluted wines',
            'Proline']

grape = wine.pop('region')
y = grape.values
wine.columns = attributes
X = wine[['Alcohol', 'Proline']].values
compareWRTScore(X, y)


3. Not-linearly separable data


In [11]:
from sklearn.datasets.samples_generator import make_circles
X, y = make_circles(100, factor=.3, noise=.3)

plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='spring')
compareWRTScore(X, y)


Benchmark


In [12]:
results = compareWRTTime(X, y)


500 times training Logistic Regression with the data that has shape (70, 2) took 0.116724014282
15000 prediction with Logistic Regression took 0.513123035431

500 times training SVM with Poly Kernel with the data that has shape (70, 2) took 0.121850967407
15000 prediction with SVM with Poly Kernel took 0.686012983322

500 times training SVM with RBF Kernel with the data that has shape (70, 2) took 0.181604862213
15000 prediction with SVM with RBF Kernel took 0.704919099808

500 times training SVM with Linear Kernel with the data that has shape (70, 2) took 0.119120121002
15000 prediction with SVM with Linear Kernel took 0.68003487587

500 times training KNN with k=5 with the data that has shape (70, 2) took 0.0970478057861
15000 prediction with KNN with k=5 took 3.72002601624

500 times training Random Forest with the data that has shape (70, 2) took 6.46779680252
15000 prediction with Random Forest took 8.27655291557


In [13]:
results = np.array(results)
index = np.arange(results.shape[0])

bar_width = 0.30
opacity = 0.6

rects1 = plt.bar(index, results[:, 1], bar_width, color="r", alpha=opacity)
rects2 = plt.bar(index + bar_width, results[:, 2], bar_width, color="b", alpha=opacity)
plt.legend((rects1[0], rects2[0]), ('Training', 'Prediction'), loc=2)
plt.xticks(index + bar_width, results[:,0], rotation=70)
plt.grid(True)



In [14]:
results[0, 1], results[0, 2]


Out[14]:
('0.116724014282', '0.513123035431')

4. Case study: Predict survivors in Titanic data


In [15]:
df = pd.read_csv("data/tinanic/train.csv")
"""
VARIABLE DESCRIPTIONS:
survival        Survival
                (0 = No; 1 = Yes)
pclass          Passenger Class
                (1 = 1st; 2 = 2nd; 3 = 3rd)
name            Name
sex             Sex
age             Age
sibsp           Number of Siblings/Spouses Aboard
parch           Number of Parents/Children Aboard
ticket          Ticket Number
fare            Passenger Fare
cabin           Cabin
embarked        Port of Embarkation
                (C = Cherbourg; Q = Queenstown; S = Southampton)

SPECIAL NOTES:
Pclass is a proxy for socio-economic status (SES)
 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower

Age is in Years; Fractional if Age less than One (1)
 If the Age is Estimated, it is in the form xx.5

With respect to the family relation variables (i.e. sibsp and parch)
some relations were ignored.  The following are the definitions used
for sibsp and parch.

Sibling:  Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic
Spouse:   Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)
Parent:   Mother or Father of Passenger Aboard Titanic
Child:    Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic

Other family relatives excluded from this study include cousins,
nephews/nieces, aunts/uncles, and in-laws.  Some children travelled
only with a nanny, therefore parch=0 for them.  As well, some
travelled with very close friends or neighbors in a village, however,
the definitions do not support such relations.

"""
df.sample(6)


Out[15]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
305 306 1 1 Allison, Master. Hudson Trevor male 0.92 1 2 113781 151.5500 C22 C26 S
578 579 0 3 Caram, Mrs. Joseph (Maria Elias) female NaN 1 0 2689 14.4583 NaN C
645 646 1 1 Harper, Mr. Henry Sleeper male 48.00 1 0 PC 17572 76.7292 D33 C
675 676 0 3 Edvardsson, Mr. Gustaf Hjalmar male 18.00 0 0 349912 7.7750 NaN S
246 247 0 3 Lindahl, Miss. Agda Thorilda Viktoria female 25.00 0 0 347071 7.7750 NaN S
217 218 0 2 Jacobsohn, Mr. Sidney Samuel male 42.00 1 0 243847 27.0000 NaN S

In [16]:
df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 90.5+ KB

Embarked feature


In [17]:
sns.countplot(data=df, hue="Survived", x="Embarked")


Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff99331f590>

In [18]:
sns.barplot(data=df, x="Embarked", y="Survived")


Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff993208450>

In [19]:
sns.countplot(data=df, x="Age")


Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff992fde910>

In [20]:
sns.boxplot(data=df, x="Survived", y="Age")
sns.stripplot(x="Survived", y="Age", data=df, jitter=True, edgecolor="gray")


Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff992f6e090>

In [21]:
sns.FacetGrid(df, hue="Survived", size=6).map(sns.kdeplot, "Age").add_legend()


Out[21]:
<seaborn.axisgrid.FacetGrid at 0x7ff992b17350>

Sex feature

First, let's have a look at which gender is dominant in the population by a countplot.


In [22]:
sns.countplot(data=df, x="Sex")


Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff992c88650>

In [23]:
sns.countplot(data=df, hue="Survived", x="Sex")


Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff992c884d0>

According to sex vs. survived chart, most of men did not survived while the majority of women did. The following chart also supports this claim by showing us that 70% of women survived.


In [24]:
sns.barplot(data=df, x="Sex", y="Survived")


Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff992c88350>

The inference is that this sex feature can be used in a classification task to determine whether a given person survived or not.

Pclass feature

This stands for Passenger Class. There are three classes as 1 = 1st; 2 = 2nd; 3 = 3rd. We can make a guess saying most probably the first class passengers survived thanks to their nobility. This guess is based on the domain knowledge; in that time classes among the people is more obvious and severe than now. Let's have a look at the data to see the truth.


In [25]:
sns.countplot(data=df, hue="Survived", x="Pclass")


Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff992c88550>

In [26]:
sns.countplot(data=df[df['Pclass'] == 1], hue="Survived", x="Sex")


Out[26]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff992837190>

The chart above corrects the guess: unfortunatelly, passenger class plays a crucial role.


In [27]:
sns.countplot(data=df[df['Pclass'] == 3], hue="Survived", x="Sex")


Out[27]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff992609c90>

In [28]:
sns.barplot(x="Sex", y="Survived", hue="Pclass", data=df);



In [30]:
def titanicFit(df):

    X = df[["Sex", "Age", "Pclass", "Embarked"]]
    y = df["Survived"]

    X.Age.fillna(X.Age.mean(), inplace=True)

    X.Sex.replace(to_replace="male", value=1, inplace=True)
    X.Sex.replace(to_replace="female", value=0, inplace=True)

    X.Embarked.replace(to_replace="S", value=1, inplace=True)
    X.Embarked.replace(to_replace="C", value=2, inplace=True)
    X.Embarked.replace(to_replace="Q", value=3, inplace=True)

    X_train, X_test, y_train, y_test = cross_validation.train_test_split(
            X, y, test_size=0.3, random_state=0)

    clf = svm.SVC(kernel="rbf")
    parameters = [
        {
            "kernel" :["linear"]
        }, {
            "kernel" :["rbf"], 
            "C":[1, 10, 100], 
            '"gamma":[0.001, 0.002, 0.01]}
        ]'

    clf = grid_search.GridSearchCV(
        svm.SVC(), param_grid=parameters, cv=5).fit(X, y)
    return clf
    #print clf.score(X_test, y_test)

clf = titanicFit(df[df.Embarked.isnull() == False])

In [36]:
clf.grid_scores_


Out[36]:
[mean: 0.78628, std: 0.01847, params: {'kernel': 'linear'},
 mean: 0.63555, std: 0.01238, params: {'kernel': 'rbf', 'C': 1, 'gamma': 0.001},
 mean: 0.63555, std: 0.00904, params: {'kernel': 'rbf', 'C': 1, 'gamma': 0.002},
 mean: 0.74691, std: 0.02952, params: {'kernel': 'rbf', 'C': 1, 'gamma': 0.01},
 mean: 0.78628, std: 0.02249, params: {'kernel': 'rbf', 'C': 10, 'gamma': 0.001},
 mean: 0.79078, std: 0.02213, params: {'kernel': 'rbf', 'C': 10, 'gamma': 0.002},
 mean: 0.80427, std: 0.03027, params: {'kernel': 'rbf', 'C': 10, 'gamma': 0.01},
 mean: 0.79078, std: 0.02041, params: {'kernel': 'rbf', 'C': 100, 'gamma': 0.001},
 mean: 0.79415, std: 0.02246, params: {'kernel': 'rbf', 'C': 100, 'gamma': 0.002},
 mean: 0.80540, std: 0.01660, params: {'kernel': 'rbf', 'C': 100, 'gamma': 0.01}]

In [ ]:


In [ ]: