In [30]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import cross_validation, svm, grid_search
%matplotlib inline

In [11]:
df = pd.read_csv("data/tinanic/train.csv")
"""
VARIABLE DESCRIPTIONS:
survival        Survival
                (0 = No; 1 = Yes)
pclass          Passenger Class
                (1 = 1st; 2 = 2nd; 3 = 3rd)
name            Name
sex             Sex
age             Age
sibsp           Number of Siblings/Spouses Aboard
parch           Number of Parents/Children Aboard
ticket          Ticket Number
fare            Passenger Fare
cabin           Cabin
embarked        Port of Embarkation
                (C = Cherbourg; Q = Queenstown; S = Southampton)

SPECIAL NOTES:
Pclass is a proxy for socio-economic status (SES)
 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower

Age is in Years; Fractional if Age less than One (1)
 If the Age is Estimated, it is in the form xx.5

With respect to the family relation variables (i.e. sibsp and parch)
some relations were ignored.  The following are the definitions used
for sibsp and parch.

Sibling:  Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic
Spouse:   Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)
Parent:   Mother or Father of Passenger Aboard Titanic
Child:    Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic

Other family relatives excluded from this study include cousins,
nephews/nieces, aunts/uncles, and in-laws.  Some children travelled
only with a nanny, therefore parch=0 for them.  As well, some
travelled with very close friends or neighbors in a village, however,
the definitions do not support such relations.

"""
df.sample(6)


Out[11]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
591 592 1 1 Stephenson, Mrs. Walter Bertram (Martha Eustis) female 52 1 0 36947 78.2667 D20 C
13 14 0 3 Andersson, Mr. Anders Johan male 39 1 5 347082 31.2750 NaN S
319 320 1 1 Spedden, Mrs. Frederic Oakley (Margaretta Corn... female 40 1 1 16966 134.5000 E34 C
159 160 0 3 Sage, Master. Thomas Henry male NaN 8 2 CA. 2343 69.5500 NaN S
390 391 1 1 Carter, Mr. William Ernest male 36 1 2 113760 120.0000 B96 B98 S
106 107 1 3 Salkjelsvik, Miss. Anna Kristine female 21 0 0 343120 7.6500 NaN S

In [12]:
df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 90.5+ KB

Embarked feature


In [13]:
sns.countplot(data=df, hue="Survived", x="Embarked")


Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x110a4d650>

In [14]:
sns.barplot(data=df, x="Embarked", y="Survived")


Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x10b94ac10>

In [15]:
sns.countplot(data=df, x="Age")


Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x110339f50>

In [16]:
sns.boxplot(data=df, x="Survived", y="Age")
sns.stripplot(
    x="Survived", y="Age", data=df, jitter=True, edgecolor="gray", alpha=0.25)


Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x110e54910>

In [17]:
sns.FacetGrid(df, hue="Survived", size=6).map(sns.kdeplot, "Age").add_legend()


Out[17]:
<seaborn.axisgrid.FacetGrid at 0x1112e7cd0>

Sex feature

First, let's have a look at which gender is dominant in the population by a countplot.


In [18]:
sns.countplot(data=df, x="Sex")


Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x11142aad0>

In [19]:
sns.countplot(data=df, hue="Survived", x="Sex")


Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x11179fd50>

According to sex vs. survived chart, most of men did not survived while the majority of women did. The following chart also supports this claim by showing us that 70% of women survived.


In [20]:
sns.barplot(data=df, x="Sex", y="Survived")


Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x1118fac10>

The inference is that this sex feature can be used in a classification task to determine whether a given person survived or not.

Pclass feature

This stands for Passenger Class. There are three classes as 1 = 1st; 2 = 2nd; 3 = 3rd. We can make a guess saying most probably the first class passengers survived thanks to their nobility. This guess is based on the domain knowledge; in that time classes among the people is more obvious and severe than now. Let's have a look at the data to see the truth.


In [21]:
sns.countplot(data=df, hue="Survived", x="Pclass")


Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x111a51510>

In [22]:
sns.countplot(data=df[df['Pclass'] == 1], hue="Survived", x="Sex")


Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x111b279d0>

The chart above corrects the guess: unfortunatelly, passenger class plays a crucial role.


In [23]:
sns.countplot(data=df[df['Pclass'] == 3], hue="Survived", x="Sex")


Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x111c78bd0>

In [24]:
sns.barplot(x="Sex", y="Survived", hue="Pclass", data=df);



In [31]:
def titanicFit(df):

    X = df[["Sex", "Age", "Pclass", "Embarked"]]
    y = df["Survived"]

    X.Age.fillna(X.Age.mean(), inplace=True)

    X.Sex.replace(to_replace="male", value=1, inplace=True)
    X.Sex.replace(to_replace="female", value=0, inplace=True)

    X.Embarked.replace(to_replace="S", value=1, inplace=True)
    X.Embarked.replace(to_replace="C", value=2, inplace=True)
    X.Embarked.replace(to_replace="Q", value=3, inplace=True)

    X_train, X_test, y_train, y_test = cross_validation.train_test_split(
            X, y, test_size=0.3, random_state=0)

    clf = svm.SVC(kernel="rbf")
    parameters = [
        {
            "kernel" :["linear"]
        }, {
            "kernel" :["rbf"], "C":[1, 10, 100], "gamma":[0.001, 0.002, 0.01]}
        ]

    clf = grid_search.GridSearchCV(
        svm.SVC(), param_grid=parameters, cv=5).fit(X, y)
    return clf
    #print clf.score(X_test, y_test)

clf = titanicFit(df[df.Embarked.isnull() == False])

In [32]:
clf.grid_scores_


Out[32]:
[mean: 0.78628, std: 0.01847, params: {'kernel': 'linear'},
 mean: 0.63555, std: 0.01238, params: {'kernel': 'rbf', 'C': 1, 'gamma': 0.001},
 mean: 0.63555, std: 0.00904, params: {'kernel': 'rbf', 'C': 1, 'gamma': 0.002},
 mean: 0.74691, std: 0.02952, params: {'kernel': 'rbf', 'C': 1, 'gamma': 0.01},
 mean: 0.78628, std: 0.02249, params: {'kernel': 'rbf', 'C': 10, 'gamma': 0.001},
 mean: 0.79078, std: 0.02213, params: {'kernel': 'rbf', 'C': 10, 'gamma': 0.002},
 mean: 0.80427, std: 0.03027, params: {'kernel': 'rbf', 'C': 10, 'gamma': 0.01},
 mean: 0.79078, std: 0.02041, params: {'kernel': 'rbf', 'C': 100, 'gamma': 0.001},
 mean: 0.79415, std: 0.02246, params: {'kernel': 'rbf', 'C': 100, 'gamma': 0.002},
 mean: 0.80540, std: 0.01660, params: {'kernel': 'rbf', 'C': 100, 'gamma': 0.01}]

In [ ]: