In [30]:

    
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import cross_validation, svm, grid_search
%matplotlib inline



In [11]:

    
df = pd.read_csv("data/tinanic/train.csv")
"""
VARIABLE DESCRIPTIONS:
survival        Survival
                (0 = No; 1 = Yes)
pclass          Passenger Class
                (1 = 1st; 2 = 2nd; 3 = 3rd)
name            Name
sex             Sex
age             Age
sibsp           Number of Siblings/Spouses Aboard
parch           Number of Parents/Children Aboard
ticket          Ticket Number
fare            Passenger Fare
cabin           Cabin
embarked        Port of Embarkation
                (C = Cherbourg; Q = Queenstown; S = Southampton)

SPECIAL NOTES:
Pclass is a proxy for socio-economic status (SES)
 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower

Age is in Years; Fractional if Age less than One (1)
 If the Age is Estimated, it is in the form xx.5

With respect to the family relation variables (i.e. sibsp and parch)
some relations were ignored.  The following are the definitions used
for sibsp and parch.

Sibling:  Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic
Spouse:   Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)
Parent:   Mother or Father of Passenger Aboard Titanic
Child:    Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic

Other family relatives excluded from this study include cousins,
nephews/nieces, aunts/uncles, and in-laws.  Some children travelled
only with a nanny, therefore parch=0 for them.  As well, some
travelled with very close friends or neighbors in a village, however,
the definitions do not support such relations.

"""
df.sample(6)









    Out[11]:






  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      591
      592
      1
      1
      Stephenson, Mrs. Walter Bertram (Martha Eustis)
      female
      52
      1
      0
      36947
      78.2667
      D20
      C
    
    
      13
      14
      0
      3
      Andersson, Mr. Anders Johan
      male
      39
      1
      5
      347082
      31.2750
      NaN
      S
    
    
      319
      320
      1
      1
      Spedden, Mrs. Frederic Oakley (Margaretta Corn...
      female
      40
      1
      1
      16966
      134.5000
      E34
      C
    
    
      159
      160
      0
      3
      Sage, Master. Thomas Henry
      male
      NaN
      8
      2
      CA. 2343
      69.5500
      NaN
      S
    
    
      390
      391
      1
      1
      Carter, Mr. William Ernest
      male
      36
      1
      2
      113760
      120.0000
      B96 B98
      S
    
    
      106
      107
      1
      3
      Salkjelsvik, Miss. Anna Kristine
      female
      21
      0
      0
      343120
      7.6500
      NaN
      S



In [12]:

    
df.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 90.5+ KB

`Embarked` feature



In [13]:

    
sns.countplot(data=df, hue="Survived", x="Embarked")









    Out[13]:





<matplotlib.axes._subplots.AxesSubplot at 0x110a4d650>



In [14]:

    
sns.barplot(data=df, x="Embarked", y="Survived")









    Out[14]:





<matplotlib.axes._subplots.AxesSubplot at 0x10b94ac10>



In [15]:

    
sns.countplot(data=df, x="Age")









    Out[15]:





<matplotlib.axes._subplots.AxesSubplot at 0x110339f50>



In [16]:

    
sns.boxplot(data=df, x="Survived", y="Age")
sns.stripplot(
    x="Survived", y="Age", data=df, jitter=True, edgecolor="gray", alpha=0.25)









    Out[16]:





<matplotlib.axes._subplots.AxesSubplot at 0x110e54910>



In [17]:

    
sns.FacetGrid(df, hue="Survived", size=6).map(sns.kdeplot, "Age").add_legend()









    Out[17]:





<seaborn.axisgrid.FacetGrid at 0x1112e7cd0>

`Sex` feature

First, let's have a look at which gender is dominant in the population by a countplot.



In [18]:

    
sns.countplot(data=df, x="Sex")









    Out[18]:





<matplotlib.axes._subplots.AxesSubplot at 0x11142aad0>



In [19]:

    
sns.countplot(data=df, hue="Survived", x="Sex")









    Out[19]:





<matplotlib.axes._subplots.AxesSubplot at 0x11179fd50>

According to sex vs. survived chart, most of men did not survived while the majority of women did. The following chart also supports this claim by showing us that 70% of women survived.



In [20]:

    
sns.barplot(data=df, x="Sex", y="Survived")









    Out[20]:





<matplotlib.axes._subplots.AxesSubplot at 0x1118fac10>

The inference is that this sex feature can be used in a classification task to determine whether a given person survived or not.

Pclass feature

This stands for Passenger Class. There are three classes as 1 = 1st; 2 = 2nd; 3 = 3rd. We can make a guess saying most probably the first class passengers survived thanks to their nobility. This guess is based on the domain knowledge; in that time classes among the people is more obvious and severe than now. Let's have a look at the data to see the truth.



In [21]:

    
sns.countplot(data=df, hue="Survived", x="Pclass")









    Out[21]:





<matplotlib.axes._subplots.AxesSubplot at 0x111a51510>



In [22]:

    
sns.countplot(data=df[df['Pclass'] == 1], hue="Survived", x="Sex")









    Out[22]:





<matplotlib.axes._subplots.AxesSubplot at 0x111b279d0>

The chart above corrects the guess: unfortunatelly, passenger class plays a crucial role.



In [23]:

    
sns.countplot(data=df[df['Pclass'] == 3], hue="Survived", x="Sex")









    Out[23]:





<matplotlib.axes._subplots.AxesSubplot at 0x111c78bd0>



In [24]:

    
sns.barplot(x="Sex", y="Survived", hue="Pclass", data=df);



In [31]:

    
def titanicFit(df):

    X = df[["Sex", "Age", "Pclass", "Embarked"]]
    y = df["Survived"]

    X.Age.fillna(X.Age.mean(), inplace=True)

    X.Sex.replace(to_replace="male", value=1, inplace=True)
    X.Sex.replace(to_replace="female", value=0, inplace=True)

    X.Embarked.replace(to_replace="S", value=1, inplace=True)
    X.Embarked.replace(to_replace="C", value=2, inplace=True)
    X.Embarked.replace(to_replace="Q", value=3, inplace=True)

    X_train, X_test, y_train, y_test = cross_validation.train_test_split(
            X, y, test_size=0.3, random_state=0)

    clf = svm.SVC(kernel="rbf")
    parameters = [
        {
            "kernel" :["linear"]
        }, {
            "kernel" :["rbf"], "C":[1, 10, 100], "gamma":[0.001, 0.002, 0.01]}
        ]

    clf = grid_search.GridSearchCV(
        svm.SVC(), param_grid=parameters, cv=5).fit(X, y)
    return clf
    #print clf.score(X_test, y_test)

clf = titanicFit(df[df.Embarked.isnull() == False])



In [32]:

    
clf.grid_scores_









    Out[32]:





[mean: 0.78628, std: 0.01847, params: {'kernel': 'linear'},
 mean: 0.63555, std: 0.01238, params: {'kernel': 'rbf', 'C': 1, 'gamma': 0.001},
 mean: 0.63555, std: 0.00904, params: {'kernel': 'rbf', 'C': 1, 'gamma': 0.002},
 mean: 0.74691, std: 0.02952, params: {'kernel': 'rbf', 'C': 1, 'gamma': 0.01},
 mean: 0.78628, std: 0.02249, params: {'kernel': 'rbf', 'C': 10, 'gamma': 0.001},
 mean: 0.79078, std: 0.02213, params: {'kernel': 'rbf', 'C': 10, 'gamma': 0.002},
 mean: 0.80427, std: 0.03027, params: {'kernel': 'rbf', 'C': 10, 'gamma': 0.01},
 mean: 0.79078, std: 0.02041, params: {'kernel': 'rbf', 'C': 100, 'gamma': 0.001},
 mean: 0.79415, std: 0.02246, params: {'kernel': 'rbf', 'C': 100, 'gamma': 0.002},
 mean: 0.80540, std: 0.01660, params: {'kernel': 'rbf', 'C': 100, 'gamma': 0.01}]



In [ ]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
591	592	1	1	Stephenson, Mrs. Walter Bertram (Martha Eustis)	female	52	1	0	36947	78.2667	D20	C
13	14	0	3	Andersson, Mr. Anders Johan	male	39	1	5	347082	31.2750	NaN	S
319	320	1	1	Spedden, Mrs. Frederic Oakley (Margaretta Corn...	female	40	1	1	16966	134.5000	E34	C
159	160	0	3	Sage, Master. Thomas Henry	male	NaN	8	2	CA. 2343	69.5500	NaN	S
390	391	1	1	Carter, Mr. William Ernest	male	36	1	2	113760	120.0000	B96 B98	S
106	107	1	3	Salkjelsvik, Miss. Anna Kristine	female	21	0	0	343120	7.6500	NaN	S

Embarked feature

Sex feature

Pclass feature

`Embarked` feature

`Sex` feature