In [30]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import cross_validation, svm, grid_search
%matplotlib inline
In [11]:
df = pd.read_csv("data/tinanic/train.csv")
"""
VARIABLE DESCRIPTIONS:
survival Survival
(0 = No; 1 = Yes)
pclass Passenger Class
(1 = 1st; 2 = 2nd; 3 = 3rd)
name Name
sex Sex
age Age
sibsp Number of Siblings/Spouses Aboard
parch Number of Parents/Children Aboard
ticket Ticket Number
fare Passenger Fare
cabin Cabin
embarked Port of Embarkation
(C = Cherbourg; Q = Queenstown; S = Southampton)
SPECIAL NOTES:
Pclass is a proxy for socio-economic status (SES)
1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower
Age is in Years; Fractional if Age less than One (1)
If the Age is Estimated, it is in the form xx.5
With respect to the family relation variables (i.e. sibsp and parch)
some relations were ignored. The following are the definitions used
for sibsp and parch.
Sibling: Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic
Spouse: Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)
Parent: Mother or Father of Passenger Aboard Titanic
Child: Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic
Other family relatives excluded from this study include cousins,
nephews/nieces, aunts/uncles, and in-laws. Some children travelled
only with a nanny, therefore parch=0 for them. As well, some
travelled with very close friends or neighbors in a village, however,
the definitions do not support such relations.
"""
df.sample(6)
Out[11]:
In [12]:
df.info()
In [13]:
sns.countplot(data=df, hue="Survived", x="Embarked")
Out[13]:
In [14]:
sns.barplot(data=df, x="Embarked", y="Survived")
Out[14]:
In [15]:
sns.countplot(data=df, x="Age")
Out[15]:
In [16]:
sns.boxplot(data=df, x="Survived", y="Age")
sns.stripplot(
x="Survived", y="Age", data=df, jitter=True, edgecolor="gray", alpha=0.25)
Out[16]:
In [17]:
sns.FacetGrid(df, hue="Survived", size=6).map(sns.kdeplot, "Age").add_legend()
Out[17]:
First, let's have a look at which gender is dominant in the population by a countplot.
In [18]:
sns.countplot(data=df, x="Sex")
Out[18]:
In [19]:
sns.countplot(data=df, hue="Survived", x="Sex")
Out[19]:
According to sex vs. survived chart, most of men did not survived while the majority of women did. The following chart also supports this claim by showing us that 70% of women survived.
In [20]:
sns.barplot(data=df, x="Sex", y="Survived")
Out[20]:
The inference is that this sex feature can be used in a classification task to determine whether a given person survived or not.
This stands for Passenger Class. There are three classes as 1 = 1st; 2 = 2nd; 3 = 3rd. We can make a guess saying most probably the first class passengers survived thanks to their nobility. This guess is based on the domain knowledge; in that time classes among the people is more obvious and severe than now. Let's have a look at the data to see the truth.
In [21]:
sns.countplot(data=df, hue="Survived", x="Pclass")
Out[21]:
In [22]:
sns.countplot(data=df[df['Pclass'] == 1], hue="Survived", x="Sex")
Out[22]:
The chart above corrects the guess: unfortunatelly, passenger class plays a crucial role.
In [23]:
sns.countplot(data=df[df['Pclass'] == 3], hue="Survived", x="Sex")
Out[23]:
In [24]:
sns.barplot(x="Sex", y="Survived", hue="Pclass", data=df);
In [31]:
def titanicFit(df):
X = df[["Sex", "Age", "Pclass", "Embarked"]]
y = df["Survived"]
X.Age.fillna(X.Age.mean(), inplace=True)
X.Sex.replace(to_replace="male", value=1, inplace=True)
X.Sex.replace(to_replace="female", value=0, inplace=True)
X.Embarked.replace(to_replace="S", value=1, inplace=True)
X.Embarked.replace(to_replace="C", value=2, inplace=True)
X.Embarked.replace(to_replace="Q", value=3, inplace=True)
X_train, X_test, y_train, y_test = cross_validation.train_test_split(
X, y, test_size=0.3, random_state=0)
clf = svm.SVC(kernel="rbf")
parameters = [
{
"kernel" :["linear"]
}, {
"kernel" :["rbf"], "C":[1, 10, 100], "gamma":[0.001, 0.002, 0.01]}
]
clf = grid_search.GridSearchCV(
svm.SVC(), param_grid=parameters, cv=5).fit(X, y)
return clf
#print clf.score(X_test, y_test)
clf = titanicFit(df[df.Embarked.isnull() == False])
In [32]:
clf.grid_scores_
Out[32]:
In [ ]: