Titanic Feature Engineering

Table of Contents

  • Overview
  • Feature Engineering and Imputation
    • Title
    • Family Size
    • Fares
    • Ages
  • Initial Modeling

In [45]:
import matplotlib.pyplot as plt
import scipy.stats as st
import seaborn as sns
import pandas as pd
import numpy as np

%matplotlib notebook

train = pd.read_csv('train.csv', index_col='PassengerId')
test = pd.read_csv('test.csv', index_col='PassengerId')

tr_len = len(train)
df = train.drop('Survived', axis=1).append(test)

Title

We'll extract title information from the Name feature, and then merge some of the titles together.

  • Merge 'Mme' into 'Mrs'
  • Merge 'Mlle' and 'Ms' into 'Miss'
  • Merge 'Lady', 'the Countess', and 'Dona' into 'fNoble'
  • Merge 'Don', 'Sir', and 'Jonkheer' into 'mNoble'
  • Merge 'Col', 'Capt', and 'Major' into 'mil'

In [46]:
df['Title'] = df['Name'].str.extract('\,\s(.*?)[.]', expand=False)
df['Title'].replace('Mme', 'Mrs', inplace=True)
df['Title'].replace('Mlle', 'Miss', inplace=True)
df['Title'].replace('Ms', 'Miss', inplace=True)
df['Title'].replace('Lady', 'fNoble', inplace=True)
df['Title'].replace('the Countess', 'fNoble', inplace=True)
df['Title'].replace('Dona', 'fNoble', inplace=True)
df['Title'].replace('Don', 'mNoble', inplace=True)
df['Title'].replace('Sir', 'mNoble', inplace=True)
df['Title'].replace('Jonkheer', 'mNoble', inplace=True)
df['Title'].replace('Col', 'mil', inplace=True)
df['Title'].replace('Capt', 'mil', inplace=True)
df['Title'].replace('Major', 'mil', inplace=True)

Family Size

We'll create a FamSize feature indicating family size. We'll impute the median fare for lone travelers, for the lone missing value.


In [47]:
df['FamSize'] = df['SibSp'] + df['Parch'] + 1

Fares

We'll create a TicketSize feature, and divide Fare by it to adjust our Fare values. We then impute the lone missing value with its median by Pclass.


In [48]:
df['TicketSize'] = df['Ticket'].value_counts()[df['Ticket']].values
df['AdjFare'] = df['Fare'].div(df['TicketSize'])
df['AdjFare'] = df.groupby('Pclass')['AdjFare'].apply(lambda x: x.fillna(x.median()))

Ages

We'll impute missing values with medians by Title and Sex.


In [49]:
df['FilledAge'] = df.groupby(['Sex', 'Title'])['Age'].apply(lambda x: x.fillna(x.median()))

Embarked

From our strategy using ticket numbers, we will fill both missing values with 'S' - Southampton.


In [50]:
df['Embarked'].fillna('S', inplace=True)

Cabins

We create an indicator variable if the cabin is known, for now.


In [52]:
df['CabinKnown'] = df['Cabin'].notnull().astype(int)

Modeling

Let's recombine, drop the unnecessary variables, and try a Random Forest model to gauge feature importance.


In [77]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

pdf = df.copy()
le = LabelEncoder()
pdf['Sex'] = le.fit_transform(pdf['Sex'])
pdf['Embarked'] = le.fit_transform(pdf['Embarked'])
pdf['Title'] = le.fit_transform(pdf['Title'])

pdf.drop(['CabinKnown', 'Embarked'], axis=1, inplace=True)

p_test = pdf[tr_len:]
p_train = pdf[:tr_len].join(train[['Survived']]).drop(['Name', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin'], axis=1)

X_train, X_test, y_train, y_test = train_test_split(p_train.drop('Survived', axis=1), p_train['Survived'], random_state=236)

clf = RandomForestClassifier(n_estimators=1000, max_depth=7, max_features=4)
clf.fit(X_train, y_train)
print('CV Score: {}'.format(clf.score(X_test, y_test)))
pd.Series(clf.feature_importances_, index=X_train.columns)


CV Score: 0.8340807174887892
Out[77]:
Pclass        0.071959
Sex           0.364403
Title         0.099324
FamSize       0.078954
TicketSize    0.067330
AdjFare       0.178269
FilledAge     0.139761
dtype: float64

In [78]:
df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 1 to 1309
Data columns (total 16 columns):
Pclass        1309 non-null int64
Name          1309 non-null object
Sex           1309 non-null object
Age           1046 non-null float64
SibSp         1309 non-null int64
Parch         1309 non-null int64
Ticket        1309 non-null object
Fare          1308 non-null float64
Cabin         295 non-null object
Embarked      1309 non-null object
Title         1309 non-null object
FamSize       1309 non-null int64
TicketSize    1309 non-null int64
AdjFare       1309 non-null float64
FilledAge     1309 non-null float64
CabinKnown    1309 non-null int32
dtypes: float64(4), int32(1), int64(5), object(6)
memory usage: 208.7+ KB