In this competition we have data about Titanic's passengers. The data is divided into two files: train and test. In "train" file a column "Survival" shows whether the passenger survived or not.
At first I explore the data, modify it and create some new features, then I select the most important of them and make a prediction using Random Forest.
In [1]:
import pandas as pd
pd.set_option('display.max_columns', None)
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style('whitegrid')
import re
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score, train_test_split, GridSearchCV
from sklearn.feature_selection import SelectFromModel
In [2]:
#Age is read as float, because later I'll need more precision for calculations.
df_train = pd.read_csv('../input/train.csv', dtype={'Age': np.float64}, )
df_test = pd.read_csv('../input/test.csv', dtype={'Age': np.float64}, )
In [3]:
df_train.describe(include='all')
Out[3]:
In [4]:
df_test.describe(include='all')
Out[4]:
In [5]:
df_train.info()
819 rows in train data and 418 in test. There are missing values in Age, Cabin and and Embarked columns in train and in Age and Cabin in test. Name, Sex, Ticket, Cabin and Embarked are categorical variables. Name contains a name itself and a title. Cabin and ticket consist of a letters and numbers. Let's deal with each column step by step.
In [6]:
df_train.pivot_table('PassengerId', 'Pclass', 'Survived', 'count').plot(kind='bar', stacked=True)
Out[6]:
Pclass. It seems that Pclass is useful and requires no changes. Passengers with Pclass 3 have less chances for survival. This is reasonable, as passengers with more expensive tickets lived at higher decks and thus could get to lifeboats faster.
Names by themselves are useful. One way to use them is grouping people by family names - maybe families have better chance for survival? But it is complicated, and there is a better way to create a feature for families. Another way is extracting a title from the name and using it. Let's try.
In [7]:
df_train['Title'] = df_train['Name'].apply(lambda x: (re.search(' ([a-zA-Z]+)\.', x)).group(1))
df_test['Title'] = df_test['Name'].apply(lambda x: (re.search(' ([a-zA-Z]+)\.', x)).group(1))
df_train['Title'].value_counts()
Out[7]:
There are many titles, in fact it is a bad idea to use them as they are - I tried and the accuracy got worse. A good idea is grouping them by social status or something like that. I have found several ways to group them. Here is the one I chose.
In [8]:
titles = {'Capt': 'Officer',
'Col': 'Officer',
'Major': 'Officer',
'Jonkheer': 'Royalty',
'Don': 'Royalty',
'Sir' : 'Royalty',
'Dr': 'Officer',
'Rev': 'Officer',
'Countess': 'Royalty',
'Dona': 'Royalty',
'Mme': 'Mrs',
'Mlle': 'Miss',
'Ms': 'Mrs',
'Mr' : 'Mr',
'Mrs' : 'Mrs',
'Miss' : 'Miss',
'Master' : 'Master',
'Lady' : 'Royalty'
}
for k,v in titles.items():
df_train.loc[df_train['Title'] == k, 'Title'] = v
df_test.loc[df_test['Title'] == k, 'Title'] = v
#New frequencies.
df_train['Title'].value_counts()
Out[8]:
Missing values for Age should be filled. I think that simple mean/median isn't good enough. So I tried several ways to group other columns and chose median by Sex, Pclass and Title.
In [9]:
print(df_train.groupby(['Sex', 'Pclass', 'Title', ])['Age'].median())
In [10]:
df_train['Age'] = df_train.groupby(['Sex','Pclass','Title'])['Age'].apply(lambda x: x.fillna(x.median()))
df_test['Age'] = df_test.groupby(['Sex','Pclass','Title'])['Age'].apply(lambda x: x.fillna(x.median()))
At first I wanted to divide passengers into males, females and children, but it increased overfitting. Also I tried to replace values with 1 and 0 (instead of creating dummies), it also worked worse. So doing nothing here.
In [11]:
df_train.groupby(['Pclass', 'Sex'])['Survived'].value_counts(normalize=True)
Out[11]:
Number of Siblings/Spouses and Parents/Children Aboard. Basically - amount of family members. So if we sum them, we get the size of the family. At first I created a single feature showing whether the person had family. It wasn't good enough. Then I tried several variants and stopped on four groups: 0 relatives, 1-2, 3 and 5 or more. From the table below we can see that such grouping makes sense.
In [12]:
df_train['Family'] = df_train['Parch'] + df_train['SibSp']
df_test['Family'] = df_test['Parch'] + df_test['SibSp']
In [13]:
df_train.groupby(['Family'])['Survived'].value_counts(normalize=True)
Out[13]:
In [14]:
def FamilySize(x):
"""
A function for Family size transformation
"""
if x == 1 or x == 2:
return 'little'
elif x == 3:
return 'medium'
elif x >= 5:
return 'big'
else:
return 'single'
df_train['Family'] = df_train['Family'].apply(lambda x : FamilySize(x))
df_test['Family'] = df_test['Family'].apply(lambda x : FamilySize(x))
In [15]:
df_train.groupby(['Pclass', 'Family'])['Survived'].mean()
Out[15]:
This value can't be used by itself. Ticket contains prefix and number. Using ticket number doesn't make sense, but prefix could be useful.
In [16]:
def Ticket_Prefix(x):
"""
Function for extracting prefixes. Tickets have length of 1-3.
"""
l = x.split()
if len(x.split()) == 3:
return x.split()[0] + x.split()[1]
elif len(x.split()) == 2:
return x.split()[0]
else:
return 'None'
df_train['TicketPrefix'] = df_train['Ticket'].apply(lambda x: Ticket_Prefix(x))
df_test['TicketPrefix'] = df_test['Ticket'].apply(lambda x: Ticket_Prefix(x))
In [17]:
#There are many similar prefixes, but combining them doesn't yield a significantly better result.
df_train.TicketPrefix.unique()
Out[17]:
There is only one missing value, and in test. Fill it with median for its Pclass.
In [18]:
ax = plt.subplot()
ax.set_ylabel('Average Fare')
df_train.groupby('Pclass').mean()['Fare'].plot(kind='bar',figsize=(7, 4), ax=ax)
df_test['Fare'] = df_test.groupby(['Pclass'])['Fare'].apply(lambda x: x.fillna(x.median()))
I thought about ignoring this feature, but it turned out to be quite significant. And the most important for predicting was whether there was information about the Cabin or not. So I fill NA with 'Unknown" value and use the first letter of the Cabin number as a feature.
In [19]:
df_train.Cabin.fillna('Unknown',inplace=True)
df_test.Cabin.fillna('Unknown',inplace=True)
df_train['Cabin'] = df_train['Cabin'].map(lambda x: x[0])
df_test['Cabin'] = df_test['Cabin'].map(lambda x: x[0])
In [20]:
#Now let's see. Most of the cabins aren't filled.
f, ax = plt.subplots(figsize=(7, 3))
sns.countplot(y='Cabin', data=df_train, color='c')
Out[20]:
In [21]:
#Other cabins vary in number.
sns.countplot(y='Cabin', data=df_train[df_train.Cabin != 'U'], color='c')
Out[21]:
In [22]:
#Factorplot shows that most people, for whom there is no info on Cabin, didn't survive.
sns.factorplot('Survived', col='Cabin', col_wrap=4, data=df_train[df_train.Cabin == 'U'], kind='count', size=2.5, aspect=.8)
Out[22]:
In [23]:
#For passengers with known Cabins survival rate varies.
sns.factorplot('Survived', col='Cabin', col_wrap=4, data=df_train[df_train.Cabin != 'U'], kind='count', size=2.5, aspect=.8)
Out[23]:
In [24]:
df_train.groupby(['Cabin']).mean()[df_train.groupby(['Cabin']).mean().columns[1:2]]
Out[24]:
I simply fill na with most common value.
In [25]:
MedEmbarked = df_train.groupby('Embarked').count()['PassengerId']
df_train.Embarked.fillna(MedEmbarked, inplace=True)
In [26]:
#This is how the data looks like now.
df_train.head()
Out[26]:
For most algorithms it is better to have only numerical data, therefore categorical variables should be changed. In some cases normalizing numerical data is necessary, but in this case this caused worse results. I noticed that some columns with categorical values have different unique values in train and test. I could deal with it by combining values in subgroups. But I decided to do feature selection first (lower) and the features selected were both in train and test.
In [27]:
#Drop unnecessary columns
to_drop = ['Ticket', 'Name', 'SibSp', 'Parch']
for i in to_drop:
df_train.drop([i], axis=1, inplace=True)
df_test.drop([i], axis=1, inplace=True)
In [28]:
#Pclass in fact is a categorical variable, though it's type isn't object.
for col in df_train.columns:
if df_train[col].dtype == 'object' or col == 'Pclass':
dummies = pd.get_dummies(df_train[col], drop_first=False)
dummies = dummies.add_prefix('{}_'.format(col))
df_train.drop(col, axis=1, inplace=True)
df_train = df_train.join(dummies)
for col in df_test.columns:
if df_test[col].dtype == 'object' or col == 'Pclass':
dummies = pd.get_dummies(df_test[col], drop_first=False)
dummies = dummies.add_prefix('{}_'.format(col))
df_test.drop(col, axis=1, inplace=True)
df_test = df_test.join(dummies)
In [29]:
#This is how the data looks like now.
df_train.head()
Out[29]:
In [30]:
X_train = df_train.drop('Survived',axis=1)
Y_train = df_train['Survived']
X_test = df_test
Now feature selection. This code ranks features by their importance for Random Forest. At first for parameters I used "n_estimators = 200" then I used more optimal parameters, which were found lower.
In [31]:
clf = RandomForestClassifier(n_estimators = 15,
criterion = 'gini',
max_features = 'sqrt',
max_depth = None,
min_samples_split =7,
min_weight_fraction_leaf = 0.0,
max_leaf_nodes = 18)
clf = clf.fit(X_train, Y_train)
indices = np.argsort(clf.feature_importances_)[::-1]
print('Feature ranking:')
for f in range(X_train.shape[1]):
print('%d. feature %d %s (%f)' % (f + 1, indices[f], X_train.columns[indices[f]], clf.feature_importances_[indices[f]]))
Feature selection by sklearn based on importance weights.
In [32]:
model = SelectFromModel(clf, prefit=True)
train_new = model.transform(X_train)
train_new.shape
Out[32]:
In [33]:
best_features = X_train.columns[indices[0:train_new.shape[1]]]
X = X_train[best_features]
Xt = X_test[best_features]
best_features
Out[33]:
Usually SelectFromModel gives 13-15 features. Sex is most important, which isn't surprising - as we know, most places in boats were given to women. Fare and Pclass prove that difference in wealth is important. Age, of course, is important. Size of family and titles are also significant, as expected. Absense of info about the Cabin is indeed significant. And for some reason PassengerId is also important. Maybe data leak?
In [34]:
X_train, X_test, y_train, y_test = train_test_split(X, Y_train, test_size=0.33, random_state=44)
I saw the next part of code there: https://www.kaggle.com/creepykoala/titanic/study-of-tree-and-forest-algorithms This is a great way to see how parameters influence the score of Random Forest.
In [35]:
plt.figure(figsize=(15,10))
#N Estimators
plt.subplot(3,3,1)
feature_param = range(1,21)
scores=[]
for feature in feature_param:
clf = RandomForestClassifier(n_estimators=feature)
clf.fit(X_train,y_train)
scores.append(clf.score(X_test,y_test))
plt.plot(scores, '.-')
plt.axis('tight')
plt.title('N Estimators')
plt.grid();
#Criterion
plt.subplot(3,3,2)
feature_param = ['gini','entropy']
scores=[]
for feature in feature_param:
clf = RandomForestClassifier(criterion=feature)
clf.fit(X_train,y_train)
scores.append(clf.score(X_test,y_test))
plt.plot(scores, '.-')
plt.title('Criterion')
plt.xticks(range(len(feature_param)), feature_param)
plt.grid();
#Max Features
plt.subplot(3,3,3)
feature_param = ['auto','sqrt','log2',None]
scores=[]
for feature in feature_param:
clf = RandomForestClassifier(max_features=feature)
clf.fit(X_train,y_train)
scores.append(clf.score(X_test,y_test))
plt.plot(scores, '.-')
plt.axis('tight')
plt.title('Max Features')
plt.xticks(range(len(feature_param)), feature_param)
plt.grid();
#Max Depth
plt.subplot(3,3,4)
feature_param = range(1,21)
scores=[]
for feature in feature_param:
clf = RandomForestClassifier(max_depth=feature)
clf.fit(X_train,y_train)
scores.append(clf.score(X_test,y_test))
plt.plot(feature_param, scores, '.-')
plt.axis('tight')
plt.title('Max Depth')
plt.grid();
#Min Samples Split
plt.subplot(3,3,5)
feature_param = range(1,21)
scores=[]
for feature in feature_param:
clf = RandomForestClassifier(min_samples_split =feature)
clf.fit(X_train,y_train)
scores.append(clf.score(X_test,y_test))
plt.plot(feature_param, scores, '.-')
plt.axis('tight')
plt.title('Min Samples Split')
plt.grid();
#Min Weight Fraction Leaf
plt.subplot(3,3,6)
feature_param = np.linspace(0,0.5,10)
scores=[]
for feature in feature_param:
clf = RandomForestClassifier(min_weight_fraction_leaf =feature)
clf.fit(X_train,y_train)
scores.append(clf.score(X_test,y_test))
plt.plot(feature_param, scores, '.-')
plt.axis('tight')
plt.title('Min Weight Fraction Leaf')
plt.grid();
#Max Leaf Nodes
plt.subplot(3,3,7)
feature_param = range(2,21)
scores=[]
for feature in feature_param:
clf = RandomForestClassifier(max_leaf_nodes=feature)
clf.fit(X_train,y_train)
scores.append(clf.score(X_test,y_test))
plt.plot(feature_param, scores, '.-')
plt.axis('tight')
plt.title('Max Leaf Nodes')
plt.grid();
Now based on these graphs I tune the model. Normally you input all parameters and their potential values and run GridSearchCV. My PC isn't good enough so I divide parameters in two groups and repeatedly run two GridSearchCV until I'm satisfied with the result. This gives a balance between the quality and the speed.
In [36]:
forest = RandomForestClassifier(max_depth = 50,
min_samples_split =7,
min_weight_fraction_leaf = 0.0,
max_leaf_nodes = 18)
parameter_grid = {'n_estimators' : [15, 100, 200],
'criterion' : ['gini', 'entropy'],
'max_features' : ['auto', 'sqrt', 'log2', None]
}
grid_search = GridSearchCV(forest, param_grid=parameter_grid, cv=StratifiedKFold(5))
grid_search.fit(X, Y_train)
print('Best score: {}'.format(grid_search.best_score_))
print('Best parameters: {}'.format(grid_search.best_params_))
In [37]:
forest = RandomForestClassifier(n_estimators = 200,
criterion = 'entropy',
max_features = None)
parameter_grid = {
'max_depth' : [None, 50],
'min_samples_split' : [7, 11],
'min_weight_fraction_leaf' : [0.0, 0.2],
'max_leaf_nodes' : [18, 20],
}
grid_search = GridSearchCV(forest, param_grid=parameter_grid, cv=StratifiedKFold(5))
grid_search.fit(X, Y_train)
print('Best score: {}'.format(grid_search.best_score_))
print('Best parameters: {}'.format(grid_search.best_params_))
In [38]:
#My optimal parameters
clf = RandomForestClassifier(n_estimators = 200,
criterion = 'entropy',
max_features = None,
max_depth = 50,
min_samples_split =7,
min_weight_fraction_leaf = 0.0,
max_leaf_nodes = 18)
clf.fit(X, Y_train)
Y_pred_RF = clf.predict(Xt)
clf.score(X_test,y_test)
Out[38]:
In [39]:
submission = pd.DataFrame({
'PassengerId': df_test['PassengerId'],
'Survived': Y_pred_RF
})
submission.to_csv('titanic.csv', index=False)
I didn't aim for a perfect model in this project, I just wanted to use my skills. The best result I got was 0.80861. Reachable maximum accuracy is ~82-85%, so I think that my result is good enough.