In most data science tutorials I have seen, a lot of the data clean-up is done in what seems to me casually, an annoying obstacle to get to the sexy Machine Learning bit. I was curious to see what if any difference it made in my Kaggle ranking if I used a somewhat more cautious approach in my data cleanup. My approach was to dumbly follow Datacamp's tutorial and submit my test set labels as a benchmark. The second step is then to use a more elaborate data cleanup process and see whether taking the extra time actually moves my ranking up, or maybe down.
This data set can be used to train a machine learning algorithm to correctly classify passengers of the Titanic's first and last voyage as having survived the disaster or not. To that end, the data set available at kaggle contains various data type pertaining to each passenger. The label to be predicted is the Survived feature (0 for died, 1 for survived). The data comes pre-divided; a training set that includes survival data and a testing set that does not. The goal is to produce survival data for the test set, and upload the result to kaggle for scoring.
In [1]:
import pandas as pd
import os
import seaborn as sb
import matplotlib.pyplot as pl
from sklearn.preprocessing import Imputer
% matplotlib inline
In [2]:
sb.set_context("notebook", font_scale=1.2, rc={"lines.linewidth": 1.5})
In [3]:
trainset = '/home/madhatter106/DATA/Titanic/train.csv'
testset = '/home/madhatter106/DATA/Titanic/test.csv'
dfTrain = pd.read_csv(trainset)
dfTest = pd.read_csv(testset)
dfTrain.head(2)
Out[3]:
In [4]:
dfTest.head(2)
Out[4]:
In [5]:
print(dfTrain.describe())
print('-' * 40)
print(dfTrain.info())
print('-' * 40)
print(dfTrain.isnull().sum())
Problems in the training dataset:
In [6]:
print(dfTest.describe())
print('-' * 40)
print(dfTest.info())
print('-' * 40)
print(dfTest.isnull().sum())
Problems in the test set:
First I'm going to get rid of PassengerId, Ticket and Cabin. While dropping the Name data is tempting, I'm going to hold on to it for now, in case I can use titles to help infer missing Age data.
In [7]:
dfTrain.drop(['PassengerId', 'Ticket', 'Cabin'], axis=1, inplace=True)
dfTest.drop(['PassengerId', 'Ticket', 'Cabin'], axis=1, inplace=True)
Before we impute/correct any of the Fare, Age and Embarked data, let's see if they appear to be a factor.
In [8]:
sb.factorplot(y='Age', x='Survived' ,data=dfTrain, aspect=3)
Out[8]:
Age is clearly a factor, what about Fare?
In [9]:
sb.factorplot(y='Fare', x='Survived', data=dfTrain, aspect=3);
Fare is clearly also a factor, but is it because it's a proxy for class?
In [10]:
sb.factorplot(x='Survived', y='Fare', hue='Pclass', data=dfTrain, aspect=3);
Interestingly, Fare appears to have an effect in $1^{st}$ class only. Let's look at 'Embarked'.
In [11]:
sb.countplot(x='Embarked', hue='Survived', data=dfTrain);
'Embarked', 'Age', and 'Fare' all seem to have an effect on survival so I'll clean up all three features.
In [12]:
dfTemp = pd.concat((dfTrain,dfTest),join='inner')
In [13]:
print(dfTemp.describe())
print('-' * 50)
print(dfTemp.info())
In [14]:
dfTemp[dfTemp.Fare==0]
Out[14]:
It makes more sense to me to impute the class-dependent median fee of tickets bought in Southampton:
In [15]:
Fares_S_AllCl_Non0 = dfTemp.loc[(dfTemp.Fare != 0) & (dfTemp.Embarked == 'S'),['Pclass', 'Fare']]
for i in range(1,4):
dfTrain.loc[(dfTrain.Fare == 0) & (dfTrain.Pclass == i),'Fare'] = Fares_S_AllCl_Non0[Fares_S_AllCl_Non0.Pclass == i].Fare.median()
dfTest.loc[(dfTest.Fare == 0) & dfTest.Pclass == i, 'Fare'] = Fares_S_AllCl_Non0[Fares_S_AllCl_Non0.Pclass == i].Fare.median()
In [16]:
dfTrain.loc[dfTrain.Embarked.isnull(),['Name','Fare', 'Pclass']]
Out[16]:
In [17]:
sb.factorplot(x='Pclass',y='Fare',hue='Embarked', data=dfTemp,aspect=2)
Out[17]:
This suggests that 'S' is a relative safe bet. But I still wonder what the more frequent value in 'Embarked' is?
In [18]:
dfTrain['Embarked'].value_counts()
Out[18]:
Now I'm fairly confident 'S' is the right value to impute to 'Embarked'
In [19]:
dfTrain['Embarked'].fillna('S', inplace=True)
Based on the factor plot above plotting fare against class, seems safe to correct bad Fare data based on Passenger Class.
In [20]:
pclass4fare = dfTest.loc[dfTest.Fare.isnull(), 'Pclass'].values[0]
In [21]:
msgClassMedianFare = dfTest[(dfTest.Pclass==pclass4fare)].Fare.dropna().median()
In [22]:
dfTest.loc[dfTest.Fare.isnull(),'Fare'] = msgClassMedianFare
Age is one of those things that to a first approximation can be estimated by how a person is referred to. First I am going to catalogue titles present in names and create another feature, "Title". To do this I need to do an inventory of all possible titles present in the dataset. Titles seem to be the seconde word in the name string, ending with a '.'.
In [23]:
nameset = set()
cnt = 0
for name in dfTemp.Name.values:
for subname in name.split(' '):
# Now I grab the part that has a '.' after but verify it is an abbreviation, not an initial...
if '.' in subname and len(subname) > 2:
cnt += 1
nameset.add(subname)
In [24]:
print(nameset)
In [25]:
def createTitle(name):
return list(set(name.split(' ')) & nameset)[0][:-1]
dfTrain['Title'] = dfTrain.Name.apply(createTitle)
dfTest['Title'] = dfTest.Name.apply(createTitle)
dfTemp['Title'] = dfTemp.Name.apply(createTitle)
Does title allow distinguishing age?
In [26]:
f = pl.figure(figsize=(12,8))
ax = f.add_subplot()
violin = sb.violinplot(x='Title',y='Age',data=dfTemp,ax=ax, scale='area')
for item in violin.get_xticklabels():
item.set_rotation(45)
Not awesome. Still, some titles come with a wide age range; others, like Master have a markedly narrower range, and that could still be informative for my Age imputation. Also I probably don't need all titles. What titles do correspond to missing ages?
In [27]:
dfTemp.Title[dfTemp.Age.isnull()].unique()
Out[27]:
Clearly I don't need to use all of the 'Title' data. Based on the graph above, I will pack Ms and Miss together and impute missing age from the median Age of Miss. I will impute the remaining missing ages directly from the median Age of their corresponding category.
In [28]:
# re-titling Ms as Miss
dfTemp.loc[dfTemp.Title=='Ms', 'Title'] = 'Miss'
In [29]:
dfTrain.loc[(dfTrain.Age.isnull()) &
((dfTrain.Title=='Miss')|
(dfTrain.Title=='Ms')),'Age'] = dfTemp.loc[dfTemp.Title=='Miss','Age'].median()
dfTest.loc[(dfTest.Age.isnull()) &
((dfTest.Title=='Miss')|
(dfTest.Title=='Ms')),'Age'] = dfTemp.loc[dfTemp.Title=='Miss','Age'].median()
In [30]:
for title in ['Mr', 'Mrs','Master','Dr']:
dfTrain.loc[(dfTrain.Age.isnull()) &
(dfTrain.Title==title),'Age'] = dfTemp.loc[dfTemp.Title==title, 'Age'].median()
dfTest.loc[(dfTest.Age.isnull()) &
(dfTest.Title==title),'Age'] = dfTemp.loc[dfTemp.Title==title,'Age'].median()
In [31]:
print(dfTrain.info())
print('-' * 50)
print(dfTest.info())
Now I can simplify this data set by dropping 'Title' and 'Name' data from both sets
In [32]:
dfTrain.drop(['Name','Title'],inplace=True, axis=1)
dfTest.drop(['Name','Title'], inplace=True,axis=1)
In [33]:
# one-hot encoding non-hierarchical categorical labels
dfTrain = pd.concat([dfTrain,pd.get_dummies(dfTrain[['Sex', 'Embarked']])],axis=1)
dfTest = pd.concat([dfTest,pd.get_dummies(dfTest[['Sex', 'Embarked',]])],axis=1)
Now we don't need 'Sex' or 'Embarked any more so we drop them from both sets
In [34]:
dfTrain.drop(['Sex', 'Embarked'], axis=1, inplace=True)
dfTest.drop(['Sex', 'Embarked'], axis=1, inplace=True)
This concludes this part of the pre-processing; the data cleanup. Here's what the data sets look like
In [35]:
print(dfTrain.info())
print('-' * 50)
print(dfTest.info())
I'm going to pickle both Dataframes for safekeeping until the next blog...
In [36]:
dfTrain.to_pickle('/home/madhatter106/DATA/Titanic/dfTrainCln_I.pkl')
dfTest.to_pickle('/home/madhatter106/DATA/Titanic/dfTestCln_I.pkl')
In the next notebook, I'll be looking to do some additional post-cleanup pre-processing. Until next time!