Note! Work in Progress - This notebook is not yet finished
An implementation in Python of the exploration of the Titanic dataset that closely follows the excellent Exploring Survival on the Titanic notebook by Megan L. Risdal found at https://www.kaggle.com/mrisdal/titanic/exploring-survival-on-the-titanic/notebook. Data preprocessing largely follows what she did though predictive modeling attempts to explore more models than just the random forest she used.
As an aside, this also serves as an interesting look at how some of the tasks performed in her notebook might be done in Python and, in a way, shows both languages' relative strengths and weaknesses.
In [2]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.cross_validation import train_test_split, cross_val_score
from sklearn import metrics
In [40]:
train = pd.read_csv("train.csv", index_col='PassengerId')
test = pd.read_csv("test.csv", index_col='PassengerId')
train.head(3)
test.head(3)
Out[40]:
In [4]:
# print(train.shape)
# print(test.shape)
print('Number of features: {}'.format(test.shape[1]))
print('Training samples: {}'.format(train.shape[0]))
print('Test samples: {}'.format(test.shape[0]))
print('Total number of samples: {}'.format(train.shape[0]+test.shape[0]))
The data contains the following features:
It's time to explore the dataset to get a general idea of what it's like.
In [5]:
# First, combine datasets
total = pd.concat([train, test])
# View summary statistics
total.describe()
Out[5]:
Most numerical data appear to be fairly complete, with the exception of fare (which only has one missing value) and age (which has 263 missing values). We can deal with the missing values later.
Let's also visualize the data with histograms to see the general distribution of the data.
In [6]:
# Generate histograms
sns.set_color_codes('muted')
total.hist(color='g')
plt.tight_layout()
plt.show()
A fairly obvious observation here is that the PassengerId variable is not very useful -- we should drop this column. The rest of the data is quite interesting, with most passengers being somewhat young (around 20 to 30 years of age) and most people traveling without too much family.
Pclass serves as a proxy for the passengers' socioeconomic stata. Interestingly, the middle class appears to be the lowest in size, though not by much compared to upperclass passengers.
Looking at the data, given that we don't have the ticket number does not appear to be too informative.
In [9]:
totalwithoutnas = total.dropna()
scattermatrix = sns.pairplot(totalwithoutnas)
plt.show()
In [11]:
total.drop('Ticket', axis=1, inplace=True)
A number of the variables in the data present opportunities to be further generate meaningful features. One particular feature that appears to contain a lot of meaning is the names of the passengers. As in the notebook of Megan, we will be able to extract titles (which are indicative of both gender and marriage status) and families (given by shared surnames, under the assumption that incidences of unrelated people having the same surname are trivial).
In [12]:
Surnames = pd.DataFrame(total['Name'].str.split(",").tolist(), columns=['Surname', 'Rest'])
Titles = pd.DataFrame(Surnames['Rest'].str.split(".").tolist(), columns=['Title', 'Rest1', 'Rest2'])
Surnames.drop('Rest',axis=1,inplace=True)
Titles = pd.DataFrame(Titles['Title'])
Surnames['Surname'].str.strip()
Titles['Title'].str.strip()
total['Surname'] = Surnames.set_index(np.arange(1,1310))
total['Title'] = Titles.set_index(np.arange(1,1310))
total.head()
Out[12]:
Let's tabulate our titles against sex to see the frequency of the various titles.
In [13]:
pd.crosstab(total['Sex'], total['Title'])
Out[13]:
We see that with the exception of Master, Mr, Miss, and Mrs, the other titles are relatively rare. We can group rare titles together to simplify our analysis. Also note that Mlle and Ms are synonymous with Miss, and Mme is synonymous with Mrs.
In [14]:
raretitles = ['Dona', 'Lady', 'the Countess','Capt', 'Col', 'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer']
total.ix[total['Title'].str.contains('Mlle|Ms|Miss'), 'Title'] = 'Miss'
total.ix[total['Title'].str.contains('Mme|Mrs'), 'Title'] = 'Mrs'
total.ix[total['Title'].str.contains('|'.join(raretitles)), 'Title'] = 'Rare Title'
pd.crosstab(total['Sex'], total['Title'])
Out[14]:
In [17]:
total['Surname'].nunique()
Out[17]:
We have 875 unique surnames.
Family size may have an impact on survival. To this end, we create a family size attribute and plot the relationship.
In [25]:
total['FamilySize'] = total['SibSp'] + total['Parch'] + 1
total['Family'] = total['Surname'] + "_" + total['FamilySize'].apply(str)
total.head(1)
Out[25]:
In [45]:
# Plot family size
famsizebarplot = sns.countplot(total['FamilySize'].loc[1:len(train.index)], hue=total['Survived'])
famsizebarplot.set_xlabel('Family Size')
plt.show()
The chart above clearly shows an interesting phenomenon -- single people and families of over 4 people have a significantly lower chance of survival than those in small (2 to 4 person) families.
In [49]:
# Categorize family size
total['FamSizeCat'] = 'small'
total.loc[(total['FamilySize'] == 1), 'FamSizeCat'] = 'singleton'
total.loc[(total['FamilySize'] > 4), 'FamSizeCat'] = 'large'
# Create mosaic plot
# To be done in the future using statsmodel
We first check columns with missing values.
In [52]:
total.isnull().sum()
Out[52]:
It appears that age, cabin, embarked, and fare have missing values. Let's first work on "Embarked" and "Fare" given that there are few enough NaN's for us to be able to manually work out what values they should have. For Cabin, given that there are 1309 samples and more than 75% of them are missing, we can probably just drop this column. It might have been useful given that location on the ship might influence their chance of survival, but data is too sparse on this particular attribute.
In [59]:
total[(total['Embarked'].isnull()) | (total['Fare'].isnull())]
Out[59]:
Miss Icard and Mrs. Stone, both shared the same cabin, both survived, both paid the same fare, and are both of the same class, interestingly enough. Mr. Storey is of the third class and embarked from Southampton.
Visualizing the fares by embarkation location may shed some light on where the two first class ladies embarked.
In [64]:
sns.boxplot(x='Embarked',y='Fare',data=train.dropna(),hue='Pclass')
plt.tight_layout()
plt.show()
In [87]:
trainwithoutnas = train.dropna()
print("Mean fares for passengers traveling in first class:")
print(trainwithoutnas[trainwithoutnas['Pclass']==1].groupby('Embarked')['Fare'].mean())
print("\nMedian fares for passengers traveling in first class:")
print(trainwithoutnas[trainwithoutnas['Pclass']==1].groupby('Embarked')['Fare'].median())
The closest value to the $80 fare paid by both ladies for first class is very close to the mean fare paid by first class passengers embarking from Southampton, but also aligns very nicely with the median fare paid by those embarking from Cherbourg. Perhaps a swarm plot will better show how passengers are distributed.
In [92]:
sns.swarmplot(x='Embarked',y='Fare',data=train.dropna(),hue='Pclass')
plt.show()
This is a tough call. Looking at the spread of the points, however, it seems that those that embarked from Southampton generally paid lower fares. It appears that the mean fare paid by those from Cherbourg is pulled up by the extreme outliers that paid more than \$500 for their tickets, with a majority of first class passengers indeed paying around $80. As such, we classify the two ladies as having embarked from Cherbourg (C).
In [98]:
total.loc[(62,830), 'Embarked'] = "C"
total.loc[(62,830), 'Embarked']
Out[98]:
The swarm plot also shows that the passengers embarking from Southampton in third class have paid around the same fare. It would be reasonable to use the mean value of third class passengers from Southampton as his fare value.
In [112]:
total.loc[1044,'Fare'] = total[(total['Embarked']=="S") & (total['Pclass']==3)]['Fare'].mean()
total.loc[1044, ['Name','Fare']]
Out[112]:
We could do mice imputation similar to Megan's notebook via the fancyimpute package.
In [191]:
AgeHistogram = total['Age'].hist(bins=20, edgecolor="black")
AgeHistogram.set_xlabel("Age")
AgeHistogram.set_ylabel("Count")
AgeHistogram.set_title("Age (Prior to Missing Value Imputation)")
plt.show()
In [132]:
import fancyimpute
total.isnull().sum()
Out[132]:
In [202]:
totalforMICE = total.drop(['Survived','Cabin','FamSizeCat','Family','Name','Surname'], axis=1)
# totalforMICE.fillna(np.nan)
totalforMICE['Sex'] = pd.get_dummies(totalforMICE['Sex'])['male']
dummycodedTitles = pd.get_dummies(totalforMICE['Title']).drop('Rare Title', axis=1)
totalforMICE = pd.merge(totalforMICE, dummycodedTitles, left_index=True, right_index=True, how='outer')
totalforMICE = totalforMICE.drop(['Title'],axis=1)
dummycodedEmbarked = pd.get_dummies(totalforMICE['Embarked'])[['C','Q']]
totalforMICE = totalforMICE.join(dummycodedEmbarked).drop(['Embarked'],axis=1)
dummycodedPclass = pd.get_dummies(totalforMICE['Pclass'], columns=[list("123")]).drop(3,axis=1)
totalforMICE = totalforMICE.join(dummycodedPclass).drop('Pclass',axis=1)
MICEdtotal = fancyimpute.MICE().complete(totalforMICE.values.astype(float))
In [203]:
MICEdtotal = pd.DataFrame(MICEdtotal, columns=totalforMICE.columns)
MICEdtotal.isnull().sum()
Out[203]:
We see that the MICE'd data has no more missing Age values. Plotting these values in the histogram:
In [204]:
MICEAgeHistogram = MICEdtotal['Age'].hist(bins=20, edgecolor="black")
MICEAgeHistogram.set_xlabel("Age")
MICEAgeHistogram.set_ylabel("Count")
MICEAgeHistogram.set_title("Age (After Missing Value Imputation)")
plt.show()
In [236]:
AgeHists, AgeHistAxes = plt.subplots(nrows=1,ncols=2, figsize=(10,5), sharey=True)
AgeHistAxes[0].hist(total['Age'].dropna(), bins=20, edgecolor='black', normed=True)
AgeHistAxes[0].set_xlabel("Age")
AgeHistAxes[0].set_ylabel("Density")
AgeHistAxes[0].set_title("Age Density (Original Data)")
AgeHistAxes[1].hist(MICEdtotal['Age'], bins=20, edgecolor='black', normed=True)
AgeHistAxes[1].set_xlabel("Age")
AgeHistAxes[1].set_ylabel("Density")
AgeHistAxes[1].set_title("Age Density (After MICE)")
AgeHists.tight_layout()
AgeHists
Out[236]:
Most age values were added around the 20 to 30 year-old age range, which makes sense given the distribution of the ages in the data that we had. Note that the fancyimpute version of MICE uses Bayesian Ridge Regression. The density is not perfectly preserved but is useful enough to proceed with the analysis.
We use the new Age column with the imputed values for our analysis.
In [237]:
newtotal = total
newtotal['Age'] = MICEdtotal['Age']
We can create some additional categorical columns based on our complete age feature -- whether the person is a child (18 or under) and whether a person is a mother (female, over 18, with children, and does not have the title "Miss").
In [327]:
AgeandSexHist = sns.FacetGrid(newtotal.iloc[0:891,:], col = 'Sex', hue='Survived', size=5)
# AgeandSexHist.map(sns.distplot, 'Age', kde=False, hist_kws={'edgecolor':'black','stacked':True})
AgeandSexHist.map(plt.hist, 'Age', alpha=0.5, bins=20)
AgeandSexHist.add_legend()
# plt.close('all')
plt.show(AgeandSexHist)
In [342]:
AgeandSexHist, AgeandSexHistAxes = plt.subplots(nrows=1,ncols=2, figsize=(10,5), sharey=True)
AgeandSexHistAxes[0].hist([newtotal.loc[0:891, 'Age'].loc[(newtotal['Sex']=='male') & (newtotal['Survived']==1)],
newtotal.loc[0:891, 'Age'].loc[(newtotal['Sex']=='male') & (newtotal['Survived']==0)]],stacked=True, edgecolor='black', label=['Survived','Did Not Survive'], bins=24)
AgeandSexHistAxes[1].hist([newtotal.loc[0:891, 'Age'].loc[(newtotal['Sex']=='female') & (newtotal['Survived']==1)],
newtotal.loc[0:891, 'Age'].loc[(newtotal['Sex']=='female') & (newtotal['Survived']==0)]],stacked=True, edgecolor='black', bins=24)
AgeandSexHistAxes[0].set_title('Survival By Age for Males')
AgeandSexHistAxes[1].set_title('Survival By Age for Females')
for i in range(2):
AgeandSexHistAxes[i].set_xlabel('Age')
AgeandSexHistAxes[0].set_ylabel('Count')
AgeandSexHistAxes[0].legend()
plt.show()
In [343]:
# Create the 'Child' variable
newtotal['Child'] = 1
newtotal.loc[newtotal['Age']>=18, 'Child'] = 0
pd.crosstab(newtotal['Child'],newtotal['Survived'])
Out[343]:
In [345]:
# Create the 'Mother' variable
newtotal['Mother'] = 0
newtotal.loc[(newtotal['Sex']=='female') & (newtotal['Parch'] > 0) & (newtotal['Age']>18) & (newtotal['Title'] != "Miss"), 'Mother'] = 1
pd.crosstab(newtotal['Mother'], newtotal['Survived'])
Out[345]:
Let's take a look at the dataset once again.
In [346]:
newtotal.head()
Out[346]:
In [349]:
newtotal.shape
Out[349]:
We ensure that all important categorical variables are dummy coded.
In [356]:
dummycodedFamSizeCat = pd.get_dummies(newtotal['FamSizeCat']).drop('large',axis=1)
newtotal = newtotal.drop(['Title','Embarked','Pclass', 'Cabin', 'Name', 'Family', 'Surname'], axis=1)
newtotal['Sex'] = pd.get_dummies(newtotal['Sex'])['male']
newtotal = newtotal.join(dummycodedEmbarked)
newtotal = newtotal.join(dummycodedPclass)
newtotal = newtotal.join(dummycodedTitles)
newtotal = newtotal.join(dummycodedFamSizeCat)
newtotal.head()
Out[356]:
After we split the data back into training and test sets, our data set will be ready to use for modeling.
In [358]:
newtrain = newtotal.loc[:891,:]
newtest = newtotal.loc[892:,:]
Note! Work in Progress - This notebook is not yet finished