Titanic Data Exploration

Table of Contents

  • Overview
  • Initial Exploration and Plotting
  • Exploratory Analysis by Variable
    • Names
    • Families
    • Tickets
    • Fares
    • Cabins
    • Embarkment
    • Ages

In [334]:
import matplotlib.pyplot as plt
import scipy.stats as st
import seaborn as sns
import pandas as pd
import numpy as np

%matplotlib inline

train = pd.read_csv('train.csv', index_col='PassengerId')
test = pd.read_csv('test.csv', index_col='PassengerId')

Initial Exploration and Plotting


In [3]:
train.head()


Out[3]:
Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
PassengerId
1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

In [4]:
train.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Name        891 non-null object
Sex         891 non-null object
Age         714 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Ticket      891 non-null object
Fare        891 non-null float64
Cabin       204 non-null object
Embarked    889 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB

In [5]:
test.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 418 entries, 892 to 1309
Data columns (total 10 columns):
Pclass      418 non-null int64
Name        418 non-null object
Sex         418 non-null object
Age         332 non-null float64
SibSp       418 non-null int64
Parch       418 non-null int64
Ticket      418 non-null object
Fare        417 non-null float64
Cabin       91 non-null object
Embarked    418 non-null object
dtypes: float64(2), int64(3), object(5)
memory usage: 35.9+ KB

In [362]:
plt.figure(1, figsize=(6, 6))
sns.barplot(x='Sex', y='Survived', data=train)
plt.show()



In [343]:
s_ages = train.loc[train['Survived'] == 1, 'Age'].dropna()
d_ages = train.loc[train['Survived'] == 0, 'Age'].dropna()
s_fares = train.loc[train['Survived'] == 1, 'Fare'].add(1).apply(np.log).dropna()
d_fares = train.loc[train['Survived'] == 0, 'Fare'].add(1).apply(np.log).dropna()

plt.figure(2, figsize=(12, 8))
plt.subplot(231)
sns.barplot(x='Pclass', y='Survived', data=train)
plt.subplot(234)
sns.barplot(x='Embarked', y='Survived', data=train)
plt.subplot(233)
sns.barplot(x='SibSp', y='Survived', data=train)
plt.subplot(236)
sns.barplot(x='Parch', y='Survived', data=train)
plt.subplot(232)
sns.distplot(d_ages, color='C0')
sns.distplot(s_ages, color='C1')
plt.subplot(235)
sns.distplot(d_fares, color='C0')
sns.distplot(s_fares, color='C1')
plt.show()


Exploratory Analysis and Feature Engineering

Here, we'll explore the features of the dataset. Since Sex and PClass are rather clear-cut and have been explored in many other kernels, we will not explore those for now. We'll explore the related features SibSp and Parch together, as a "family size" feature group. Due to the large amount of missing values for Age, we will explore it last - after looking at the other features, we may come up with strategies for imputation.

Finally, we'll create several derived features if necessary.

Names

This doesn't seem like a very promising feature, but take a look:


In [351]:
train['Name'].head()


Out[351]:
PassengerId
1                              Braund, Mr. Owen Harris
2    Cumings, Mrs. John Bradley (Florence Briggs Th...
3                               Heikkinen, Miss. Laina
4         Futrelle, Mrs. Jacques Heath (Lily May Peel)
5                             Allen, Mr. William Henry
Name: Name, dtype: object

The names look very consistently formatted, in the form of (last), (title). (first) (middle) Since there are only a handful of distinct titles (versus the largely unique names), we'll extract this information:


In [352]:
train['Title'] = train['Name'].str.extract('\,\s(.*?)[.]', expand=False)
print(train['Title'].unique())


['Mr' 'Mrs' 'Miss' 'Master' 'Don' 'Rev' 'Dr' 'Mme' 'Ms' 'Major' 'Lady'
 'Sir' 'Mlle' 'Col' 'Capt' 'the Countess' 'Jonkheer']

In [353]:
test['Title'] = test['Name'].str.extract('\,\s(.*?)[.]', expand=False)
print(test['Title'].unique())


['Mr' 'Mrs' 'Miss' 'Master' 'Ms' 'Col' 'Rev' 'Dr' 'Dona']

To start with, let's get an idea of how many passengers are holding each title.


In [366]:
plt.figure(3, figsize=(14, 4))
plt.subplot(121)
sns.countplot(train.loc[train['Sex'] == 'female', 'Title'])
plt.subplot(122)
sns.countplot(train.loc[train['Sex'] == 'male', 'Title'])
plt.show()


The low number of most of the titles suggest grouping up the more esoteric ones. We'll do so as follows (there are no hard rules, so we'll use some judgment):

  • Merge Mme. into Mrs. and Mlle. into Miss.
  • Merge Lady, the Countess, and Dona (from the test set) into a category of noblewomen.
  • Merge Don, Sir, and Jonkheer into a category of noblemen.
  • Merge Col, Capt, and Major into a category of military.

For 'Ms.', we'll look at the woman's age, and also check her party.


In [121]:
train[train['Title'] == 'Ms']


Out[121]:
Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Title
PassengerId
444 1 2 Reynaldo, Ms. Encarnacion female 28.0 0 0 230434 13.0 NaN S Ms

Since she is relatively young and traveling alone, we'll throw her in with the "Miss" group.


In [359]:
title_map = {'Mr': 'Mr',
             'Mrs': 'Mrs',
             'Miss': 'Miss',
             'Master': 'Master',
             'Dr': 'Dr',
             'Rev': 'Rev',
             'Don': 'mnoble',
             'Sir': 'mnoble',
             'Jonkheer': 'mnoble',
             'Lady': 'fnoble',
             'the Countess': 'fnoble',
             'Dona': 'fnoble',
             'Col': 'mil',
             'Capt': 'mil',
             'Major': 'mil',
             'Mme': 'Mrs',
             'Mlle': 'Miss',
             'Ms': 'Miss'}

train['AdjTitle'] = train['Title'].map(title_map)
test['AdjTitle'] = test['Title'].map(title_map)

Let's see how these titles did:


In [365]:
plt.figure(4, figsize=(8, 4))
plt.subplot(121)
sns.barplot(x='AdjTitle', y='Survived', data=train[train['Sex'] == 'female'])
plt.subplot(122)
sns.barplot(x='AdjTitle', y='Survived', data=train[train['Sex'] == 'male'])
plt.show()


For women, it seems pretty clear-cut: The women with nobility titles survived (as did women on the whole). The men with titles (all except Rev.) seem to do better on average, but it's highly variable. Since the gender-based model where all women live and men die attains over a 76% accuracy, the hard part of our model seems to be picking out the few male survivors.

Family Size

Here we'll work with the SibSp and Parch features, which involve family size. To look for lone travelers, we'll first look at the distribution of the features added together, separated by gender:


In [364]:
train['FamSize'] = train['SibSp'] + train['Parch']
test['FamSize'] = test['SibSp'] + test['Parch']

plt.figure(5, figsize=(8, 4))
plt.subplot(121)
sns.countplot(train.loc[train['Sex'] == 'female', 'FamSize'])
plt.subplot(122)
sns.countplot(train.loc[train['Sex'] == 'male', 'FamSize'])
plt.show()



In [367]:
train['FamSize'] = train['SibSp'] + train['Parch']
test['FamSize'] = test['SibSp'] + test['Parch']

plt.figure(6, figsize=(12, 8))
plt.subplot(231)
sns.countplot(train.loc[(train['Sex'] == 'female') & (train['Pclass'] == 1), 'FamSize'])
plt.subplot(234)
sns.countplot(train.loc[(train['Sex'] == 'male') & (train['Pclass'] == 1), 'FamSize'])
plt.subplot(232)
sns.countplot(train.loc[(train['Sex'] == 'female') & (train['Pclass'] == 2), 'FamSize'])
plt.subplot(235)
sns.countplot(train.loc[(train['Sex'] == 'male') & (train['Pclass'] == 2), 'FamSize'])
plt.subplot(233)
sns.countplot(train.loc[(train['Sex'] == 'female') & (train['Pclass'] == 3), 'FamSize'])
plt.subplot(236)
sns.countplot(train.loc[(train['Sex'] == 'male') & (train['Pclass'] == 3), 'FamSize'])
plt.show()


How did this impact survival?


In [368]:
plt.figure(7, figsize=(8, 4))
plt.subplot(121)
sns.barplot(x='FamSize', y='Survived', data=train[train['Sex'] == 'female'])
plt.subplot(122)
sns.barplot(x='FamSize', y='Survived', data=train[train['Sex'] == 'male'])
plt.show()



In [369]:
plt.figure(9, figsize=(12, 8))
plt.subplot(231)
sns.barplot(x='FamSize', y='Survived', data=train[(train['Sex'] == 'female') & (train['Pclass'] == 1)])
plt.subplot(234)
sns.barplot(x='FamSize', y='Survived', data=train[(train['Sex'] == 'male') & (train['Pclass'] == 1)])
plt.subplot(232)
sns.barplot(x='FamSize', y='Survived', data=train[(train['Sex'] == 'female') & (train['Pclass'] == 2)])
plt.subplot(235)
sns.barplot(x='FamSize', y='Survived', data=train[(train['Sex'] == 'male') & (train['Pclass'] == 2)])
plt.subplot(233)
sns.barplot(x='FamSize', y='Survived', data=train[(train['Sex'] == 'female') & (train['Pclass'] == 3)])
plt.subplot(236)
sns.barplot(x='FamSize', y='Survived', data=train[(train['Sex'] == 'male') & (train['Pclass'] == 3)])
plt.show()


Let's ignore (for the moment) possible effects from passenger class. We can then draw the following conclusions:

  • Most men traveled alone. Those with families were generally in smaller ones. A huge amount of men traveled alone in third class; they had very low survival chances.
  • Many women traveled alone, but not as many as men. Larger groups consisted of mostly women.
  • In first and second class:
    • Women seem to have roughly the same survival chance, independent of family size.
    • Men with larger family sizes seem to have relatively higher chances of survival.
  • In third class:
    • Women and men seem to have relatively higher chances of survival, up to a family size of 3.
    • Women and men with a family size of 4 or higher had drastically lower odds of survival.

Tickets

One thing we can do with ticket numbers is scan for duplicates:


In [283]:
ticket_dupes = train[(train['Ticket'].duplicated(keep=False))].set_index('Ticket', append=True).swaplevel(0, 1).sort_index()
ticket_dupes


Out[283]:
Survived Pclass Name Sex Age SibSp Parch Fare Cabin Embarked Title PTitle Child TicketSize AdjFare LogFare AdjTitle FamSize
Ticket PassengerId
110152 258 1 1 Cherry, Miss. Gladys female 30.00 0 0 86.5000 B77 S Miss Miss False 3 28.833333 3.395626 Miss 0
505 1 1 Maioni, Miss. Roberta female 16.00 0 0 86.5000 B79 S Miss Miss False 3 28.833333 3.395626 Miss 0
760 1 1 Rothes, the Countess. of (Lucy Noel Martha Dye... female 33.00 0 0 86.5000 B77 S the Countess fnoble False 3 28.833333 3.395626 fnoble 0
110413 263 0 1 Taussig, Mr. Emil male 52.00 1 1 79.6500 E67 S Mr Mr False 3 26.550000 3.316003 Mr 2
559 1 1 Taussig, Mrs. Emil (Tillie Mandelbaum) female 39.00 1 1 79.6500 E67 S Mrs Mrs False 3 26.550000 3.316003 Mrs 2
586 1 1 Taussig, Miss. Ruth female 18.00 0 2 79.6500 E68 S Miss Miss False 3 26.550000 3.316003 Miss 2
110465 111 0 1 Porter, Mr. Walter Chamberlain male 47.00 0 0 52.0000 C110 S Mr Mr False 2 26.000000 3.295837 Mr 0
476 0 1 Clifford, Mr. George Quincy male NaN 0 0 52.0000 A14 S Mr Mr False 2 26.000000 3.295837 Mr 0
111361 330 1 1 Hippach, Miss. Jean Gertrude female 16.00 0 1 57.9792 B18 C Miss Miss False 2 28.989600 3.400851 Miss 1
524 1 1 Hippach, Mrs. Louis Albert (Ida Sophia Fischer) female 44.00 0 1 57.9792 B18 C Mrs Mrs False 2 28.989600 3.400851 Mrs 1
113505 167 1 1 Chibnall, Mrs. (Edith Martha Bowerman) female NaN 0 1 55.0000 E33 S Mrs Mrs False 2 27.500000 3.349904 Mrs 1
357 1 1 Bowerman, Miss. Elsie Edith female 22.00 0 1 55.0000 E33 S Miss Miss False 2 27.500000 3.349904 Miss 1
113572 62 1 1 Icard, Miss. Amelie female 38.00 0 0 80.0000 B28 NaN Miss Miss False 2 40.000000 3.713572 Miss 0
830 1 1 Stone, Mrs. George Nelson (Martha Evelyn) female 62.00 0 0 80.0000 B28 NaN Mrs Mrs False 2 40.000000 3.713572 Mrs 0
113760 391 1 1 Carter, Mr. William Ernest male 36.00 1 2 120.0000 B96 B98 S Mr Mr False 4 30.000000 3.433987 Mr 3
436 1 1 Carter, Miss. Lucile Polk female 14.00 1 2 120.0000 B96 B98 S Miss Miss False 4 30.000000 3.433987 Miss 3
764 1 1 Carter, Mrs. William Ernest (Lucile Polk) female 36.00 1 2 120.0000 B96 B98 S Mrs Mrs False 4 30.000000 3.433987 Mrs 3
803 1 1 Carter, Master. William Thornton II male 11.00 1 2 120.0000 B96 B98 S Master Master True 4 30.000000 3.433987 Master 3
113776 152 1 1 Pears, Mrs. Thomas (Edith Wearne) female 22.00 1 0 66.6000 C2 S Mrs Mrs False 2 33.300000 3.535145 Mrs 1
337 0 1 Pears, Mr. Thomas Clinton male 29.00 1 0 66.6000 C2 S Mr Mr False 2 33.300000 3.535145 Mr 1
113781 298 0 1 Allison, Miss. Helen Loraine female 2.00 1 2 151.5500 C22 C26 S Miss Miss True 4 37.887500 3.660673 Miss 3
306 1 1 Allison, Master. Hudson Trevor male 0.92 1 2 151.5500 C22 C26 S Master Master True 4 37.887500 3.660673 Master 3
499 0 1 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female 25.00 1 2 151.5500 C22 C26 S Mrs Mrs False 4 37.887500 3.660673 Mrs 3
709 1 1 Cleaver, Miss. Alice female 22.00 0 0 151.5500 NaN S Miss Miss False 4 37.887500 3.660673 Miss 0
113789 36 0 1 Holverson, Mr. Alexander Oskar male 42.00 1 0 52.0000 NaN S Mr Mr False 2 26.000000 3.295837 Mr 1
384 1 1 Holverson, Mrs. Alexander Oskar (Mary Aline To... female 35.00 1 0 52.0000 NaN S Mrs Mrs False 2 26.000000 3.295837 Mrs 1
113798 271 0 1 Cairns, Mr. Alexander male NaN 0 0 31.0000 NaN S Mr Mr False 2 15.500000 2.803360 Mr 0
843 1 1 Serepeca, Miss. Augusta female 30.00 0 0 31.0000 NaN C Miss Miss False 2 15.500000 2.803360 Miss 0
113803 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.00 1 0 53.1000 C123 S Mrs Mrs False 2 26.550000 3.316003 Mrs 1
138 0 1 Futrelle, Mr. Jacques Heath male 37.00 1 0 53.1000 C123 S Mr Mr False 2 26.550000 3.316003 Mr 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
PC 17758 506 0 1 Penasco y Castellana, Mr. Victor de Satode male 18.00 1 0 108.9000 C65 C Mr Mr False 2 54.450000 4.015482 Mr 1
PC 17760 270 1 1 Bissette, Miss. Amelia female 35.00 0 0 135.6333 C99 S Miss Miss False 3 45.211100 3.833220 Miss 0
326 1 1 Young, Miss. Marie Grice female 36.00 0 0 135.6333 C32 C Miss Miss False 3 45.211100 3.833220 Miss 0
374 0 1 Ringhini, Mr. Sante male 22.00 0 0 135.6333 NaN C Mr Mr False 3 45.211100 3.833220 Mr 0
PC 17761 538 1 1 LeRoy, Miss. Bertha female 30.00 0 0 106.4250 NaN C Miss Miss False 2 53.212500 3.992912 Miss 0
545 0 1 Douglas, Mr. Walter Donald male 50.00 1 0 106.4250 C86 C Mr Mr False 2 53.212500 3.992912 Mr 1
PP 9549 11 1 3 Sandstrom, Miss. Marguerite Rut female 4.00 1 1 16.7000 G6 S Miss Miss True 2 8.350000 2.235376 Miss 2
395 1 3 Sandstrom, Mrs. Hjalmar (Agnes Charlotta Bengt... female 24.00 0 2 16.7000 G6 S Mrs Mrs False 2 8.350000 2.235376 Mrs 2
S.C./PARIS 2079 818 0 2 Mallet, Mr. Albert male 31.00 1 1 37.0042 NaN C Mr Mr False 2 18.502100 2.970522 Mr 2
828 1 2 Mallet, Master. Andre male 1.00 0 2 37.0042 NaN C Master Master True 2 18.502100 2.970522 Master 2
S.O./P.P. 3 773 0 2 Mack, Mrs. (Mary) female 57.00 0 0 10.5000 E77 S Mrs Mrs False 2 5.250000 1.832581 Mrs 0
842 0 2 Mudd, Mr. Thomas Charles male 16.00 0 0 10.5000 NaN S Mr Mr False 2 5.250000 1.832581 Mr 0
S.O.C. 14879 73 0 2 Hood, Mr. Ambrose Jr male 21.00 0 0 73.5000 NaN S Mr Mr False 5 14.700000 2.753661 Mr 0
121 0 2 Hickman, Mr. Stanley George male 21.00 2 0 73.5000 NaN S Mr Mr False 5 14.700000 2.753661 Mr 2
386 0 2 Davies, Mr. Charles Henry male 18.00 0 0 73.5000 NaN S Mr Mr False 5 14.700000 2.753661 Mr 0
656 0 2 Hickman, Mr. Leonard Mark male 24.00 2 0 73.5000 NaN S Mr Mr False 5 14.700000 2.753661 Mr 2
666 0 2 Hickman, Mr. Lewis male 32.00 2 0 73.5000 NaN S Mr Mr False 5 14.700000 2.753661 Mr 2
SC/Paris 2123 44 1 2 Laroche, Miss. Simonne Marie Anne Andree female 3.00 1 2 41.5792 NaN C Miss Miss True 3 13.859733 2.698655 Miss 3
609 1 2 Laroche, Mrs. Joseph (Juliette Marie Louise La... female 22.00 1 2 41.5792 NaN C Mrs Mrs False 3 13.859733 2.698655 Mrs 3
686 0 2 Laroche, Mr. Joseph Philippe Lemercier male 25.00 1 2 41.5792 NaN C Mr Mr False 3 13.859733 2.698655 Mr 3
STON/O2. 3101279 143 1 3 Hakkarainen, Mrs. Pekka Pietari (Elin Matilda ... female 24.00 1 0 15.8500 NaN S Mrs Mrs False 2 7.925000 2.188856 Mrs 1
404 0 3 Hakkarainen, Mr. Pekka Pietari male 28.00 1 0 15.8500 NaN S Mr Mr False 2 7.925000 2.188856 Mr 1
W./C. 6607 784 0 3 Johnston, Mr. Andrew G male NaN 1 2 23.4500 NaN S Mr Mr False 2 11.725000 2.543569 Mr 3
889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 23.4500 NaN S Miss Miss False 2 11.725000 2.543569 Miss 3
W./C. 6608 87 0 3 Ford, Mr. William Neal male 16.00 1 3 34.3750 NaN S Mr Mr False 4 8.593750 2.261112 Mr 4
148 0 3 Ford, Miss. Robina Maggie "Ruby" female 9.00 2 2 34.3750 NaN S Miss Miss True 4 8.593750 2.261112 Miss 4
437 0 3 Ford, Miss. Doolina Margaret "Daisy" female 21.00 2 2 34.3750 NaN S Miss Miss False 4 8.593750 2.261112 Miss 4
737 0 3 Ford, Mrs. Edward (Margaret Ann Watson) female 48.00 1 3 34.3750 NaN S Mrs Mrs False 4 8.593750 2.261112 Mrs 4
WE/P 5735 541 1 1 Crosby, Miss. Harriet R female 36.00 0 2 71.0000 B22 S Miss Miss False 2 35.500000 3.597312 Miss 2
746 0 1 Crosby, Capt. Edward Gifford male 70.00 1 1 71.0000 B22 S Capt mil False 2 35.500000 3.597312 mil 2

344 rows × 18 columns

We can check whether holders of duplicate tickets are likely to share cabins, fares, family size and embark location.


In [290]:
dupe_counts = ticket_dupes.reset_index().groupby('Ticket')[['Fare', 'Cabin', 'Embarked', 'FamSize']].nunique()
dupe_counts.describe()


Out[290]:
Fare Cabin Embarked FamSize
count 134.000000 134.000000 134.000000 134.000000
mean 1.007463 0.537313 1.007463 1.201493
std 0.086387 0.742448 0.150001 0.402620
min 1.000000 0.000000 0.000000 1.000000
25% 1.000000 0.000000 1.000000 1.000000
50% 1.000000 0.000000 1.000000 1.000000
75% 1.000000 1.000000 1.000000 1.000000
max 2.000000 3.000000 2.000000 2.000000

It seems like most of them did. Let's take a look at the fares:


In [304]:
ticket_dupes.loc[dupe_counts[dupe_counts['Fare'] > 1].index.values]


Out[304]:
Survived Pclass Name Sex Age SibSp Parch Fare Cabin Embarked Title PTitle Child TicketSize AdjFare LogFare AdjTitle FamSize
Ticket PassengerId
7534 139 0 3 Osen, Mr. Olaf Elon male 16.0 0 0 9.2167 NaN S Mr Mr False 2 4.60835 1.724257 Mr 0
877 0 3 Gustafsson, Mr. Alfred Ossian male 20.0 0 0 9.8458 NaN S Mr Mr False 2 4.92290 1.778826 Mr 0

Only one pair of fares that are different (and not by much). For all we know, this could be an entry error, but let's ignore this for now. Let's look at embark locations:


In [306]:
ticket_dupes.loc[dupe_counts[dupe_counts['Embarked'] > 1].index.values]


Out[306]:
Survived Pclass Name Sex Age SibSp Parch Fare Cabin Embarked Title PTitle Child TicketSize AdjFare LogFare AdjTitle FamSize
Ticket PassengerId
113798 271 0 1 Cairns, Mr. Alexander male NaN 0 0 31.0000 NaN S Mr Mr False 2 15.5000 2.80336 Mr 0
843 1 1 Serepeca, Miss. Augusta female 30.0 0 0 31.0000 NaN C Miss Miss False 2 15.5000 2.80336 Miss 0
PC 17760 270 1 1 Bissette, Miss. Amelia female 35.0 0 0 135.6333 C99 S Miss Miss False 3 45.2111 3.83322 Miss 0
326 1 1 Young, Miss. Marie Grice female 36.0 0 0 135.6333 C32 C Miss Miss False 3 45.2111 3.83322 Miss 0
374 0 1 Ringhini, Mr. Sante male 22.0 0 0 135.6333 NaN C Mr Mr False 3 45.2111 3.83322 Mr 0

Only two! Though these could be mistakes, it is plausible that they did board at different locations, since they do not appear related to each other. Let's look at the last two variables:


In [307]:
ticket_dupes.loc[dupe_counts[dupe_counts['FamSize'] > 1].index.values]


Out[307]:
Survived Pclass Name Sex Age SibSp Parch Fare Cabin Embarked Title PTitle Child TicketSize AdjFare LogFare AdjTitle FamSize
Ticket PassengerId
113781 298 0 1 Allison, Miss. Helen Loraine female 2.00 1 2 151.5500 C22 C26 S Miss Miss True 4 37.887500 3.660673 Miss 3
306 1 1 Allison, Master. Hudson Trevor male 0.92 1 2 151.5500 C22 C26 S Master Master True 4 37.887500 3.660673 Master 3
499 0 1 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female 25.00 1 2 151.5500 C22 C26 S Mrs Mrs False 4 37.887500 3.660673 Mrs 3
709 1 1 Cleaver, Miss. Alice female 22.00 0 0 151.5500 NaN S Miss Miss False 4 37.887500 3.660673 Miss 0
11767 311 1 1 Hays, Miss. Margaret Bechstein female 24.00 0 0 83.1583 C54 C Miss Miss False 2 41.579150 3.751365 Miss 0
880 1 1 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.00 0 1 83.1583 C50 C Mrs Mrs False 2 41.579150 3.751365 Mrs 1
12749 521 1 1 Perreault, Miss. Anne female 30.00 0 0 93.5000 B73 S Miss Miss False 2 46.750000 3.865979 Miss 0
821 1 1 Hays, Mrs. Charles Melville (Clara Jennings Gr... female 52.00 1 1 93.5000 B69 S Mrs Mrs False 2 46.750000 3.865979 Mrs 2
13502 276 1 1 Andrews, Miss. Kornelia Theodosia female 63.00 1 0 77.9583 D7 S Miss Miss False 3 25.986100 3.295322 Miss 1
628 1 1 Longley, Miss. Gretchen Fiske female 21.00 0 0 77.9583 D9 S Miss Miss False 3 25.986100 3.295322 Miss 0
766 1 1 Hogeboom, Mrs. John C (Anna Andrews) female 51.00 1 0 77.9583 D11 S Mrs Mrs False 3 25.986100 3.295322 Mrs 1
16966 320 1 1 Spedden, Mrs. Frederic Oakley (Margaretta Corn... female 40.00 1 1 134.5000 E34 C Mrs Mrs False 2 67.250000 4.223177 Mrs 2
338 1 1 Burns, Miss. Elizabeth Margaret female 41.00 0 0 134.5000 E40 C Miss Miss False 2 67.250000 4.223177 Miss 0
17421 307 1 1 Fleming, Miss. Margaret female NaN 0 0 110.8833 NaN C Miss Miss False 4 27.720825 3.357622 Miss 0
551 1 1 Thayer, Mr. John Borland Jr male 17.00 0 2 110.8833 C70 C Mr Mr False 4 27.720825 3.357622 Mr 2
582 1 1 Thayer, Mrs. John Borland (Marian Longstreth M... female 39.00 1 1 110.8833 C68 C Mrs Mrs False 4 27.720825 3.357622 Mrs 2
699 0 1 Thayer, Mr. John Borland male 49.00 1 1 110.8833 C68 C Mr Mr False 4 27.720825 3.357622 Mr 2
19877 291 1 1 Barber, Miss. Ellen "Nellie" female 26.00 0 0 78.8500 NaN S Miss Miss False 2 39.425000 3.699448 Miss 0
742 0 1 Cavendish, Mr. Tyrell William male 36.00 1 0 78.8500 C46 S Mr Mr False 2 39.425000 3.699448 Mr 1
19928 246 0 1 Minahan, Dr. William Edward male 44.00 2 0 90.0000 C78 Q Dr Dr False 2 45.000000 3.828641 Dr 2
413 1 1 Minahan, Miss. Daisy E female 33.00 1 0 90.0000 C78 Q Miss Miss False 2 45.000000 3.828641 Miss 1
24160 690 1 1 Madill, Miss. Georgette Alexandra female 15.00 0 1 211.3375 B5 S Miss Miss False 3 70.445833 4.268940 Miss 1
731 1 1 Allen, Miss. Elisabeth Walton female 29.00 0 0 211.3375 B5 S Miss Miss False 3 70.445833 4.268940 Miss 0
780 1 1 Robert, Mrs. Edward Scott (Elisabeth Walton Mc... female 43.00 0 1 211.3375 B3 S Mrs Mrs False 3 70.445833 4.268940 Mrs 1
243847 218 0 2 Jacobsohn, Mr. Sidney Samuel male 42.00 1 0 27.0000 NaN S Mr Mr False 2 13.500000 2.674149 Mr 1
601 1 2 Jacobsohn, Mrs. Sidney Samuel (Amy Frances Chr... female 24.00 2 1 27.0000 NaN S Mrs Mrs False 2 13.500000 2.674149 Mrs 3
248727 597 1 2 Leitch, Miss. Jessie Wills female NaN 0 0 33.0000 NaN S Miss Miss False 3 11.000000 2.484907 Miss 0
721 1 2 Harper, Miss. Annie Jessie "Nina" female 6.00 0 1 33.0000 NaN S Miss Miss True 3 11.000000 2.484907 Miss 1
849 0 2 Harper, Rev. John male 28.00 0 1 33.0000 NaN S Rev Rev False 3 11.000000 2.484907 Rev 1
29106 408 1 2 Richards, Master. William Rowe male 3.00 1 1 18.7500 NaN S Master Master True 3 6.250000 1.981001 Master 2
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
371110 518 0 3 Ryan, Mr. Patrick male NaN 0 0 24.1500 NaN Q Mr Mr False 3 8.050000 2.202765 Mr 0
769 0 3 Moran, Mr. Daniel J male NaN 1 0 24.1500 NaN Q Mr Mr False 3 8.050000 2.202765 Mr 1
A/4 48871 566 0 3 Davies, Mr. Alfred J male 24.00 2 0 24.1500 NaN S Mr Mr False 2 12.075000 2.570702 Mr 2
812 0 3 Lester, Mr. James male 39.00 0 0 24.1500 NaN S Mr Mr False 2 12.075000 2.570702 Mr 0
PC 17485 310 1 1 Francatelli, Miss. Laura Mabel female 30.00 0 0 56.9292 E36 C Miss Miss False 2 28.464600 3.383190 Miss 0
600 1 1 Duff Gordon, Sir. Cosmo Edmund ("Mr Morgan") male 49.00 1 0 56.9292 A20 C Sir mnoble False 2 28.464600 3.383190 mnoble 1
PC 17569 32 1 1 Spencer, Mrs. William Augustus (Marie Eugenie) female NaN 1 0 146.5208 B78 C Mrs Mrs False 2 73.260400 4.307578 Mrs 1
196 1 1 Lurette, Miss. Elise female 58.00 0 0 146.5208 B80 C Miss Miss False 2 73.260400 4.307578 Miss 0
PC 17572 53 1 1 Harper, Mrs. Henry Sleeper (Myna Haxtun) female 49.00 1 0 76.7292 D33 C Mrs Mrs False 3 25.576400 3.280024 Mrs 1
646 1 1 Harper, Mr. Henry Sleeper male 48.00 1 0 76.7292 D33 C Mr Mr False 3 25.576400 3.280024 Mr 1
682 1 1 Hassab, Mr. Hammad male 27.00 0 0 76.7292 D49 C Mr Mr False 3 25.576400 3.280024 Mr 0
PC 17582 269 1 1 Graham, Mrs. William Thompson (Edith Junkins) female 58.00 0 1 153.4625 C125 S Mrs Mrs False 3 51.154167 3.954204 Mrs 1
333 0 1 Graham, Mr. George Edward male 38.00 0 1 153.4625 C91 S Mr Mr False 3 51.154167 3.954204 Mr 1
610 1 1 Shutes, Miss. Elizabeth W female 40.00 0 0 153.4625 C125 S Miss Miss False 3 51.154167 3.954204 Miss 0
PC 17611 335 1 1 Frauenthal, Mrs. Henry William (Clara Heinshei... female NaN 1 0 133.6500 NaN S Mrs Mrs False 2 66.825000 4.216931 Mrs 1
661 1 1 Frauenthal, Dr. Henry William male 50.00 2 0 133.6500 NaN S Dr Dr False 2 66.825000 4.216931 Dr 2
PC 17755 259 1 1 Ward, Miss. Anna female 35.00 0 0 512.3292 NaN C Miss Miss False 3 170.776400 5.146194 Miss 0
680 1 1 Cardeza, Mr. Thomas Drake Martinez male 36.00 0 1 512.3292 B51 B53 B55 C Mr Mr False 3 170.776400 5.146194 Mr 1
738 1 1 Lesurer, Mr. Gustave J male 35.00 0 0 512.3292 B101 C Mr Mr False 3 170.776400 5.146194 Mr 0
PC 17757 381 1 1 Bidois, Miss. Rosalie female 42.00 0 0 227.5250 NaN C Miss Miss False 4 56.881250 4.058393 Miss 0
558 0 1 Robbins, Mr. Victor male NaN 0 0 227.5250 NaN C Mr Mr False 4 56.881250 4.058393 Mr 0
701 1 1 Astor, Mrs. John Jacob (Madeleine Talmadge Force) female 18.00 1 0 227.5250 C62 C64 C Mrs Mrs False 4 56.881250 4.058393 Mrs 1
717 1 1 Endres, Miss. Caroline Louise female 38.00 0 0 227.5250 C45 C Miss Miss False 4 56.881250 4.058393 Miss 0
PC 17761 538 1 1 LeRoy, Miss. Bertha female 30.00 0 0 106.4250 NaN C Miss Miss False 2 53.212500 3.992912 Miss 0
545 0 1 Douglas, Mr. Walter Donald male 50.00 1 0 106.4250 C86 C Mr Mr False 2 53.212500 3.992912 Mr 1
S.O.C. 14879 73 0 2 Hood, Mr. Ambrose Jr male 21.00 0 0 73.5000 NaN S Mr Mr False 5 14.700000 2.753661 Mr 0
121 0 2 Hickman, Mr. Stanley George male 21.00 2 0 73.5000 NaN S Mr Mr False 5 14.700000 2.753661 Mr 2
386 0 2 Davies, Mr. Charles Henry male 18.00 0 0 73.5000 NaN S Mr Mr False 5 14.700000 2.753661 Mr 0
656 0 2 Hickman, Mr. Leonard Mark male 24.00 2 0 73.5000 NaN S Mr Mr False 5 14.700000 2.753661 Mr 2
666 0 2 Hickman, Mr. Lewis male 32.00 2 0 73.5000 NaN S Mr Mr False 5 14.700000 2.753661 Mr 2

72 rows × 18 columns


In [308]:
ticket_dupes.loc[dupe_counts[dupe_counts['Cabin'] > 1].index.values]


Out[308]:
Survived Pclass Name Sex Age SibSp Parch Fare Cabin Embarked Title PTitle Child TicketSize AdjFare LogFare AdjTitle FamSize
Ticket PassengerId
110152 258 1 1 Cherry, Miss. Gladys female 30.0 0 0 86.5000 B77 S Miss Miss False 3 28.833333 3.395626 Miss 0
505 1 1 Maioni, Miss. Roberta female 16.0 0 0 86.5000 B79 S Miss Miss False 3 28.833333 3.395626 Miss 0
760 1 1 Rothes, the Countess. of (Lucy Noel Martha Dye... female 33.0 0 0 86.5000 B77 S the Countess fnoble False 3 28.833333 3.395626 fnoble 0
110413 263 0 1 Taussig, Mr. Emil male 52.0 1 1 79.6500 E67 S Mr Mr False 3 26.550000 3.316003 Mr 2
559 1 1 Taussig, Mrs. Emil (Tillie Mandelbaum) female 39.0 1 1 79.6500 E67 S Mrs Mrs False 3 26.550000 3.316003 Mrs 2
586 1 1 Taussig, Miss. Ruth female 18.0 0 2 79.6500 E68 S Miss Miss False 3 26.550000 3.316003 Miss 2
110465 111 0 1 Porter, Mr. Walter Chamberlain male 47.0 0 0 52.0000 C110 S Mr Mr False 2 26.000000 3.295837 Mr 0
476 0 1 Clifford, Mr. George Quincy male NaN 0 0 52.0000 A14 S Mr Mr False 2 26.000000 3.295837 Mr 0
11767 311 1 1 Hays, Miss. Margaret Bechstein female 24.0 0 0 83.1583 C54 C Miss Miss False 2 41.579150 3.751365 Miss 0
880 1 1 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.0 0 1 83.1583 C50 C Mrs Mrs False 2 41.579150 3.751365 Mrs 1
12749 521 1 1 Perreault, Miss. Anne female 30.0 0 0 93.5000 B73 S Miss Miss False 2 46.750000 3.865979 Miss 0
821 1 1 Hays, Mrs. Charles Melville (Clara Jennings Gr... female 52.0 1 1 93.5000 B69 S Mrs Mrs False 2 46.750000 3.865979 Mrs 2
13502 276 1 1 Andrews, Miss. Kornelia Theodosia female 63.0 1 0 77.9583 D7 S Miss Miss False 3 25.986100 3.295322 Miss 1
628 1 1 Longley, Miss. Gretchen Fiske female 21.0 0 0 77.9583 D9 S Miss Miss False 3 25.986100 3.295322 Miss 0
766 1 1 Hogeboom, Mrs. John C (Anna Andrews) female 51.0 1 0 77.9583 D11 S Mrs Mrs False 3 25.986100 3.295322 Mrs 1
16966 320 1 1 Spedden, Mrs. Frederic Oakley (Margaretta Corn... female 40.0 1 1 134.5000 E34 C Mrs Mrs False 2 67.250000 4.223177 Mrs 2
338 1 1 Burns, Miss. Elizabeth Margaret female 41.0 0 0 134.5000 E40 C Miss Miss False 2 67.250000 4.223177 Miss 0
17421 307 1 1 Fleming, Miss. Margaret female NaN 0 0 110.8833 NaN C Miss Miss False 4 27.720825 3.357622 Miss 0
551 1 1 Thayer, Mr. John Borland Jr male 17.0 0 2 110.8833 C70 C Mr Mr False 4 27.720825 3.357622 Mr 2
582 1 1 Thayer, Mrs. John Borland (Marian Longstreth M... female 39.0 1 1 110.8833 C68 C Mrs Mrs False 4 27.720825 3.357622 Mrs 2
699 0 1 Thayer, Mr. John Borland male 49.0 1 1 110.8833 C68 C Mr Mr False 4 27.720825 3.357622 Mr 2
24160 690 1 1 Madill, Miss. Georgette Alexandra female 15.0 0 1 211.3375 B5 S Miss Miss False 3 70.445833 4.268940 Miss 1
731 1 1 Allen, Miss. Elisabeth Walton female 29.0 0 0 211.3375 B5 S Miss Miss False 3 70.445833 4.268940 Miss 0
780 1 1 Robert, Mrs. Edward Scott (Elisabeth Walton Mc... female 43.0 0 1 211.3375 B3 S Mrs Mrs False 3 70.445833 4.268940 Mrs 1
35273 216 1 1 Newell, Miss. Madeleine female 31.0 1 0 113.2750 D36 C Miss Miss False 3 37.758333 3.657346 Miss 1
394 1 1 Newell, Miss. Marjorie female 23.0 1 0 113.2750 D36 C Miss Miss False 3 37.758333 3.657346 Miss 1
660 0 1 Newell, Mr. Arthur Webster male 58.0 0 2 113.2750 D48 C Mr Mr False 3 37.758333 3.657346 Mr 2
PC 17485 310 1 1 Francatelli, Miss. Laura Mabel female 30.0 0 0 56.9292 E36 C Miss Miss False 2 28.464600 3.383190 Miss 0
600 1 1 Duff Gordon, Sir. Cosmo Edmund ("Mr Morgan") male 49.0 1 0 56.9292 A20 C Sir mnoble False 2 28.464600 3.383190 mnoble 1
PC 17569 32 1 1 Spencer, Mrs. William Augustus (Marie Eugenie) female NaN 1 0 146.5208 B78 C Mrs Mrs False 2 73.260400 4.307578 Mrs 1
196 1 1 Lurette, Miss. Elise female 58.0 0 0 146.5208 B80 C Miss Miss False 2 73.260400 4.307578 Miss 0
PC 17572 53 1 1 Harper, Mrs. Henry Sleeper (Myna Haxtun) female 49.0 1 0 76.7292 D33 C Mrs Mrs False 3 25.576400 3.280024 Mrs 1
646 1 1 Harper, Mr. Henry Sleeper male 48.0 1 0 76.7292 D33 C Mr Mr False 3 25.576400 3.280024 Mr 1
682 1 1 Hassab, Mr. Hammad male 27.0 0 0 76.7292 D49 C Mr Mr False 3 25.576400 3.280024 Mr 0
PC 17582 269 1 1 Graham, Mrs. William Thompson (Edith Junkins) female 58.0 0 1 153.4625 C125 S Mrs Mrs False 3 51.154167 3.954204 Mrs 1
333 0 1 Graham, Mr. George Edward male 38.0 0 1 153.4625 C91 S Mr Mr False 3 51.154167 3.954204 Mr 1
610 1 1 Shutes, Miss. Elizabeth W female 40.0 0 0 153.4625 C125 S Miss Miss False 3 51.154167 3.954204 Miss 0
PC 17593 140 0 1 Giglio, Mr. Victor male 24.0 0 0 79.2000 B86 C Mr Mr False 2 39.600000 3.703768 Mr 0
790 0 1 Guggenheim, Mr. Benjamin male 46.0 0 0 79.2000 B82 B84 C Mr Mr False 2 39.600000 3.703768 Mr 0
PC 17755 259 1 1 Ward, Miss. Anna female 35.0 0 0 512.3292 NaN C Miss Miss False 3 170.776400 5.146194 Miss 0
680 1 1 Cardeza, Mr. Thomas Drake Martinez male 36.0 0 1 512.3292 B51 B53 B55 C Mr Mr False 3 170.776400 5.146194 Mr 1
738 1 1 Lesurer, Mr. Gustave J male 35.0 0 0 512.3292 B101 C Mr Mr False 3 170.776400 5.146194 Mr 0
PC 17757 381 1 1 Bidois, Miss. Rosalie female 42.0 0 0 227.5250 NaN C Miss Miss False 4 56.881250 4.058393 Miss 0
558 0 1 Robbins, Mr. Victor male NaN 0 0 227.5250 NaN C Mr Mr False 4 56.881250 4.058393 Mr 0
701 1 1 Astor, Mrs. John Jacob (Madeleine Talmadge Force) female 18.0 1 0 227.5250 C62 C64 C Mrs Mrs False 4 56.881250 4.058393 Mrs 1
717 1 1 Endres, Miss. Caroline Louise female 38.0 0 0 227.5250 C45 C Miss Miss False 4 56.881250 4.058393 Miss 0
PC 17760 270 1 1 Bissette, Miss. Amelia female 35.0 0 0 135.6333 C99 S Miss Miss False 3 45.211100 3.833220 Miss 0
326 1 1 Young, Miss. Marie Grice female 36.0 0 0 135.6333 C32 C Miss Miss False 3 45.211100 3.833220 Miss 0
374 0 1 Ringhini, Mr. Sante male 22.0 0 0 135.6333 NaN C Mr Mr False 3 45.211100 3.833220 Mr 0

We have many more duplicate values here; it's plausible different families could split tickets, or bring servants/maids. Also, for family size, it's worth remembering that non-married partners do not count toward SibSp.

Fares

We see that Fare is a highly right-skewed variable.


In [315]:
plt.figure(10)
sns.distplot(train['Fare'])
plt.show()
st.skew(train['Fare'])


Out[315]:
4.7792532923723545

Let's look at the outliers values that are above 200...


In [323]:
train[train['Fare'] > 200]


Out[323]:
Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Title PTitle Child TicketSize AdjFare LogFare AdjTitle FamSize
PassengerId
28 0 1 Fortune, Mr. Charles Alexander male 19.0 3 2 19950 263.0000 C23 C25 C27 S Mr Mr False 4 65.750000 4.200954 Mr 5
89 1 1 Fortune, Miss. Mabel Helen female 23.0 3 2 19950 263.0000 C23 C25 C27 S Miss Miss False 4 65.750000 4.200954 Miss 5
119 0 1 Baxter, Mr. Quigg Edmond male 24.0 0 1 PC 17558 247.5208 B58 B60 C Mr Mr False 2 123.760400 4.826395 Mr 1
259 1 1 Ward, Miss. Anna female 35.0 0 0 PC 17755 512.3292 NaN C Miss Miss False 3 170.776400 5.146194 Miss 0
300 1 1 Baxter, Mrs. James (Helene DeLaudeniere Chaput) female 50.0 0 1 PC 17558 247.5208 B58 B60 C Mrs Mrs False 2 123.760400 4.826395 Mrs 1
312 1 1 Ryerson, Miss. Emily Borie female 18.0 2 2 PC 17608 262.3750 B57 B59 B63 B66 C Miss Miss False 2 131.187500 4.884221 Miss 4
342 1 1 Fortune, Miss. Alice Elizabeth female 24.0 3 2 19950 263.0000 C23 C25 C27 S Miss Miss False 4 65.750000 4.200954 Miss 5
378 0 1 Widener, Mr. Harry Elkins male 27.0 0 2 113503 211.5000 C82 C Mr Mr False 1 211.500000 5.358942 Mr 2
381 1 1 Bidois, Miss. Rosalie female 42.0 0 0 PC 17757 227.5250 NaN C Miss Miss False 4 56.881250 4.058393 Miss 0
439 0 1 Fortune, Mr. Mark male 64.0 1 4 19950 263.0000 C23 C25 C27 S Mr Mr False 4 65.750000 4.200954 Mr 5
528 0 1 Farthing, Mr. John male NaN 0 0 PC 17483 221.7792 C95 S Mr Mr False 1 221.779200 5.406181 Mr 0
558 0 1 Robbins, Mr. Victor male NaN 0 0 PC 17757 227.5250 NaN C Mr Mr False 4 56.881250 4.058393 Mr 0
680 1 1 Cardeza, Mr. Thomas Drake Martinez male 36.0 0 1 PC 17755 512.3292 B51 B53 B55 C Mr Mr False 3 170.776400 5.146194 Mr 1
690 1 1 Madill, Miss. Georgette Alexandra female 15.0 0 1 24160 211.3375 B5 S Miss Miss False 3 70.445833 4.268940 Miss 1
701 1 1 Astor, Mrs. John Jacob (Madeleine Talmadge Force) female 18.0 1 0 PC 17757 227.5250 C62 C64 C Mrs Mrs False 4 56.881250 4.058393 Mrs 1
717 1 1 Endres, Miss. Caroline Louise female 38.0 0 0 PC 17757 227.5250 C45 C Miss Miss False 4 56.881250 4.058393 Miss 0
731 1 1 Allen, Miss. Elisabeth Walton female 29.0 0 0 24160 211.3375 B5 S Miss Miss False 3 70.445833 4.268940 Miss 0
738 1 1 Lesurer, Mr. Gustave J male 35.0 0 0 PC 17755 512.3292 B101 C Mr Mr False 3 170.776400 5.146194 Mr 0
743 1 1 Ryerson, Miss. Susan Parker "Suzette" female 21.0 2 2 PC 17608 262.3750 B57 B59 B63 B66 C Miss Miss False 2 131.187500 4.884221 Miss 4
780 1 1 Robert, Mrs. Edward Scott (Elisabeth Walton Mc... female 43.0 0 1 24160 211.3375 B3 S Mrs Mrs False 3 70.445833 4.268940 Mrs 1

We see that almost all of the fares have shared cabins and shared tickets. Let's test the theory that 'Fare' refers to a group fare of all tickets with the same number, rather than fare per ticket:


In [371]:
train['TicketSize'] = train['Ticket'].value_counts()[train['Ticket']].values
test['TicketSize'] = test['Ticket'].value_counts()[test['Ticket']].values

In [372]:
plt.figure(11, figsize=(12, 4))
plt.subplot(131)
sns.regplot(x='TicketSize', y='Fare', data=train[train['Pclass'] == 1])
plt.subplot(132)
sns.regplot(x='TicketSize', y='Fare', data=train[train['Pclass'] == 2])
plt.subplot(133)
sns.regplot(x='TicketSize', y='Fare', data=train[train['Pclass'] == 3])
plt.show()


Let's assume that it is linear. We'll divide by the ticket size, and look at the skew for each class:


In [373]:
train['AdjFare'] = train['Fare'].div(train['TicketSize'])
g = sns.FacetGrid(train, col='Pclass')
g = g.map(plt.hist, 'AdjFare')
plt.show()
train.groupby('Pclass')['AdjFare'].apply(st.skew)


Out[373]:
Pclass
1    3.120576
2    1.040021
3    2.319343
Name: AdjFare, dtype: float64

This is still somewhat right skewed. If we want, we can later use a square root transform; however, for now, we will leave the Fare as is.

Cabins

Only a fraction of the passengers have known cabin information. We'll create a feature called CabinKnown that indicates if the cabin is given. Let's see if a known cabin is related to survival:


In [11]:
train['CabinKnown'] = train['Cabin'].notnull()
pd.crosstab(train['CabinKnown'], train['Survived'])


Out[11]:
Survived 0 1
CabinKnown
False 481 206
True 68 136

In [39]:
plt.figure(2)
sns.barplot(x='CabinKnown', y='Survived', data=train)
plt.show()

Let's also search for duplicate cabins, since that may indicate party size and help impute missing values.


In [14]:
train[(train['Cabin'].duplicated(keep=False)) & (train['Cabin'].notnull())].set_index('Cabin', append=True).swaplevel(0, 1).sort_index()


Out[14]:
Survived Pclass Name Sex Age SibSp Parch Ticket Fare Embarked CabinKnown
Cabin PassengerId
B18 330 1 1 Hippach, Miss. Jean Gertrude female 16.0 0 1 111361 57.9792 C True
524 1 1 Hippach, Mrs. Louis Albert (Ida Sophia Fischer) female 44.0 0 1 111361 57.9792 C True
B20 691 1 1 Dick, Mr. Albert Adrian male 31.0 1 0 17474 57.0000 S True
782 1 1 Dick, Mrs. Albert Adrian (Vera Gillespie) female 17.0 1 0 17474 57.0000 S True
B22 541 1 1 Crosby, Miss. Harriet R female 36.0 0 2 WE/P 5735 71.0000 S True
746 0 1 Crosby, Capt. Edward Gifford male 70.0 1 1 WE/P 5735 71.0000 S True
B28 62 1 1 Icard, Miss. Amelie female 38.0 0 0 113572 80.0000 NaN True
830 1 1 Stone, Mrs. George Nelson (Martha Evelyn) female 62.0 0 0 113572 80.0000 NaN True
B35 370 1 1 Aubart, Mme. Leontine Pauline female 24.0 0 0 PC 17477 69.3000 C True
642 1 1 Sagesser, Mlle. Emma female 24.0 0 0 PC 17477 69.3000 C True
B49 292 1 1 Bishop, Mrs. Dickinson H (Helen Walton) female 19.0 1 0 11967 91.0792 C True
485 1 1 Bishop, Mr. Dickinson H male 25.0 1 0 11967 91.0792 C True
B5 690 1 1 Madill, Miss. Georgette Alexandra female 15.0 0 1 24160 211.3375 S True
731 1 1 Allen, Miss. Elisabeth Walton female 29.0 0 0 24160 211.3375 S True
B51 B53 B55 680 1 1 Cardeza, Mr. Thomas Drake Martinez male 36.0 0 1 PC 17755 512.3292 C True
873 0 1 Carlsson, Mr. Frans Olof male 33.0 0 0 695 5.0000 S True
B57 B59 B63 B66 312 1 1 Ryerson, Miss. Emily Borie female 18.0 2 2 PC 17608 262.3750 C True
743 1 1 Ryerson, Miss. Susan Parker "Suzette" female 21.0 2 2 PC 17608 262.3750 C True
B58 B60 119 0 1 Baxter, Mr. Quigg Edmond male 24.0 0 1 PC 17558 247.5208 C True
300 1 1 Baxter, Mrs. James (Helene DeLaudeniere Chaput) female 50.0 0 1 PC 17558 247.5208 C True
B77 258 1 1 Cherry, Miss. Gladys female 30.0 0 0 110152 86.5000 S True
760 1 1 Rothes, the Countess. of (Lucy Noel Martha Dye... female 33.0 0 0 110152 86.5000 S True
B96 B98 391 1 1 Carter, Mr. William Ernest male 36.0 1 2 113760 120.0000 S True
436 1 1 Carter, Miss. Lucile Polk female 14.0 1 2 113760 120.0000 S True
764 1 1 Carter, Mrs. William Ernest (Lucile Polk) female 36.0 1 2 113760 120.0000 S True
803 1 1 Carter, Master. William Thornton II male 11.0 1 2 113760 120.0000 S True
C123 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 S True
138 0 1 Futrelle, Mr. Jacques Heath male 37.0 1 0 113803 53.1000 S True
C124 332 0 1 Partner, Mr. Austen male 45.5 0 0 113043 28.5000 S True
712 0 1 Klaber, Mr. Herman male NaN 0 0 113028 26.5500 S True
... ... ... ... ... ... ... ... ... ... ... ... ...
E101 304 1 2 Keane, Miss. Nora A female NaN 0 0 226593 12.3500 Q True
718 1 2 Troutt, Miss. Edwina Celia "Winnie" female 27.0 0 0 34218 10.5000 S True
E121 752 1 3 Moor, Master. Meier male 6.0 0 1 392096 12.4750 S True
824 1 3 Moor, Mrs. (Beila) female 27.0 0 1 392096 12.4750 S True
E24 702 1 1 Silverthorne, Mr. Spencer Victor male 35.0 0 0 PC 17475 26.2875 S True
708 1 1 Calderhead, Mr. Edward Pennington male 42.0 0 0 PC 17476 26.2875 S True
E25 513 1 1 McGough, Mr. James Robert male 36.0 0 0 PC 17473 26.2875 S True
573 1 1 Flynn, Mr. John Irwin ("Irving") male 36.0 0 0 PC 17474 26.3875 S True
E33 167 1 1 Chibnall, Mrs. (Edith Martha Bowerman) female NaN 0 1 113505 55.0000 S True
357 1 1 Bowerman, Miss. Elsie Edith female 22.0 0 1 113505 55.0000 S True
E44 435 0 1 Silvey, Mr. William Baird male 50.0 1 0 13507 55.9000 S True
578 1 1 Silvey, Mrs. William Baird (Alice Munger) female 39.0 1 0 13507 55.9000 S True
E67 263 0 1 Taussig, Mr. Emil male 52.0 1 1 110413 79.6500 S True
559 1 1 Taussig, Mrs. Emil (Tillie Mandelbaum) female 39.0 1 1 110413 79.6500 S True
E8 725 1 1 Chambers, Mr. Norman Campbell male 27.0 1 0 113806 53.1000 S True
810 1 1 Chambers, Mrs. Norman Campbell (Bertha Griggs) female 33.0 1 0 113806 53.1000 S True
F G73 76 0 3 Moen, Mr. Sigurd Hansen male 25.0 0 0 348123 7.6500 S True
716 0 3 Soholt, Mr. Peter Andreas Lauritz Andersen male 19.0 0 0 348124 7.6500 S True
F2 149 0 2 Navratil, Mr. Michel ("Louis M Hoffman") male 36.5 0 2 230080 26.0000 S True
194 1 2 Navratil, Master. Michel M male 3.0 1 1 230080 26.0000 S True
341 1 2 Navratil, Master. Edmond Roger male 2.0 1 1 230080 26.0000 S True
F33 67 1 2 Nye, Mrs. (Elizabeth Ramell) female 29.0 0 0 C.A. 29395 10.5000 S True
346 1 2 Brown, Miss. Amelia "Mildred" female 24.0 0 0 248733 13.0000 S True
517 1 2 Lemore, Mrs. (Amelia Milley) female 34.0 0 0 C.A. 34260 10.5000 S True
F4 184 1 2 Becker, Master. Richard F male 1.0 2 1 230136 39.0000 S True
619 1 2 Becker, Miss. Marion Louise female 4.0 2 1 230136 39.0000 S True
G6 11 1 3 Sandstrom, Miss. Marguerite Rut female 4.0 1 1 PP 9549 16.7000 S True
206 0 3 Strom, Miss. Telma Matilda female 2.0 0 1 347054 10.4625 S True
252 0 3 Strom, Mrs. Wilhelm (Elna Matilda Persson) female 29.0 1 1 347054 10.4625 S True
395 1 3 Sandstrom, Mrs. Hjalmar (Agnes Charlotta Bengt... female 24.0 0 2 PP 9549 16.7000 S True

103 rows × 11 columns

Embark Location

Here are the two missing values for embarkment:


In [329]:
train[train['Embarked'].isnull()]


Out[329]:
Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Title PTitle Child TicketSize AdjFare LogFare AdjTitle FamSize
PassengerId
62 1 1 Icard, Miss. Amelie female 38.0 0 0 113572 80.0 B28 NaN Miss Miss False 2 40.0 3.713572 Miss 0
830 1 1 Stone, Mrs. George Nelson (Martha Evelyn) female 62.0 0 0 113572 80.0 B28 NaN Mrs Mrs False 2 40.0 3.713572 Mrs 0

Let's go by the ticket number, since with the different patterns, it's likely ticket number format can match embarkment location. We'll scan all tickets that begin with '113' and see if there is a pattern:


In [333]:
train.loc[train['Ticket'].str.startswith('113'), 'Embarked'].value_counts()


Out[333]:
S    41
C     4
Name: Embarked, dtype: int64

Seems like an overwhelming number of '113' tickets boarded at Southampton. 'S' it is!

Age

Let's start by taking a look at the distribution of ages. Since survival and gender are so strongly correlated, we'll split up age by gender as well.


In [212]:
fs_ages = train.loc[(train['Survived'] == 1) & (train['Sex'] == "female"), 'Age'].dropna()
fd_ages = train.loc[(train['Survived'] == 0) & (train['Sex'] == "female"), 'Age'].dropna()
ms_ages = train.loc[(train['Survived'] == 1) & (train['Sex'] == "male"), 'Age'].dropna()
md_ages = train.loc[(train['Survived'] == 0) & (train['Sex'] == "male"), 'Age'].dropna()

plt.figure(10, figsize=(9, 9))
plt.subplot(211)
sns.distplot(fs_ages, bins=range(81), kde=False, color='C1')
sns.distplot(fd_ages, bins=range(81), kde=False, color='C0', axlabel='Female Age')
plt.subplot(212)
sns.distplot(ms_ages, bins=range(81), kde=False, color='C1')
sns.distplot(md_ages, bins=range(81), kde=False, color='C0', axlabel='Male Age')
plt.show()

There's obviously a dichomotomy in both graphs: We see that teenaged or older males had a very poor survival rate compared with younger males. It seems back in the day, teenage boys were not considered "children."

For females, age seems to matter much less. There is a cutoff with about 50/50 survival rate (very young children dependent on others?) somewhere around 11 to 15.

To find a good cutoff point for "child" versus "adult," we can zoom our data in around ages 10-15.


In [148]:
train.loc[(train['Age'] < 15) & (train['Age'] > 10), ['Age', 'Survived', 'Sex']].sort_values(['Sex', 'Age'])


Out[148]:
Age Survived Sex
PassengerId
543 11.0 0 female
447 13.0 1 female
781 13.0 1 female
10 14.0 1 female
15 14.0 0 female
40 14.0 1 female
436 14.0 1 female
112 14.5 0 female
60 11.0 0 male
732 11.0 0 male
803 11.0 1 male
126 12.0 1 male
684 14.0 0 male
687 14.0 0 male

We can set the cutoff at 12 or below to be considered a "child," and 13 or above to be considered an "adult." This will capture the border cases of two 13 year old girls surviving, and a 12 year old boy surviving.


In [149]:
train['Child'] = train['Age'] <= 12

Missing Ages

Something we observed is that there are no missing titles. One possibility to impute missing ages would be to check how ages are distributed among titles:


In [203]:
train.loc[train['AdjTitle'] == 'Master', 'Age'].describe()


Out[203]:
count    36.000000
mean      4.574167
std       3.619872
min       0.420000
25%       1.000000
50%       3.500000
75%       8.000000
max      12.000000
Name: Age, dtype: float64

In [204]:
train.loc[train['AdjTitle'] == 'Mr', 'Age'].describe()


Out[204]:
count    398.000000
mean      32.368090
std       12.708793
min       11.000000
25%       23.000000
50%       30.000000
75%       39.000000
max       80.000000
Name: Age, dtype: float64

So any male with the title 'Master' is no more than 12! This puts him in the (luckier) basket of male children, increasing his surival odds. Similarly, most (but not all) males who are 'Mr' are above 12, making them less likely to be lucky.


In [198]:
train[train['Title'] == 'Miss']['Age'].describe()


Out[198]:
count    146.000000
mean      21.773973
std       12.990292
min        0.750000
25%       14.125000
50%       21.000000
75%       30.000000
max       63.000000
Name: Age, dtype: float64