Titanic: Machine Learning from Disaster - Data Wrangling

Homepage: https://github.com/tien-le/kaggle-titanic

unbelivable ... to achieve 1.000. How did they do this?

Just curious, how did they cheat the score? ANS: maybe, we have the information existing in https://www.encyclopedia-titanica.org/titanic-victims/

Competition Description

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

Practice Skills

Binary classification
Python and R basics

References

https://www.kaggle.com/c/titanic

https://triangleinequality.wordpress.com/2013/09/08/basic-feature-engineering-with-the-titanic-data/

https://triangleinequality.wordpress.com/2013/05/19/machine-learning-with-python-first-steps-munging/

https://www.kaggle.com/mrisdal/exploring-survival-on-the-titanic

Data overview

The data has been split into two groups:

training set (train.csv)
test set (test.csv)

The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.

The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.

Data Dictionary

Variable	Definition	Key
eq	qe	qe
survival	Survival	0 = No, 1 = Yes
pclass	Ticket class	1 = 1st, 2 = 2nd, 3 = 3rd
sex	Sex
Age	Age in years
sibsp	# of siblings / spouses aboard the Titanic
parch	# of parents / children aboard the Titanic
ticket	Ticket number
fare	Passenger fare
cabin	Cabin number
embarked	Port of Embarkation	C = Cherbourg, Q = Queenstown, S = Southampton
Variable	Notes

pclass: A proxy for socio-economic status (SES) 1st = Upper 2nd = Middle 3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way... Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way... Parent = mother, father Child = daughter, son, stepdaughter, stepson Some children travelled only with a nanny, therefore parch=0 for them.

Data Exploration

Five Steps: Variable Identification, Uni-variate Analysis, Bi-variate Analysis, Missing Values Imputation, Outlier Treament

Step 1. Variable Identification

Identify Preditor (input) variables + Target (output) variables
Identify the data type and category of variables



In [3]:

    
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
import random



In [4]:

    
trn_corpus = pd.read_csv("data/train.csv")

#889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S --> containing NaN
trn_corpus.set_index("PassengerId", inplace=True)
trn_corpus.info()
trn_corpus.describe()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Name        891 non-null object
Sex         891 non-null object
Age         714 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Ticket      891 non-null object
Fare        891 non-null float64
Cabin       204 non-null object
Embarked    889 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB






    Out[4]:







  
    
      
      Survived
      Pclass
      Age
      SibSp
      Parch
      Fare
    
  
  
    
      count
      891.000000
      891.000000
      714.000000
      891.000000
      891.000000
      891.000000
    
    
      mean
      0.383838
      2.308642
      29.699118
      0.523008
      0.381594
      32.204208
    
    
      std
      0.486592
      0.836071
      14.526497
      1.102743
      0.806057
      49.693429
    
    
      min
      0.000000
      1.000000
      0.420000
      0.000000
      0.000000
      0.000000
    
    
      25%
      0.000000
      2.000000
      20.125000
      0.000000
      0.000000
      7.910400
    
    
      50%
      0.000000
      3.000000
      28.000000
      0.000000
      0.000000
      14.454200
    
    
      75%
      1.000000
      3.000000
      38.000000
      1.000000
      0.000000
      31.000000
    
    
      max
      1.000000
      3.000000
      80.000000
      8.000000
      6.000000
      512.329200



In [5]:

    
trn_corpus.head()









    Out[5]:







  
    
      
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
    
      PassengerId
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22.0
      1
      0
      A/5 21171
      7.2500
      NaN
      S
    
    
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38.0
      1
      0
      PC 17599
      71.2833
      C85
      C
    
    
      3
      1
      3
      Heikkinen, Miss. Laina
      female
      26.0
      0
      0
      STON/O2. 3101282
      7.9250
      NaN
      S
    
    
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      female
      35.0
      1
      0
      113803
      53.1000
      C123
      S
    
    
      5
      0
      3
      Allen, Mr. William Henry
      male
      35.0
      0
      0
      373450
      8.0500
      NaN
      S



In [6]:

    
tst_corpus = pd.read_csv("data/test.csv")

tst_corpus.set_index("PassengerId", inplace=True)
tst_corpus.info()
tst_corpus.describe()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 418 entries, 892 to 1309
Data columns (total 10 columns):
Pclass      418 non-null int64
Name        418 non-null object
Sex         418 non-null object
Age         332 non-null float64
SibSp       418 non-null int64
Parch       418 non-null int64
Ticket      418 non-null object
Fare        417 non-null float64
Cabin       91 non-null object
Embarked    418 non-null object
dtypes: float64(2), int64(3), object(5)
memory usage: 35.9+ KB






    Out[6]:







  
    
      
      Pclass
      Age
      SibSp
      Parch
      Fare
    
  
  
    
      count
      418.000000
      332.000000
      418.000000
      418.000000
      417.000000
    
    
      mean
      2.265550
      30.272590
      0.447368
      0.392344
      35.627188
    
    
      std
      0.841838
      14.181209
      0.896760
      0.981429
      55.907576
    
    
      min
      1.000000
      0.170000
      0.000000
      0.000000
      0.000000
    
    
      25%
      1.000000
      21.000000
      0.000000
      0.000000
      7.895800
    
    
      50%
      3.000000
      27.000000
      0.000000
      0.000000
      14.454200
    
    
      75%
      3.000000
      39.000000
      1.000000
      0.000000
      31.500000
    
    
      max
      3.000000
      76.000000
      8.000000
      9.000000
      512.329200



In [7]:

    
tst_corpus.head()









    Out[7]:







  
    
      
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
    
      PassengerId
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      892
      3
      Kelly, Mr. James
      male
      34.5
      0
      0
      330911
      7.8292
      NaN
      Q
    
    
      893
      3
      Wilkes, Mrs. James (Ellen Needs)
      female
      47.0
      1
      0
      363272
      7.0000
      NaN
      S
    
    
      894
      2
      Myles, Mr. Thomas Francis
      male
      62.0
      0
      0
      240276
      9.6875
      NaN
      Q
    
    
      895
      3
      Wirz, Mr. Albert
      male
      27.0
      0
      0
      315154
      8.6625
      NaN
      S
    
    
      896
      3
      Hirvonen, Mrs. Alexander (Helga E Lindqvist)
      female
      22.0
      1
      1
      3101298
      12.2875
      NaN
      S

Adding Column "Survived" from file "gender_submission.csv"



In [8]:

    
expected_labels = pd.read_csv("data/gender_submission.csv")

expected_labels.set_index("PassengerId", inplace=True)
expected_labels.info()
expected_labels.describe()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 418 entries, 892 to 1309
Data columns (total 1 columns):
Survived    418 non-null int64
dtypes: int64(1)
memory usage: 6.5 KB






    Out[8]:







  
    
      
      Survived
    
  
  
    
      count
      418.000000
    
    
      mean
      0.363636
    
    
      std
      0.481622
    
    
      min
      0.000000
    
    
      25%
      0.000000
    
    
      50%
      0.000000
    
    
      75%
      1.000000
    
    
      max
      1.000000



In [9]:

    
expected_labels.head()









    Out[9]:







  
    
      
      Survived
    
    
      PassengerId
      
    
  
  
    
      892
      0
    
    
      893
      1
    
    
      894
      0
    
    
      895
      0
    
    
      896
      1



In [10]:

    
trn_corpus.index.names









    Out[10]:





FrozenList(['PassengerId'])



In [11]:

    
expected_labels.index.names









    Out[11]:





FrozenList(['PassengerId'])



In [12]:

    
#pd.merge(tst_corpus, expected_labels, how="inner", on="PassengerId")

tst_corpus_having_expected_label = pd.concat([tst_corpus, expected_labels], axis=1, join='inner')
tst_corpus_having_expected_label.head()









    Out[12]:







  
    
      
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
      Survived
    
    
      PassengerId
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      892
      3
      Kelly, Mr. James
      male
      34.5
      0
      0
      330911
      7.8292
      NaN
      Q
      0
    
    
      893
      3
      Wilkes, Mrs. James (Ellen Needs)
      female
      47.0
      1
      0
      363272
      7.0000
      NaN
      S
      1
    
    
      894
      2
      Myles, Mr. Thomas Francis
      male
      62.0
      0
      0
      240276
      9.6875
      NaN
      Q
      0
    
    
      895
      3
      Wirz, Mr. Albert
      male
      27.0
      0
      0
      315154
      8.6625
      NaN
      S
      0
    
    
      896
      3
      Hirvonen, Mrs. Alexander (Helga E Lindqvist)
      female
      22.0
      1
      1
      3101298
      12.2875
      NaN
      S
      1



In [13]:

    
print("Columns name: ", trn_corpus.columns)
print("Num of columns: ", len(trn_corpus.columns))
print("Num of rows: ", len(trn_corpus.index)) #trn_corpus.shape[0]

trn_corpus_size = len(trn_corpus.index)









    



Columns name:  Index(['Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket',
       'Fare', 'Cabin', 'Embarked'],
      dtype='object')
Num of columns:  11
Num of rows:  891



In [14]:

    
print("Columns name: ", tst_corpus.columns)
print("Num of columns: ", len(tst_corpus.columns))
print("Num of rows: ", len(tst_corpus.index)) #tst_corpus.shape[0]

tst_corpus_size = len(tst_corpus.index)









    



Columns name:  Index(['Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare',
       'Cabin', 'Embarked'],
      dtype='object')
Num of columns:  10
Num of rows:  418

Overview of Data using visualization



In [15]:

    
#sns.pairplot(trn_corpus.dropna())



In [16]:

    
#sns.pairplot(tst_corpus.dropna())

Concatenating trn_corpus and tst_corpus_having_expected_label using append



In [17]:

    
df = trn_corpus.append(tst_corpus_having_expected_label)

print("Columns name: ", df.columns)
print("Num of columns: ", len(df.columns))
print("Num of rows: ", len(df.index)) #trn_corpus.shape[0]

print("Sum of trn_corpus_size and tst_corpus_size: ", trn_corpus_size + tst_corpus_size)









    



Columns name:  Index(['Age', 'Cabin', 'Embarked', 'Fare', 'Name', 'Parch', 'Pclass', 'Sex',
       'SibSp', 'Survived', 'Ticket'],
      dtype='object')
Num of columns:  11
Num of rows:  1309
Sum of trn_corpus_size and tst_corpus_size:  1309



In [18]:

    
df.info()
df.describe()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 1 to 1309
Data columns (total 11 columns):
Age         1046 non-null float64
Cabin       295 non-null object
Embarked    1307 non-null object
Fare        1308 non-null float64
Name        1309 non-null object
Parch       1309 non-null int64
Pclass      1309 non-null int64
Sex         1309 non-null object
SibSp       1309 non-null int64
Survived    1309 non-null int64
Ticket      1309 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 122.7+ KB






    Out[18]:







  
    
      
      Age
      Fare
      Parch
      Pclass
      SibSp
      Survived
    
  
  
    
      count
      1046.000000
      1308.000000
      1309.000000
      1309.000000
      1309.000000
      1309.000000
    
    
      mean
      29.881138
      33.295479
      0.385027
      2.294882
      0.498854
      0.377387
    
    
      std
      14.413493
      51.758668
      0.865560
      0.837836
      1.041658
      0.484918
    
    
      min
      0.170000
      0.000000
      0.000000
      1.000000
      0.000000
      0.000000
    
    
      25%
      21.000000
      7.895800
      0.000000
      2.000000
      0.000000
      0.000000
    
    
      50%
      28.000000
      14.454200
      0.000000
      3.000000
      0.000000
      0.000000
    
    
      75%
      39.000000
      31.275000
      0.000000
      3.000000
      1.000000
      1.000000
    
    
      max
      80.000000
      512.329200
      9.000000
      3.000000
      8.000000
      1.000000



In [ ]:

Answer for Step 1:

1. Preditor (input) Variables and Data type

PassengerId 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object

2. Target (output) Variables and Data Type

Survived 891 non-null int64

3. Category of Variables

Continuous variables
- PassengerId 891 non-null int64 #primary key
- Age 714 non-null float64
- Fare 891 non-null float64
Categorial variables
- Name 891 non-null object
- Sex 891 non-null object
- Ticket 891 non-null object
- Cabin 204 non-null object
- Embarked 889 non-null object # embarked -- Port of Embarkation -- C = Cherbourg, Q = Queenstown, S = Southampton
- SibSp 891 non-null int64 # # of siblings / spouses aboard the Titanic -- [1 0 3 4 2 5 8] ; 7 items
- Parch 891 non-null int64 # # of parents / children aboard the Titanic -- [0 1 2 5 3 4 6] ; 7 items
- Survived 891 non-null int64 #survival -- Survival -- 0 = No, 1 = Yes
- Pclass 891 non-null int64 #pclass -- Ticket class -- 1 = 1st, 2 = 2nd, 3 = 3rd

Verify the unique data in each variables



In [19]:

    
#df.head()



In [20]:

    
#print("PassengerId:", df["PassengerId"].unique(), ";", df["PassengerId"].nunique(), "items")
print("Survived:", df["Survived"].unique(), ";", df["Survived"].nunique(), "items")
print("Pclass:", df["Pclass"].unique(), ";", df["Pclass"].nunique(), "Pclass")
#print("Name:", df["Name"].unique(), ";", df["Name"].nunique(), "items")
print("Sex:", df["Sex"].unique(), ";", df["Sex"].nunique(), "items")
#print("Age:", df["Age"].unique(), ";", df["Age"].nunique(), "items")
print("SibSp:", df["SibSp"].unique(), ";", df["SibSp"].nunique(), "items")
print("Parch:", df["Parch"].unique(), ";", df["Parch"].nunique(), "items")
#print("Ticket:", df["Ticket"].unique(), ";", df["Ticket"].nunique(), "items") # 681 items
#print("Fare:", df["Fare"].unique(), ";", df["Fare"].nunique(), "items") # 248 items
#print("Cabin:", df["Cabin"].unique(), ";", df["Cabin"].nunique(), "items") # 147 items
print("Embarked:", df["Embarked"].unique(), ";", df["Embarked"].nunique(), "items")









    



Survived: [0 1] ; 2 items
Pclass: [3 1 2] ; 3 Pclass
Sex: ['male' 'female'] ; 2 items
SibSp: [1 0 3 4 2 5 8] ; 7 items
Parch: [0 1 2 5 3 4 6 9] ; 8 items
Embarked: ['S' 'C' 'Q' nan] ; 3 items

Bonus - Step 1:

Ref: https://triangleinequality.wordpress.com/2013/05/19/machine-learning-with-python-first-steps-munging/



In [21]:

    
trn_corpus.describe()

So we use read_csv since that is the form (comma separated values), the data is in. Pandas automatically gave the columns names from the header and inferred the data types. For large data sets it is recommended that you specify the data types manually.

Notice that the age, cabin and embarked columns have null values. Also we apparently have some free-loaders because the minimum fare is 0. We might think that these are babies, so let’s check that:



In [22]:

    
trn_corpus[['Age','Fare']][trn_corpus.Fare < 5]









    Out[22]:







  
    
      
      Age
      Fare
    
    
      PassengerId
      
      
    
  
  
    
      180
      36.0
      0.0000
    
    
      264
      40.0
      0.0000
    
    
      272
      25.0
      0.0000
    
    
      278
      NaN
      0.0000
    
    
      303
      19.0
      0.0000
    
    
      379
      20.0
      4.0125
    
    
      414
      NaN
      0.0000
    
    
      467
      NaN
      0.0000
    
    
      482
      NaN
      0.0000
    
    
      598
      49.0
      0.0000
    
    
      634
      NaN
      0.0000
    
    
      675
      NaN
      0.0000
    
    
      733
      NaN
      0.0000
    
    
      807
      39.0
      0.0000
    
    
      816
      NaN
      0.0000
    
    
      823
      38.0
      0.0000

These guys are surely old enough to know better! But notice that there is a jump from a fare of 0 to 4, so there is something going on here, most likely these are errors, so let’s replace them by the mean fare for their class, and do the same for null values.



In [23]:

    
df.nunique()









    Out[23]:





Age           98
Cabin        186
Embarked       3
Fare         281
Name        1307
Parch          8
Pclass         3
Sex            2
SibSp          7
Survived       2
Ticket       929
dtype: int64



In [24]:

    
df["Fare"].fillna(0.0, inplace = True)



In [25]:

    
df[df["Fare"].isnull()]









    Out[25]:







  
    
      
      Age
      Cabin
      Embarked
      Fare
      Name
      Parch
      Pclass
      Sex
      SibSp
      Survived
      Ticket
    
    
      PassengerId



In [26]:

    
#first we set those fares of 0 to nan ==> Not used
#trn_corpus.Fare = trn_corpus.Fare.map(lambda x: np.nan if x==0 else x)
#df.Fare = df.Fare.map(lambda x: np.nan if x==0 else x)



In [27]:

    
#not that lambda just means a function we make on the fly
#calculate the mean fare for each class
classmeans_trn_corpus = trn_corpus.pivot_table('Fare', index = 'Pclass', aggfunc = "mean") #np.mean
classmeans_trn_corpus



In [28]:

    
df.nunique()









    Out[28]:





Age           98
Cabin        186
Embarked       3
Fare         281
Name        1307
Parch          8
Pclass         3
Sex            2
SibSp          7
Survived       2
Ticket       929
dtype: int64



In [29]:

    
trn_corpus.nunique()









    Out[29]:





Survived      2
Pclass        3
Name        891
Sex           2
Age          88
SibSp         7
Parch         7
Ticket      681
Fare        248
Cabin       147
Embarked      3
dtype: int64



In [30]:

    
#df.head()



In [31]:

    
#trn_corpus.head()



In [32]:

    
classmeans_trn_corpus.query('Pclass == 3')



In [33]:

    
classmeans_trn_corpus.xs(3)["Fare"]









    Out[33]:





13.675550101832997



In [34]:

    
classmeans_trn_corpus.query('Pclass == 3')

Step 2. Uni-variate Analysis

In this step, we explore the variables one by one. It depends on the variable type: Continuous or Categorial.

Continuous Variables

--> Understanding the central tendency and spread of the variables.

Central Tendency: mean, mode, median, min, max
Measure of Dispersion: range, Quartile, IQR (Interquartile Range), Variance, Standard Deviation, Skewness, Kurtosis
Visualization Methods: Histogram, Box Plot

Continuous variables
- Age 714 non-null float64
- Fare 891 non-null float64

Central Tendency: mean, mode, median, min, max



In [35]:

    
print("Central Tendency - for Age")
trn_corpus["Age"].describe()









    



Central Tendency - for Age






    Out[35]:





count    714.000000
mean      29.699118
std       14.526497
min        0.420000
25%       20.125000
50%       28.000000
75%       38.000000
max       80.000000
Name: Age, dtype: float64



In [36]:

    
trn_corpus_Age_dropna = trn_corpus["Age"].dropna()



In [37]:

    
#Ref: https://docs.python.org/3/library/statistics.html
import statistics

corpus_stat = trn_corpus_Age_dropna.copy()
    
print("=" * 36)
print("=" * 36)
print("Averages and measures of central location - Age")
print("These functions calculate an average or typical value from a population or sample.")
print("-" * 36)

print("Mode (most common value) of discrete data = ", statistics.mode(trn_corpus["Age"]))

print("Arithmetic mean (“average”) of data = ", statistics.mean(corpus_stat))
#print("Harmonic mean of data = ", statistics.harmonic_mean(trn_corpus_Age_dropna))
#StatisticsError is raised if data is empty, or any element is less than zero. New in version 3.6.

print("Median (middle value) of data = ", statistics.median(corpus_stat))
print("Median, or 50th percentile, of grouped data = ", statistics.median_grouped(corpus_stat)) 
print("Low median of data = ", statistics.median_low(corpus_stat))
print("High median of data = ", statistics.median_high(corpus_stat))

print("-" * 36)
#Method 2 - Using DataFrame
print("Arithmetic mean (“average”) of data = ", trn_corpus["Age"].mean())
print("Max = ", trn_corpus["Age"].max())
print("Min = ", trn_corpus["Age"].min())
print("Count = ", trn_corpus["Age"].count())









    



====================================
====================================
Averages and measures of central location - Age
These functions calculate an average or typical value from a population or sample.
------------------------------------
Mode (most common value) of discrete data =  24.0
Arithmetic mean (“average”) of data =  29.6991176471
Median (middle value) of data =  28.0
Median, or 50th percentile, of grouped data =  28.3
Low median of data =  28.0
High median of data =  28.0
------------------------------------
Arithmetic mean (“average”) of data =  29.6991176471
Max =  80.0
Min =  0.42
Count =  714



In [38]:

    
print("Central Tendency - for Fare")
trn_corpus["Fare"].describe()









    



Central Tendency - for Fare






    Out[38]:





count    891.000000
mean      32.204208
std       49.693429
min        0.000000
25%        7.910400
50%       14.454200
75%       31.000000
max      512.329200
Name: Fare, dtype: float64



In [39]:

    
trn_corpus_Fare_dropna = trn_corpus["Fare"].dropna()



In [40]:

    
#Ref: https://docs.python.org/3/library/statistics.html
import statistics

corpus_stat = trn_corpus_Fare_dropna.copy()
    
print("=" * 36)
print("=" * 36)
print("Averages and measures of central location - Age")
print("These functions calculate an average or typical value from a population or sample.")
print("-" * 36)

print("Mode (most common value) of discrete data = ", statistics.mode(trn_corpus["Fare"]))

print("Arithmetic mean (“average”) of data = ", statistics.mean(corpus_stat))
#print("Harmonic mean of data = ", statistics.harmonic_mean(trn_corpus_Age_dropna))
#StatisticsError is raised if data is empty, or any element is less than zero. New in version 3.6.

print("Median (middle value) of data = ", statistics.median(corpus_stat))
print("Median, or 50th percentile, of grouped data = ", statistics.median_grouped(corpus_stat)) 
print("Low median of data = ", statistics.median_low(corpus_stat))
print("High median of data = ", statistics.median_high(corpus_stat))

print("-" * 36)
#Method 2 - Using DataFrame
print("Arithmetic mean (“average”) of data = ", trn_corpus["Fare"].mean())
print("Max = ", trn_corpus["Fare"].max())
print("Min = ", trn_corpus["Fare"].min())
print("Count = ", trn_corpus["Fare"].count())









    



====================================
====================================
Averages and measures of central location - Age
These functions calculate an average or typical value from a population or sample.
------------------------------------
Mode (most common value) of discrete data =  8.05
Arithmetic mean (“average”) of data =  32.2042079686
Median (middle value) of data =  14.4542
Median, or 50th percentile, of grouped data =  14.7399142857
Low median of data =  14.4542
High median of data =  14.4542
------------------------------------
Arithmetic mean (“average”) of data =  32.2042079686
Max =  512.3292
Min =  0.0
Count =  891

Measure of Dispersion: range, Quartile, IQR (Interquartile Range), Variance, Standard Deviation, Skewness, Kurtosis

Ref: https://github.com/pandas-dev/pandas/blob/v0.20.3/pandas/core/generic.py#L5665-L5968

    For numeric data, the result's index will include ``count``,
    ``mean``, ``std``, ``min``, ``max`` as well as lower, ``50`` and
    upper percentiles. By default the lower percentile is ``25`` and the
    upper percentile is ``75``. The ``50`` percentile is the
    same as the median.



In [41]:

    
print("=" * 36)
print("=" * 36)

corpus_stat = trn_corpus_Age_dropna.copy()

print("Measures of spread - Age")
print("""These functions calculate a measure of how much the population or sample tends to deviate 
      from the typical or average values.""")

print("-" * 36)
print("Population standard deviation of data = ", statistics.pstdev(corpus_stat))
print("Population variance of data = ", statistics.pvariance(corpus_stat))
print("Sample standard deviation of data = ", statistics.stdev(corpus_stat))
print("Sample variance of data = ", statistics.variance(corpus_stat))

print("-" * 36)
corpus_stat = trn_corpus["Age"].copy()

print("Range = max - min = ", corpus_stat.max() - corpus_stat.min())
print("Quartile 25%, 50%, 75% = ", corpus_stat.describe()[['25%','50%','75%']][0], 
      corpus_stat.describe()[['25%','50%','75%']][1], 
      corpus_stat.describe()[['25%','50%','75%']][2])
print(corpus_stat.describe()[['25%','50%','75%']])
print("IQR (Interquartile Range) = Q3-Q1 = ", 
      corpus_stat.describe()[['25%','50%','75%']][2] - corpus_stat.describe()[['25%','50%','75%']][0])
print("Variance = ", corpus_stat.var())
print("Standard Deviation = ", corpus_stat.std())

print("Skewness = ", corpus_stat.skew()) 
print("Kurtosis = ", corpus_stat.kurtosis())









    



====================================
====================================
Measures of spread - Age
These functions calculate a measure of how much the population or sample tends to deviate 
      from the typical or average values.
------------------------------------
Population standard deviation of data =  14.516321150817316
Population variance of data =  210.723579754
Sample standard deviation of data =  14.526497332334042
Sample variance of data =  211.019124746
------------------------------------
Range = max - min =  79.58
Quartile 25%, 50%, 75% =  20.125 28.0 38.0
25%    20.125
50%    28.000
75%    38.000
Name: Age, dtype: float64
IQR (Interquartile Range) = Q3-Q1 =  17.875
Variance =  211.019124746
Standard Deviation =  14.5264973323
Skewness =  0.389107782301
Kurtosis =  0.178274153642

Comments:

Skewness > 0 ==> Positively skewed or Skewed to the right
Kurtosis > 0 ==> Fatter tail (Leptokurtic). Ref: http://www.investopedia.com/terms/l/leptokurtic.asp



In [42]:

    
sns.distplot(trn_corpus_Age_dropna, rug=True, hist=True)









    Out[42]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f8f12913630>



In [43]:

    
trn_corpus["Age"].describe()









    Out[43]:





count    714.000000
mean      29.699118
std       14.526497
min        0.420000
25%       20.125000
50%       28.000000
75%       38.000000
max       80.000000
Name: Age, dtype: float64



In [44]:

    
print("=" * 36)
print("=" * 36)

corpus_stat = trn_corpus_Fare_dropna.copy()

print("Measures of spread - Fare")
print("""These functions calculate a measure of how much the population or sample tends to deviate 
      from the typical or average values.""")

print("-" * 36)
print("Population standard deviation of data = ", statistics.pstdev(corpus_stat))
print("Population variance of data = ", statistics.pvariance(corpus_stat))
print("Sample standard deviation of data = ", statistics.stdev(corpus_stat))
print("Sample variance of data = ", statistics.variance(corpus_stat))

print("-" * 36)
corpus_stat = trn_corpus["Fare"].copy()

print("Range = max - min = ", corpus_stat.max() - corpus_stat.min())
print("Quartile 25%, 50%, 75% = ", corpus_stat.describe()[['25%','50%','75%']][0], 
      corpus_stat.describe()[['25%','50%','75%']][1], 
      corpus_stat.describe()[['25%','50%','75%']][2])
print(corpus_stat.describe()[['25%','50%','75%']])
print("IQR (Interquartile Range) = Q3-Q1 = ", 
      corpus_stat.describe()[['25%','50%','75%']][2] - corpus_stat.describe()[['25%','50%','75%']][0])
print("Variance = ", corpus_stat.var())
print("Standard Deviation = ", corpus_stat.std())

print("Skewness = ", corpus_stat.skew()) 
print("Kurtosis = ", corpus_stat.kurtosis())









    



====================================
====================================
Measures of spread - Fare
These functions calculate a measure of how much the population or sample tends to deviate 
      from the typical or average values.
------------------------------------
Population standard deviation of data =  49.66553444477411
Population variance of data =  2466.66531169
Sample standard deviation of data =  49.6934285971809
Sample variance of data =  2469.43684574
------------------------------------
Range = max - min =  512.3292
Quartile 25%, 50%, 75% =  7.9104 14.4542 31.0
25%     7.9104
50%    14.4542
75%    31.0000
Name: Fare, dtype: float64
IQR (Interquartile Range) = Q3-Q1 =  23.0896
Variance =  2469.43684574
Standard Deviation =  49.6934285972
Skewness =  4.78731651967
Kurtosis =  33.3981408809

Comments:

Skewness > 0 ==> Positively skewed or Skewed to the right
Kurtosis > 0 ==> Fatter tail (Leptokurtic). Ref: http://www.investopedia.com/terms/l/leptokurtic.asp



In [45]:

    
sns.distplot(trn_corpus_Fare_dropna, rug=True, hist=True)









    Out[45]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f8f0fe290b8>



In [46]:

    
trn_corpus["Fare"].describe()









    Out[46]:





count    891.000000
mean      32.204208
std       49.693429
min        0.000000
25%        7.910400
50%       14.454200
75%       31.000000
max      512.329200
Name: Fare, dtype: float64

Visualization Methods: Histogram, Box Plot



In [47]:

    
trn_corpus_Age_dropna.head()









    Out[47]:





PassengerId
1    22.0
2    38.0
3    26.0
4    35.0
5    35.0
Name: Age, dtype: float64



In [48]:

    
sns.distplot(trn_corpus_Age_dropna, rug=True, hist=True)









    Out[48]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f8f0fe357f0>



In [49]:

    
#Plot the distribution with a histogram and maximum likelihood gaussian distribution fit
from scipy.stats import norm
ax = sns.distplot(trn_corpus_Age_dropna, fit=norm, kde=False)



In [50]:

    
#ax = sns.distplot(trn_corpus_Age_dropna, vertical=True, color="y")



In [51]:

    
ax = sns.distplot(trn_corpus_Age_dropna, rug=True, rug_kws={"color": "g"},
                  kde_kws={"color": "k", "lw": 3, "label": "KDE"},
                  hist_kws={"histtype": "step", "linewidth": 3,
                  "alpha": 1, "color": "g"})



In [52]:

    
sns.boxplot(x="Survived", y="Age", hue="Sex", data=trn_corpus)









    Out[52]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f8f0daaccc0>

For boxplots, the assumption when using a hue variable is that it is nested within the x or y variable. This means that by default, the boxes for different levels of hue will be offset, as you can see above. If your hue variable is not nested, you can set the dodge parameter to disable offsetting: Ref: http://seaborn.pydata.org/tutorial/categorical.html



In [53]:

    
trn_corpus["survival"] = trn_corpus["Survived"].isin([0, 1])
sns.boxplot(x="Survived", y="Age", hue="survival", data=trn_corpus, dodge=False);



In [54]:

    
#sns.violinplot(x="Survived", y="Age", hue="Sex", data=trn_corpus)



In [55]:

    
#sns.violinplot(x="Survived", y="Age", hue="Sex", data=trn_corpus, split=True)



In [56]:

    
#sns.violinplot(x="Survived", y="Age", hue="Sex", data=trn_corpus, split=True, inner="stick", palette="Set3");



In [57]:

    
#sns.violinplot(x="Survived", y="Age", data=trn_corpus, inner=None)
#sns.swarmplot(x="Survived", y="Age", data=trn_corpus, color="w", alpha=.5);



In [58]:

    
ax = sns.distplot(trn_corpus_Fare_dropna, rug=True, rug_kws={"color": "g"},
                  kde_kws={"color": "k", "lw": 3, "label": "KDE"},
                  hist_kws={"histtype": "step", "linewidth": 3,
                  "alpha": 1, "color": "g"})



In [59]:

    
sns.distplot(trn_corpus_Fare_dropna, rug=True, hist=True)









    Out[59]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f8f0cd91b00>



In [60]:

    
#Plot the distribution with a histogram and maximum likelihood gaussian distribution fit
from scipy.stats import norm
ax = sns.distplot(trn_corpus_Fare_dropna, fit=norm, kde=False)



In [61]:

    
sns.boxplot(x="Survived", y="Fare", hue="Sex", data=trn_corpus)









    Out[61]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f8f0dc64f60>



In [62]:

    
trn_corpus["survival"] = trn_corpus["Survived"].isin([0, 1])
sns.boxplot(x="Survived", y="Fare", hue="survival", data=trn_corpus, dodge=False);

Categorial Variables



In [63]:

    
sns.countplot(x = "Sex", data = trn_corpus)









    Out[63]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f8f0dba8240>



In [64]:

    
sns.barplot(x = "Sex", y = "Survived", data = trn_corpus, estimator=np.std)









    Out[64]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f8f0c77ac88>

Step 3. Bi-variate Analysis

Continuous & Continuous

Categorial & Categorial

Categorial & Continuous

Step 4. Missing/Special Value Treatment

Missing Value Treatment

Column "Age" - Missing Value

Now let’s do a similar thing for age, replacing missing values with the overall mean. Later we’ll learn about more sophisticated techniques for replacing missing values and improve upon this.



In [65]:

    
# Duplicate one column Age in order to Fillna with meanAge of each Title (After having Title)
trn_corpus["AgeUsingMeanTitle"] = trn_corpus["Age"] 

meanAge_trn_corpus = np.mean(trn_corpus["Age"])
trn_corpus["Age"] = trn_corpus["Age"].fillna(meanAge_trn_corpus)

trn_corpus.head()









    Out[65]:







  
    
      
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
      survival
      AgeUsingMeanTitle
    
    
      PassengerId
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22.0
      1
      0
      A/5 21171
      7.2500
      NaN
      S
      True
      22.0
    
    
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38.0
      1
      0
      PC 17599
      71.2833
      C85
      C
      True
      38.0
    
    
      3
      1
      3
      Heikkinen, Miss. Laina
      female
      26.0
      0
      0
      STON/O2. 3101282
      7.9250
      NaN
      S
      True
      26.0
    
    
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      female
      35.0
      1
      0
      113803
      53.1000
      C123
      S
      True
      35.0
    
    
      5
      0
      3
      Allen, Mr. William Henry
      male
      35.0
      0
      0
      373450
      8.0500
      NaN
      S
      True
      35.0



In [66]:

    
# Duplicate one column Age in order to Fillna with meanAge of each Title (After having Title)
df["AgeUsingMeanTitle"] = df["Age"] 

meanAge_df = np.mean(df["Age"])
df["Age"] = df["Age"].fillna(meanAge_df)

df.head()









    Out[66]:







  
    
      
      Age
      Cabin
      Embarked
      Fare
      Name
      Parch
      Pclass
      Sex
      SibSp
      Survived
      Ticket
      AgeUsingMeanTitle
    
    
      PassengerId
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1
      22.0
      NaN
      S
      7.2500
      Braund, Mr. Owen Harris
      0
      3
      male
      1
      0
      A/5 21171
      22.0
    
    
      2
      38.0
      C85
      C
      71.2833
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      0
      1
      female
      1
      1
      PC 17599
      38.0
    
    
      3
      26.0
      NaN
      S
      7.9250
      Heikkinen, Miss. Laina
      0
      3
      female
      0
      1
      STON/O2. 3101282
      26.0
    
    
      4
      35.0
      C123
      S
      53.1000
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      0
      1
      female
      1
      1
      113803
      35.0
    
    
      5
      35.0
      NaN
      S
      8.0500
      Allen, Mr. William Henry
      0
      3
      male
      0
      0
      373450
      35.0

Column "Cabin" - Missing Value

Now for the cabin, since the majority of values are missing, it might be best to treat that as a piece of information itself, so we’ll set these to be ‘Unknown’.



In [67]:

    
#trn_corpus["Cabin"] = trn_corpus["Cabin"].fillna('Unknown') # because we will check Nan in the next step

Column "Embarked" - Missing Value

We set feature embarked having NaN to be the majority of column Embarked.



In [68]:

    
trn_corpus["Embarked"].describe()









    Out[68]:





count     889
unique      3
top         S
freq      644
Name: Embarked, dtype: object



In [69]:

    
trn_corpus["Embarked"].describe()["top"]









    Out[69]:





'S'



In [70]:

    
df["Embarked"].describe()["top"]









    Out[70]:





'S'



In [71]:

    
df["Embarked"] = df["Embarked"].fillna(df["Embarked"].describe()["top"])

df.head()









    Out[71]:







  
    
      
      Age
      Cabin
      Embarked
      Fare
      Name
      Parch
      Pclass
      Sex
      SibSp
      Survived
      Ticket
      AgeUsingMeanTitle
    
    
      PassengerId
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1
      22.0
      NaN
      S
      7.2500
      Braund, Mr. Owen Harris
      0
      3
      male
      1
      0
      A/5 21171
      22.0
    
    
      2
      38.0
      C85
      C
      71.2833
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      0
      1
      female
      1
      1
      PC 17599
      38.0
    
    
      3
      26.0
      NaN
      S
      7.9250
      Heikkinen, Miss. Laina
      0
      3
      female
      0
      1
      STON/O2. 3101282
      26.0
    
    
      4
      35.0
      C123
      S
      53.1000
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      0
      1
      female
      1
      1
      113803
      35.0
    
    
      5
      35.0
      NaN
      S
      8.0500
      Allen, Mr. William Henry
      0
      3
      male
      0
      0
      373450
      35.0



In [72]:

    
df["Embarked"].unique()









    Out[72]:





array(['S', 'C', 'Q'], dtype=object)

Special Value Treatment --> Ex: Fare = 0.0



In [73]:

    
#not that lambda just means a function we make on the fly
#calculate the mean fare for each class
classmeans_trn_corpus = trn_corpus.pivot_table('Fare', index = 'Pclass', aggfunc = "mean") #np.mean
classmeans_trn_corpus



In [74]:

    
#not that lambda just means a function we make on the fly
#calculate the mean fare for each class
classmeans_tst_corpus = tst_corpus.pivot_table('Fare', index = 'Pclass', aggfunc = "mean") #np.mean
classmeans_tst_corpus



In [75]:

    
#not that lambda just means a function we make on the fly
#calculate the mean fare for each class
classmeans_df = df.pivot_table('Fare', index = 'Pclass', aggfunc = "mean") #np.mean
classmeans_df



In [76]:

    
classmeans_trn_corpus.xs(3)["Fare"]









    Out[76]:





13.675550101832997



In [77]:

    
#Ref: https://triangleinequality.wordpress.com/2013/05/19/machine-learning-with-python-first-steps-munging/

#Remove Primary key (index)
trn_corpus.reset_index(inplace=True)
tst_corpus.reset_index(inplace=True)
df.reset_index(inplace=True)



In [78]:

    
trn_corpus.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 14 columns):
PassengerId          891 non-null int64
Survived             891 non-null int64
Pclass               891 non-null int64
Name                 891 non-null object
Sex                  891 non-null object
Age                  891 non-null float64
SibSp                891 non-null int64
Parch                891 non-null int64
Ticket               891 non-null object
Fare                 891 non-null float64
Cabin                204 non-null object
Embarked             889 non-null object
survival             891 non-null bool
AgeUsingMeanTitle    714 non-null float64
dtypes: bool(1), float64(3), int64(5), object(5)
memory usage: 91.4+ KB



In [79]:

    
list_passenger_id_having_Fare_zero = list(trn_corpus[trn_corpus["Fare"] == 0.0]["PassengerId"])
#list_passenger_id_having_Fare_zero = list(trn_corpus[trn_corpus["Fare"] == 0.0].index) #because we did set_index to df
print(list_passenger_id_having_Fare_zero)

#so apply acts on dataframes, either row-wise or column-wise, axis=1 means rows
trn_corpus["Fare"] = trn_corpus[['Fare', 'Pclass']].apply(lambda x: classmeans_trn_corpus.xs(x['Pclass'])["Fare"]
                                                        if x['Fare']==0.0 else x['Fare'], axis=1 )

#trn_corpus[trn_corpus["PassengerId"].apply(lambda x: x in list_passenger_id_having_Fare_zero)]
trn_corpus[trn_corpus.index.isin(list_passenger_id_having_Fare_zero)]









    



[180, 264, 272, 278, 303, 414, 467, 482, 598, 634, 675, 733, 807, 816, 823]






    Out[79]:







  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
      survival
      AgeUsingMeanTitle
    
  
  
    
      180
      181
      0
      3
      Sage, Miss. Constance Gladys
      female
      29.699118
      8
      2
      CA. 2343
      69.550
      NaN
      S
      True
      NaN
    
    
      264
      265
      0
      3
      Henry, Miss. Delia
      female
      29.699118
      0
      0
      382649
      7.750
      NaN
      Q
      True
      NaN
    
    
      272
      273
      1
      2
      Mellinger, Mrs. (Elizabeth Anne Maidment)
      female
      41.000000
      0
      1
      250644
      19.500
      NaN
      S
      True
      41.0
    
    
      278
      279
      0
      3
      Rice, Master. Eric
      male
      7.000000
      4
      1
      382652
      29.125
      NaN
      Q
      True
      7.0
    
    
      303
      304
      1
      2
      Keane, Miss. Nora A
      female
      29.699118
      0
      0
      226593
      12.350
      E101
      Q
      True
      NaN
    
    
      414
      415
      1
      3
      Sundman, Mr. Johan Julian
      male
      44.000000
      0
      0
      STON/O 2. 3101269
      7.925
      NaN
      S
      True
      44.0
    
    
      467
      468
      0
      1
      Smart, Mr. John Montgomery
      male
      56.000000
      0
      0
      113792
      26.550
      NaN
      S
      True
      56.0
    
    
      482
      483
      0
      3
      Rouse, Mr. Richard Henry
      male
      50.000000
      0
      0
      A/5 3594
      8.050
      NaN
      S
      True
      50.0
    
    
      598
      599
      0
      3
      Boulos, Mr. Hanna
      male
      29.699118
      0
      0
      2664
      7.225
      NaN
      C
      True
      NaN
    
    
      634
      635
      0
      3
      Skoog, Miss. Mabel
      female
      9.000000
      3
      2
      347088
      27.900
      NaN
      S
      True
      9.0
    
    
      675
      676
      0
      3
      Edvardsson, Mr. Gustaf Hjalmar
      male
      18.000000
      0
      0
      349912
      7.775
      NaN
      S
      True
      18.0
    
    
      733
      734
      0
      2
      Berriman, Mr. William John
      male
      23.000000
      0
      0
      28425
      13.000
      NaN
      S
      True
      23.0
    
    
      807
      808
      0
      3
      Pettersson, Miss. Ellen Natalia
      female
      18.000000
      0
      0
      347087
      7.775
      NaN
      S
      True
      18.0
    
    
      816
      817
      0
      3
      Heininen, Miss. Wendla Maria
      female
      23.000000
      0
      0
      STON/O2. 3101290
      7.925
      NaN
      S
      True
      23.0
    
    
      823
      824
      1
      3
      Moor, Mrs. (Beila)
      female
      27.000000
      0
      1
      392096
      12.475
      E121
      S
      True
      27.0



In [80]:

    
trn_corpus.index.names









    Out[80]:





FrozenList([None])



In [81]:

    
df.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 13 columns):
PassengerId          1309 non-null int64
Age                  1309 non-null float64
Cabin                295 non-null object
Embarked             1309 non-null object
Fare                 1309 non-null float64
Name                 1309 non-null object
Parch                1309 non-null int64
Pclass               1309 non-null int64
Sex                  1309 non-null object
SibSp                1309 non-null int64
Survived             1309 non-null int64
Ticket               1309 non-null object
AgeUsingMeanTitle    1046 non-null float64
dtypes: float64(3), int64(5), object(5)
memory usage: 133.0+ KB



In [82]:

    
#df["Fare"].unique() #contain nan from tst_corpus



In [83]:

    
classmeans_df



In [84]:

    
list_passenger_id_having_Fare_zero_df = list(df[df["AgeUsingMeanTitle"].isnull()]["PassengerId"])
#list_passenger_id_having_Fare_zero = list(trn_corpus[trn_corpus["Fare"] == 0.0].index) #because we did set_index to df
print(len(list_passenger_id_having_Fare_zero_df))

#so apply acts on dataframes, either row-wise or column-wise, axis=1 means rows
df["Fare"] = df[['Fare', 'Pclass']].apply(lambda x: classmeans_df.xs(x['Pclass'])["Fare"]
                                                        if x['Fare'] is np.nan else x['Fare'], axis=1 )#if x['Fare'] == 0.0 else x['Fare'], axis=1 )

#trn_corpus[trn_corpus["PassengerId"].apply(lambda x: x in list_passenger_id_having_Fare_zero)]
df[df.index.isin(list_passenger_id_having_Fare_zero_df)]









    



263






    Out[84]:







  
    
      
      PassengerId
      Age
      Cabin
      Embarked
      Fare
      Name
      Parch
      Pclass
      Sex
      SibSp
      Survived
      Ticket
      AgeUsingMeanTitle
    
  
  
    
      6
      7
      54.000000
      E46
      S
      51.8625
      McCarthy, Mr. Timothy J
      0
      1
      male
      0
      0
      17463
      54.00
    
    
      18
      19
      31.000000
      NaN
      S
      18.0000
      Vander Planke, Mrs. Julius (Emelia Maria Vande...
      0
      3
      female
      1
      0
      345763
      31.00
    
    
      20
      21
      35.000000
      NaN
      S
      26.0000
      Fynney, Mr. Joseph J
      0
      2
      male
      0
      0
      239865
      35.00
    
    
      27
      28
      19.000000
      C23 C25 C27
      S
      263.0000
      Fortune, Mr. Charles Alexander
      2
      1
      male
      3
      0
      19950
      19.00
    
    
      29
      30
      29.881138
      NaN
      S
      7.8958
      Todoroff, Mr. Lalio
      0
      3
      male
      0
      0
      349216
      NaN
    
    
      30
      31
      40.000000
      NaN
      C
      27.7208
      Uruchurtu, Don. Manuel E
      0
      1
      male
      0
      0
      PC 17601
      40.00
    
    
      32
      33
      29.881138
      NaN
      Q
      7.7500
      Glynn, Miss. Mary Agatha
      0
      3
      female
      0
      1
      335677
      NaN
    
    
      33
      34
      66.000000
      NaN
      S
      10.5000
      Wheadon, Mr. Edward H
      0
      2
      male
      0
      0
      C.A. 24579
      66.00
    
    
      37
      38
      21.000000
      NaN
      S
      8.0500
      Cann, Mr. Ernest Charles
      0
      3
      male
      0
      0
      A./5. 2152
      21.00
    
    
      43
      44
      3.000000
      NaN
      C
      41.5792
      Laroche, Miss. Simonne Marie Anne Andree
      2
      2
      female
      1
      1
      SC/Paris 2123
      3.00
    
    
      46
      47
      29.881138
      NaN
      Q
      15.5000
      Lennon, Mr. Denis
      0
      3
      male
      1
      0
      370371
      NaN
    
    
      47
      48
      29.881138
      NaN
      Q
      7.7500
      O'Driscoll, Miss. Bridget
      0
      3
      female
      0
      1
      14311
      NaN
    
    
      48
      49
      29.881138
      NaN
      C
      21.6792
      Samaan, Mr. Youssef
      0
      3
      male
      2
      0
      2662
      NaN
    
    
      49
      50
      18.000000
      NaN
      S
      17.8000
      Arnold-Franchi, Mrs. Josef (Josefine Franchi)
      0
      3
      female
      1
      0
      349237
      18.00
    
    
      56
      57
      21.000000
      NaN
      S
      10.5000
      Rugg, Miss. Emily
      0
      2
      female
      0
      1
      C.A. 31026
      21.00
    
    
      65
      66
      29.881138
      NaN
      C
      15.2458
      Moubarek, Master. Gerios
      1
      3
      male
      1
      1
      2661
      NaN
    
    
      66
      67
      29.000000
      F33
      S
      10.5000
      Nye, Mrs. (Elizabeth Ramell)
      0
      2
      female
      0
      1
      C.A. 29395
      29.00
    
    
      77
      78
      29.881138
      NaN
      S
      8.0500
      Moutal, Mr. Rahamin Haim
      0
      3
      male
      0
      0
      374746
      NaN
    
    
      78
      79
      0.830000
      NaN
      S
      29.0000
      Caldwell, Master. Alden Gates
      2
      2
      male
      0
      1
      248738
      0.83
    
    
      83
      84
      28.000000
      NaN
      S
      47.1000
      Carrau, Mr. Francisco M
      0
      1
      male
      0
      0
      113059
      28.00
    
    
      88
      89
      23.000000
      C23 C25 C27
      S
      263.0000
      Fortune, Miss. Mabel Helen
      2
      1
      female
      3
      1
      19950
      23.00
    
    
      96
      97
      71.000000
      A5
      C
      34.6542
      Goldschmidt, Mr. George B
      0
      1
      male
      0
      0
      PC 17754
      71.00
    
    
      102
      103
      21.000000
      D26
      S
      77.2875
      White, Mr. Richard Frasar
      1
      1
      male
      0
      0
      35281
      21.00
    
    
      108
      109
      38.000000
      NaN
      S
      7.8958
      Rekic, Mr. Tido
      0
      3
      male
      0
      0
      349249
      38.00
    
    
      110
      111
      47.000000
      C110
      S
      52.0000
      Porter, Mr. Walter Chamberlain
      0
      1
      male
      0
      0
      110465
      47.00
    
    
      122
      123
      32.500000
      NaN
      C
      30.0708
      Nasser, Mr. Nicholas
      0
      2
      male
      1
      0
      237736
      32.50
    
    
      127
      128
      24.000000
      NaN
      S
      7.1417
      Madsen, Mr. Fridtjof Arne
      0
      3
      male
      0
      1
      C 17369
      24.00
    
    
      129
      130
      45.000000
      NaN
      S
      6.9750
      Ekstrom, Mr. Johan
      0
      3
      male
      0
      0
      347061
      45.00
    
    
      141
      142
      22.000000
      NaN
      S
      7.7500
      Nysten, Miss. Anna Sofia
      0
      3
      female
      0
      1
      347081
      22.00
    
    
      155
      156
      51.000000
      NaN
      C
      61.3792
      Williams, Mr. Charles Duane
      1
      1
      male
      0
      0
      PC 17597
      51.00
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      1159
      1160
      29.881138
      NaN
      S
      8.0500
      Howard, Miss. May Elizabeth
      0
      3
      female
      0
      1
      A. 2. 39186
      NaN
    
    
      1160
      1161
      17.000000
      NaN
      S
      8.6625
      Pokrnic, Mr. Mate
      0
      3
      male
      0
      0
      315095
      17.00
    
    
      1163
      1164
      26.000000
      C89
      C
      136.7792
      Clark, Mrs. Walter Miller (Virginia McDowell)
      0
      1
      female
      1
      1
      13508
      26.00
    
    
      1165
      1166
      29.881138
      NaN
      C
      7.2250
      Saade, Mr. Jean Nassr
      0
      3
      male
      0
      0
      2676
      NaN
    
    
      1166
      1167
      20.000000
      NaN
      S
      26.0000
      Bryhl, Miss. Dagmar Jenny Ingeborg
      0
      2
      female
      1
      1
      236853
      20.00
    
    
      1174
      1175
      9.000000
      NaN
      C
      15.2458
      Touma, Miss. Maria Youssef
      1
      3
      female
      1
      1
      2650
      9.00
    
    
      1178
      1179
      24.000000
      B45
      S
      82.2667
      Snyder, Mr. John Pillsbury
      0
      1
      male
      1
      0
      21228
      24.00
    
    
      1180
      1181
      29.881138
      NaN
      S
      8.0500
      Ford, Mr. Arthur
      0
      3
      male
      0
      0
      A/5 1478
      NaN
    
    
      1181
      1182
      29.881138
      NaN
      S
      39.6000
      Rheims, Mr. George Alexander Lucien
      0
      1
      male
      0
      0
      PC 17607
      NaN
    
    
      1182
      1183
      30.000000
      NaN
      Q
      6.9500
      Daly, Miss. Margaret Marcella Maggie""
      0
      3
      female
      0
      1
      382650
      30.00
    
    
      1184
      1185
      53.000000
      A34
      S
      81.8583
      Dodge, Dr. Washington
      1
      1
      male
      1
      0
      33638
      53.00
    
    
      1189
      1190
      30.000000
      NaN
      S
      45.5000
      Loring, Mr. Joseph Holland
      0
      1
      male
      0
      0
      113801
      30.00
    
    
      1193
      1194
      43.000000
      NaN
      S
      21.0000
      Phillips, Mr. Escott Robert
      1
      2
      male
      0
      0
      S.O./P.P. 2
      43.00
    
    
      1196
      1197
      64.000000
      B26
      S
      26.5500
      Crosby, Mrs. Edward Gifford (Catherine Elizabe...
      1
      1
      female
      1
      1
      112901
      64.00
    
    
      1204
      1205
      37.000000
      NaN
      Q
      7.7500
      Carr, Miss. Jeannie
      0
      3
      female
      0
      1
      368364
      37.00
    
    
      1224
      1225
      19.000000
      NaN
      C
      15.7417
      Nakid, Mrs. Said (Waika Mary" Mowad)"
      1
      3
      female
      1
      1
      2653
      19.00
    
    
      1231
      1232
      18.000000
      NaN
      S
      10.5000
      Fillbrook, Mr. Joseph Charles
      0
      2
      male
      0
      0
      C.A. 15185
      18.00
    
    
      1234
      1235
      58.000000
      B51 B53 B55
      C
      512.3292
      Cardeza, Mrs. James Warburton Martinez (Charlo...
      1
      1
      female
      0
      1
      PC 17755
      58.00
    
    
      1236
      1237
      16.000000
      NaN
      S
      7.6500
      Abelseth, Miss. Karen Marie
      0
      3
      female
      0
      1
      348125
      16.00
    
    
      1249
      1250
      29.881138
      NaN
      Q
      7.7500
      O'Keefe, Mr. Patrick
      0
      3
      male
      0
      0
      368402
      NaN
    
    
      1250
      1251
      30.000000
      NaN
      S
      15.5500
      Lindell, Mrs. Edvard Bengtsson (Elin Gerda Per...
      0
      3
      female
      1
      1
      349910
      30.00
    
    
      1257
      1258
      29.881138
      NaN
      C
      14.4583
      Caram, Mr. Joseph
      0
      3
      male
      1
      0
      2689
      NaN
    
    
      1258
      1259
      22.000000
      NaN
      S
      39.6875
      Riihivouri, Miss. Susanna Juhantytar Sanni""
      0
      3
      female
      0
      1
      3101295
      22.00
    
    
      1272
      1273
      26.000000
      NaN
      Q
      7.8792
      Foley, Mr. Joseph
      0
      3
      male
      0
      0
      330910
      26.00
    
    
      1274
      1275
      19.000000
      NaN
      S
      16.1000
      McNamee, Mrs. Neal (Eileen O'Leary)
      0
      3
      female
      1
      1
      376566
      19.00
    
    
      1276
      1277
      24.000000
      NaN
      S
      65.0000
      Herman, Miss. Kate
      2
      2
      female
      1
      1
      220845
      24.00
    
    
      1300
      1301
      3.000000
      NaN
      S
      13.7750
      Peacock, Miss. Treasteall
      1
      3
      female
      1
      1
      SOTON/O.Q. 3101315
      3.00
    
    
      1302
      1303
      37.000000
      C78
      Q
      90.0000
      Minahan, Mrs. William Edward (Lillian E Thorpe)
      0
      1
      female
      1
      1
      19928
      37.00
    
    
      1305
      1306
      39.000000
      C105
      C
      108.9000
      Oliva y Ocana, Dona. Fermina
      0
      1
      female
      0
      1
      PC 17758
      39.00
    
    
      1308
      1309
      29.881138
      NaN
      C
      22.3583
      Peter, Master. Michael J
      1
      3
      male
      1
      0
      2668
      NaN
    
  

262 rows × 13 columns



In [85]:

    
df.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 13 columns):
PassengerId          1309 non-null int64
Age                  1309 non-null float64
Cabin                295 non-null object
Embarked             1309 non-null object
Fare                 1309 non-null float64
Name                 1309 non-null object
Parch                1309 non-null int64
Pclass               1309 non-null int64
Sex                  1309 non-null object
SibSp                1309 non-null int64
Survived             1309 non-null int64
Ticket               1309 non-null object
AgeUsingMeanTitle    1046 non-null float64
dtypes: float64(3), int64(5), object(5)
memory usage: 133.0+ KB

Step 5. Outlier Detection and Treatment

Feature Engineering

Variable Transformation & Variable/Feature Creation

Step 1. Variable Transformation



In [ ]:

Step 2. Variable/Feature Creation

Ref: https://triangleinequality.wordpress.com/2013/09/08/basic-feature-engineering-with-the-titanic-data/

Titles

First up the Name column is currently not being used, but we can at least extract the title from the name. There are quite a few titles going around, but I want to reduce them all to Mrs, Miss, Mr and Master. To do this we’ll need a function that searches for substrings. Thankfully the library ‘string’ has just what we need.



In [86]:

    
def substrings_in_string(big_string, substrings):
    if big_string is np.nan:
        return np.nan
    #end if
    
    for substring in substrings:
        if big_string.find(substring) != -1:
            return substring
        #end if
    #end for
    
    print(big_string)
    return np.nan
#end def
 
#replacing all titles with mr, mrs, miss, master
def replace_titles(x):
    title=x['Title']
    if title in ['Don', 'Major', 'Capt', 'Jonkheer', 'Rev', 'Col']:
        return 0 #'Mr'
    elif title in ['Countess', 'Mme']:
        return 1 #'Mrs'
    elif title in ['Mlle', 'Ms']:
        return 2 #'Miss'
    elif title == 'Dr':
        if x['Sex'] == 'Male':
            return 0 #'Mr'
        else:
            return 1 #'Mrs'
    else:
        return 3 #title
    #end if
#end def

title_list=['Mrs', 'Mr', 'Master', 'Miss', 'Major', 'Rev',
                    'Dr', 'Ms', 'Mlle','Col', 'Capt', 'Mme', 'Countess',
                    'Don', 'Jonkheer']

df['Title'] = df['Name'].map(lambda x: substrings_in_string(x, title_list))
    
df['Title'] = df.apply(replace_titles, axis=1)



In [87]:

    
df.head()









    Out[87]:







  
    
      
      PassengerId
      Age
      Cabin
      Embarked
      Fare
      Name
      Parch
      Pclass
      Sex
      SibSp
      Survived
      Ticket
      AgeUsingMeanTitle
      Title
    
  
  
    
      0
      1
      22.0
      NaN
      S
      7.2500
      Braund, Mr. Owen Harris
      0
      3
      male
      1
      0
      A/5 21171
      22.0
      3
    
    
      1
      2
      38.0
      C85
      C
      71.2833
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      0
      1
      female
      1
      1
      PC 17599
      38.0
      3
    
    
      2
      3
      26.0
      NaN
      S
      7.9250
      Heikkinen, Miss. Laina
      0
      3
      female
      0
      1
      STON/O2. 3101282
      26.0
      3
    
    
      3
      4
      35.0
      C123
      S
      53.1000
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      0
      1
      female
      1
      1
      113803
      35.0
      3
    
    
      4
      5
      35.0
      NaN
      S
      8.0500
      Allen, Mr. William Henry
      0
      3
      male
      0
      0
      373450
      35.0
      3

Column "Age" - Missing Value - Using Mean for each Title

Now let’s do a similar thing for age, replacing missing values with the overall mean. Later we’ll learn about more sophisticated techniques for replacing missing values and improve upon this.



In [88]:

    
trn_corpus[["Sex", "AgeUsingMeanTitle"]].groupby("Sex").mean()









    Out[88]:







  
    
      
      AgeUsingMeanTitle
    
    
      Sex
      
    
  
  
    
      female
      27.915709
    
    
      male
      30.726645



In [89]:

    
df[["Sex", "AgeUsingMeanTitle"]].groupby("Sex").mean()









    Out[89]:







  
    
      
      AgeUsingMeanTitle
    
    
      Sex
      
    
  
  
    
      female
      28.687088
    
    
      male
      30.585228



In [90]:

    
#Method 2 - Using pivot table
mean_title_trn_corpus = trn_corpus.pivot_table("AgeUsingMeanTitle", index = "Sex", aggfunc= "mean") #np.mean
#mean_title_trn_corpus = trn_corpus.pivot_table("AgeUsingMeanTitle", index = ["Sex"], aggfunc= "mean").reset_index() #if having index

mean_title_trn_corpus









    Out[90]:







  
    
      
      AgeUsingMeanTitle
    
    
      Sex
      
    
  
  
    
      female
      27.915709
    
    
      male
      30.726645



In [91]:

    
#Method 2 - Using pivot table
mean_title_df = df.pivot_table("AgeUsingMeanTitle", index = "Sex", aggfunc= "mean") #np.mean
#mean_title_df = df.pivot_table("AgeUsingMeanTitle", index = ["Sex"], aggfunc= "mean").reset_index() #if having index

mean_title_df









    Out[91]:







  
    
      
      AgeUsingMeanTitle
    
    
      Sex
      
    
  
  
    
      female
      28.687088
    
    
      male
      30.585228



In [92]:

    
mean_title_df.xs("male")["AgeUsingMeanTitle"]









    Out[92]:





30.585227963525838



In [93]:

    
#list(df["AgeUsingMeanTitle"].unique())



In [94]:

    
list_passenger_id_having_Age_nan = list(df[df["AgeUsingMeanTitle"].isnull()]["PassengerId"])

#list_passenger_id_having_Age_nan



In [95]:

    
df["AgeUsingMeanTitle"].fillna(df.groupby("Sex")["AgeUsingMeanTitle"].transform("mean"), inplace=True)

df[df["PassengerId"].apply(lambda x: x in list_passenger_id_having_Age_nan)]









    Out[95]:







  
    
      
      PassengerId
      Age
      Cabin
      Embarked
      Fare
      Name
      Parch
      Pclass
      Sex
      SibSp
      Survived
      Ticket
      AgeUsingMeanTitle
      Title
    
  
  
    
      5
      6
      29.881138
      NaN
      Q
      8.4583
      Moran, Mr. James
      0
      3
      male
      0
      0
      330877
      30.585228
      3
    
    
      17
      18
      29.881138
      NaN
      S
      13.0000
      Williams, Mr. Charles Eugene
      0
      2
      male
      0
      1
      244373
      30.585228
      3
    
    
      19
      20
      29.881138
      NaN
      C
      7.2250
      Masselmani, Mrs. Fatima
      0
      3
      female
      0
      1
      2649
      28.687088
      3
    
    
      26
      27
      29.881138
      NaN
      C
      7.2250
      Emir, Mr. Farred Chehab
      0
      3
      male
      0
      0
      2631
      30.585228
      3
    
    
      28
      29
      29.881138
      NaN
      Q
      7.8792
      O'Dwyer, Miss. Ellen "Nellie"
      0
      3
      female
      0
      1
      330959
      28.687088
      3
    
    
      29
      30
      29.881138
      NaN
      S
      7.8958
      Todoroff, Mr. Lalio
      0
      3
      male
      0
      0
      349216
      30.585228
      3
    
    
      31
      32
      29.881138
      B78
      C
      146.5208
      Spencer, Mrs. William Augustus (Marie Eugenie)
      0
      1
      female
      1
      1
      PC 17569
      28.687088
      3
    
    
      32
      33
      29.881138
      NaN
      Q
      7.7500
      Glynn, Miss. Mary Agatha
      0
      3
      female
      0
      1
      335677
      28.687088
      3
    
    
      36
      37
      29.881138
      NaN
      C
      7.2292
      Mamee, Mr. Hanna
      0
      3
      male
      0
      1
      2677
      30.585228
      3
    
    
      42
      43
      29.881138
      NaN
      C
      7.8958
      Kraeff, Mr. Theodor
      0
      3
      male
      0
      0
      349253
      30.585228
      3
    
    
      45
      46
      29.881138
      NaN
      S
      8.0500
      Rogers, Mr. William John
      0
      3
      male
      0
      0
      S.C./A.4. 23567
      30.585228
      3
    
    
      46
      47
      29.881138
      NaN
      Q
      15.5000
      Lennon, Mr. Denis
      0
      3
      male
      1
      0
      370371
      30.585228
      3
    
    
      47
      48
      29.881138
      NaN
      Q
      7.7500
      O'Driscoll, Miss. Bridget
      0
      3
      female
      0
      1
      14311
      28.687088
      3
    
    
      48
      49
      29.881138
      NaN
      C
      21.6792
      Samaan, Mr. Youssef
      0
      3
      male
      2
      0
      2662
      30.585228
      3
    
    
      55
      56
      29.881138
      C52
      S
      35.5000
      Woolner, Mr. Hugh
      0
      1
      male
      0
      1
      19947
      30.585228
      3
    
    
      64
      65
      29.881138
      NaN
      C
      27.7208
      Stewart, Mr. Albert A
      0
      1
      male
      0
      0
      PC 17605
      30.585228
      3
    
    
      65
      66
      29.881138
      NaN
      C
      15.2458
      Moubarek, Master. Gerios
      1
      3
      male
      1
      1
      2661
      30.585228
      3
    
    
      76
      77
      29.881138
      NaN
      S
      7.8958
      Staneff, Mr. Ivan
      0
      3
      male
      0
      0
      349208
      30.585228
      3
    
    
      77
      78
      29.881138
      NaN
      S
      8.0500
      Moutal, Mr. Rahamin Haim
      0
      3
      male
      0
      0
      374746
      30.585228
      3
    
    
      82
      83
      29.881138
      NaN
      Q
      7.7875
      McDermott, Miss. Brigdet Delia
      0
      3
      female
      0
      1
      330932
      28.687088
      3
    
    
      87
      88
      29.881138
      NaN
      S
      8.0500
      Slocovski, Mr. Selman Francis
      0
      3
      male
      0
      0
      SOTON/OQ 392086
      30.585228
      3
    
    
      95
      96
      29.881138
      NaN
      S
      8.0500
      Shorney, Mr. Charles Joseph
      0
      3
      male
      0
      0
      374910
      30.585228
      3
    
    
      101
      102
      29.881138
      NaN
      S
      7.8958
      Petroff, Mr. Pastcho ("Pentcho")
      0
      3
      male
      0
      0
      349215
      30.585228
      3
    
    
      107
      108
      29.881138
      NaN
      S
      7.7750
      Moss, Mr. Albert Johan
      0
      3
      male
      0
      1
      312991
      30.585228
      3
    
    
      109
      110
      29.881138
      NaN
      Q
      24.1500
      Moran, Miss. Bertha
      0
      3
      female
      1
      1
      371110
      28.687088
      3
    
    
      121
      122
      29.881138
      NaN
      S
      8.0500
      Moore, Mr. Leonard Charles
      0
      3
      male
      0
      0
      A4. 54510
      30.585228
      3
    
    
      126
      127
      29.881138
      NaN
      Q
      7.7500
      McMahon, Mr. Martin
      0
      3
      male
      0
      0
      370372
      30.585228
      3
    
    
      128
      129
      29.881138
      F E69
      C
      22.3583
      Peter, Miss. Anna
      1
      3
      female
      1
      1
      2668
      28.687088
      3
    
    
      140
      141
      29.881138
      NaN
      C
      15.2458
      Boulos, Mrs. Joseph (Sultana)
      2
      3
      female
      0
      0
      2678
      28.687088
      3
    
    
      154
      155
      29.881138
      NaN
      S
      7.3125
      Olsen, Mr. Ole Martin
      0
      3
      male
      0
      0
      Fa 265302
      30.585228
      3
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      1159
      1160
      29.881138
      NaN
      S
      8.0500
      Howard, Miss. May Elizabeth
      0
      3
      female
      0
      1
      A. 2. 39186
      28.687088
      3
    
    
      1162
      1163
      29.881138
      NaN
      Q
      7.7500
      Fox, Mr. Patrick
      0
      3
      male
      0
      0
      368573
      30.585228
      3
    
    
      1164
      1165
      29.881138
      NaN
      Q
      15.5000
      Lennon, Miss. Mary
      0
      3
      female
      1
      1
      370371
      28.687088
      3
    
    
      1165
      1166
      29.881138
      NaN
      C
      7.2250
      Saade, Mr. Jean Nassr
      0
      3
      male
      0
      0
      2676
      30.585228
      3
    
    
      1173
      1174
      29.881138
      NaN
      Q
      7.7500
      Fleming, Miss. Honora
      0
      3
      female
      0
      1
      364859
      28.687088
      3
    
    
      1177
      1178
      29.881138
      NaN
      S
      7.2500
      Franklin, Mr. Charles (Charles Fardon)
      0
      3
      male
      0
      0
      SOTON/O.Q. 3101314
      30.585228
      3
    
    
      1179
      1180
      29.881138
      F E46
      C
      7.2292
      Mardirosian, Mr. Sarkis
      0
      3
      male
      0
      0
      2655
      30.585228
      3
    
    
      1180
      1181
      29.881138
      NaN
      S
      8.0500
      Ford, Mr. Arthur
      0
      3
      male
      0
      0
      A/5 1478
      30.585228
      3
    
    
      1181
      1182
      29.881138
      NaN
      S
      39.6000
      Rheims, Mr. George Alexander Lucien
      0
      1
      male
      0
      0
      PC 17607
      30.585228
      3
    
    
      1183
      1184
      29.881138
      NaN
      C
      7.2292
      Nasr, Mr. Mustafa
      0
      3
      male
      0
      0
      2652
      30.585228
      3
    
    
      1188
      1189
      29.881138
      NaN
      C
      21.6792
      Samaan, Mr. Hanna
      0
      3
      male
      2
      0
      2662
      30.585228
      3
    
    
      1192
      1193
      29.881138
      D
      C
      15.0458
      Malachard, Mr. Noel
      0
      2
      male
      0
      0
      237735
      30.585228
      3
    
    
      1195
      1196
      29.881138
      NaN
      Q
      7.7500
      McCarthy, Miss. Catherine Katie""
      0
      3
      female
      0
      1
      383123
      28.687088
      3
    
    
      1203
      1204
      29.881138
      NaN
      S
      7.5750
      Sadowitz, Mr. Harry
      0
      3
      male
      0
      0
      LP 1588
      30.585228
      3
    
    
      1223
      1224
      29.881138
      NaN
      C
      7.2250
      Thomas, Mr. Tannous
      0
      3
      male
      0
      0
      2684
      30.585228
      3
    
    
      1230
      1231
      29.881138
      NaN
      C
      7.2292
      Betros, Master. Seman
      0
      3
      male
      0
      0
      2622
      30.585228
      3
    
    
      1233
      1234
      29.881138
      NaN
      S
      69.5500
      Sage, Mr. John George
      9
      3
      male
      1
      0
      CA. 2343
      30.585228
      3
    
    
      1235
      1236
      29.881138
      NaN
      S
      14.5000
      van Billiard, Master. James William
      1
      3
      male
      1
      0
      A/5. 851
      30.585228
      3
    
    
      1248
      1249
      29.881138
      NaN
      S
      7.8792
      Lockyer, Mr. Edward
      0
      3
      male
      0
      0
      1222
      30.585228
      3
    
    
      1249
      1250
      29.881138
      NaN
      Q
      7.7500
      O'Keefe, Mr. Patrick
      0
      3
      male
      0
      0
      368402
      30.585228
      3
    
    
      1256
      1257
      29.881138
      NaN
      S
      69.5500
      Sage, Mrs. John (Annie Bullen)
      9
      3
      female
      1
      1
      CA. 2343
      28.687088
      3
    
    
      1257
      1258
      29.881138
      NaN
      C
      14.4583
      Caram, Mr. Joseph
      0
      3
      male
      1
      0
      2689
      30.585228
      3
    
    
      1271
      1272
      29.881138
      NaN
      Q
      7.7500
      O'Connor, Mr. Patrick
      0
      3
      male
      0
      0
      366713
      30.585228
      3
    
    
      1273
      1274
      29.881138
      NaN
      S
      14.5000
      Risien, Mrs. Samuel (Emma)
      0
      3
      female
      0
      1
      364498
      28.687088
      3
    
    
      1275
      1276
      29.881138
      NaN
      S
      12.8750
      Wheeler, Mr. Edwin Frederick""
      0
      2
      male
      0
      0
      SC/PARIS 2159
      30.585228
      3
    
    
      1299
      1300
      29.881138
      NaN
      Q
      7.7208
      Riordan, Miss. Johanna Hannah""
      0
      3
      female
      0
      1
      334915
      28.687088
      3
    
    
      1301
      1302
      29.881138
      NaN
      Q
      7.7500
      Naughton, Miss. Hannah
      0
      3
      female
      0
      1
      365237
      28.687088
      3
    
    
      1304
      1305
      29.881138
      NaN
      S
      8.0500
      Spector, Mr. Woolf
      0
      3
      male
      0
      0
      A.5. 3236
      30.585228
      3
    
    
      1307
      1308
      29.881138
      NaN
      S
      8.0500
      Ware, Mr. Frederick
      0
      3
      male
      0
      0
      359309
      30.585228
      3
    
    
      1308
      1309
      29.881138
      NaN
      C
      22.3583
      Peter, Master. Michael J
      1
      3
      male
      1
      0
      2668
      30.585228
      3
    
  

263 rows × 14 columns



In [ ]:

Cabin

This is going be very similar, we have a ‘Cabin’ column not doing much, only 1st class passengers have cabins, the rest are ‘Unknown’. A cabin number looks like ‘C123’. The letter refers to the deck, and so we’re going to extract these just like the titles.



In [96]:

    
#df["Cabin"].unique()



In [97]:

    
df["Cabin"].nunique()









    Out[97]:





186



In [98]:

    
#Turning cabin number into Deck
cabin_list = ['A', 'B', 'C', 'D', 'E', 'F', 'T', 'G', 'UNK']
df['Deck1']=df['Cabin'].map(lambda x: substrings_in_string(x, cabin_list))

#df.head()



In [99]:

    
#Task: How to get the Deck from Cabin
#Method 2
def get_deck_from_cabin(strCabin):
    if strCabin is np.nan:
        return np.nan
    #end if
    
    return strCabin[0]
#end def

df["Deck2"] = df["Cabin"].apply(get_deck_from_cabin)

#df.head()

Question: Columns of Deck and Deck2 are the same?



In [100]:

    
print(df["Deck1"].unique())
print(df["Deck1"].nunique())









    



[nan 'C' 'E' 'G' 'D' 'A' 'B' 'F' 'T']
8



In [101]:

    
print(df["Deck2"].unique())
print(df["Deck2"].nunique())









    



[nan 'C' 'E' 'G' 'D' 'A' 'B' 'F' 'T']
8



In [102]:

    
df[df["Deck1"].fillna("UNK") != df["Deck2"].fillna("UNK")]









    Out[102]:







  
    
      
      PassengerId
      Age
      Cabin
      Embarked
      Fare
      Name
      Parch
      Pclass
      Sex
      SibSp
      Survived
      Ticket
      AgeUsingMeanTitle
      Title
      Deck1
      Deck2
    
  
  
    
      128
      129
      29.881138
      F E69
      C
      22.3583
      Peter, Miss. Anna
      1
      3
      female
      1
      1
      2668
      28.687088
      3
      E
      F
    
    
      1179
      1180
      29.881138
      F E46
      C
      7.2292
      Mardirosian, Mr. Sarkis
      0
      3
      male
      0
      0
      2655
      30.585228
      3
      E
      F
    
    
      1212
      1213
      25.000000
      F E57
      C
      7.2292
      Krekorian, Mr. Neshan
      0
      3
      male
      0
      0
      2654
      25.000000
      3
      E
      F

Comment: We will use the values in column "Deck2".

Family Size

One thing you can do to create new features is linear combinations of features. In a model like linear regression this should be unnecessary, but for a decision tree may find it hard to model such relationships. Reading on the forums at Kaggle, some people have considered the size of a person’s family, the sum of their ‘SibSp’ and ‘Parch’ attributes. Perhaps people traveling alone did better? Or on the other hand perhaps if you had a family, you might have risked your life looking for them, or even giving up a space up to them in a lifeboat. Let’s throw that into the mix.



In [103]:

    
#Creating new family_size column
df['FamilySize']=df['SibSp']+df['Parch']

#df.head()

AgeClass

This is an interaction term, since age and class are both numbers we can just multiply them.



In [104]:

    
df['AgeClass']=df['AgeUsingMeanTitle']*df['Pclass']

#df.head()

Adding Male column

This is the “sex” variable in the data set from kaggle. I’ve just changed male/female to 1/0.



In [105]:

    
sex = {'male':1, 'female':0}
df["Male"] = df['Sex'].map(sex)

#df.head()

SexClass

This is an interaction term, since age and class are both numbers we can just multiply them.



In [106]:

    
df['SexClass']=df['Male']*df['Pclass']

#df.head()

Fare per Person

Here we divide the fare by the number of family members traveling together, I’m not exactly sure what this represents, but it’s easy enough to add in.



In [107]:

    
df['FarePerPerson']=df['Fare']/(df['FamilySize']+1)

#df.head()

AgeSquared

Here we use "combined_age" squared.

"combined_age" – this is the age of the passenger, with missing values replaced by the median age for each Title. So if the age was missing for Mr Smith, then I’ve used the median age for all passengers with Title “Mr”.



In [108]:

    
df["AgeSquared"]=df["AgeUsingMeanTitle"]**2

#df.head()

AgeClassSquared

Here we use "AgeClass" squared.



In [109]:

    
df["AgeClassSquared"]=df['AgeClass']**2

#df.head()

Creating Dummy Variables



In [110]:

    
df.head()









    Out[110]:







  
    
      
      PassengerId
      Age
      Cabin
      Embarked
      Fare
      Name
      Parch
      Pclass
      Sex
      SibSp
      ...
      Title
      Deck1
      Deck2
      FamilySize
      AgeClass
      Male
      SexClass
      FarePerPerson
      AgeSquared
      AgeClassSquared
    
  
  
    
      0
      1
      22.0
      NaN
      S
      7.2500
      Braund, Mr. Owen Harris
      0
      3
      male
      1
      ...
      3
      NaN
      NaN
      1
      66.0
      1
      3
      3.62500
      484.0
      4356.0
    
    
      1
      2
      38.0
      C85
      C
      71.2833
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      0
      1
      female
      1
      ...
      3
      C
      C
      1
      38.0
      0
      0
      35.64165
      1444.0
      1444.0
    
    
      2
      3
      26.0
      NaN
      S
      7.9250
      Heikkinen, Miss. Laina
      0
      3
      female
      0
      ...
      3
      NaN
      NaN
      0
      78.0
      0
      0
      7.92500
      676.0
      6084.0
    
    
      3
      4
      35.0
      C123
      S
      53.1000
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      0
      1
      female
      1
      ...
      3
      C
      C
      1
      35.0
      0
      0
      26.55000
      1225.0
      1225.0
    
    
      4
      5
      35.0
      NaN
      S
      8.0500
      Allen, Mr. William Henry
      0
      3
      male
      0
      ...
      3
      NaN
      NaN
      0
      105.0
      1
      3
      8.05000
      1225.0
      11025.0
    
  

5 rows × 23 columns



In [ ]:

The predictor variables in the model are:

Ref: http://gertlowitz.blogspot.fr/2013/06/where-am-i-up-to-with-titanic-competion.html



In [111]:

    
df.describe()









    Out[111]:







  
    
      
      PassengerId
      Age
      Fare
      Parch
      Pclass
      SibSp
      Survived
      AgeUsingMeanTitle
      Title
      FamilySize
      AgeClass
      Male
      SexClass
      FarePerPerson
      AgeSquared
      AgeClassSquared
    
  
  
    
      count
      1309.000000
      1309.000000
      1309.000000
      1309.000000
      1309.000000
      1309.000000
      1309.000000
      1309.000000
      1309.000000
      1309.000000
      1309.000000
      1309.000000
      1309.000000
      1309.000000
      1309.000000
      1309.000000
    
    
      mean
      655.000000
      29.881138
      33.270043
      0.385027
      2.294882
      0.498854
      0.377387
      29.909496
      2.941176
      0.883881
      64.692851
      0.644003
      1.527884
      20.502540
      1060.582027
      5193.522645
    
    
      std
      378.020061
      12.883193
      51.747063
      0.865560
      0.837836
      1.041658
      0.484918
      12.889182
      0.391491
      1.583639
      31.766784
      0.478997
      1.309876
      35.765156
      888.665904
      4866.021451
    
    
      min
      1.000000
      0.170000
      0.000000
      0.000000
      1.000000
      0.000000
      0.000000
      0.170000
      0.000000
      0.000000
      0.510000
      0.000000
      0.000000
      0.000000
      0.028900
      0.260100
    
    
      25%
      328.000000
      22.000000
      7.895800
      0.000000
      2.000000
      0.000000
      0.000000
      22.000000
      3.000000
      0.000000
      42.000000
      0.000000
      0.000000
      7.452767
      484.000000
      1764.000000
    
    
      50%
      655.000000
      29.881138
      14.454200
      0.000000
      3.000000
      0.000000
      0.000000
      30.000000
      3.000000
      0.000000
      62.000000
      1.000000
      2.000000
      8.458300
      900.000000
      3844.000000
    
    
      75%
      982.000000
      35.000000
      31.275000
      0.000000
      3.000000
      1.000000
      1.000000
      35.000000
      3.000000
      1.000000
      90.000000
      1.000000
      3.000000
      24.150000
      1225.000000
      8100.000000
    
    
      max
      1309.000000
      80.000000
      512.329200
      9.000000
      3.000000
      8.000000
      1.000000
      80.000000
      3.000000
      10.000000
      222.000000
      1.000000
      3.000000
      512.329200
      6400.000000
      49284.000000



In [112]:

    
df.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 23 columns):
PassengerId          1309 non-null int64
Age                  1309 non-null float64
Cabin                295 non-null object
Embarked             1309 non-null object
Fare                 1309 non-null float64
Name                 1309 non-null object
Parch                1309 non-null int64
Pclass               1309 non-null int64
Sex                  1309 non-null object
SibSp                1309 non-null int64
Survived             1309 non-null int64
Ticket               1309 non-null object
AgeUsingMeanTitle    1309 non-null float64
Title                1309 non-null int64
Deck1                295 non-null object
Deck2                295 non-null object
FamilySize           1309 non-null int64
AgeClass             1309 non-null float64
Male                 1309 non-null int64
SexClass             1309 non-null int64
FarePerPerson        1309 non-null float64
AgeSquared           1309 non-null float64
AgeClassSquared      1309 non-null float64
dtypes: float64(7), int64(9), object(7)
memory usage: 235.3+ KB

Male – this is the “sex” variable in the data set from kaggle. I’ve just changed male/female to 1/0.
Pclass - no change from the pclass variable in the kaggle data set.
Fare – no change from the fare variable in the kaggle dataset
FarePerPerson – I have calculated the number of people travelling together (sibsp + parch + 1) and divided the fare variable by that number
Title – extracted the Title of each passenger from their name. I used Excel to do this and other data manipulation.
AgeUsingMeanTitle – this is the age of the passenger, with missing values replaced by the median age for each Title. So if the age was missing for Mr Smith, then I’ve used the median age for all passengers with Title “Mr”.
AgeClass – multiplied “Combined Age” by pclass
SexClass – multiplied sex ( 1 or 2) by pclass.
FamilySize – sibsp + parch
AgeSquared – combined_age squared
AgeClassSquared – age_class squared



In [123]:

    
df_train_test = df[["PassengerId","Male", "Pclass","Fare","FarePerPerson","Title",
            "AgeUsingMeanTitle","AgeClass","SexClass","FamilySize","AgeSquared","AgeClassSquared","Survived"]]



In [124]:

    
df_train_test.describe()









    Out[124]:







  
    
      
      PassengerId
      Male
      Pclass
      Fare
      FarePerPerson
      Title
      AgeUsingMeanTitle
      AgeClass
      SexClass
      FamilySize
      AgeSquared
      AgeClassSquared
      Survived
    
  
  
    
      count
      1309.000000
      1309.000000
      1309.000000
      1309.000000
      1309.000000
      1309.000000
      1309.000000
      1309.000000
      1309.000000
      1309.000000
      1309.000000
      1309.000000
      1309.000000
    
    
      mean
      655.000000
      0.644003
      2.294882
      33.270043
      20.502540
      2.941176
      29.909496
      64.692851
      1.527884
      0.883881
      1060.582027
      5193.522645
      0.377387
    
    
      std
      378.020061
      0.478997
      0.837836
      51.747063
      35.765156
      0.391491
      12.889182
      31.766784
      1.309876
      1.583639
      888.665904
      4866.021451
      0.484918
    
    
      min
      1.000000
      0.000000
      1.000000
      0.000000
      0.000000
      0.000000
      0.170000
      0.510000
      0.000000
      0.000000
      0.028900
      0.260100
      0.000000
    
    
      25%
      328.000000
      0.000000
      2.000000
      7.895800
      7.452767
      3.000000
      22.000000
      42.000000
      0.000000
      0.000000
      484.000000
      1764.000000
      0.000000
    
    
      50%
      655.000000
      1.000000
      3.000000
      14.454200
      8.458300
      3.000000
      30.000000
      62.000000
      2.000000
      0.000000
      900.000000
      3844.000000
      0.000000
    
    
      75%
      982.000000
      1.000000
      3.000000
      31.275000
      24.150000
      3.000000
      35.000000
      90.000000
      3.000000
      1.000000
      1225.000000
      8100.000000
      1.000000
    
    
      max
      1309.000000
      1.000000
      3.000000
      512.329200
      512.329200
      3.000000
      80.000000
      222.000000
      3.000000
      10.000000
      6400.000000
      49284.000000
      1.000000



In [125]:

    
df_train_test.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 13 columns):
PassengerId          1309 non-null int64
Male                 1309 non-null int64
Pclass               1309 non-null int64
Fare                 1309 non-null float64
FarePerPerson        1309 non-null float64
Title                1309 non-null int64
AgeUsingMeanTitle    1309 non-null float64
AgeClass             1309 non-null float64
SexClass             1309 non-null int64
FamilySize           1309 non-null int64
AgeSquared           1309 non-null float64
AgeClassSquared      1309 non-null float64
Survived             1309 non-null int64
dtypes: float64(6), int64(7)
memory usage: 133.0 KB



In [126]:

    
df_train_test.head()









    Out[126]:







  
    
      
      PassengerId
      Male
      Pclass
      Fare
      FarePerPerson
      Title
      AgeUsingMeanTitle
      AgeClass
      SexClass
      FamilySize
      AgeSquared
      AgeClassSquared
      Survived
    
  
  
    
      0
      1
      1
      3
      7.2500
      3.62500
      3
      22.0
      66.0
      3
      1
      484.0
      4356.0
      0
    
    
      1
      2
      0
      1
      71.2833
      35.64165
      3
      38.0
      38.0
      0
      1
      1444.0
      1444.0
      1
    
    
      2
      3
      0
      3
      7.9250
      7.92500
      3
      26.0
      78.0
      0
      0
      676.0
      6084.0
      1
    
    
      3
      4
      0
      1
      53.1000
      26.55000
      3
      35.0
      35.0
      0
      1
      1225.0
      1225.0
      1
    
    
      4
      5
      1
      3
      8.0500
      8.05000
      3
      35.0
      105.0
      3
      0
      1225.0
      11025.0
      0



In [117]:

    
print("Num of rows in Training corpus: ", trn_corpus_size) 
print("Num of rows in Testing corpus: ", tst_corpus_size)









    



Num of rows in Training corpus:  891
Num of rows in Testing corpus:  418



In [127]:

    
df_train_test.columns









    Out[127]:





Index(['PassengerId', 'Male', 'Pclass', 'Fare', 'FarePerPerson', 'Title',
       'AgeUsingMeanTitle', 'AgeClass', 'SexClass', 'FamilySize', 'AgeSquared',
       'AgeClassSquared', 'Survived'],
      dtype='object')



In [128]:

    
len(df_train_test.columns)









    Out[128]:





13



In [ ]:



In [ ]:



In [116]:

    
#Ref: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
#from sklearn.model_selection import train_test_split #Split arrays or matrices into random train and test subsets



In [132]:

    
trn_corpus_after_preprocessing = df_train_test.iloc[:trn_corpus_size - 1,:].copy()

#trn_corpus_after_preprocessing
print(len(trn_corpus_after_preprocessing["AgeUsingMeanTitle"]))



In [133]:

    
trn_corpus_after_preprocessing.columns









    Out[133]:





Index(['PassengerId', 'Male', 'Pclass', 'Fare', 'FarePerPerson', 'Title',
       'AgeUsingMeanTitle', 'AgeClass', 'SexClass', 'FamilySize', 'AgeSquared',
       'AgeClassSquared', 'Survived'],
      dtype='object')



In [134]:

    
tst_corpus_after_preprocessing = df_train_test.iloc[trn_corpus_size:,:].copy()

#tst_corpus_after_preprocessing



In [136]:

    
trn_corpus_after_preprocessing.to_csv("output/trn_corpus_after_preprocessing.csv", index=False, header=True)
tst_corpus_after_preprocessing.to_csv("output/tst_corpus_after_preprocessing.csv", index=False, header=True)



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:

	Fare
Pclass
1	94.280297
2	22.202104
3	12.459678

	Fare
Pclass
1	87.508992
2	21.179196
3	13.284126

	Fare
Pclass
1	87.508992
2	21.179196
3	13.284126

	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
PassengerId
1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S

	Pclass	Age	SibSp	Parch	Fare
count	418.000000	332.000000	418.000000	418.000000	417.000000
mean	2.265550	30.272590	0.447368	0.392344	35.627188
std	0.841838	14.181209	0.896760	0.981429	55.907576
min	1.000000	0.170000	0.000000	0.000000	0.000000
25%	1.000000	21.000000	0.000000	0.000000	7.895800
50%	3.000000	27.000000	0.000000	0.000000	14.454200
75%	3.000000	39.000000	1.000000	0.000000	31.500000
max	3.000000	76.000000	8.000000	9.000000	512.329200

	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
PassengerId
892	3	Kelly, Mr. James	male	34.5	0	0	330911	7.8292	NaN	Q
893	3	Wilkes, Mrs. James (Ellen Needs)	female	47.0	1	0	363272	7.0000	NaN	S
894	2	Myles, Mr. Thomas Francis	male	62.0	0	0	240276	9.6875	NaN	Q
895	3	Wirz, Mr. Albert	male	27.0	0	0	315154	8.6625	NaN	S
896	3	Hirvonen, Mrs. Alexander (Helga E Lindqvist)	female	22.0	1	1	3101298	12.2875	NaN	S

	Age	Fare	Parch	Pclass	SibSp	Survived
count	1046.000000	1308.000000	1309.000000	1309.000000	1309.000000	1309.000000
mean	29.881138	33.295479	0.385027	2.294882	0.498854	0.377387
std	14.413493	51.758668	0.865560	0.837836	1.041658	0.484918
min	0.170000	0.000000	0.000000	1.000000	0.000000	0.000000
25%	21.000000	7.895800	0.000000	2.000000	0.000000	0.000000
50%	28.000000	14.454200	0.000000	3.000000	0.000000	0.000000
75%	39.000000	31.275000	0.000000	3.000000	1.000000	1.000000
max	80.000000	512.329200	9.000000	3.000000	8.000000	1.000000

	Age	Fare
PassengerId
180	36.0	0.0000
264	40.0	0.0000
272	25.0	0.0000
278	NaN	0.0000
303	19.0	0.0000
379	20.0	4.0125
414	NaN	0.0000
467	NaN	0.0000
482	NaN	0.0000
598	49.0	0.0000
634	NaN	0.0000
675	NaN	0.0000
733	NaN	0.0000
807	39.0	0.0000
816	NaN	0.0000
823	38.0	0.0000

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	survival	AgeUsingMeanTitle
180	181	0	3	Sage, Miss. Constance Gladys	female	29.699118	8	2	CA. 2343	69.550	NaN	S	True	NaN
264	265	0	3	Henry, Miss. Delia	female	29.699118	0	0	382649	7.750	NaN	Q	True	NaN
272	273	1	2	Mellinger, Mrs. (Elizabeth Anne Maidment)	female	41.000000	0	1	250644	19.500	NaN	S	True	41.0
278	279	0	3	Rice, Master. Eric	male	7.000000	4	1	382652	29.125	NaN	Q	True	7.0
303	304	1	2	Keane, Miss. Nora A	female	29.699118	0	0	226593	12.350	E101	Q	True	NaN
414	415	1	3	Sundman, Mr. Johan Julian	male	44.000000	0	0	STON/O 2. 3101269	7.925	NaN	S	True	44.0
467	468	0	1	Smart, Mr. John Montgomery	male	56.000000	0	0	113792	26.550	NaN	S	True	56.0
482	483	0	3	Rouse, Mr. Richard Henry	male	50.000000	0	0	A/5 3594	8.050	NaN	S	True	50.0
598	599	0	3	Boulos, Mr. Hanna	male	29.699118	0	0	2664	7.225	NaN	C	True	NaN
634	635	0	3	Skoog, Miss. Mabel	female	9.000000	3	2	347088	27.900	NaN	S	True	9.0
675	676	0	3	Edvardsson, Mr. Gustaf Hjalmar	male	18.000000	0	0	349912	7.775	NaN	S	True	18.0
733	734	0	2	Berriman, Mr. William John	male	23.000000	0	0	28425	13.000	NaN	S	True	23.0
807	808	0	3	Pettersson, Miss. Ellen Natalia	female	18.000000	0	0	347087	7.775	NaN	S	True	18.0
816	817	0	3	Heininen, Miss. Wendla Maria	female	23.000000	0	0	STON/O2. 3101290	7.925	NaN	S	True	23.0
823	824	1	3	Moor, Mrs. (Beila)	female	27.000000	0	1	392096	12.475	E121	S	True	27.0

	PassengerId	Age	Cabin	Embarked	Fare	Name	Parch	Pclass	Sex	SibSp	Survived	Ticket	AgeUsingMeanTitle
6	7	54.000000	E46	S	51.8625	McCarthy, Mr. Timothy J	0	1	male	0	0	17463	54.00
18	19	31.000000	NaN	S	18.0000	Vander Planke, Mrs. Julius (Emelia Maria Vande...	0	3	female	1	0	345763	31.00
20	21	35.000000	NaN	S	26.0000	Fynney, Mr. Joseph J	0	2	male	0	0	239865	35.00
27	28	19.000000	C23 C25 C27	S	263.0000	Fortune, Mr. Charles Alexander	2	1	male	3	0	19950	19.00
29	30	29.881138	NaN	S	7.8958	Todoroff, Mr. Lalio	0	3	male	0	0	349216	NaN
30	31	40.000000	NaN	C	27.7208	Uruchurtu, Don. Manuel E	0	1	male	0	0	PC 17601	40.00
32	33	29.881138	NaN	Q	7.7500	Glynn, Miss. Mary Agatha	0	3	female	0	1	335677	NaN
33	34	66.000000	NaN	S	10.5000	Wheadon, Mr. Edward H	0	2	male	0	0	C.A. 24579	66.00
37	38	21.000000	NaN	S	8.0500	Cann, Mr. Ernest Charles	0	3	male	0	0	A./5. 2152	21.00
43	44	3.000000	NaN	C	41.5792	Laroche, Miss. Simonne Marie Anne Andree	2	2	female	1	1	SC/Paris 2123	3.00
46	47	29.881138	NaN	Q	15.5000	Lennon, Mr. Denis	0	3	male	1	0	370371	NaN
47	48	29.881138	NaN	Q	7.7500	O'Driscoll, Miss. Bridget	0	3	female	0	1	14311	NaN
48	49	29.881138	NaN	C	21.6792	Samaan, Mr. Youssef	0	3	male	2	0	2662	NaN
49	50	18.000000	NaN	S	17.8000	Arnold-Franchi, Mrs. Josef (Josefine Franchi)	0	3	female	1	0	349237	18.00
56	57	21.000000	NaN	S	10.5000	Rugg, Miss. Emily	0	2	female	0	1	C.A. 31026	21.00
65	66	29.881138	NaN	C	15.2458	Moubarek, Master. Gerios	1	3	male	1	1	2661	NaN
66	67	29.000000	F33	S	10.5000	Nye, Mrs. (Elizabeth Ramell)	0	2	female	0	1	C.A. 29395	29.00
77	78	29.881138	NaN	S	8.0500	Moutal, Mr. Rahamin Haim	0	3	male	0	0	374746	NaN
78	79	0.830000	NaN	S	29.0000	Caldwell, Master. Alden Gates	2	2	male	0	1	248738	0.83
83	84	28.000000	NaN	S	47.1000	Carrau, Mr. Francisco M	0	1	male	0	0	113059	28.00
88	89	23.000000	C23 C25 C27	S	263.0000	Fortune, Miss. Mabel Helen	2	1	female	3	1	19950	23.00
96	97	71.000000	A5	C	34.6542	Goldschmidt, Mr. George B	0	1	male	0	0	PC 17754	71.00
102	103	21.000000	D26	S	77.2875	White, Mr. Richard Frasar	1	1	male	0	0	35281	21.00
108	109	38.000000	NaN	S	7.8958	Rekic, Mr. Tido	0	3	male	0	0	349249	38.00
110	111	47.000000	C110	S	52.0000	Porter, Mr. Walter Chamberlain	0	1	male	0	0	110465	47.00
122	123	32.500000	NaN	C	30.0708	Nasser, Mr. Nicholas	0	2	male	1	0	237736	32.50
127	128	24.000000	NaN	S	7.1417	Madsen, Mr. Fridtjof Arne	0	3	male	0	1	C 17369	24.00
129	130	45.000000	NaN	S	6.9750	Ekstrom, Mr. Johan	0	3	male	0	0	347061	45.00
141	142	22.000000	NaN	S	7.7500	Nysten, Miss. Anna Sofia	0	3	female	0	1	347081	22.00
155	156	51.000000	NaN	C	61.3792	Williams, Mr. Charles Duane	1	1	male	0	0	PC 17597	51.00
...	...	...	...	...	...	...	...	...	...	...	...	...	...
1159	1160	29.881138	NaN	S	8.0500	Howard, Miss. May Elizabeth	0	3	female	0	1	A. 2. 39186	NaN
1160	1161	17.000000	NaN	S	8.6625	Pokrnic, Mr. Mate	0	3	male	0	0	315095	17.00
1163	1164	26.000000	C89	C	136.7792	Clark, Mrs. Walter Miller (Virginia McDowell)	0	1	female	1	1	13508	26.00
1165	1166	29.881138	NaN	C	7.2250	Saade, Mr. Jean Nassr	0	3	male	0	0	2676	NaN
1166	1167	20.000000	NaN	S	26.0000	Bryhl, Miss. Dagmar Jenny Ingeborg	0	2	female	1	1	236853	20.00
1174	1175	9.000000	NaN	C	15.2458	Touma, Miss. Maria Youssef	1	3	female	1	1	2650	9.00
1178	1179	24.000000	B45	S	82.2667	Snyder, Mr. John Pillsbury	0	1	male	1	0	21228	24.00
1180	1181	29.881138	NaN	S	8.0500	Ford, Mr. Arthur	0	3	male	0	0	A/5 1478	NaN
1181	1182	29.881138	NaN	S	39.6000	Rheims, Mr. George Alexander Lucien	0	1	male	0	0	PC 17607	NaN
1182	1183	30.000000	NaN	Q	6.9500	Daly, Miss. Margaret Marcella Maggie""	0	3	female	0	1	382650	30.00
1184	1185	53.000000	A34	S	81.8583	Dodge, Dr. Washington	1	1	male	1	0	33638	53.00
1189	1190	30.000000	NaN	S	45.5000	Loring, Mr. Joseph Holland	0	1	male	0	0	113801	30.00
1193	1194	43.000000	NaN	S	21.0000	Phillips, Mr. Escott Robert	1	2	male	0	0	S.O./P.P. 2	43.00
1196	1197	64.000000	B26	S	26.5500	Crosby, Mrs. Edward Gifford (Catherine Elizabe...	1	1	female	1	1	112901	64.00
1204	1205	37.000000	NaN	Q	7.7500	Carr, Miss. Jeannie	0	3	female	0	1	368364	37.00
1224	1225	19.000000	NaN	C	15.7417	Nakid, Mrs. Said (Waika Mary" Mowad)"	1	3	female	1	1	2653	19.00
1231	1232	18.000000	NaN	S	10.5000	Fillbrook, Mr. Joseph Charles	0	2	male	0	0	C.A. 15185	18.00
1234	1235	58.000000	B51 B53 B55	C	512.3292	Cardeza, Mrs. James Warburton Martinez (Charlo...	1	1	female	0	1	PC 17755	58.00
1236	1237	16.000000	NaN	S	7.6500	Abelseth, Miss. Karen Marie	0	3	female	0	1	348125	16.00
1249	1250	29.881138	NaN	Q	7.7500	O'Keefe, Mr. Patrick	0	3	male	0	0	368402	NaN
1250	1251	30.000000	NaN	S	15.5500	Lindell, Mrs. Edvard Bengtsson (Elin Gerda Per...	0	3	female	1	1	349910	30.00
1257	1258	29.881138	NaN	C	14.4583	Caram, Mr. Joseph	0	3	male	1	0	2689	NaN
1258	1259	22.000000	NaN	S	39.6875	Riihivouri, Miss. Susanna Juhantytar Sanni""	0	3	female	0	1	3101295	22.00
1272	1273	26.000000	NaN	Q	7.8792	Foley, Mr. Joseph	0	3	male	0	0	330910	26.00
1274	1275	19.000000	NaN	S	16.1000	McNamee, Mrs. Neal (Eileen O'Leary)	0	3	female	1	1	376566	19.00
1276	1277	24.000000	NaN	S	65.0000	Herman, Miss. Kate	2	2	female	1	1	220845	24.00
1300	1301	3.000000	NaN	S	13.7750	Peacock, Miss. Treasteall	1	3	female	1	1	SOTON/O.Q. 3101315	3.00
1302	1303	37.000000	C78	Q	90.0000	Minahan, Mrs. William Edward (Lillian E Thorpe)	0	1	female	1	1	19928	37.00
1305	1306	39.000000	C105	C	108.9000	Oliva y Ocana, Dona. Fermina	0	1	female	0	1	PC 17758	39.00
1308	1309	29.881138	NaN	C	22.3583	Peter, Master. Michael J	1	3	male	1	0	2668	NaN

	PassengerId	Age	Cabin	Embarked	Fare	Name	Parch	Pclass	Sex	SibSp	Survived	Ticket	AgeUsingMeanTitle	Title
5	6	29.881138	NaN	Q	8.4583	Moran, Mr. James	0	3	male	0	0	330877	30.585228	3
17	18	29.881138	NaN	S	13.0000	Williams, Mr. Charles Eugene	0	2	male	0	1	244373	30.585228	3
19	20	29.881138	NaN	C	7.2250	Masselmani, Mrs. Fatima	0	3	female	0	1	2649	28.687088	3
26	27	29.881138	NaN	C	7.2250	Emir, Mr. Farred Chehab	0	3	male	0	0	2631	30.585228	3
28	29	29.881138	NaN	Q	7.8792	O'Dwyer, Miss. Ellen "Nellie"	0	3	female	0	1	330959	28.687088	3
29	30	29.881138	NaN	S	7.8958	Todoroff, Mr. Lalio	0	3	male	0	0	349216	30.585228	3
31	32	29.881138	B78	C	146.5208	Spencer, Mrs. William Augustus (Marie Eugenie)	0	1	female	1	1	PC 17569	28.687088	3
32	33	29.881138	NaN	Q	7.7500	Glynn, Miss. Mary Agatha	0	3	female	0	1	335677	28.687088	3
36	37	29.881138	NaN	C	7.2292	Mamee, Mr. Hanna	0	3	male	0	1	2677	30.585228	3
42	43	29.881138	NaN	C	7.8958	Kraeff, Mr. Theodor	0	3	male	0	0	349253	30.585228	3
45	46	29.881138	NaN	S	8.0500	Rogers, Mr. William John	0	3	male	0	0	S.C./A.4. 23567	30.585228	3
46	47	29.881138	NaN	Q	15.5000	Lennon, Mr. Denis	0	3	male	1	0	370371	30.585228	3
47	48	29.881138	NaN	Q	7.7500	O'Driscoll, Miss. Bridget	0	3	female	0	1	14311	28.687088	3
48	49	29.881138	NaN	C	21.6792	Samaan, Mr. Youssef	0	3	male	2	0	2662	30.585228	3
55	56	29.881138	C52	S	35.5000	Woolner, Mr. Hugh	0	1	male	0	1	19947	30.585228	3
64	65	29.881138	NaN	C	27.7208	Stewart, Mr. Albert A	0	1	male	0	0	PC 17605	30.585228	3
65	66	29.881138	NaN	C	15.2458	Moubarek, Master. Gerios	1	3	male	1	1	2661	30.585228	3
76	77	29.881138	NaN	S	7.8958	Staneff, Mr. Ivan	0	3	male	0	0	349208	30.585228	3
77	78	29.881138	NaN	S	8.0500	Moutal, Mr. Rahamin Haim	0	3	male	0	0	374746	30.585228	3
82	83	29.881138	NaN	Q	7.7875	McDermott, Miss. Brigdet Delia	0	3	female	0	1	330932	28.687088	3
87	88	29.881138	NaN	S	8.0500	Slocovski, Mr. Selman Francis	0	3	male	0	0	SOTON/OQ 392086	30.585228	3
95	96	29.881138	NaN	S	8.0500	Shorney, Mr. Charles Joseph	0	3	male	0	0	374910	30.585228	3
101	102	29.881138	NaN	S	7.8958	Petroff, Mr. Pastcho ("Pentcho")	0	3	male	0	0	349215	30.585228	3
107	108	29.881138	NaN	S	7.7750	Moss, Mr. Albert Johan	0	3	male	0	1	312991	30.585228	3
109	110	29.881138	NaN	Q	24.1500	Moran, Miss. Bertha	0	3	female	1	1	371110	28.687088	3
121	122	29.881138	NaN	S	8.0500	Moore, Mr. Leonard Charles	0	3	male	0	0	A4. 54510	30.585228	3
126	127	29.881138	NaN	Q	7.7500	McMahon, Mr. Martin	0	3	male	0	0	370372	30.585228	3
128	129	29.881138	F E69	C	22.3583	Peter, Miss. Anna	1	3	female	1	1	2668	28.687088	3
140	141	29.881138	NaN	C	15.2458	Boulos, Mrs. Joseph (Sultana)	2	3	female	0	0	2678	28.687088	3
154	155	29.881138	NaN	S	7.3125	Olsen, Mr. Ole Martin	0	3	male	0	0	Fa 265302	30.585228	3
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1159	1160	29.881138	NaN	S	8.0500	Howard, Miss. May Elizabeth	0	3	female	0	1	A. 2. 39186	28.687088	3
1162	1163	29.881138	NaN	Q	7.7500	Fox, Mr. Patrick	0	3	male	0	0	368573	30.585228	3
1164	1165	29.881138	NaN	Q	15.5000	Lennon, Miss. Mary	0	3	female	1	1	370371	28.687088	3
1165	1166	29.881138	NaN	C	7.2250	Saade, Mr. Jean Nassr	0	3	male	0	0	2676	30.585228	3
1173	1174	29.881138	NaN	Q	7.7500	Fleming, Miss. Honora	0	3	female	0	1	364859	28.687088	3
1177	1178	29.881138	NaN	S	7.2500	Franklin, Mr. Charles (Charles Fardon)	0	3	male	0	0	SOTON/O.Q. 3101314	30.585228	3
1179	1180	29.881138	F E46	C	7.2292	Mardirosian, Mr. Sarkis	0	3	male	0	0	2655	30.585228	3
1180	1181	29.881138	NaN	S	8.0500	Ford, Mr. Arthur	0	3	male	0	0	A/5 1478	30.585228	3
1181	1182	29.881138	NaN	S	39.6000	Rheims, Mr. George Alexander Lucien	0	1	male	0	0	PC 17607	30.585228	3
1183	1184	29.881138	NaN	C	7.2292	Nasr, Mr. Mustafa	0	3	male	0	0	2652	30.585228	3
1188	1189	29.881138	NaN	C	21.6792	Samaan, Mr. Hanna	0	3	male	2	0	2662	30.585228	3
1192	1193	29.881138	D	C	15.0458	Malachard, Mr. Noel	0	2	male	0	0	237735	30.585228	3
1195	1196	29.881138	NaN	Q	7.7500	McCarthy, Miss. Catherine Katie""	0	3	female	0	1	383123	28.687088	3
1203	1204	29.881138	NaN	S	7.5750	Sadowitz, Mr. Harry	0	3	male	0	0	LP 1588	30.585228	3
1223	1224	29.881138	NaN	C	7.2250	Thomas, Mr. Tannous	0	3	male	0	0	2684	30.585228	3
1230	1231	29.881138	NaN	C	7.2292	Betros, Master. Seman	0	3	male	0	0	2622	30.585228	3
1233	1234	29.881138	NaN	S	69.5500	Sage, Mr. John George	9	3	male	1	0	CA. 2343	30.585228	3
1235	1236	29.881138	NaN	S	14.5000	van Billiard, Master. James William	1	3	male	1	0	A/5. 851	30.585228	3
1248	1249	29.881138	NaN	S	7.8792	Lockyer, Mr. Edward	0	3	male	0	0	1222	30.585228	3
1249	1250	29.881138	NaN	Q	7.7500	O'Keefe, Mr. Patrick	0	3	male	0	0	368402	30.585228	3
1256	1257	29.881138	NaN	S	69.5500	Sage, Mrs. John (Annie Bullen)	9	3	female	1	1	CA. 2343	28.687088	3
1257	1258	29.881138	NaN	C	14.4583	Caram, Mr. Joseph	0	3	male	1	0	2689	30.585228	3
1271	1272	29.881138	NaN	Q	7.7500	O'Connor, Mr. Patrick	0	3	male	0	0	366713	30.585228	3
1273	1274	29.881138	NaN	S	14.5000	Risien, Mrs. Samuel (Emma)	0	3	female	0	1	364498	28.687088	3
1275	1276	29.881138	NaN	S	12.8750	Wheeler, Mr. Edwin Frederick""	0	2	male	0	0	SC/PARIS 2159	30.585228	3
1299	1300	29.881138	NaN	Q	7.7208	Riordan, Miss. Johanna Hannah""	0	3	female	0	1	334915	28.687088	3
1301	1302	29.881138	NaN	Q	7.7500	Naughton, Miss. Hannah	0	3	female	0	1	365237	28.687088	3
1304	1305	29.881138	NaN	S	8.0500	Spector, Mr. Woolf	0	3	male	0	0	A.5. 3236	30.585228	3
1307	1308	29.881138	NaN	S	8.0500	Ware, Mr. Frederick	0	3	male	0	0	359309	30.585228	3
1308	1309	29.881138	NaN	C	22.3583	Peter, Master. Michael J	1	3	male	1	0	2668	30.585228	3

	PassengerId	Male	Pclass	Fare	FarePerPerson	Title	AgeUsingMeanTitle	AgeClass	SexClass	FamilySize	AgeSquared	AgeClassSquared	Survived
0	1	1	3	7.2500	3.62500	3	22.0	66.0	3	1	484.0	4356.0	0
1	2	0	1	71.2833	35.64165	3	38.0	38.0	0	1	1444.0	1444.0	1
2	3	0	3	7.9250	7.92500	3	26.0	78.0	0	0	676.0	6084.0	1
3	4	0	1	53.1000	26.55000	3	35.0	35.0	0	1	1225.0	1225.0	1
4	5	1	3	8.0500	8.05000	3	35.0	105.0	3	0	1225.0	11025.0	0