Titanic: Machine Learning from Disaster - Data Wrangling

Homepage: https://github.com/tien-le/kaggle-titanic

unbelivable ... to achieve 1.000. How did they do this?

Just curious, how did they cheat the score? ANS: maybe, we have the information existing in https://www.encyclopedia-titanica.org/titanic-victims/

Competition Description

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

Practice Skills

  • Binary classification
  • Python and R basics

References

https://www.kaggle.com/c/titanic

https://triangleinequality.wordpress.com/2013/09/08/basic-feature-engineering-with-the-titanic-data/

https://triangleinequality.wordpress.com/2013/05/19/machine-learning-with-python-first-steps-munging/

https://www.kaggle.com/mrisdal/exploring-survival-on-the-titanic

Data overview

The data has been split into two groups:

training set (train.csv)
test set (test.csv)

The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.

The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.

Data Dictionary

Variable Definition Key
eq qe qe
survival Survival 0 = No, 1 = Yes
pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
sex Sex
Age Age in years
sibsp # of siblings / spouses aboard the Titanic
parch # of parents / children aboard the Titanic
ticket Ticket number
fare Passenger fare
cabin Cabin number
embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton
Variable Notes

pclass: A proxy for socio-economic status (SES) 1st = Upper 2nd = Middle 3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way... Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way... Parent = mother, father Child = daughter, son, stepdaughter, stepson Some children travelled only with a nanny, therefore parch=0 for them.

Data Exploration

Five Steps: Variable Identification, Uni-variate Analysis, Bi-variate Analysis, Missing Values Imputation, Outlier Treament

Step 1. Variable Identification

  • Identify Preditor (input) variables + Target (output) variables
  • Identify the data type and category of variables

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
import random

In [4]:
trn_corpus = pd.read_csv("data/train.csv")

#889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S --> containing NaN
trn_corpus.set_index("PassengerId", inplace=True)
trn_corpus.info()
trn_corpus.describe()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Name        891 non-null object
Sex         891 non-null object
Age         714 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Ticket      891 non-null object
Fare        891 non-null float64
Cabin       204 non-null object
Embarked    889 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB
Out[4]:
Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200

In [5]:
trn_corpus.head()


Out[5]:
Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
PassengerId
1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

In [6]:
tst_corpus = pd.read_csv("data/test.csv")

tst_corpus.set_index("PassengerId", inplace=True)
tst_corpus.info()
tst_corpus.describe()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 418 entries, 892 to 1309
Data columns (total 10 columns):
Pclass      418 non-null int64
Name        418 non-null object
Sex         418 non-null object
Age         332 non-null float64
SibSp       418 non-null int64
Parch       418 non-null int64
Ticket      418 non-null object
Fare        417 non-null float64
Cabin       91 non-null object
Embarked    418 non-null object
dtypes: float64(2), int64(3), object(5)
memory usage: 35.9+ KB
Out[6]:
Pclass Age SibSp Parch Fare
count 418.000000 332.000000 418.000000 418.000000 417.000000
mean 2.265550 30.272590 0.447368 0.392344 35.627188
std 0.841838 14.181209 0.896760 0.981429 55.907576
min 1.000000 0.170000 0.000000 0.000000 0.000000
25% 1.000000 21.000000 0.000000 0.000000 7.895800
50% 3.000000 27.000000 0.000000 0.000000 14.454200
75% 3.000000 39.000000 1.000000 0.000000 31.500000
max 3.000000 76.000000 8.000000 9.000000 512.329200

In [7]:
tst_corpus.head()


Out[7]:
Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
PassengerId
892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q
893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 NaN S
894 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 NaN Q
895 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 NaN S
896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 NaN S

Adding Column "Survived" from file "gender_submission.csv"


In [8]:
expected_labels = pd.read_csv("data/gender_submission.csv")

expected_labels.set_index("PassengerId", inplace=True)
expected_labels.info()
expected_labels.describe()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 418 entries, 892 to 1309
Data columns (total 1 columns):
Survived    418 non-null int64
dtypes: int64(1)
memory usage: 6.5 KB
Out[8]:
Survived
count 418.000000
mean 0.363636
std 0.481622
min 0.000000
25% 0.000000
50% 0.000000
75% 1.000000
max 1.000000

In [9]:
expected_labels.head()


Out[9]:
Survived
PassengerId
892 0
893 1
894 0
895 0
896 1

In [10]:
trn_corpus.index.names


Out[10]:
FrozenList(['PassengerId'])

In [11]:
expected_labels.index.names


Out[11]:
FrozenList(['PassengerId'])

In [12]:
#pd.merge(tst_corpus, expected_labels, how="inner", on="PassengerId")

tst_corpus_having_expected_label = pd.concat([tst_corpus, expected_labels], axis=1, join='inner')
tst_corpus_having_expected_label.head()


Out[12]:
Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Survived
PassengerId
892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q 0
893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 NaN S 1
894 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 NaN Q 0
895 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 NaN S 0
896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 NaN S 1

In [13]:
print("Columns name: ", trn_corpus.columns)
print("Num of columns: ", len(trn_corpus.columns))
print("Num of rows: ", len(trn_corpus.index)) #trn_corpus.shape[0]

trn_corpus_size = len(trn_corpus.index)


Columns name:  Index(['Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket',
       'Fare', 'Cabin', 'Embarked'],
      dtype='object')
Num of columns:  11
Num of rows:  891

In [14]:
print("Columns name: ", tst_corpus.columns)
print("Num of columns: ", len(tst_corpus.columns))
print("Num of rows: ", len(tst_corpus.index)) #tst_corpus.shape[0]

tst_corpus_size = len(tst_corpus.index)


Columns name:  Index(['Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare',
       'Cabin', 'Embarked'],
      dtype='object')
Num of columns:  10
Num of rows:  418

Overview of Data using visualization


In [15]:
#sns.pairplot(trn_corpus.dropna())

In [16]:
#sns.pairplot(tst_corpus.dropna())

Concatenating trn_corpus and tst_corpus_having_expected_label using append


In [17]:
df = trn_corpus.append(tst_corpus_having_expected_label)

print("Columns name: ", df.columns)
print("Num of columns: ", len(df.columns))
print("Num of rows: ", len(df.index)) #trn_corpus.shape[0]

print("Sum of trn_corpus_size and tst_corpus_size: ", trn_corpus_size + tst_corpus_size)


Columns name:  Index(['Age', 'Cabin', 'Embarked', 'Fare', 'Name', 'Parch', 'Pclass', 'Sex',
       'SibSp', 'Survived', 'Ticket'],
      dtype='object')
Num of columns:  11
Num of rows:  1309
Sum of trn_corpus_size and tst_corpus_size:  1309

In [18]:
df.info()
df.describe()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 1 to 1309
Data columns (total 11 columns):
Age         1046 non-null float64
Cabin       295 non-null object
Embarked    1307 non-null object
Fare        1308 non-null float64
Name        1309 non-null object
Parch       1309 non-null int64
Pclass      1309 non-null int64
Sex         1309 non-null object
SibSp       1309 non-null int64
Survived    1309 non-null int64
Ticket      1309 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 122.7+ KB
Out[18]:
Age Fare Parch Pclass SibSp Survived
count 1046.000000 1308.000000 1309.000000 1309.000000 1309.000000 1309.000000
mean 29.881138 33.295479 0.385027 2.294882 0.498854 0.377387
std 14.413493 51.758668 0.865560 0.837836 1.041658 0.484918
min 0.170000 0.000000 0.000000 1.000000 0.000000 0.000000
25% 21.000000 7.895800 0.000000 2.000000 0.000000 0.000000
50% 28.000000 14.454200 0.000000 3.000000 0.000000 0.000000
75% 39.000000 31.275000 0.000000 3.000000 1.000000 1.000000
max 80.000000 512.329200 9.000000 3.000000 8.000000 1.000000

In [ ]:

Answer for Step 1:

1. Preditor (input) Variables and Data type

  • PassengerId 891 non-null int64
  • Pclass 891 non-null int64
  • Name 891 non-null object
  • Sex 891 non-null object
  • Age 714 non-null float64
  • SibSp 891 non-null int64
  • Parch 891 non-null int64
  • Ticket 891 non-null object
  • Fare 891 non-null float64
  • Cabin 204 non-null object
  • Embarked 889 non-null object

2. Target (output) Variables and Data Type

  • Survived 891 non-null int64

3. Category of Variables

  • Continuous variables
    • PassengerId 891 non-null int64 #primary key
    • Age 714 non-null float64
    • Fare 891 non-null float64
  • Categorial variables

    • Name 891 non-null object
    • Sex 891 non-null object
    • Ticket 891 non-null object
    • Cabin 204 non-null object
    • Embarked 889 non-null object # embarked -- Port of Embarkation -- C = Cherbourg, Q = Queenstown, S = Southampton

    • SibSp 891 non-null int64 # # of siblings / spouses aboard the Titanic -- [1 0 3 4 2 5 8] ; 7 items

    • Parch 891 non-null int64 # # of parents / children aboard the Titanic -- [0 1 2 5 3 4 6] ; 7 items
    • Survived 891 non-null int64 #survival -- Survival -- 0 = No, 1 = Yes
    • Pclass 891 non-null int64 #pclass -- Ticket class -- 1 = 1st, 2 = 2nd, 3 = 3rd

Verify the unique data in each variables


In [19]:
#df.head()

In [20]:
#print("PassengerId:", df["PassengerId"].unique(), ";", df["PassengerId"].nunique(), "items")
print("Survived:", df["Survived"].unique(), ";", df["Survived"].nunique(), "items")
print("Pclass:", df["Pclass"].unique(), ";", df["Pclass"].nunique(), "Pclass")
#print("Name:", df["Name"].unique(), ";", df["Name"].nunique(), "items")
print("Sex:", df["Sex"].unique(), ";", df["Sex"].nunique(), "items")
#print("Age:", df["Age"].unique(), ";", df["Age"].nunique(), "items")
print("SibSp:", df["SibSp"].unique(), ";", df["SibSp"].nunique(), "items")
print("Parch:", df["Parch"].unique(), ";", df["Parch"].nunique(), "items")
#print("Ticket:", df["Ticket"].unique(), ";", df["Ticket"].nunique(), "items") # 681 items
#print("Fare:", df["Fare"].unique(), ";", df["Fare"].nunique(), "items") # 248 items
#print("Cabin:", df["Cabin"].unique(), ";", df["Cabin"].nunique(), "items") # 147 items
print("Embarked:", df["Embarked"].unique(), ";", df["Embarked"].nunique(), "items")


Survived: [0 1] ; 2 items
Pclass: [3 1 2] ; 3 Pclass
Sex: ['male' 'female'] ; 2 items
SibSp: [1 0 3 4 2 5 8] ; 7 items
Parch: [0 1 2 5 3 4 6 9] ; 8 items
Embarked: ['S' 'C' 'Q' nan] ; 3 items

In [21]:
trn_corpus.describe()


Out[21]:
Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200

So we use read_csv since that is the form (comma separated values), the data is in. Pandas automatically gave the columns names from the header and inferred the data types. For large data sets it is recommended that you specify the data types manually.

Notice that the age, cabin and embarked columns have null values. Also we apparently have some free-loaders because the minimum fare is 0. We might think that these are babies, so let’s check that:


In [22]:
trn_corpus[['Age','Fare']][trn_corpus.Fare < 5]


Out[22]:
Age Fare
PassengerId
180 36.0 0.0000
264 40.0 0.0000
272 25.0 0.0000
278 NaN 0.0000
303 19.0 0.0000
379 20.0 4.0125
414 NaN 0.0000
467 NaN 0.0000
482 NaN 0.0000
598 49.0 0.0000
634 NaN 0.0000
675 NaN 0.0000
733 NaN 0.0000
807 39.0 0.0000
816 NaN 0.0000
823 38.0 0.0000

These guys are surely old enough to know better! But notice that there is a jump from a fare of 0 to 4, so there is something going on here, most likely these are errors, so let’s replace them by the mean fare for their class, and do the same for null values.


In [23]:
df.nunique()


Out[23]:
Age           98
Cabin        186
Embarked       3
Fare         281
Name        1307
Parch          8
Pclass         3
Sex            2
SibSp          7
Survived       2
Ticket       929
dtype: int64

In [24]:
df["Fare"].fillna(0.0, inplace = True)

In [25]:
df[df["Fare"].isnull()]


Out[25]:
Age Cabin Embarked Fare Name Parch Pclass Sex SibSp Survived Ticket
PassengerId

In [26]:
#first we set those fares of 0 to nan ==> Not used
#trn_corpus.Fare = trn_corpus.Fare.map(lambda x: np.nan if x==0 else x)
#df.Fare = df.Fare.map(lambda x: np.nan if x==0 else x)

In [27]:
#not that lambda just means a function we make on the fly
#calculate the mean fare for each class
classmeans_trn_corpus = trn_corpus.pivot_table('Fare', index = 'Pclass', aggfunc = "mean") #np.mean
classmeans_trn_corpus


Out[27]:
Fare
Pclass
1 84.154687
2 20.662183
3 13.675550

In [28]:
df.nunique()


Out[28]:
Age           98
Cabin        186
Embarked       3
Fare         281
Name        1307
Parch          8
Pclass         3
Sex            2
SibSp          7
Survived       2
Ticket       929
dtype: int64

In [29]:
trn_corpus.nunique()


Out[29]:
Survived      2
Pclass        3
Name        891
Sex           2
Age          88
SibSp         7
Parch         7
Ticket      681
Fare        248
Cabin       147
Embarked      3
dtype: int64

In [30]:
#df.head()

In [31]:
#trn_corpus.head()

In [32]:
classmeans_trn_corpus.query('Pclass == 3')


Out[32]:
Fare
Pclass
3 13.67555

In [33]:
classmeans_trn_corpus.xs(3)["Fare"]


Out[33]:
13.675550101832997

In [34]:
classmeans_trn_corpus.query('Pclass == 3')


Out[34]:
Fare
Pclass
3 13.67555

Step 2. Uni-variate Analysis

In this step, we explore the variables one by one. It depends on the variable type: Continuous or Categorial.

Continuous Variables

--> Understanding the central tendency and spread of the variables.

  • Central Tendency: mean, mode, median, min, max
  • Measure of Dispersion: range, Quartile, IQR (Interquartile Range), Variance, Standard Deviation, Skewness, Kurtosis
  • Visualization Methods: Histogram, Box Plot
  • Continuous variables
    • Age 714 non-null float64
    • Fare 891 non-null float64
  • Central Tendency: mean, mode, median, min, max

In [35]:
print("Central Tendency - for Age")
trn_corpus["Age"].describe()


Central Tendency - for Age
Out[35]:
count    714.000000
mean      29.699118
std       14.526497
min        0.420000
25%       20.125000
50%       28.000000
75%       38.000000
max       80.000000
Name: Age, dtype: float64

In [36]:
trn_corpus_Age_dropna = trn_corpus["Age"].dropna()

In [37]:
#Ref: https://docs.python.org/3/library/statistics.html
import statistics

corpus_stat = trn_corpus_Age_dropna.copy()
    
print("=" * 36)
print("=" * 36)
print("Averages and measures of central location - Age")
print("These functions calculate an average or typical value from a population or sample.")
print("-" * 36)

print("Mode (most common value) of discrete data = ", statistics.mode(trn_corpus["Age"]))

print("Arithmetic mean (“average”) of data = ", statistics.mean(corpus_stat))
#print("Harmonic mean of data = ", statistics.harmonic_mean(trn_corpus_Age_dropna))
#StatisticsError is raised if data is empty, or any element is less than zero. New in version 3.6.

print("Median (middle value) of data = ", statistics.median(corpus_stat))
print("Median, or 50th percentile, of grouped data = ", statistics.median_grouped(corpus_stat)) 
print("Low median of data = ", statistics.median_low(corpus_stat))
print("High median of data = ", statistics.median_high(corpus_stat))

print("-" * 36)
#Method 2 - Using DataFrame
print("Arithmetic mean (“average”) of data = ", trn_corpus["Age"].mean())
print("Max = ", trn_corpus["Age"].max())
print("Min = ", trn_corpus["Age"].min())
print("Count = ", trn_corpus["Age"].count())


====================================
====================================
Averages and measures of central location - Age
These functions calculate an average or typical value from a population or sample.
------------------------------------
Mode (most common value) of discrete data =  24.0
Arithmetic mean (“average”) of data =  29.6991176471
Median (middle value) of data =  28.0
Median, or 50th percentile, of grouped data =  28.3
Low median of data =  28.0
High median of data =  28.0
------------------------------------
Arithmetic mean (“average”) of data =  29.6991176471
Max =  80.0
Min =  0.42
Count =  714

In [38]:
print("Central Tendency - for Fare")
trn_corpus["Fare"].describe()


Central Tendency - for Fare
Out[38]:
count    891.000000
mean      32.204208
std       49.693429
min        0.000000
25%        7.910400
50%       14.454200
75%       31.000000
max      512.329200
Name: Fare, dtype: float64

In [39]:
trn_corpus_Fare_dropna = trn_corpus["Fare"].dropna()

In [40]:
#Ref: https://docs.python.org/3/library/statistics.html
import statistics

corpus_stat = trn_corpus_Fare_dropna.copy()
    
print("=" * 36)
print("=" * 36)
print("Averages and measures of central location - Age")
print("These functions calculate an average or typical value from a population or sample.")
print("-" * 36)

print("Mode (most common value) of discrete data = ", statistics.mode(trn_corpus["Fare"]))

print("Arithmetic mean (“average”) of data = ", statistics.mean(corpus_stat))
#print("Harmonic mean of data = ", statistics.harmonic_mean(trn_corpus_Age_dropna))
#StatisticsError is raised if data is empty, or any element is less than zero. New in version 3.6.

print("Median (middle value) of data = ", statistics.median(corpus_stat))
print("Median, or 50th percentile, of grouped data = ", statistics.median_grouped(corpus_stat)) 
print("Low median of data = ", statistics.median_low(corpus_stat))
print("High median of data = ", statistics.median_high(corpus_stat))

print("-" * 36)
#Method 2 - Using DataFrame
print("Arithmetic mean (“average”) of data = ", trn_corpus["Fare"].mean())
print("Max = ", trn_corpus["Fare"].max())
print("Min = ", trn_corpus["Fare"].min())
print("Count = ", trn_corpus["Fare"].count())


====================================
====================================
Averages and measures of central location - Age
These functions calculate an average or typical value from a population or sample.
------------------------------------
Mode (most common value) of discrete data =  8.05
Arithmetic mean (“average”) of data =  32.2042079686
Median (middle value) of data =  14.4542
Median, or 50th percentile, of grouped data =  14.7399142857
Low median of data =  14.4542
High median of data =  14.4542
------------------------------------
Arithmetic mean (“average”) of data =  32.2042079686
Max =  512.3292
Min =  0.0
Count =  891
  • Measure of Dispersion: range, Quartile, IQR (Interquartile Range), Variance, Standard Deviation, Skewness, Kurtosis

Ref: https://github.com/pandas-dev/pandas/blob/v0.20.3/pandas/core/generic.py#L5665-L5968

    For numeric data, the result's index will include ``count``,
    ``mean``, ``std``, ``min``, ``max`` as well as lower, ``50`` and
    upper percentiles. By default the lower percentile is ``25`` and the
    upper percentile is ``75``. The ``50`` percentile is the
    same as the median.

In [41]:
print("=" * 36)
print("=" * 36)

corpus_stat = trn_corpus_Age_dropna.copy()

print("Measures of spread - Age")
print("""These functions calculate a measure of how much the population or sample tends to deviate 
      from the typical or average values.""")

print("-" * 36)
print("Population standard deviation of data = ", statistics.pstdev(corpus_stat))
print("Population variance of data = ", statistics.pvariance(corpus_stat))
print("Sample standard deviation of data = ", statistics.stdev(corpus_stat))
print("Sample variance of data = ", statistics.variance(corpus_stat))

print("-" * 36)
corpus_stat = trn_corpus["Age"].copy()

print("Range = max - min = ", corpus_stat.max() - corpus_stat.min())
print("Quartile 25%, 50%, 75% = ", corpus_stat.describe()[['25%','50%','75%']][0], 
      corpus_stat.describe()[['25%','50%','75%']][1], 
      corpus_stat.describe()[['25%','50%','75%']][2])
print(corpus_stat.describe()[['25%','50%','75%']])
print("IQR (Interquartile Range) = Q3-Q1 = ", 
      corpus_stat.describe()[['25%','50%','75%']][2] - corpus_stat.describe()[['25%','50%','75%']][0])
print("Variance = ", corpus_stat.var())
print("Standard Deviation = ", corpus_stat.std())

print("Skewness = ", corpus_stat.skew()) 
print("Kurtosis = ", corpus_stat.kurtosis())


====================================
====================================
Measures of spread - Age
These functions calculate a measure of how much the population or sample tends to deviate 
      from the typical or average values.
------------------------------------
Population standard deviation of data =  14.516321150817316
Population variance of data =  210.723579754
Sample standard deviation of data =  14.526497332334042
Sample variance of data =  211.019124746
------------------------------------
Range = max - min =  79.58
Quartile 25%, 50%, 75% =  20.125 28.0 38.0
25%    20.125
50%    28.000
75%    38.000
Name: Age, dtype: float64
IQR (Interquartile Range) = Q3-Q1 =  17.875
Variance =  211.019124746
Standard Deviation =  14.5264973323
Skewness =  0.389107782301
Kurtosis =  0.178274153642

Comments:


In [42]:
sns.distplot(trn_corpus_Age_dropna, rug=True, hist=True)


Out[42]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8f12913630>

In [43]:
trn_corpus["Age"].describe()


Out[43]:
count    714.000000
mean      29.699118
std       14.526497
min        0.420000
25%       20.125000
50%       28.000000
75%       38.000000
max       80.000000
Name: Age, dtype: float64

In [44]:
print("=" * 36)
print("=" * 36)

corpus_stat = trn_corpus_Fare_dropna.copy()

print("Measures of spread - Fare")
print("""These functions calculate a measure of how much the population or sample tends to deviate 
      from the typical or average values.""")

print("-" * 36)
print("Population standard deviation of data = ", statistics.pstdev(corpus_stat))
print("Population variance of data = ", statistics.pvariance(corpus_stat))
print("Sample standard deviation of data = ", statistics.stdev(corpus_stat))
print("Sample variance of data = ", statistics.variance(corpus_stat))

print("-" * 36)
corpus_stat = trn_corpus["Fare"].copy()

print("Range = max - min = ", corpus_stat.max() - corpus_stat.min())
print("Quartile 25%, 50%, 75% = ", corpus_stat.describe()[['25%','50%','75%']][0], 
      corpus_stat.describe()[['25%','50%','75%']][1], 
      corpus_stat.describe()[['25%','50%','75%']][2])
print(corpus_stat.describe()[['25%','50%','75%']])
print("IQR (Interquartile Range) = Q3-Q1 = ", 
      corpus_stat.describe()[['25%','50%','75%']][2] - corpus_stat.describe()[['25%','50%','75%']][0])
print("Variance = ", corpus_stat.var())
print("Standard Deviation = ", corpus_stat.std())

print("Skewness = ", corpus_stat.skew()) 
print("Kurtosis = ", corpus_stat.kurtosis())


====================================
====================================
Measures of spread - Fare
These functions calculate a measure of how much the population or sample tends to deviate 
      from the typical or average values.
------------------------------------
Population standard deviation of data =  49.66553444477411
Population variance of data =  2466.66531169
Sample standard deviation of data =  49.6934285971809
Sample variance of data =  2469.43684574
------------------------------------
Range = max - min =  512.3292
Quartile 25%, 50%, 75% =  7.9104 14.4542 31.0
25%     7.9104
50%    14.4542
75%    31.0000
Name: Fare, dtype: float64
IQR (Interquartile Range) = Q3-Q1 =  23.0896
Variance =  2469.43684574
Standard Deviation =  49.6934285972
Skewness =  4.78731651967
Kurtosis =  33.3981408809

Comments:


In [45]:
sns.distplot(trn_corpus_Fare_dropna, rug=True, hist=True)


Out[45]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8f0fe290b8>

In [46]:
trn_corpus["Fare"].describe()


Out[46]:
count    891.000000
mean      32.204208
std       49.693429
min        0.000000
25%        7.910400
50%       14.454200
75%       31.000000
max      512.329200
Name: Fare, dtype: float64
  • Visualization Methods: Histogram, Box Plot

In [47]:
trn_corpus_Age_dropna.head()


Out[47]:
PassengerId
1    22.0
2    38.0
3    26.0
4    35.0
5    35.0
Name: Age, dtype: float64

In [48]:
sns.distplot(trn_corpus_Age_dropna, rug=True, hist=True)


Out[48]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8f0fe357f0>

In [49]:
#Plot the distribution with a histogram and maximum likelihood gaussian distribution fit
from scipy.stats import norm
ax = sns.distplot(trn_corpus_Age_dropna, fit=norm, kde=False)



In [50]:
#ax = sns.distplot(trn_corpus_Age_dropna, vertical=True, color="y")

In [51]:
ax = sns.distplot(trn_corpus_Age_dropna, rug=True, rug_kws={"color": "g"},
                  kde_kws={"color": "k", "lw": 3, "label": "KDE"},
                  hist_kws={"histtype": "step", "linewidth": 3,
                  "alpha": 1, "color": "g"})



In [52]:
sns.boxplot(x="Survived", y="Age", hue="Sex", data=trn_corpus)


Out[52]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8f0daaccc0>

For boxplots, the assumption when using a hue variable is that it is nested within the x or y variable. This means that by default, the boxes for different levels of hue will be offset, as you can see above. If your hue variable is not nested, you can set the dodge parameter to disable offsetting: Ref: http://seaborn.pydata.org/tutorial/categorical.html


In [53]:
trn_corpus["survival"] = trn_corpus["Survived"].isin([0, 1])
sns.boxplot(x="Survived", y="Age", hue="survival", data=trn_corpus, dodge=False);



In [54]:
#sns.violinplot(x="Survived", y="Age", hue="Sex", data=trn_corpus)

In [55]:
#sns.violinplot(x="Survived", y="Age", hue="Sex", data=trn_corpus, split=True)

In [56]:
#sns.violinplot(x="Survived", y="Age", hue="Sex", data=trn_corpus, split=True, inner="stick", palette="Set3");

In [57]:
#sns.violinplot(x="Survived", y="Age", data=trn_corpus, inner=None)
#sns.swarmplot(x="Survived", y="Age", data=trn_corpus, color="w", alpha=.5);

In [58]:
ax = sns.distplot(trn_corpus_Fare_dropna, rug=True, rug_kws={"color": "g"},
                  kde_kws={"color": "k", "lw": 3, "label": "KDE"},
                  hist_kws={"histtype": "step", "linewidth": 3,
                  "alpha": 1, "color": "g"})



In [59]:
sns.distplot(trn_corpus_Fare_dropna, rug=True, hist=True)


Out[59]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8f0cd91b00>

In [60]:
#Plot the distribution with a histogram and maximum likelihood gaussian distribution fit
from scipy.stats import norm
ax = sns.distplot(trn_corpus_Fare_dropna, fit=norm, kde=False)



In [61]:
sns.boxplot(x="Survived", y="Fare", hue="Sex", data=trn_corpus)


Out[61]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8f0dc64f60>

In [62]:
trn_corpus["survival"] = trn_corpus["Survived"].isin([0, 1])
sns.boxplot(x="Survived", y="Fare", hue="survival", data=trn_corpus, dodge=False);


Categorial Variables


In [63]:
sns.countplot(x = "Sex", data = trn_corpus)


Out[63]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8f0dba8240>

In [64]:
sns.barplot(x = "Sex", y = "Survived", data = trn_corpus, estimator=np.std)


Out[64]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8f0c77ac88>

Step 3. Bi-variate Analysis

Continuous & Continuous

Categorial & Categorial

Categorial & Continuous

Step 4. Missing/Special Value Treatment

Missing Value Treatment

Column "Age" - Missing Value

Now let’s do a similar thing for age, replacing missing values with the overall mean. Later we’ll learn about more sophisticated techniques for replacing missing values and improve upon this.


In [65]:
# Duplicate one column Age in order to Fillna with meanAge of each Title (After having Title)
trn_corpus["AgeUsingMeanTitle"] = trn_corpus["Age"] 

meanAge_trn_corpus = np.mean(trn_corpus["Age"])
trn_corpus["Age"] = trn_corpus["Age"].fillna(meanAge_trn_corpus)

trn_corpus.head()


Out[65]:
Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked survival AgeUsingMeanTitle
PassengerId
1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S True 22.0
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C True 38.0
3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S True 26.0
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S True 35.0
5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S True 35.0

In [66]:
# Duplicate one column Age in order to Fillna with meanAge of each Title (After having Title)
df["AgeUsingMeanTitle"] = df["Age"] 

meanAge_df = np.mean(df["Age"])
df["Age"] = df["Age"].fillna(meanAge_df)

df.head()


Out[66]:
Age Cabin Embarked Fare Name Parch Pclass Sex SibSp Survived Ticket AgeUsingMeanTitle
PassengerId
1 22.0 NaN S 7.2500 Braund, Mr. Owen Harris 0 3 male 1 0 A/5 21171 22.0
2 38.0 C85 C 71.2833 Cumings, Mrs. John Bradley (Florence Briggs Th... 0 1 female 1 1 PC 17599 38.0
3 26.0 NaN S 7.9250 Heikkinen, Miss. Laina 0 3 female 0 1 STON/O2. 3101282 26.0
4 35.0 C123 S 53.1000 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 1 female 1 1 113803 35.0
5 35.0 NaN S 8.0500 Allen, Mr. William Henry 0 3 male 0 0 373450 35.0

Column "Cabin" - Missing Value

Now for the cabin, since the majority of values are missing, it might be best to treat that as a piece of information itself, so we’ll set these to be ‘Unknown’.


In [67]:
#trn_corpus["Cabin"] = trn_corpus["Cabin"].fillna('Unknown') # because we will check Nan in the next step

Column "Embarked" - Missing Value

We set feature embarked having NaN to be the majority of column Embarked.


In [68]:
trn_corpus["Embarked"].describe()


Out[68]:
count     889
unique      3
top         S
freq      644
Name: Embarked, dtype: object

In [69]:
trn_corpus["Embarked"].describe()["top"]


Out[69]:
'S'

In [70]:
df["Embarked"].describe()["top"]


Out[70]:
'S'

In [71]:
df["Embarked"] = df["Embarked"].fillna(df["Embarked"].describe()["top"])

df.head()


Out[71]:
Age Cabin Embarked Fare Name Parch Pclass Sex SibSp Survived Ticket AgeUsingMeanTitle
PassengerId
1 22.0 NaN S 7.2500 Braund, Mr. Owen Harris 0 3 male 1 0 A/5 21171 22.0
2 38.0 C85 C 71.2833 Cumings, Mrs. John Bradley (Florence Briggs Th... 0 1 female 1 1 PC 17599 38.0
3 26.0 NaN S 7.9250 Heikkinen, Miss. Laina 0 3 female 0 1 STON/O2. 3101282 26.0
4 35.0 C123 S 53.1000 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 1 female 1 1 113803 35.0
5 35.0 NaN S 8.0500 Allen, Mr. William Henry 0 3 male 0 0 373450 35.0

In [72]:
df["Embarked"].unique()


Out[72]:
array(['S', 'C', 'Q'], dtype=object)

Special Value Treatment --> Ex: Fare = 0.0


In [73]:
#not that lambda just means a function we make on the fly
#calculate the mean fare for each class
classmeans_trn_corpus = trn_corpus.pivot_table('Fare', index = 'Pclass', aggfunc = "mean") #np.mean
classmeans_trn_corpus


Out[73]:
Fare
Pclass
1 84.154687
2 20.662183
3 13.675550

In [74]:
#not that lambda just means a function we make on the fly
#calculate the mean fare for each class
classmeans_tst_corpus = tst_corpus.pivot_table('Fare', index = 'Pclass', aggfunc = "mean") #np.mean
classmeans_tst_corpus


Out[74]:
Fare
Pclass
1 94.280297
2 22.202104
3 12.459678

In [75]:
#not that lambda just means a function we make on the fly
#calculate the mean fare for each class
classmeans_df = df.pivot_table('Fare', index = 'Pclass', aggfunc = "mean") #np.mean
classmeans_df


Out[75]:
Fare
Pclass
1 87.508992
2 21.179196
3 13.284126

In [76]:
classmeans_trn_corpus.xs(3)["Fare"]


Out[76]:
13.675550101832997

In [77]:
#Ref: https://triangleinequality.wordpress.com/2013/05/19/machine-learning-with-python-first-steps-munging/

#Remove Primary key (index)
trn_corpus.reset_index(inplace=True)
tst_corpus.reset_index(inplace=True)
df.reset_index(inplace=True)

In [78]:
trn_corpus.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 14 columns):
PassengerId          891 non-null int64
Survived             891 non-null int64
Pclass               891 non-null int64
Name                 891 non-null object
Sex                  891 non-null object
Age                  891 non-null float64
SibSp                891 non-null int64
Parch                891 non-null int64
Ticket               891 non-null object
Fare                 891 non-null float64
Cabin                204 non-null object
Embarked             889 non-null object
survival             891 non-null bool
AgeUsingMeanTitle    714 non-null float64
dtypes: bool(1), float64(3), int64(5), object(5)
memory usage: 91.4+ KB

In [79]:
list_passenger_id_having_Fare_zero = list(trn_corpus[trn_corpus["Fare"] == 0.0]["PassengerId"])
#list_passenger_id_having_Fare_zero = list(trn_corpus[trn_corpus["Fare"] == 0.0].index) #because we did set_index to df
print(list_passenger_id_having_Fare_zero)

#so apply acts on dataframes, either row-wise or column-wise, axis=1 means rows
trn_corpus["Fare"] = trn_corpus[['Fare', 'Pclass']].apply(lambda x: classmeans_trn_corpus.xs(x['Pclass'])["Fare"]
                                                        if x['Fare']==0.0 else x['Fare'], axis=1 )

#trn_corpus[trn_corpus["PassengerId"].apply(lambda x: x in list_passenger_id_having_Fare_zero)]
trn_corpus[trn_corpus.index.isin(list_passenger_id_having_Fare_zero)]


[180, 264, 272, 278, 303, 414, 467, 482, 598, 634, 675, 733, 807, 816, 823]
Out[79]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked survival AgeUsingMeanTitle
180 181 0 3 Sage, Miss. Constance Gladys female 29.699118 8 2 CA. 2343 69.550 NaN S True NaN
264 265 0 3 Henry, Miss. Delia female 29.699118 0 0 382649 7.750 NaN Q True NaN
272 273 1 2 Mellinger, Mrs. (Elizabeth Anne Maidment) female 41.000000 0 1 250644 19.500 NaN S True 41.0
278 279 0 3 Rice, Master. Eric male 7.000000 4 1 382652 29.125 NaN Q True 7.0
303 304 1 2 Keane, Miss. Nora A female 29.699118 0 0 226593 12.350 E101 Q True NaN
414 415 1 3 Sundman, Mr. Johan Julian male 44.000000 0 0 STON/O 2. 3101269 7.925 NaN S True 44.0
467 468 0 1 Smart, Mr. John Montgomery male 56.000000 0 0 113792 26.550 NaN S True 56.0
482 483 0 3 Rouse, Mr. Richard Henry male 50.000000 0 0 A/5 3594 8.050 NaN S True 50.0
598 599 0 3 Boulos, Mr. Hanna male 29.699118 0 0 2664 7.225 NaN C True NaN
634 635 0 3 Skoog, Miss. Mabel female 9.000000 3 2 347088 27.900 NaN S True 9.0
675 676 0 3 Edvardsson, Mr. Gustaf Hjalmar male 18.000000 0 0 349912 7.775 NaN S True 18.0
733 734 0 2 Berriman, Mr. William John male 23.000000 0 0 28425 13.000 NaN S True 23.0
807 808 0 3 Pettersson, Miss. Ellen Natalia female 18.000000 0 0 347087 7.775 NaN S True 18.0
816 817 0 3 Heininen, Miss. Wendla Maria female 23.000000 0 0 STON/O2. 3101290 7.925 NaN S True 23.0
823 824 1 3 Moor, Mrs. (Beila) female 27.000000 0 1 392096 12.475 E121 S True 27.0

In [80]:
trn_corpus.index.names


Out[80]:
FrozenList([None])

In [81]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 13 columns):
PassengerId          1309 non-null int64
Age                  1309 non-null float64
Cabin                295 non-null object
Embarked             1309 non-null object
Fare                 1309 non-null float64
Name                 1309 non-null object
Parch                1309 non-null int64
Pclass               1309 non-null int64
Sex                  1309 non-null object
SibSp                1309 non-null int64
Survived             1309 non-null int64
Ticket               1309 non-null object
AgeUsingMeanTitle    1046 non-null float64
dtypes: float64(3), int64(5), object(5)
memory usage: 133.0+ KB

In [82]:
#df["Fare"].unique() #contain nan from tst_corpus

In [83]:
classmeans_df


Out[83]:
Fare
Pclass
1 87.508992
2 21.179196
3 13.284126

In [84]:
list_passenger_id_having_Fare_zero_df = list(df[df["AgeUsingMeanTitle"].isnull()]["PassengerId"])
#list_passenger_id_having_Fare_zero = list(trn_corpus[trn_corpus["Fare"] == 0.0].index) #because we did set_index to df
print(len(list_passenger_id_having_Fare_zero_df))

#so apply acts on dataframes, either row-wise or column-wise, axis=1 means rows
df["Fare"] = df[['Fare', 'Pclass']].apply(lambda x: classmeans_df.xs(x['Pclass'])["Fare"]
                                                        if x['Fare'] is np.nan else x['Fare'], axis=1 )#if x['Fare'] == 0.0 else x['Fare'], axis=1 )

#trn_corpus[trn_corpus["PassengerId"].apply(lambda x: x in list_passenger_id_having_Fare_zero)]
df[df.index.isin(list_passenger_id_having_Fare_zero_df)]


263
Out[84]:
PassengerId Age Cabin Embarked Fare Name Parch Pclass Sex SibSp Survived Ticket AgeUsingMeanTitle
6 7 54.000000 E46 S 51.8625 McCarthy, Mr. Timothy J 0 1 male 0 0 17463 54.00
18 19 31.000000 NaN S 18.0000 Vander Planke, Mrs. Julius (Emelia Maria Vande... 0 3 female 1 0 345763 31.00
20 21 35.000000 NaN S 26.0000 Fynney, Mr. Joseph J 0 2 male 0 0 239865 35.00
27 28 19.000000 C23 C25 C27 S 263.0000 Fortune, Mr. Charles Alexander 2 1 male 3 0 19950 19.00
29 30 29.881138 NaN S 7.8958 Todoroff, Mr. Lalio 0 3 male 0 0 349216 NaN
30 31 40.000000 NaN C 27.7208 Uruchurtu, Don. Manuel E 0 1 male 0 0 PC 17601 40.00
32 33 29.881138 NaN Q 7.7500 Glynn, Miss. Mary Agatha 0 3 female 0 1 335677 NaN
33 34 66.000000 NaN S 10.5000 Wheadon, Mr. Edward H 0 2 male 0 0 C.A. 24579 66.00
37 38 21.000000 NaN S 8.0500 Cann, Mr. Ernest Charles 0 3 male 0 0 A./5. 2152 21.00
43 44 3.000000 NaN C 41.5792 Laroche, Miss. Simonne Marie Anne Andree 2 2 female 1 1 SC/Paris 2123 3.00
46 47 29.881138 NaN Q 15.5000 Lennon, Mr. Denis 0 3 male 1 0 370371 NaN
47 48 29.881138 NaN Q 7.7500 O'Driscoll, Miss. Bridget 0 3 female 0 1 14311 NaN
48 49 29.881138 NaN C 21.6792 Samaan, Mr. Youssef 0 3 male 2 0 2662 NaN
49 50 18.000000 NaN S 17.8000 Arnold-Franchi, Mrs. Josef (Josefine Franchi) 0 3 female 1 0 349237 18.00
56 57 21.000000 NaN S 10.5000 Rugg, Miss. Emily 0 2 female 0 1 C.A. 31026 21.00
65 66 29.881138 NaN C 15.2458 Moubarek, Master. Gerios 1 3 male 1 1 2661 NaN
66 67 29.000000 F33 S 10.5000 Nye, Mrs. (Elizabeth Ramell) 0 2 female 0 1 C.A. 29395 29.00
77 78 29.881138 NaN S 8.0500 Moutal, Mr. Rahamin Haim 0 3 male 0 0 374746 NaN
78 79 0.830000 NaN S 29.0000 Caldwell, Master. Alden Gates 2 2 male 0 1 248738 0.83
83 84 28.000000 NaN S 47.1000 Carrau, Mr. Francisco M 0 1 male 0 0 113059 28.00
88 89 23.000000 C23 C25 C27 S 263.0000 Fortune, Miss. Mabel Helen 2 1 female 3 1 19950 23.00
96 97 71.000000 A5 C 34.6542 Goldschmidt, Mr. George B 0 1 male 0 0 PC 17754 71.00
102 103 21.000000 D26 S 77.2875 White, Mr. Richard Frasar 1 1 male 0 0 35281 21.00
108 109 38.000000 NaN S 7.8958 Rekic, Mr. Tido 0 3 male 0 0 349249 38.00
110 111 47.000000 C110 S 52.0000 Porter, Mr. Walter Chamberlain 0 1 male 0 0 110465 47.00
122 123 32.500000 NaN C 30.0708 Nasser, Mr. Nicholas 0 2 male 1 0 237736 32.50
127 128 24.000000 NaN S 7.1417 Madsen, Mr. Fridtjof Arne 0 3 male 0 1 C 17369 24.00
129 130 45.000000 NaN S 6.9750 Ekstrom, Mr. Johan 0 3 male 0 0 347061 45.00
141 142 22.000000 NaN S 7.7500 Nysten, Miss. Anna Sofia 0 3 female 0 1 347081 22.00
155 156 51.000000 NaN C 61.3792 Williams, Mr. Charles Duane 1 1 male 0 0 PC 17597 51.00
... ... ... ... ... ... ... ... ... ... ... ... ... ...
1159 1160 29.881138 NaN S 8.0500 Howard, Miss. May Elizabeth 0 3 female 0 1 A. 2. 39186 NaN
1160 1161 17.000000 NaN S 8.6625 Pokrnic, Mr. Mate 0 3 male 0 0 315095 17.00
1163 1164 26.000000 C89 C 136.7792 Clark, Mrs. Walter Miller (Virginia McDowell) 0 1 female 1 1 13508 26.00
1165 1166 29.881138 NaN C 7.2250 Saade, Mr. Jean Nassr 0 3 male 0 0 2676 NaN
1166 1167 20.000000 NaN S 26.0000 Bryhl, Miss. Dagmar Jenny Ingeborg 0 2 female 1 1 236853 20.00
1174 1175 9.000000 NaN C 15.2458 Touma, Miss. Maria Youssef 1 3 female 1 1 2650 9.00
1178 1179 24.000000 B45 S 82.2667 Snyder, Mr. John Pillsbury 0 1 male 1 0 21228 24.00
1180 1181 29.881138 NaN S 8.0500 Ford, Mr. Arthur 0 3 male 0 0 A/5 1478 NaN
1181 1182 29.881138 NaN S 39.6000 Rheims, Mr. George Alexander Lucien 0 1 male 0 0 PC 17607 NaN
1182 1183 30.000000 NaN Q 6.9500 Daly, Miss. Margaret Marcella Maggie"" 0 3 female 0 1 382650 30.00
1184 1185 53.000000 A34 S 81.8583 Dodge, Dr. Washington 1 1 male 1 0 33638 53.00
1189 1190 30.000000 NaN S 45.5000 Loring, Mr. Joseph Holland 0 1 male 0 0 113801 30.00
1193 1194 43.000000 NaN S 21.0000 Phillips, Mr. Escott Robert 1 2 male 0 0 S.O./P.P. 2 43.00
1196 1197 64.000000 B26 S 26.5500 Crosby, Mrs. Edward Gifford (Catherine Elizabe... 1 1 female 1 1 112901 64.00
1204 1205 37.000000 NaN Q 7.7500 Carr, Miss. Jeannie 0 3 female 0 1 368364 37.00
1224 1225 19.000000 NaN C 15.7417 Nakid, Mrs. Said (Waika Mary" Mowad)" 1 3 female 1 1 2653 19.00
1231 1232 18.000000 NaN S 10.5000 Fillbrook, Mr. Joseph Charles 0 2 male 0 0 C.A. 15185 18.00
1234 1235 58.000000 B51 B53 B55 C 512.3292 Cardeza, Mrs. James Warburton Martinez (Charlo... 1 1 female 0 1 PC 17755 58.00
1236 1237 16.000000 NaN S 7.6500 Abelseth, Miss. Karen Marie 0 3 female 0 1 348125 16.00
1249 1250 29.881138 NaN Q 7.7500 O'Keefe, Mr. Patrick 0 3 male 0 0 368402 NaN
1250 1251 30.000000 NaN S 15.5500 Lindell, Mrs. Edvard Bengtsson (Elin Gerda Per... 0 3 female 1 1 349910 30.00
1257 1258 29.881138 NaN C 14.4583 Caram, Mr. Joseph 0 3 male 1 0 2689 NaN
1258 1259 22.000000 NaN S 39.6875 Riihivouri, Miss. Susanna Juhantytar Sanni"" 0 3 female 0 1 3101295 22.00
1272 1273 26.000000 NaN Q 7.8792 Foley, Mr. Joseph 0 3 male 0 0 330910 26.00
1274 1275 19.000000 NaN S 16.1000 McNamee, Mrs. Neal (Eileen O'Leary) 0 3 female 1 1 376566 19.00
1276 1277 24.000000 NaN S 65.0000 Herman, Miss. Kate 2 2 female 1 1 220845 24.00
1300 1301 3.000000 NaN S 13.7750 Peacock, Miss. Treasteall 1 3 female 1 1 SOTON/O.Q. 3101315 3.00
1302 1303 37.000000 C78 Q 90.0000 Minahan, Mrs. William Edward (Lillian E Thorpe) 0 1 female 1 1 19928 37.00
1305 1306 39.000000 C105 C 108.9000 Oliva y Ocana, Dona. Fermina 0 1 female 0 1 PC 17758 39.00
1308 1309 29.881138 NaN C 22.3583 Peter, Master. Michael J 1 3 male 1 0 2668 NaN

262 rows × 13 columns


In [85]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 13 columns):
PassengerId          1309 non-null int64
Age                  1309 non-null float64
Cabin                295 non-null object
Embarked             1309 non-null object
Fare                 1309 non-null float64
Name                 1309 non-null object
Parch                1309 non-null int64
Pclass               1309 non-null int64
Sex                  1309 non-null object
SibSp                1309 non-null int64
Survived             1309 non-null int64
Ticket               1309 non-null object
AgeUsingMeanTitle    1046 non-null float64
dtypes: float64(3), int64(5), object(5)
memory usage: 133.0+ KB

Step 5. Outlier Detection and Treatment

Feature Engineering

Variable Transformation & Variable/Feature Creation

Step 1. Variable Transformation


In [ ]:

Step 2. Variable/Feature Creation

Titles

First up the Name column is currently not being used, but we can at least extract the title from the name. There are quite a few titles going around, but I want to reduce them all to Mrs, Miss, Mr and Master. To do this we’ll need a function that searches for substrings. Thankfully the library ‘string’ has just what we need.


In [86]:
def substrings_in_string(big_string, substrings):
    if big_string is np.nan:
        return np.nan
    #end if
    
    for substring in substrings:
        if big_string.find(substring) != -1:
            return substring
        #end if
    #end for
    
    print(big_string)
    return np.nan
#end def
 
#replacing all titles with mr, mrs, miss, master
def replace_titles(x):
    title=x['Title']
    if title in ['Don', 'Major', 'Capt', 'Jonkheer', 'Rev', 'Col']:
        return 0 #'Mr'
    elif title in ['Countess', 'Mme']:
        return 1 #'Mrs'
    elif title in ['Mlle', 'Ms']:
        return 2 #'Miss'
    elif title == 'Dr':
        if x['Sex'] == 'Male':
            return 0 #'Mr'
        else:
            return 1 #'Mrs'
    else:
        return 3 #title
    #end if
#end def

title_list=['Mrs', 'Mr', 'Master', 'Miss', 'Major', 'Rev',
                    'Dr', 'Ms', 'Mlle','Col', 'Capt', 'Mme', 'Countess',
                    'Don', 'Jonkheer']

df['Title'] = df['Name'].map(lambda x: substrings_in_string(x, title_list))
    
df['Title'] = df.apply(replace_titles, axis=1)

In [87]:
df.head()


Out[87]:
PassengerId Age Cabin Embarked Fare Name Parch Pclass Sex SibSp Survived Ticket AgeUsingMeanTitle Title
0 1 22.0 NaN S 7.2500 Braund, Mr. Owen Harris 0 3 male 1 0 A/5 21171 22.0 3
1 2 38.0 C85 C 71.2833 Cumings, Mrs. John Bradley (Florence Briggs Th... 0 1 female 1 1 PC 17599 38.0 3
2 3 26.0 NaN S 7.9250 Heikkinen, Miss. Laina 0 3 female 0 1 STON/O2. 3101282 26.0 3
3 4 35.0 C123 S 53.1000 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 1 female 1 1 113803 35.0 3
4 5 35.0 NaN S 8.0500 Allen, Mr. William Henry 0 3 male 0 0 373450 35.0 3

Column "Age" - Missing Value - Using Mean for each Title

Now let’s do a similar thing for age, replacing missing values with the overall mean. Later we’ll learn about more sophisticated techniques for replacing missing values and improve upon this.


In [88]:
trn_corpus[["Sex", "AgeUsingMeanTitle"]].groupby("Sex").mean()


Out[88]:
AgeUsingMeanTitle
Sex
female 27.915709
male 30.726645

In [89]:
df[["Sex", "AgeUsingMeanTitle"]].groupby("Sex").mean()


Out[89]:
AgeUsingMeanTitle
Sex
female 28.687088
male 30.585228

In [90]:
#Method 2 - Using pivot table
mean_title_trn_corpus = trn_corpus.pivot_table("AgeUsingMeanTitle", index = "Sex", aggfunc= "mean") #np.mean
#mean_title_trn_corpus = trn_corpus.pivot_table("AgeUsingMeanTitle", index = ["Sex"], aggfunc= "mean").reset_index() #if having index

mean_title_trn_corpus


Out[90]:
AgeUsingMeanTitle
Sex
female 27.915709
male 30.726645

In [91]:
#Method 2 - Using pivot table
mean_title_df = df.pivot_table("AgeUsingMeanTitle", index = "Sex", aggfunc= "mean") #np.mean
#mean_title_df = df.pivot_table("AgeUsingMeanTitle", index = ["Sex"], aggfunc= "mean").reset_index() #if having index

mean_title_df


Out[91]:
AgeUsingMeanTitle
Sex
female 28.687088
male 30.585228

In [92]:
mean_title_df.xs("male")["AgeUsingMeanTitle"]


Out[92]:
30.585227963525838

In [93]:
#list(df["AgeUsingMeanTitle"].unique())

In [94]:
list_passenger_id_having_Age_nan = list(df[df["AgeUsingMeanTitle"].isnull()]["PassengerId"])

#list_passenger_id_having_Age_nan

In [95]:
df["AgeUsingMeanTitle"].fillna(df.groupby("Sex")["AgeUsingMeanTitle"].transform("mean"), inplace=True)

df[df["PassengerId"].apply(lambda x: x in list_passenger_id_having_Age_nan)]


Out[95]:
PassengerId Age Cabin Embarked Fare Name Parch Pclass Sex SibSp Survived Ticket AgeUsingMeanTitle Title
5 6 29.881138 NaN Q 8.4583 Moran, Mr. James 0 3 male 0 0 330877 30.585228 3
17 18 29.881138 NaN S 13.0000 Williams, Mr. Charles Eugene 0 2 male 0 1 244373 30.585228 3
19 20 29.881138 NaN C 7.2250 Masselmani, Mrs. Fatima 0 3 female 0 1 2649 28.687088 3
26 27 29.881138 NaN C 7.2250 Emir, Mr. Farred Chehab 0 3 male 0 0 2631 30.585228 3
28 29 29.881138 NaN Q 7.8792 O'Dwyer, Miss. Ellen "Nellie" 0 3 female 0 1 330959 28.687088 3
29 30 29.881138 NaN S 7.8958 Todoroff, Mr. Lalio 0 3 male 0 0 349216 30.585228 3
31 32 29.881138 B78 C 146.5208 Spencer, Mrs. William Augustus (Marie Eugenie) 0 1 female 1 1 PC 17569 28.687088 3
32 33 29.881138 NaN Q 7.7500 Glynn, Miss. Mary Agatha 0 3 female 0 1 335677 28.687088 3
36 37 29.881138 NaN C 7.2292 Mamee, Mr. Hanna 0 3 male 0 1 2677 30.585228 3
42 43 29.881138 NaN C 7.8958 Kraeff, Mr. Theodor 0 3 male 0 0 349253 30.585228 3
45 46 29.881138 NaN S 8.0500 Rogers, Mr. William John 0 3 male 0 0 S.C./A.4. 23567 30.585228 3
46 47 29.881138 NaN Q 15.5000 Lennon, Mr. Denis 0 3 male 1 0 370371 30.585228 3
47 48 29.881138 NaN Q 7.7500 O'Driscoll, Miss. Bridget 0 3 female 0 1 14311 28.687088 3
48 49 29.881138 NaN C 21.6792 Samaan, Mr. Youssef 0 3 male 2 0 2662 30.585228 3
55 56 29.881138 C52 S 35.5000 Woolner, Mr. Hugh 0 1 male 0 1 19947 30.585228 3
64 65 29.881138 NaN C 27.7208 Stewart, Mr. Albert A 0 1 male 0 0 PC 17605 30.585228 3
65 66 29.881138 NaN C 15.2458 Moubarek, Master. Gerios 1 3 male 1 1 2661 30.585228 3
76 77 29.881138 NaN S 7.8958 Staneff, Mr. Ivan 0 3 male 0 0 349208 30.585228 3
77 78 29.881138 NaN S 8.0500 Moutal, Mr. Rahamin Haim 0 3 male 0 0 374746 30.585228 3
82 83 29.881138 NaN Q 7.7875 McDermott, Miss. Brigdet Delia 0 3 female 0 1 330932 28.687088 3
87 88 29.881138 NaN S 8.0500 Slocovski, Mr. Selman Francis 0 3 male 0 0 SOTON/OQ 392086 30.585228 3
95 96 29.881138 NaN S 8.0500 Shorney, Mr. Charles Joseph 0 3 male 0 0 374910 30.585228 3
101 102 29.881138 NaN S 7.8958 Petroff, Mr. Pastcho ("Pentcho") 0 3 male 0 0 349215 30.585228 3
107 108 29.881138 NaN S 7.7750 Moss, Mr. Albert Johan 0 3 male 0 1 312991 30.585228 3
109 110 29.881138 NaN Q 24.1500 Moran, Miss. Bertha 0 3 female 1 1 371110 28.687088 3
121 122 29.881138 NaN S 8.0500 Moore, Mr. Leonard Charles 0 3 male 0 0 A4. 54510 30.585228 3
126 127 29.881138 NaN Q 7.7500 McMahon, Mr. Martin 0 3 male 0 0 370372 30.585228 3
128 129 29.881138 F E69 C 22.3583 Peter, Miss. Anna 1 3 female 1 1 2668 28.687088 3
140 141 29.881138 NaN C 15.2458 Boulos, Mrs. Joseph (Sultana) 2 3 female 0 0 2678 28.687088 3
154 155 29.881138 NaN S 7.3125 Olsen, Mr. Ole Martin 0 3 male 0 0 Fa 265302 30.585228 3
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1159 1160 29.881138 NaN S 8.0500 Howard, Miss. May Elizabeth 0 3 female 0 1 A. 2. 39186 28.687088 3
1162 1163 29.881138 NaN Q 7.7500 Fox, Mr. Patrick 0 3 male 0 0 368573 30.585228 3
1164 1165 29.881138 NaN Q 15.5000 Lennon, Miss. Mary 0 3 female 1 1 370371 28.687088 3
1165 1166 29.881138 NaN C 7.2250 Saade, Mr. Jean Nassr 0 3 male 0 0 2676 30.585228 3
1173 1174 29.881138 NaN Q 7.7500 Fleming, Miss. Honora 0 3 female 0 1 364859 28.687088 3
1177 1178 29.881138 NaN S 7.2500 Franklin, Mr. Charles (Charles Fardon) 0 3 male 0 0 SOTON/O.Q. 3101314 30.585228 3
1179 1180 29.881138 F E46 C 7.2292 Mardirosian, Mr. Sarkis 0 3 male 0 0 2655 30.585228 3
1180 1181 29.881138 NaN S 8.0500 Ford, Mr. Arthur 0 3 male 0 0 A/5 1478 30.585228 3
1181 1182 29.881138 NaN S 39.6000 Rheims, Mr. George Alexander Lucien 0 1 male 0 0 PC 17607 30.585228 3
1183 1184 29.881138 NaN C 7.2292 Nasr, Mr. Mustafa 0 3 male 0 0 2652 30.585228 3
1188 1189 29.881138 NaN C 21.6792 Samaan, Mr. Hanna 0 3 male 2 0 2662 30.585228 3
1192 1193 29.881138 D C 15.0458 Malachard, Mr. Noel 0 2 male 0 0 237735 30.585228 3
1195 1196 29.881138 NaN Q 7.7500 McCarthy, Miss. Catherine Katie"" 0 3 female 0 1 383123 28.687088 3
1203 1204 29.881138 NaN S 7.5750 Sadowitz, Mr. Harry 0 3 male 0 0 LP 1588 30.585228 3
1223 1224 29.881138 NaN C 7.2250 Thomas, Mr. Tannous 0 3 male 0 0 2684 30.585228 3
1230 1231 29.881138 NaN C 7.2292 Betros, Master. Seman 0 3 male 0 0 2622 30.585228 3
1233 1234 29.881138 NaN S 69.5500 Sage, Mr. John George 9 3 male 1 0 CA. 2343 30.585228 3
1235 1236 29.881138 NaN S 14.5000 van Billiard, Master. James William 1 3 male 1 0 A/5. 851 30.585228 3
1248 1249 29.881138 NaN S 7.8792 Lockyer, Mr. Edward 0 3 male 0 0 1222 30.585228 3
1249 1250 29.881138 NaN Q 7.7500 O'Keefe, Mr. Patrick 0 3 male 0 0 368402 30.585228 3
1256 1257 29.881138 NaN S 69.5500 Sage, Mrs. John (Annie Bullen) 9 3 female 1 1 CA. 2343 28.687088 3
1257 1258 29.881138 NaN C 14.4583 Caram, Mr. Joseph 0 3 male 1 0 2689 30.585228 3
1271 1272 29.881138 NaN Q 7.7500 O'Connor, Mr. Patrick 0 3 male 0 0 366713 30.585228 3
1273 1274 29.881138 NaN S 14.5000 Risien, Mrs. Samuel (Emma) 0 3 female 0 1 364498 28.687088 3
1275 1276 29.881138 NaN S 12.8750 Wheeler, Mr. Edwin Frederick"" 0 2 male 0 0 SC/PARIS 2159 30.585228 3
1299 1300 29.881138 NaN Q 7.7208 Riordan, Miss. Johanna Hannah"" 0 3 female 0 1 334915 28.687088 3
1301 1302 29.881138 NaN Q 7.7500 Naughton, Miss. Hannah 0 3 female 0 1 365237 28.687088 3
1304 1305 29.881138 NaN S 8.0500 Spector, Mr. Woolf 0 3 male 0 0 A.5. 3236 30.585228 3
1307 1308 29.881138 NaN S 8.0500 Ware, Mr. Frederick 0 3 male 0 0 359309 30.585228 3
1308 1309 29.881138 NaN C 22.3583 Peter, Master. Michael J 1 3 male 1 0 2668 30.585228 3

263 rows × 14 columns


In [ ]:

Cabin

This is going be very similar, we have a ‘Cabin’ column not doing much, only 1st class passengers have cabins, the rest are ‘Unknown’. A cabin number looks like ‘C123’. The letter refers to the deck, and so we’re going to extract these just like the titles.


In [96]:
#df["Cabin"].unique()

In [97]:
df["Cabin"].nunique()


Out[97]:
186

In [98]:
#Turning cabin number into Deck
cabin_list = ['A', 'B', 'C', 'D', 'E', 'F', 'T', 'G', 'UNK']
df['Deck1']=df['Cabin'].map(lambda x: substrings_in_string(x, cabin_list))

#df.head()

In [99]:
#Task: How to get the Deck from Cabin
#Method 2
def get_deck_from_cabin(strCabin):
    if strCabin is np.nan:
        return np.nan
    #end if
    
    return strCabin[0]
#end def

df["Deck2"] = df["Cabin"].apply(get_deck_from_cabin)

#df.head()

Question: Columns of Deck and Deck2 are the same?


In [100]:
print(df["Deck1"].unique())
print(df["Deck1"].nunique())


[nan 'C' 'E' 'G' 'D' 'A' 'B' 'F' 'T']
8

In [101]:
print(df["Deck2"].unique())
print(df["Deck2"].nunique())


[nan 'C' 'E' 'G' 'D' 'A' 'B' 'F' 'T']
8

In [102]:
df[df["Deck1"].fillna("UNK") != df["Deck2"].fillna("UNK")]


Out[102]:
PassengerId Age Cabin Embarked Fare Name Parch Pclass Sex SibSp Survived Ticket AgeUsingMeanTitle Title Deck1 Deck2
128 129 29.881138 F E69 C 22.3583 Peter, Miss. Anna 1 3 female 1 1 2668 28.687088 3 E F
1179 1180 29.881138 F E46 C 7.2292 Mardirosian, Mr. Sarkis 0 3 male 0 0 2655 30.585228 3 E F
1212 1213 25.000000 F E57 C 7.2292 Krekorian, Mr. Neshan 0 3 male 0 0 2654 25.000000 3 E F

Comment: We will use the values in column "Deck2".

Family Size

One thing you can do to create new features is linear combinations of features. In a model like linear regression this should be unnecessary, but for a decision tree may find it hard to model such relationships. Reading on the forums at Kaggle, some people have considered the size of a person’s family, the sum of their ‘SibSp’ and ‘Parch’ attributes. Perhaps people traveling alone did better? Or on the other hand perhaps if you had a family, you might have risked your life looking for them, or even giving up a space up to them in a lifeboat. Let’s throw that into the mix.


In [103]:
#Creating new family_size column
df['FamilySize']=df['SibSp']+df['Parch']

#df.head()

AgeClass

This is an interaction term, since age and class are both numbers we can just multiply them.


In [104]:
df['AgeClass']=df['AgeUsingMeanTitle']*df['Pclass']

#df.head()

Adding Male column

This is the “sex” variable in the data set from kaggle. I’ve just changed male/female to 1/0.


In [105]:
sex = {'male':1, 'female':0}
df["Male"] = df['Sex'].map(sex)

#df.head()

SexClass

This is an interaction term, since age and class are both numbers we can just multiply them.


In [106]:
df['SexClass']=df['Male']*df['Pclass']

#df.head()

Fare per Person

Here we divide the fare by the number of family members traveling together, I’m not exactly sure what this represents, but it’s easy enough to add in.


In [107]:
df['FarePerPerson']=df['Fare']/(df['FamilySize']+1)

#df.head()

AgeSquared

Here we use "combined_age" squared.

"combined_age" – this is the age of the passenger, with missing values replaced by the median age for each Title. So if the age was missing for Mr Smith, then I’ve used the median age for all passengers with Title “Mr”.


In [108]:
df["AgeSquared"]=df["AgeUsingMeanTitle"]**2

#df.head()

AgeClassSquared

Here we use "AgeClass" squared.


In [109]:
df["AgeClassSquared"]=df['AgeClass']**2

#df.head()

Creating Dummy Variables


In [110]:
df.head()


Out[110]:
PassengerId Age Cabin Embarked Fare Name Parch Pclass Sex SibSp ... Title Deck1 Deck2 FamilySize AgeClass Male SexClass FarePerPerson AgeSquared AgeClassSquared
0 1 22.0 NaN S 7.2500 Braund, Mr. Owen Harris 0 3 male 1 ... 3 NaN NaN 1 66.0 1 3 3.62500 484.0 4356.0
1 2 38.0 C85 C 71.2833 Cumings, Mrs. John Bradley (Florence Briggs Th... 0 1 female 1 ... 3 C C 1 38.0 0 0 35.64165 1444.0 1444.0
2 3 26.0 NaN S 7.9250 Heikkinen, Miss. Laina 0 3 female 0 ... 3 NaN NaN 0 78.0 0 0 7.92500 676.0 6084.0
3 4 35.0 C123 S 53.1000 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 1 female 1 ... 3 C C 1 35.0 0 0 26.55000 1225.0 1225.0
4 5 35.0 NaN S 8.0500 Allen, Mr. William Henry 0 3 male 0 ... 3 NaN NaN 0 105.0 1 3 8.05000 1225.0 11025.0

5 rows × 23 columns


In [ ]:


In [111]:
df.describe()


Out[111]:
PassengerId Age Fare Parch Pclass SibSp Survived AgeUsingMeanTitle Title FamilySize AgeClass Male SexClass FarePerPerson AgeSquared AgeClassSquared
count 1309.000000 1309.000000 1309.000000 1309.000000 1309.000000 1309.000000 1309.000000 1309.000000 1309.000000 1309.000000 1309.000000 1309.000000 1309.000000 1309.000000 1309.000000 1309.000000
mean 655.000000 29.881138 33.270043 0.385027 2.294882 0.498854 0.377387 29.909496 2.941176 0.883881 64.692851 0.644003 1.527884 20.502540 1060.582027 5193.522645
std 378.020061 12.883193 51.747063 0.865560 0.837836 1.041658 0.484918 12.889182 0.391491 1.583639 31.766784 0.478997 1.309876 35.765156 888.665904 4866.021451
min 1.000000 0.170000 0.000000 0.000000 1.000000 0.000000 0.000000 0.170000 0.000000 0.000000 0.510000 0.000000 0.000000 0.000000 0.028900 0.260100
25% 328.000000 22.000000 7.895800 0.000000 2.000000 0.000000 0.000000 22.000000 3.000000 0.000000 42.000000 0.000000 0.000000 7.452767 484.000000 1764.000000
50% 655.000000 29.881138 14.454200 0.000000 3.000000 0.000000 0.000000 30.000000 3.000000 0.000000 62.000000 1.000000 2.000000 8.458300 900.000000 3844.000000
75% 982.000000 35.000000 31.275000 0.000000 3.000000 1.000000 1.000000 35.000000 3.000000 1.000000 90.000000 1.000000 3.000000 24.150000 1225.000000 8100.000000
max 1309.000000 80.000000 512.329200 9.000000 3.000000 8.000000 1.000000 80.000000 3.000000 10.000000 222.000000 1.000000 3.000000 512.329200 6400.000000 49284.000000

In [112]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 23 columns):
PassengerId          1309 non-null int64
Age                  1309 non-null float64
Cabin                295 non-null object
Embarked             1309 non-null object
Fare                 1309 non-null float64
Name                 1309 non-null object
Parch                1309 non-null int64
Pclass               1309 non-null int64
Sex                  1309 non-null object
SibSp                1309 non-null int64
Survived             1309 non-null int64
Ticket               1309 non-null object
AgeUsingMeanTitle    1309 non-null float64
Title                1309 non-null int64
Deck1                295 non-null object
Deck2                295 non-null object
FamilySize           1309 non-null int64
AgeClass             1309 non-null float64
Male                 1309 non-null int64
SexClass             1309 non-null int64
FarePerPerson        1309 non-null float64
AgeSquared           1309 non-null float64
AgeClassSquared      1309 non-null float64
dtypes: float64(7), int64(9), object(7)
memory usage: 235.3+ KB
  • Male – this is the “sex” variable in the data set from kaggle. I’ve just changed male/female to 1/0.
  • Pclass - no change from the pclass variable in the kaggle data set.
  • Fare – no change from the fare variable in the kaggle dataset
  • FarePerPerson – I have calculated the number of people travelling together (sibsp + parch + 1) and divided the fare variable by that number
  • Title – extracted the Title of each passenger from their name. I used Excel to do this and other data manipulation.
  • AgeUsingMeanTitle – this is the age of the passenger, with missing values replaced by the median age for each Title. So if the age was missing for Mr Smith, then I’ve used the median age for all passengers with Title “Mr”.
  • AgeClass – multiplied “Combined Age” by pclass
  • SexClass – multiplied sex ( 1 or 2) by pclass.
  • FamilySize – sibsp + parch
  • AgeSquared – combined_age squared

  • AgeClassSquared – age_class squared


In [123]:
df_train_test = df[["PassengerId","Male", "Pclass","Fare","FarePerPerson","Title",
            "AgeUsingMeanTitle","AgeClass","SexClass","FamilySize","AgeSquared","AgeClassSquared","Survived"]]

In [124]:
df_train_test.describe()


Out[124]:
PassengerId Male Pclass Fare FarePerPerson Title AgeUsingMeanTitle AgeClass SexClass FamilySize AgeSquared AgeClassSquared Survived
count 1309.000000 1309.000000 1309.000000 1309.000000 1309.000000 1309.000000 1309.000000 1309.000000 1309.000000 1309.000000 1309.000000 1309.000000 1309.000000
mean 655.000000 0.644003 2.294882 33.270043 20.502540 2.941176 29.909496 64.692851 1.527884 0.883881 1060.582027 5193.522645 0.377387
std 378.020061 0.478997 0.837836 51.747063 35.765156 0.391491 12.889182 31.766784 1.309876 1.583639 888.665904 4866.021451 0.484918
min 1.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.170000 0.510000 0.000000 0.000000 0.028900 0.260100 0.000000
25% 328.000000 0.000000 2.000000 7.895800 7.452767 3.000000 22.000000 42.000000 0.000000 0.000000 484.000000 1764.000000 0.000000
50% 655.000000 1.000000 3.000000 14.454200 8.458300 3.000000 30.000000 62.000000 2.000000 0.000000 900.000000 3844.000000 0.000000
75% 982.000000 1.000000 3.000000 31.275000 24.150000 3.000000 35.000000 90.000000 3.000000 1.000000 1225.000000 8100.000000 1.000000
max 1309.000000 1.000000 3.000000 512.329200 512.329200 3.000000 80.000000 222.000000 3.000000 10.000000 6400.000000 49284.000000 1.000000

In [125]:
df_train_test.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 13 columns):
PassengerId          1309 non-null int64
Male                 1309 non-null int64
Pclass               1309 non-null int64
Fare                 1309 non-null float64
FarePerPerson        1309 non-null float64
Title                1309 non-null int64
AgeUsingMeanTitle    1309 non-null float64
AgeClass             1309 non-null float64
SexClass             1309 non-null int64
FamilySize           1309 non-null int64
AgeSquared           1309 non-null float64
AgeClassSquared      1309 non-null float64
Survived             1309 non-null int64
dtypes: float64(6), int64(7)
memory usage: 133.0 KB

In [126]:
df_train_test.head()


Out[126]:
PassengerId Male Pclass Fare FarePerPerson Title AgeUsingMeanTitle AgeClass SexClass FamilySize AgeSquared AgeClassSquared Survived
0 1 1 3 7.2500 3.62500 3 22.0 66.0 3 1 484.0 4356.0 0
1 2 0 1 71.2833 35.64165 3 38.0 38.0 0 1 1444.0 1444.0 1
2 3 0 3 7.9250 7.92500 3 26.0 78.0 0 0 676.0 6084.0 1
3 4 0 1 53.1000 26.55000 3 35.0 35.0 0 1 1225.0 1225.0 1
4 5 1 3 8.0500 8.05000 3 35.0 105.0 3 0 1225.0 11025.0 0

In [117]:
print("Num of rows in Training corpus: ", trn_corpus_size) 
print("Num of rows in Testing corpus: ", tst_corpus_size)


Num of rows in Training corpus:  891
Num of rows in Testing corpus:  418

In [127]:
df_train_test.columns


Out[127]:
Index(['PassengerId', 'Male', 'Pclass', 'Fare', 'FarePerPerson', 'Title',
       'AgeUsingMeanTitle', 'AgeClass', 'SexClass', 'FamilySize', 'AgeSquared',
       'AgeClassSquared', 'Survived'],
      dtype='object')

In [128]:
len(df_train_test.columns)


Out[128]:
13

In [ ]:


In [ ]:


In [116]:
#Ref: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
#from sklearn.model_selection import train_test_split #Split arrays or matrices into random train and test subsets

In [132]:
trn_corpus_after_preprocessing = df_train_test.iloc[:trn_corpus_size - 1,:].copy()

#trn_corpus_after_preprocessing
print(len(trn_corpus_after_preprocessing["AgeUsingMeanTitle"]))


890

In [133]:
trn_corpus_after_preprocessing.columns


Out[133]:
Index(['PassengerId', 'Male', 'Pclass', 'Fare', 'FarePerPerson', 'Title',
       'AgeUsingMeanTitle', 'AgeClass', 'SexClass', 'FamilySize', 'AgeSquared',
       'AgeClassSquared', 'Survived'],
      dtype='object')

In [134]:
tst_corpus_after_preprocessing = df_train_test.iloc[trn_corpus_size:,:].copy()

#tst_corpus_after_preprocessing

In [136]:
trn_corpus_after_preprocessing.to_csv("output/trn_corpus_after_preprocessing.csv", index=False, header=True)
tst_corpus_after_preprocessing.to_csv("output/tst_corpus_after_preprocessing.csv", index=False, header=True)

In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]: