Homepage: https://github.com/tien-le/kaggle-titanic
unbelivable ... to achieve 1.000. How did they do this?
Just curious, how did they cheat the score? ANS: maybe, we have the information existing in https://www.encyclopedia-titanica.org/titanic-victims/
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.
In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.
https://www.kaggle.com/c/titanic
https://triangleinequality.wordpress.com/2013/09/08/basic-feature-engineering-with-the-titanic-data/
https://www.kaggle.com/mrisdal/exploring-survival-on-the-titanic
The data has been split into two groups:
training set (train.csv)
test set (test.csv)
The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.
The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.
Variable | Definition | Key |
---|---|---|
eq | qe | qe |
survival | Survival | 0 = No, 1 = Yes |
pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd |
sex | Sex | |
Age | Age in years | |
sibsp | # of siblings / spouses aboard the Titanic | |
parch | # of parents / children aboard the Titanic | |
ticket | Ticket number | |
fare | Passenger fare | |
cabin | Cabin number | |
embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |
Variable | Notes |
pclass: A proxy for socio-economic status (SES) 1st = Upper 2nd = Middle 3rd = Lower
age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
sibsp: The dataset defines family relations in this way... Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored)
parch: The dataset defines family relations in this way... Parent = mother, father Child = daughter, son, stepdaughter, stepson Some children travelled only with a nanny, therefore parch=0 for them.
In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import random
In [4]:
trn_corpus = pd.read_csv("data/train.csv")
#889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S --> containing NaN
trn_corpus.set_index("PassengerId", inplace=True)
trn_corpus.info()
trn_corpus.describe()
Out[4]:
In [5]:
trn_corpus.head()
Out[5]:
In [6]:
tst_corpus = pd.read_csv("data/test.csv")
tst_corpus.set_index("PassengerId", inplace=True)
tst_corpus.info()
tst_corpus.describe()
Out[6]:
In [7]:
tst_corpus.head()
Out[7]:
In [8]:
expected_labels = pd.read_csv("data/gender_submission.csv")
expected_labels.set_index("PassengerId", inplace=True)
expected_labels.info()
expected_labels.describe()
Out[8]:
In [9]:
expected_labels.head()
Out[9]:
In [10]:
trn_corpus.index.names
Out[10]:
In [11]:
expected_labels.index.names
Out[11]:
In [12]:
#pd.merge(tst_corpus, expected_labels, how="inner", on="PassengerId")
tst_corpus_having_expected_label = pd.concat([tst_corpus, expected_labels], axis=1, join='inner')
tst_corpus_having_expected_label.head()
Out[12]:
In [13]:
print("Columns name: ", trn_corpus.columns)
print("Num of columns: ", len(trn_corpus.columns))
print("Num of rows: ", len(trn_corpus.index)) #trn_corpus.shape[0]
trn_corpus_size = len(trn_corpus.index)
In [14]:
print("Columns name: ", tst_corpus.columns)
print("Num of columns: ", len(tst_corpus.columns))
print("Num of rows: ", len(tst_corpus.index)) #tst_corpus.shape[0]
tst_corpus_size = len(tst_corpus.index)
In [15]:
#sns.pairplot(trn_corpus.dropna())
In [16]:
#sns.pairplot(tst_corpus.dropna())
In [17]:
df = trn_corpus.append(tst_corpus_having_expected_label)
print("Columns name: ", df.columns)
print("Num of columns: ", len(df.columns))
print("Num of rows: ", len(df.index)) #trn_corpus.shape[0]
print("Sum of trn_corpus_size and tst_corpus_size: ", trn_corpus_size + tst_corpus_size)
In [18]:
df.info()
df.describe()
Out[18]:
In [ ]:
1. Preditor (input) Variables and Data type
2. Target (output) Variables and Data Type
3. Category of Variables
Categorial variables
Embarked 889 non-null object # embarked -- Port of Embarkation -- C = Cherbourg, Q = Queenstown, S = Southampton
SibSp 891 non-null int64 # # of siblings / spouses aboard the Titanic -- [1 0 3 4 2 5 8] ; 7 items
Verify the unique data in each variables
In [19]:
#df.head()
In [20]:
#print("PassengerId:", df["PassengerId"].unique(), ";", df["PassengerId"].nunique(), "items")
print("Survived:", df["Survived"].unique(), ";", df["Survived"].nunique(), "items")
print("Pclass:", df["Pclass"].unique(), ";", df["Pclass"].nunique(), "Pclass")
#print("Name:", df["Name"].unique(), ";", df["Name"].nunique(), "items")
print("Sex:", df["Sex"].unique(), ";", df["Sex"].nunique(), "items")
#print("Age:", df["Age"].unique(), ";", df["Age"].nunique(), "items")
print("SibSp:", df["SibSp"].unique(), ";", df["SibSp"].nunique(), "items")
print("Parch:", df["Parch"].unique(), ";", df["Parch"].nunique(), "items")
#print("Ticket:", df["Ticket"].unique(), ";", df["Ticket"].nunique(), "items") # 681 items
#print("Fare:", df["Fare"].unique(), ";", df["Fare"].nunique(), "items") # 248 items
#print("Cabin:", df["Cabin"].unique(), ";", df["Cabin"].nunique(), "items") # 147 items
print("Embarked:", df["Embarked"].unique(), ";", df["Embarked"].nunique(), "items")
In [21]:
trn_corpus.describe()
Out[21]:
So we use read_csv since that is the form (comma separated values), the data is in. Pandas automatically gave the columns names from the header and inferred the data types. For large data sets it is recommended that you specify the data types manually.
Notice that the age, cabin and embarked columns have null values. Also we apparently have some free-loaders because the minimum fare is 0. We might think that these are babies, so let’s check that:
In [22]:
trn_corpus[['Age','Fare']][trn_corpus.Fare < 5]
Out[22]:
These guys are surely old enough to know better! But notice that there is a jump from a fare of 0 to 4, so there is something going on here, most likely these are errors, so let’s replace them by the mean fare for their class, and do the same for null values.
In [23]:
df.nunique()
Out[23]:
In [24]:
df["Fare"].fillna(0.0, inplace = True)
In [25]:
df[df["Fare"].isnull()]
Out[25]:
In [26]:
#first we set those fares of 0 to nan ==> Not used
#trn_corpus.Fare = trn_corpus.Fare.map(lambda x: np.nan if x==0 else x)
#df.Fare = df.Fare.map(lambda x: np.nan if x==0 else x)
In [27]:
#not that lambda just means a function we make on the fly
#calculate the mean fare for each class
classmeans_trn_corpus = trn_corpus.pivot_table('Fare', index = 'Pclass', aggfunc = "mean") #np.mean
classmeans_trn_corpus
Out[27]:
In [28]:
df.nunique()
Out[28]:
In [29]:
trn_corpus.nunique()
Out[29]:
In [30]:
#df.head()
In [31]:
#trn_corpus.head()
In [32]:
classmeans_trn_corpus.query('Pclass == 3')
Out[32]:
In [33]:
classmeans_trn_corpus.xs(3)["Fare"]
Out[33]:
In [34]:
classmeans_trn_corpus.query('Pclass == 3')
Out[34]:
Continuous Variables
--> Understanding the central tendency and spread of the variables.
In [35]:
print("Central Tendency - for Age")
trn_corpus["Age"].describe()
Out[35]:
In [36]:
trn_corpus_Age_dropna = trn_corpus["Age"].dropna()
In [37]:
#Ref: https://docs.python.org/3/library/statistics.html
import statistics
corpus_stat = trn_corpus_Age_dropna.copy()
print("=" * 36)
print("=" * 36)
print("Averages and measures of central location - Age")
print("These functions calculate an average or typical value from a population or sample.")
print("-" * 36)
print("Mode (most common value) of discrete data = ", statistics.mode(trn_corpus["Age"]))
print("Arithmetic mean (“average”) of data = ", statistics.mean(corpus_stat))
#print("Harmonic mean of data = ", statistics.harmonic_mean(trn_corpus_Age_dropna))
#StatisticsError is raised if data is empty, or any element is less than zero. New in version 3.6.
print("Median (middle value) of data = ", statistics.median(corpus_stat))
print("Median, or 50th percentile, of grouped data = ", statistics.median_grouped(corpus_stat))
print("Low median of data = ", statistics.median_low(corpus_stat))
print("High median of data = ", statistics.median_high(corpus_stat))
print("-" * 36)
#Method 2 - Using DataFrame
print("Arithmetic mean (“average”) of data = ", trn_corpus["Age"].mean())
print("Max = ", trn_corpus["Age"].max())
print("Min = ", trn_corpus["Age"].min())
print("Count = ", trn_corpus["Age"].count())
In [38]:
print("Central Tendency - for Fare")
trn_corpus["Fare"].describe()
Out[38]:
In [39]:
trn_corpus_Fare_dropna = trn_corpus["Fare"].dropna()
In [40]:
#Ref: https://docs.python.org/3/library/statistics.html
import statistics
corpus_stat = trn_corpus_Fare_dropna.copy()
print("=" * 36)
print("=" * 36)
print("Averages and measures of central location - Age")
print("These functions calculate an average or typical value from a population or sample.")
print("-" * 36)
print("Mode (most common value) of discrete data = ", statistics.mode(trn_corpus["Fare"]))
print("Arithmetic mean (“average”) of data = ", statistics.mean(corpus_stat))
#print("Harmonic mean of data = ", statistics.harmonic_mean(trn_corpus_Age_dropna))
#StatisticsError is raised if data is empty, or any element is less than zero. New in version 3.6.
print("Median (middle value) of data = ", statistics.median(corpus_stat))
print("Median, or 50th percentile, of grouped data = ", statistics.median_grouped(corpus_stat))
print("Low median of data = ", statistics.median_low(corpus_stat))
print("High median of data = ", statistics.median_high(corpus_stat))
print("-" * 36)
#Method 2 - Using DataFrame
print("Arithmetic mean (“average”) of data = ", trn_corpus["Fare"].mean())
print("Max = ", trn_corpus["Fare"].max())
print("Min = ", trn_corpus["Fare"].min())
print("Count = ", trn_corpus["Fare"].count())
Ref: https://github.com/pandas-dev/pandas/blob/v0.20.3/pandas/core/generic.py#L5665-L5968
For numeric data, the result's index will include ``count``,
``mean``, ``std``, ``min``, ``max`` as well as lower, ``50`` and
upper percentiles. By default the lower percentile is ``25`` and the
upper percentile is ``75``. The ``50`` percentile is the
same as the median.
In [41]:
print("=" * 36)
print("=" * 36)
corpus_stat = trn_corpus_Age_dropna.copy()
print("Measures of spread - Age")
print("""These functions calculate a measure of how much the population or sample tends to deviate
from the typical or average values.""")
print("-" * 36)
print("Population standard deviation of data = ", statistics.pstdev(corpus_stat))
print("Population variance of data = ", statistics.pvariance(corpus_stat))
print("Sample standard deviation of data = ", statistics.stdev(corpus_stat))
print("Sample variance of data = ", statistics.variance(corpus_stat))
print("-" * 36)
corpus_stat = trn_corpus["Age"].copy()
print("Range = max - min = ", corpus_stat.max() - corpus_stat.min())
print("Quartile 25%, 50%, 75% = ", corpus_stat.describe()[['25%','50%','75%']][0],
corpus_stat.describe()[['25%','50%','75%']][1],
corpus_stat.describe()[['25%','50%','75%']][2])
print(corpus_stat.describe()[['25%','50%','75%']])
print("IQR (Interquartile Range) = Q3-Q1 = ",
corpus_stat.describe()[['25%','50%','75%']][2] - corpus_stat.describe()[['25%','50%','75%']][0])
print("Variance = ", corpus_stat.var())
print("Standard Deviation = ", corpus_stat.std())
print("Skewness = ", corpus_stat.skew())
print("Kurtosis = ", corpus_stat.kurtosis())
Comments:
In [42]:
sns.distplot(trn_corpus_Age_dropna, rug=True, hist=True)
Out[42]:
In [43]:
trn_corpus["Age"].describe()
Out[43]:
In [44]:
print("=" * 36)
print("=" * 36)
corpus_stat = trn_corpus_Fare_dropna.copy()
print("Measures of spread - Fare")
print("""These functions calculate a measure of how much the population or sample tends to deviate
from the typical or average values.""")
print("-" * 36)
print("Population standard deviation of data = ", statistics.pstdev(corpus_stat))
print("Population variance of data = ", statistics.pvariance(corpus_stat))
print("Sample standard deviation of data = ", statistics.stdev(corpus_stat))
print("Sample variance of data = ", statistics.variance(corpus_stat))
print("-" * 36)
corpus_stat = trn_corpus["Fare"].copy()
print("Range = max - min = ", corpus_stat.max() - corpus_stat.min())
print("Quartile 25%, 50%, 75% = ", corpus_stat.describe()[['25%','50%','75%']][0],
corpus_stat.describe()[['25%','50%','75%']][1],
corpus_stat.describe()[['25%','50%','75%']][2])
print(corpus_stat.describe()[['25%','50%','75%']])
print("IQR (Interquartile Range) = Q3-Q1 = ",
corpus_stat.describe()[['25%','50%','75%']][2] - corpus_stat.describe()[['25%','50%','75%']][0])
print("Variance = ", corpus_stat.var())
print("Standard Deviation = ", corpus_stat.std())
print("Skewness = ", corpus_stat.skew())
print("Kurtosis = ", corpus_stat.kurtosis())
Comments:
In [45]:
sns.distplot(trn_corpus_Fare_dropna, rug=True, hist=True)
Out[45]:
In [46]:
trn_corpus["Fare"].describe()
Out[46]:
In [47]:
trn_corpus_Age_dropna.head()
Out[47]:
In [48]:
sns.distplot(trn_corpus_Age_dropna, rug=True, hist=True)
Out[48]:
In [49]:
#Plot the distribution with a histogram and maximum likelihood gaussian distribution fit
from scipy.stats import norm
ax = sns.distplot(trn_corpus_Age_dropna, fit=norm, kde=False)
In [50]:
#ax = sns.distplot(trn_corpus_Age_dropna, vertical=True, color="y")
In [51]:
ax = sns.distplot(trn_corpus_Age_dropna, rug=True, rug_kws={"color": "g"},
kde_kws={"color": "k", "lw": 3, "label": "KDE"},
hist_kws={"histtype": "step", "linewidth": 3,
"alpha": 1, "color": "g"})
In [52]:
sns.boxplot(x="Survived", y="Age", hue="Sex", data=trn_corpus)
Out[52]:
For boxplots, the assumption when using a hue variable is that it is nested within the x or y variable. This means that by default, the boxes for different levels of hue will be offset, as you can see above. If your hue variable is not nested, you can set the dodge parameter to disable offsetting: Ref: http://seaborn.pydata.org/tutorial/categorical.html
In [53]:
trn_corpus["survival"] = trn_corpus["Survived"].isin([0, 1])
sns.boxplot(x="Survived", y="Age", hue="survival", data=trn_corpus, dodge=False);
In [54]:
#sns.violinplot(x="Survived", y="Age", hue="Sex", data=trn_corpus)
In [55]:
#sns.violinplot(x="Survived", y="Age", hue="Sex", data=trn_corpus, split=True)
In [56]:
#sns.violinplot(x="Survived", y="Age", hue="Sex", data=trn_corpus, split=True, inner="stick", palette="Set3");
In [57]:
#sns.violinplot(x="Survived", y="Age", data=trn_corpus, inner=None)
#sns.swarmplot(x="Survived", y="Age", data=trn_corpus, color="w", alpha=.5);
In [58]:
ax = sns.distplot(trn_corpus_Fare_dropna, rug=True, rug_kws={"color": "g"},
kde_kws={"color": "k", "lw": 3, "label": "KDE"},
hist_kws={"histtype": "step", "linewidth": 3,
"alpha": 1, "color": "g"})
In [59]:
sns.distplot(trn_corpus_Fare_dropna, rug=True, hist=True)
Out[59]:
In [60]:
#Plot the distribution with a histogram and maximum likelihood gaussian distribution fit
from scipy.stats import norm
ax = sns.distplot(trn_corpus_Fare_dropna, fit=norm, kde=False)
In [61]:
sns.boxplot(x="Survived", y="Fare", hue="Sex", data=trn_corpus)
Out[61]:
In [62]:
trn_corpus["survival"] = trn_corpus["Survived"].isin([0, 1])
sns.boxplot(x="Survived", y="Fare", hue="survival", data=trn_corpus, dodge=False);
Categorial Variables
In [63]:
sns.countplot(x = "Sex", data = trn_corpus)
Out[63]:
In [64]:
sns.barplot(x = "Sex", y = "Survived", data = trn_corpus, estimator=np.std)
Out[64]:
Column "Age" - Missing Value
Now let’s do a similar thing for age, replacing missing values with the overall mean. Later we’ll learn about more sophisticated techniques for replacing missing values and improve upon this.
In [65]:
# Duplicate one column Age in order to Fillna with meanAge of each Title (After having Title)
trn_corpus["AgeUsingMeanTitle"] = trn_corpus["Age"]
meanAge_trn_corpus = np.mean(trn_corpus["Age"])
trn_corpus["Age"] = trn_corpus["Age"].fillna(meanAge_trn_corpus)
trn_corpus.head()
Out[65]:
In [66]:
# Duplicate one column Age in order to Fillna with meanAge of each Title (After having Title)
df["AgeUsingMeanTitle"] = df["Age"]
meanAge_df = np.mean(df["Age"])
df["Age"] = df["Age"].fillna(meanAge_df)
df.head()
Out[66]:
Column "Cabin" - Missing Value
Now for the cabin, since the majority of values are missing, it might be best to treat that as a piece of information itself, so we’ll set these to be ‘Unknown’.
In [67]:
#trn_corpus["Cabin"] = trn_corpus["Cabin"].fillna('Unknown') # because we will check Nan in the next step
Column "Embarked" - Missing Value
We set feature embarked having NaN to be the majority of column Embarked.
In [68]:
trn_corpus["Embarked"].describe()
Out[68]:
In [69]:
trn_corpus["Embarked"].describe()["top"]
Out[69]:
In [70]:
df["Embarked"].describe()["top"]
Out[70]:
In [71]:
df["Embarked"] = df["Embarked"].fillna(df["Embarked"].describe()["top"])
df.head()
Out[71]:
In [72]:
df["Embarked"].unique()
Out[72]:
In [73]:
#not that lambda just means a function we make on the fly
#calculate the mean fare for each class
classmeans_trn_corpus = trn_corpus.pivot_table('Fare', index = 'Pclass', aggfunc = "mean") #np.mean
classmeans_trn_corpus
Out[73]:
In [74]:
#not that lambda just means a function we make on the fly
#calculate the mean fare for each class
classmeans_tst_corpus = tst_corpus.pivot_table('Fare', index = 'Pclass', aggfunc = "mean") #np.mean
classmeans_tst_corpus
Out[74]:
In [75]:
#not that lambda just means a function we make on the fly
#calculate the mean fare for each class
classmeans_df = df.pivot_table('Fare', index = 'Pclass', aggfunc = "mean") #np.mean
classmeans_df
Out[75]:
In [76]:
classmeans_trn_corpus.xs(3)["Fare"]
Out[76]:
In [77]:
#Ref: https://triangleinequality.wordpress.com/2013/05/19/machine-learning-with-python-first-steps-munging/
#Remove Primary key (index)
trn_corpus.reset_index(inplace=True)
tst_corpus.reset_index(inplace=True)
df.reset_index(inplace=True)
In [78]:
trn_corpus.info()
In [79]:
list_passenger_id_having_Fare_zero = list(trn_corpus[trn_corpus["Fare"] == 0.0]["PassengerId"])
#list_passenger_id_having_Fare_zero = list(trn_corpus[trn_corpus["Fare"] == 0.0].index) #because we did set_index to df
print(list_passenger_id_having_Fare_zero)
#so apply acts on dataframes, either row-wise or column-wise, axis=1 means rows
trn_corpus["Fare"] = trn_corpus[['Fare', 'Pclass']].apply(lambda x: classmeans_trn_corpus.xs(x['Pclass'])["Fare"]
if x['Fare']==0.0 else x['Fare'], axis=1 )
#trn_corpus[trn_corpus["PassengerId"].apply(lambda x: x in list_passenger_id_having_Fare_zero)]
trn_corpus[trn_corpus.index.isin(list_passenger_id_having_Fare_zero)]
Out[79]:
In [80]:
trn_corpus.index.names
Out[80]:
In [81]:
df.info()
In [82]:
#df["Fare"].unique() #contain nan from tst_corpus
In [83]:
classmeans_df
Out[83]:
In [84]:
list_passenger_id_having_Fare_zero_df = list(df[df["AgeUsingMeanTitle"].isnull()]["PassengerId"])
#list_passenger_id_having_Fare_zero = list(trn_corpus[trn_corpus["Fare"] == 0.0].index) #because we did set_index to df
print(len(list_passenger_id_having_Fare_zero_df))
#so apply acts on dataframes, either row-wise or column-wise, axis=1 means rows
df["Fare"] = df[['Fare', 'Pclass']].apply(lambda x: classmeans_df.xs(x['Pclass'])["Fare"]
if x['Fare'] is np.nan else x['Fare'], axis=1 )#if x['Fare'] == 0.0 else x['Fare'], axis=1 )
#trn_corpus[trn_corpus["PassengerId"].apply(lambda x: x in list_passenger_id_having_Fare_zero)]
df[df.index.isin(list_passenger_id_having_Fare_zero_df)]
Out[84]:
In [85]:
df.info()
In [ ]:
First up the Name column is currently not being used, but we can at least extract the title from the name. There are quite a few titles going around, but I want to reduce them all to Mrs, Miss, Mr and Master. To do this we’ll need a function that searches for substrings. Thankfully the library ‘string’ has just what we need.
In [86]:
def substrings_in_string(big_string, substrings):
if big_string is np.nan:
return np.nan
#end if
for substring in substrings:
if big_string.find(substring) != -1:
return substring
#end if
#end for
print(big_string)
return np.nan
#end def
#replacing all titles with mr, mrs, miss, master
def replace_titles(x):
title=x['Title']
if title in ['Don', 'Major', 'Capt', 'Jonkheer', 'Rev', 'Col']:
return 0 #'Mr'
elif title in ['Countess', 'Mme']:
return 1 #'Mrs'
elif title in ['Mlle', 'Ms']:
return 2 #'Miss'
elif title == 'Dr':
if x['Sex'] == 'Male':
return 0 #'Mr'
else:
return 1 #'Mrs'
else:
return 3 #title
#end if
#end def
title_list=['Mrs', 'Mr', 'Master', 'Miss', 'Major', 'Rev',
'Dr', 'Ms', 'Mlle','Col', 'Capt', 'Mme', 'Countess',
'Don', 'Jonkheer']
df['Title'] = df['Name'].map(lambda x: substrings_in_string(x, title_list))
df['Title'] = df.apply(replace_titles, axis=1)
In [87]:
df.head()
Out[87]:
Column "Age" - Missing Value - Using Mean for each Title
Now let’s do a similar thing for age, replacing missing values with the overall mean. Later we’ll learn about more sophisticated techniques for replacing missing values and improve upon this.
In [88]:
trn_corpus[["Sex", "AgeUsingMeanTitle"]].groupby("Sex").mean()
Out[88]:
In [89]:
df[["Sex", "AgeUsingMeanTitle"]].groupby("Sex").mean()
Out[89]:
In [90]:
#Method 2 - Using pivot table
mean_title_trn_corpus = trn_corpus.pivot_table("AgeUsingMeanTitle", index = "Sex", aggfunc= "mean") #np.mean
#mean_title_trn_corpus = trn_corpus.pivot_table("AgeUsingMeanTitle", index = ["Sex"], aggfunc= "mean").reset_index() #if having index
mean_title_trn_corpus
Out[90]:
In [91]:
#Method 2 - Using pivot table
mean_title_df = df.pivot_table("AgeUsingMeanTitle", index = "Sex", aggfunc= "mean") #np.mean
#mean_title_df = df.pivot_table("AgeUsingMeanTitle", index = ["Sex"], aggfunc= "mean").reset_index() #if having index
mean_title_df
Out[91]:
In [92]:
mean_title_df.xs("male")["AgeUsingMeanTitle"]
Out[92]:
In [93]:
#list(df["AgeUsingMeanTitle"].unique())
In [94]:
list_passenger_id_having_Age_nan = list(df[df["AgeUsingMeanTitle"].isnull()]["PassengerId"])
#list_passenger_id_having_Age_nan
In [95]:
df["AgeUsingMeanTitle"].fillna(df.groupby("Sex")["AgeUsingMeanTitle"].transform("mean"), inplace=True)
df[df["PassengerId"].apply(lambda x: x in list_passenger_id_having_Age_nan)]
Out[95]:
In [ ]:
In [96]:
#df["Cabin"].unique()
In [97]:
df["Cabin"].nunique()
Out[97]:
In [98]:
#Turning cabin number into Deck
cabin_list = ['A', 'B', 'C', 'D', 'E', 'F', 'T', 'G', 'UNK']
df['Deck1']=df['Cabin'].map(lambda x: substrings_in_string(x, cabin_list))
#df.head()
In [99]:
#Task: How to get the Deck from Cabin
#Method 2
def get_deck_from_cabin(strCabin):
if strCabin is np.nan:
return np.nan
#end if
return strCabin[0]
#end def
df["Deck2"] = df["Cabin"].apply(get_deck_from_cabin)
#df.head()
In [100]:
print(df["Deck1"].unique())
print(df["Deck1"].nunique())
In [101]:
print(df["Deck2"].unique())
print(df["Deck2"].nunique())
In [102]:
df[df["Deck1"].fillna("UNK") != df["Deck2"].fillna("UNK")]
Out[102]:
Comment: We will use the values in column "Deck2".
One thing you can do to create new features is linear combinations of features. In a model like linear regression this should be unnecessary, but for a decision tree may find it hard to model such relationships. Reading on the forums at Kaggle, some people have considered the size of a person’s family, the sum of their ‘SibSp’ and ‘Parch’ attributes. Perhaps people traveling alone did better? Or on the other hand perhaps if you had a family, you might have risked your life looking for them, or even giving up a space up to them in a lifeboat. Let’s throw that into the mix.
In [103]:
#Creating new family_size column
df['FamilySize']=df['SibSp']+df['Parch']
#df.head()
In [104]:
df['AgeClass']=df['AgeUsingMeanTitle']*df['Pclass']
#df.head()
In [105]:
sex = {'male':1, 'female':0}
df["Male"] = df['Sex'].map(sex)
#df.head()
In [106]:
df['SexClass']=df['Male']*df['Pclass']
#df.head()
In [107]:
df['FarePerPerson']=df['Fare']/(df['FamilySize']+1)
#df.head()
In [108]:
df["AgeSquared"]=df["AgeUsingMeanTitle"]**2
#df.head()
In [109]:
df["AgeClassSquared"]=df['AgeClass']**2
#df.head()
Creating Dummy Variables
In [110]:
df.head()
Out[110]:
In [ ]:
Ref: http://gertlowitz.blogspot.fr/2013/06/where-am-i-up-to-with-titanic-competion.html
In [111]:
df.describe()
Out[111]:
In [112]:
df.info()
AgeSquared – combined_age squared
AgeClassSquared – age_class squared
In [123]:
df_train_test = df[["PassengerId","Male", "Pclass","Fare","FarePerPerson","Title",
"AgeUsingMeanTitle","AgeClass","SexClass","FamilySize","AgeSquared","AgeClassSquared","Survived"]]
In [124]:
df_train_test.describe()
Out[124]:
In [125]:
df_train_test.info()
In [126]:
df_train_test.head()
Out[126]:
In [117]:
print("Num of rows in Training corpus: ", trn_corpus_size)
print("Num of rows in Testing corpus: ", tst_corpus_size)
In [127]:
df_train_test.columns
Out[127]:
In [128]:
len(df_train_test.columns)
Out[128]:
In [ ]:
In [ ]:
In [116]:
#Ref: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
#from sklearn.model_selection import train_test_split #Split arrays or matrices into random train and test subsets
In [132]:
trn_corpus_after_preprocessing = df_train_test.iloc[:trn_corpus_size - 1,:].copy()
#trn_corpus_after_preprocessing
print(len(trn_corpus_after_preprocessing["AgeUsingMeanTitle"]))
In [133]:
trn_corpus_after_preprocessing.columns
Out[133]:
In [134]:
tst_corpus_after_preprocessing = df_train_test.iloc[trn_corpus_size:,:].copy()
#tst_corpus_after_preprocessing
In [136]:
trn_corpus_after_preprocessing.to_csv("output/trn_corpus_after_preprocessing.csv", index=False, header=True)
tst_corpus_after_preprocessing.to_csv("output/tst_corpus_after_preprocessing.csv", index=False, header=True)
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]: