Titanic: Machine Learning

Exploring Data


In [46]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# TODO: Dont hard code the path
# Either define a variable to store the path or let input file be in the same folder
# 2nd approach is better -- Did this now.

train_df = pd.read_csv('train.csv')  # training data in a pandas' Data Frame object
test_df  = pd.read_csv('test.csv')   # test data

full_df = [train_df, test_df]        # complete pandas' Data Frame object

In [47]:
# first 10 records
train_df.head(10)


Out[47]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 NaN S
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 NaN C

In [48]:
# last 10 records
train_df.tail(10)


Out[48]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
881 882 0 3 Markun, Mr. Johann male 33.0 0 0 349257 7.8958 NaN S
882 883 0 3 Dahlberg, Miss. Gerda Ulrika female 22.0 0 0 7552 10.5167 NaN S
883 884 0 2 Banfield, Mr. Frederick James male 28.0 0 0 C.A./SOTON 34068 10.5000 NaN S
884 885 0 3 Sutehall, Mr. Henry Jr male 25.0 0 0 SOTON/OQ 392076 7.0500 NaN S
885 886 0 3 Rice, Mrs. William (Margaret Norton) female 39.0 0 5 382652 29.1250 NaN Q
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

In [49]:
# <Raghu> for now since we are starting, lets go with mean.
# Here in this example, they have used a formula making use of mean and std dev of training data for Age.
# https://www.kaggle.io/svf/560373/fcf6c03312081da830b2ab2cb26b4a1a/__results__.html#6.-Age
# We can do these later to improve efficiency of learning and predicting. </Raghu>

print(train_df['Age'].value_counts(dropna=False)) #How should we handle these? Drop NaNs? Replace wtih mean?


NaN       177
 24.00     30
 22.00     27
 18.00     26
 28.00     25
 19.00     25
 30.00     25
 21.00     24
 25.00     23
 36.00     22
 29.00     20
 32.00     18
 26.00     18
 35.00     18
 27.00     18
 16.00     17
 31.00     17
 34.00     15
 23.00     15
 33.00     15
 20.00     15
 39.00     14
 17.00     13
 42.00     13
 40.00     13
 45.00     12
 38.00     11
 50.00     10
 2.00      10
 4.00      10
         ... 
 28.50      2
 63.00      2
 0.83       2
 30.50      2
 70.00      2
 57.00      2
 0.75       2
 13.00      2
 59.00      2
 10.00      2
 64.00      2
 40.50      2
 45.50      2
 32.50      2
 20.50      1
 24.50      1
 0.67       1
 70.50      1
 0.92       1
 74.00      1
 34.50      1
 14.50      1
 80.00      1
 12.00      1
 53.00      1
 36.50      1
 55.50      1
 66.00      1
 23.50      1
 0.42       1
Name: Age, Length: 89, dtype: int64

In [50]:
train_df.describe() #Only 38% of passengers survived, average age is 29.67.


Out[50]:
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200

In [51]:
train_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

In [52]:
train_df['Pclass'].plot(kind = 'hist', rot=0, logx=True, logy=True) 
#Large majority 3rd class, small portion 1st, even smaleler portion 2nd


Out[52]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2c229c22d0>

In [53]:
train_df.plot(kind='scatter', x='Age', y='Fare') #Outliers?


Out[53]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2c22350f10>

In [54]:
sns.lmplot(x='Age', y='Fare', data=train_df, hue='Pclass') 
#Probably not outliers/errors in data, likely just very expensive tickets since they are 1st class passengers?


Out[54]:
<seaborn.axisgrid.FacetGrid at 0x7f2c22cff110>

In [55]:
sns.residplot(x='Age', y='Fare', data=train_df, dropna=True)


Out[55]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2c22163510>

In [56]:
sns.boxplot(x='Pclass', y='Fare', data=train_df)


Out[56]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2c220b8890>

Cleaning data

Replacing NaN values of Age


In [57]:
# </Rachel> I went to the link provided above for this calculation, but was confused by the Warning message
# I found a similar method on Github that shows the distribution before and after the random values are generated </Rachel>

fig, (axis1,axis2) = plt.subplots(1,2,figsize=(15,4))
axis1.set_title('Original Age values - Titanic')
axis2.set_title('New Age values - Titanic')

# plot original Age values (drop null values and convert to int)
train_df['Age'].dropna().astype(int).hist(bins=70, ax=axis1)

# get average, std and number of NaN values
average_age = train_df["Age"].mean()
std_age = train_df["Age"].std()
count_nan_age = train_df["Age"].isnull().sum()

# generate random numbers between (mean - std) & (mean + std)
rand_age = np.random.randint(average_age - std_age, average_age + std_age, size = count_nan_age)

# fill NaN values in Age column with random values generated
age_slice = train_df["Age"].copy()
age_slice[np.isnan(age_slice)] = rand_age

# plot imputed Age values
age_slice.astype(int).hist(bins=70, ax=axis2)


Out[57]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2c21f7d350>

In [58]:
# Distribution looks good - replace Age vector in original data with new values
train_df["Age"] = age_slice

# Show number of missing Age values
train_df["Age"].isnull().sum()


Out[58]:
0

Replacing NaN values of Embarked


In [59]:
# Fill missing values with most common value
train_df['Embarked'] = train_df['Embarked'].fillna('S')

# Show number of missing values
train_df['Embarked'].isnull().sum()


Out[59]:
0

Replacing missing Fare values


In [60]:
train_df['Fare'] = train_df.Fare.apply(lambda x: x if x>0 else pd.np.nan) # Replaced zeros with NaNs
train_df['Fare'].isnull().sum()                                           # Checked to make sure they are now recognized as null


Out[60]:
15

In [61]:
m = train_df.groupby('Pclass').mean().Fare # Calculated mean for each group/class
m


Out[61]:
Pclass
1    86.148874
2    21.358661
3    13.787875
Name: Fare, dtype: float64

In [62]:
train_df['Fare'] = train_df.apply(lambda row: m[row['Pclass']]    # Replaced NaNs for Fare with the mean value for each class
                                       if pd.isnull(row['Fare'])
                                       else row['Fare'],
                           axis=1) 
train_df['Fare'].isnull().sum()                                   # Checked to make sure there are no longer missing values


Out[62]:
0

Convert Embarked & Sex to numerical categories


In [63]:
# Transform Embarked (S = 0, C = 1, Q = 2)
train_df['Embarked'] = train_df['Embarked'].map({'S': 0, 'C': 1, 'Q': 2}).astype(int)

# Transform Sex (male = 0, female = 1)
train_df['Sex'] = train_df['Sex'].map({'female': 1, 'male': 0}).astype(int)

Specify variables for prediction model


In [64]:
# With the target variable "Survived", I would recommend starting with Sex, Age, Pclass and Fare as our predictors...
# Let me know what you think!

# <Raghu>  Yes, lets start with these we have to engineer the data (fill NaNs) and clean the data 
# to make it less overfitting. </Raghu>