In the previous section, we ended up with a smaller set of predictions because we chose to throw away rows with missing values. We build on this approach in this section by filling in the missing data with an educated guess.
We will only provide detailed descriptions on new concepts introduced.
In [1]:
import pandas as pd
import numpy as np
df = pd.read_csv('../data/train.csv')
In [2]:
df = df.drop(['Name', 'Ticket', 'Cabin'], axis=1)
Similar to the previous section, we review the data type and value counts.
In [3]:
df.info()
There are a number of ways that we could fill in the NaN values of the column Age. For simplicity, we'll do so by taking the average, or mean, of values of each column. We'll review as to whether taking the median would be a better choice in a later section.
In [4]:
age_mean = df['Age'].mean()
df['Age'] = df['Age'].fillna(age_mean)
Exercise
Taking the average does not make sense for the column Embarked, as it is a categorical value. Instead, we shall replace the NaN values by the mode, or most frequently occurring value.
In [5]:
from scipy.stats import mode
mode_embarked = mode(df['Embarked'])[0][0]
df['Embarked'] = df['Embarked'].fillna(mode_embarked)
In [6]:
df['Gender'] = df['Sex'].map({'female': 0, 'male': 1}).astype(int)
df['Port'] = df['Embarked'].map({'C':1, 'S':2, 'Q':3}).astype(int)
df = df.drop(['Sex', 'Embarked'], axis=1)
cols = df.columns.tolist()
cols = [cols[1]] + cols[0:1] + cols[2:]
df = df[cols]
We now review details of our training data.
In [7]:
df.info()
Hence have we have preserved all the rows of our data set, and proceed to create a numerical array for Scikit-learn.
In [8]:
train_data = df.values
In [9]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators = 100)
model = model.fit(train_data[0:,2:],train_data[0:,0])
In [10]:
df_test = pd.read_csv('../data/test.csv')
We now review what needs to be cleaned in the test data.
In [11]:
df_test.info()
In [12]:
df_test = df_test.drop(['Name', 'Ticket', 'Cabin'], axis=1)
As per our previous approach, we fill in the NaN values in the column Age with the mean.
In [13]:
df_test['Age'] = df_test['Age'].fillna(age_mean)
For the column Fare, however, it makes sense to fill in the NaN values with the mean by the column Pclass, or Passenger class.
In [14]:
fare_means = df.pivot_table('Fare', index='Pclass', aggfunc='mean')
In [15]:
fare_means
Out[15]:
Here we created a pivot table by calculating the mean of the column Fare by each Pclass, which we will use to fill in our NaN values.
In [16]:
df_test['Fare'] = df_test[['Fare', 'Pclass']].apply(lambda x:
fare_means[x['Pclass']] if pd.isnull(x['Fare'])
else x['Fare'], axis=1)
This is one of the more complicated lines of code we'll encounter, so let's unpack this.
First, we look at each of the pairs (Fare, Pclass) (i.e. lambda x). From this pair, we check if the Fare part is NaN (i.e. if pd.isnull(x['Fare'])). If Fare is NaN, we look at the Pclass value of that pair (i.e. x['PClass']), and replace the NaN value the mean fare of that class (i.e. fare_means[x['Pclass']]). If Fare is not NaN, then we keep it the same (i.e. else x['Fare']).
In [17]:
df_test['Gender'] = df_test['Sex'].map({'female': 0, 'male': 1}).astype(int)
df_test['Port'] = df_test['Embarked'].map({'C':1, 'S':2, 'Q':3})
df_test = df_test.drop(['Sex', 'Embarked'], axis=1)
test_data = df_test.values
output = model.predict(test_data[:,1:])
In [18]:
result = np.c_[test_data[:,0].astype(int), output.astype(int)]
df_result = pd.DataFrame(result[:,0:2], columns=['PassengerId', 'Survived'])
df_result.to_csv('../results/titanic_1-1.csv', index=False)
Now we'll discover that our submission has 418 predictions, and can proceed to make our first leaderboard entry.
In [19]:
df_result.shape
Out[19]:
Congratulations on making your first Kaggle submission!!