Based on Crboerts.
Before attempting to train a model we need to examine the data. We divide it into the following sections:
We use the following tools:
When you want to know more about a function, you can use the built in documentation by putting the marker at the function an pressing Shift + Tab
Tab
will auto-complete properties, functions and methodsShift-Tab
will display documentation for property, function or methods under the caretControl-Enter
will run the current selected cellAlt-Enter
will run the current cell and start on next cell
In [132]:
# pandas
import pandas as pd
from pandas import Series,DataFrame
# Used for pretty print DataFrames
from IPython.display import display
import math
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.mlab as mlab
from scipy import stats
from scipy.stats import norm
%matplotlib inline
# preprocessing
from sklearn.preprocessing import StandardScaler
# machine learning
#from sklearn.model_selection import train_test_split
from sklearn.cross_validation import train_test_split, KFold
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
# Fix warnings
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
In [38]:
# Set a seed to ensure that we get repeatable results
np.random.seed(1)
# Load the data
df_train = pd.read_csv('/data/Titanic_training_data.csv')
What kind of data have we loaded? Which features are numerical and which are categorical? Do we have a lot of missing data, and if so, which features are missing data?
.info() is a very useful function that gives an overview of the data and also helps us to some extent answer some of the questions.
In [61]:
print(df_train.info())
Most of the features are complete (not missing any data), as they have as many non null values as number of data points, but we are missing 2 values in Embarked some values in Age and a lot of values in Cabin.
Pandas have already helped us assign a data type for each feature when it loaded the data. About half of the features are numbers while the rest are objects (=strings). This doesn't mean that half are numerical features, as some categorical features may be ordinal (have a natural order) which are represented by the number, e.g. Pclass.
In [62]:
display(df_train.head(10))
Just by looking at the first 10 examples, we can already start to filter features that will most likely not be useful for our use case. These are PassengerId, Name and Ticket. We will also remove the Cabin feature, as it is missing in too many features.
Another thing to note is that Sex and Embarked are strings, which will not work very well for our algorithms later on, which is important to keep in mind.
In [63]:
df_dropped = df_train.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1)
df_dropped.describe()
Out[63]:
From the function call .describe, we get a table with the summary statistics about the numerical values in the data frame. This table gives as a lot of information about the distribution of the data.
In [64]:
for col in ['Pclass', 'Sex', 'SibSp', 'Parch', 'Embarked']:
display(df_dropped[[col, 'Survived']].groupby([col], as_index=False).mean().sort_values(by='Survived', ascending=False))
From these tables, we can see
This is done by splitting the data into two groups depending on survival, and then draw the histogram for each numerical value. The histogram tells us how often a certain value comes up in our data set. So the main thing we want to do is to look if the specific feature has any impact (difference) on the histograms.
In [65]:
for col in ['Age', 'Fare']:
plot = sns.FacetGrid(df_dropped, col='Survived')
plot.map(plt.hist, col, bins=20, normed=True)
From these plots, we see that the age didn't really play that much of a difference, except for babies, which has a much higher relative survival rate. As for the fare, it had a bit more difference between the two groups, where paying higher fares gave a higher survival rate.
In [93]:
df_wrangling = df_dropped.drop(['Fare'], axis=1)
Now we'll try to fix the problem of missing values in some features. This can be done either filling the values in some way or just throwing away the whole example that has a certain missing value. In our case, we have Embarked, which is missing 2 values, while Age is missing more than that. Because Embarked is only missing 2 values, we'll fill it using the most common value. As for the Age feature, we suspect that there are differences in age depending on the gender and class that the passenger belongs to. Therefore, we will calculate the average age in each passenger group and use that the fill the Age feature. To make the algorithms easier to work with, we convert the categorical string values of Age and Embarked into ordinal numerical values.
In [94]:
print(df_wrangling['Embarked'].value_counts())
df_fill = df_wrangling
df_fill['Embarked'] = df_fill['Embarked'].fillna('S')
# Convert into numerical values
df_fill['Sex'] = df_fill['Sex'].map( {'female':1, 'male':0}).astype(int)
df_fill['Embarked'] = df_fill['Embarked'].map( {'S':0, 'C': 1, 'Q': 2}).astype(int)
In [102]:
age_mean = np.zeros((2, 3))
df_filled = df_fill
# Calculate the average age of each group
for i in range(0, 2):
for j in range(0, 3):
age_mean[i,j] = df_filled[(df_filled["Sex"] == i) & (df_filled["Pclass"] == j)]["Age"].dropna().mean()
# Fill in the calculated average age into each data point that is missing age
for i in range(0, 2):
for j in range(0, 3):
df_filled.loc[ (df_filled['Age'].isnull()) & (df_filled['Sex'] == i) & (df_filled['Pclass'] == j),'Age'] = age_mean[i,j]
# Reformat into int data type
df_filled['Age'] = df_filled['Age'].astype(int)
df_filled.head(10)
Out[102]:
After having filled the missing values, we will now see if we can construct new features from the existing ones that may help when we train our model. Just as with everything else, a big part of the process is to do trial and error, so therefore the features that we create may not necessarily improve our model.
We will show you one way of creating new features, but there are many other ways to do this, only limited by your creativity. In this case, we will combine the features SibSp and Parch into a new feature, call it FamSize, because they were both somewhat correlating low number with higher survival rate. We will threshold this new feature so that passengers with FamSize lower than 4 will be 0 and higher will be 1.
In [128]:
df_feat = df_filled.drop(['SibSp', 'Parch'], axis=1)
temp = df_filled['SibSp'] + df_filled['Parch']
temp[temp <= 3] = 0
temp[temp > 3] = 1
df_feat['FamSize'] = temp
In [129]:
# User defined function to plot confusion matrices
def plot_confusion_matrix(cm, nbr_data, title='Confusion matrix', cmap=plt.cm.Blues):
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(2)
plt.xticks(tick_marks, ["Died", "Survived"], rotation=45)
plt.yticks(tick_marks, ["Died", "Survived"])
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
In [136]:
y = df_feat["Survived"].values
X = df_feat.drop(['Survived'], axis=1).values
# Reproducibility
rng = np.random.RandomState(1234)
nbr_folds = 10
#classifier = RandomForestClassifier(n_estimators=100, random_state=rng)
classifier = LogisticRegression()
results = []
predictions = []
confusion_matrices = []
kf = KFold(df_feat.shape[0], n_folds=nbr_folds, random_state=rng)
for train, test in kf:
classifier.fit(X[train], y[train])
y_pred = classifier.predict(X[test])
results.append(np.mean(y_pred == y[test]))
predictions.append(y_pred == y[test])
cm = confusion_matrix(y[test], y_pred)
cm = cm.astype("float") / cm.sum(axis=1)[:, np.newaxis]
confusion_matrices.append(cm)
cms = np.zeros(confusion_matrices[0].shape)
for cm in confusion_matrices:
cms = cms + cm / len(confusion_matrices)
plt.plot(np.linspace(1, nbr_folds-1, num=nbr_folds), np.array(results))
print(np.array(results).mean())
plt.figure()
plot_confusion_matrix(cms, y.size)
print(cms)
We're finally at the end of this example, but that doesn't mean that we're necessarily done with the data set and machine learning work. Although we have a model that can predict survival, as we can see in the model validation, it's still a bit from perfect. We've just started analysing and processing the data and chosen a few features to use when training our linear regression model. From here, there are several paths that we can take to improve the performance of our model. We could:
In [ ]: