This notebook is companion to the book Data Science Solutions. The notebook walks us through a typical workflow for solving data science competitions at sites like Kaggle.
There are several excellent notebooks to study data science competition entries. However many will skip some of the explanation on how the solution is developed as these notebooks are developed by experts for experts. The objective of this notebook is to follow a step-by-step workflow, explaining each step and rationale for every decision we take during solution development.
The competition solution workflow goes through seven stages described in the Data Science Solutions book's sample chapter online here.
The workflow indicates general sequence of how each stage may follow the other. However there are use cases with exceptions.
Competition sites like Kaggle define the problem to solve or questions to ask while providing the datasets for training your data science model and testing the model results against a test dataset. The question or problem definition for Titanic Survival competition is described here at Kaggle.
Knowing from a training set of samples listing passengers who survived or did not survive the Titanic disaster, can our model determine based on a given test dataset not containing the survival information, if these passengers in the test dataset survived or not.
We may also want to develop some early understanding about the domain of our problem. This is described on the Kaggle competition description page here. Here are the highlights to note.
The data science solutions workflow solves for seven major goals.
Classifying. We may want to classify or categorize our samples. We may also want to understand the implications or correlation of different classes with our solution goal.
Correlating. One can approach the problem based on available features within the training dataset. Which features within the dataset contribute significantly to our solution goal? Statistically speaking is there a correlation among a feature and solution goal? As the feature values change does the solution state change as well, and visa-versa? This can be tested both for numerical and categorical features in the given dataset. We may also want to determine correlation among features other than survival for subsequent goals and workflow stages. Correlating certain features may help in creating, completing, or correcting features.
Converting. For modeling stage, one needs to prepare the data. Depending on the choice of model algorithm one may require all features to be converted to numerical equivalent values. So for instance converting text categorical values to numeric values.
Completing. Data preparation may also require us to estimate any missing values within a feature. Model algorithms may work best when there are no missing values.
Correcting. We may also analyze the given training dataset for errors or possibly innacurate values within features and try to corrent these values or exclude the samples containing the errors. One way to do this is to detect any outliers among our samples or features. We may also completely discard a feature if it is not contribting to the analysis or may significantly skew the results.
Creating. Can we create new features based on an existing feature or a set of features, such that the new feature follows the correlation, conversion, completeness goals.
Charting. How to select the right visualization plots and charts depending on nature of the data and the solution goals. A good start is to read the Tableau paper on Which chart or graph is right for you?.
In [371]:
# data analysis and wrangling
import pandas as pd
import numpy as np
import random as rnd
# visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# machine learning
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
In [372]:
# read titanic training & test csv files as a pandas DataFrame
train_df = pd.read_csv('data/titanic-kaggle/train.csv')
test_df = pd.read_csv('data/titanic-kaggle/test.csv')
Pandas also helps describe the datasets answering following questions early in our project.
Which features are available in the dataset?
Noting the feature names for directly manipulating or analyzing these. These feature names are described on the Kaggle data page here.
In [373]:
print train_df.columns.values
Which features are categorical?
These values classify the samples into sets of similar samples. Within categorical features are the values nominal, ordinal, ratio, or interval based? Among other things this helps us select the appropriate plots for visualization.
Which features are numerical?
Which features are numerical? These values change from sample to sample. Within numerical features are the values discrete, continuous, or timeseries based? Among other things this helps us select the appropriate plots for visualization.
In [374]:
# preview the data
train_df.head()
Out[374]:
Which features are mixed data types?
Numerical, alphanumeric data within same feature. These are candidates for correcting goal.
Which features may contain errors or typos?
This is harder to review for a large dataset, however reviewing a few samples from a smaller dataset may just tell us outright, which features may require correcting.
In [375]:
train_df.tail()
Out[375]:
Which features contain blank, null or empty values?
These will require correcting.
What are the data types for various features?
Helping us during converting goal.
In [376]:
train_df.info()
print('_'*40)
test_df.info()
What is the distribution of numerical feature values across the samples?
This helps us determine, among other early insights, how representative is the training dataset of the actual problem domain.
In [377]:
train_df.describe(percentiles=[.25, .5, .75])
# Review survived rate using `percentiles=[.61, .62]` knowing our problem description mentions 38% survival rate.
# Review Parch distribution using `percentiles=[.75, .8]`
# Sibling distribution `[.65, .7]`
# Age and Fare `[.1, .2, .3, .4, .5, .6, .7, .8, .9, .99]`
Out[377]:
What is the distribution of categorical features?
In [378]:
train_df.describe(include=['O'])
Out[378]:
We arrive at following assumptions based on data analysis done so far. We may validate these assumptions further before taking appropriate actions.
Completing.
Correcting.
Creating.
Correlating.
We may also add to our assumptions based on the problem description noted earlier.
Classifying.
Now we can start confirming some of our assumptions using visualizations for analyzing the data.
Let us start by understanding correlations between numerical features and our solution goal (Survived).
A histogram chart is useful for analyzing continous numerical variables like Age where banding or ranges will help identify useful patterns. The histogram can indicate distribution of samples using automatically defined bins or equally ranged bands. This helps us answer questions relating to specific bands (Did infants have better survival rate?)
Note that x-axis in historgram visualizations represents the count of samples or passengers.
Observations.
Decisions.
This simple analysis confirms our assumptions as decisions for subsequent workflow stages.
In [379]:
g = sns.FacetGrid(train_df, col='Survived')
g.map(plt.hist, 'Age', bins=20)
Out[379]:
We can combine multiple features for identifying correlations using a single plot. This can be done with numerical and categorical features which have numeric values.
Observations.
Decisions.
In [380]:
grid = sns.FacetGrid(train_df, col='Pclass', hue='Survived')
grid.map(plt.hist, 'Age', alpha=.5, bins=20)
grid.add_legend();
Now we can correlate categorical features with our solution goal.
Observations.
Decisions.
In [381]:
grid = sns.FacetGrid(train_df, col='Embarked')
grid.map(sns.pointplot, 'Pclass', 'Survived', 'Sex', palette='deep')
grid.add_legend()
Out[381]:
We may also want to correlate categorical features (with non-numeric values) and numeric features. We can consider correlating Embarked (Categorical non-numeric), Sex (Categorical non-numeric), Fare (Numeric continuous), with Survived (Categorical numeric).
Observations.
Decisions.
In [382]:
grid = sns.FacetGrid(train_df, col='Embarked', hue='Survived', palette={0: 'k', 1: 'w'})
grid.map(sns.barplot, 'Sex', 'Fare', alpha=.5, ci=None)
grid.add_legend()
Out[382]:
We have collected several assumptions and decisions regarding our datasets and solution requirements. So far we did not have to change a single feature or value to arrive at these. Let us now execute our decisions and assumptions for correcting, creating, and completing goals.
This is a good starting goal to execute. By dropping features we are dealing with fewer data points. Speeds up our notebook and eases the analysis.
Based on our assumptions and decisions we want to drop the Cabin (correcting #2) and Ticket (correcting #1) features.
Note that where applicable we perform operations on both training and testing datasets together to stay consistent.
In [383]:
train_df = train_df.drop(['Ticket', 'Cabin'], axis=1)
test_df = test_df.drop(['Ticket', 'Cabin'], axis=1)
We want to analyze if Name feature can be engineered to extract titles and test correlation between titles and survival, before dropping Name and PassengerId features.
In the following code we extract Title feature using regular expressions. The RegEx pattern (\w+\.)
matches the first word which ends with a dot character within Name feature. The expand=False
flag returns a DataFrame.
Observations.
When we plot Title, Age, and Survived, we note the following observations.
Decision.
In [384]:
train_df['Title'] = train_df.Name.str.extract('(\w+\.)', expand=False)
sns.barplot(hue="Survived", x="Age", y="Title", data=train_df, ci=False)
Out[384]:
Let us extract the Title feature for the training dataset as well.
Then we can safely drop the Name feature from training and testing datasets and the PassengerId feature from the training dataset.
In [385]:
test_df['Title'] = test_df.Name.str.extract('(\w+\.)', expand=False)
train_df = train_df.drop(['Name', 'PassengerId'], axis=1)
test_df = test_df.drop(['Name'], axis=1)
test_df.describe(include=['O'])
Out[385]:
Now we can convert features which contain strings to numerical values. This is required by most model algorithms. Doing so will also help us in achieving the feature completing goal.
Let us start by converting Sex feature to a new feature called Gender where female=1 and male=0.
In [386]:
train_df['Gender'] = train_df['Sex'].map( {'female': 1, 'male': 0} ).astype(int)
train_df.loc[:, ['Gender', 'Sex']].head()
Out[386]:
We do this both for training and test datasets.
In [387]:
test_df['Gender'] = test_df['Sex'].map( {'female': 1, 'male': 0} ).astype(int)
test_df.loc[:, ['Gender', 'Sex']].head()
Out[387]:
We can now drop the Sex feature from our datasets.
In [388]:
train_df = train_df.drop(['Sex'], axis=1)
test_df = test_df.drop(['Sex'], axis=1)
train_df.head()
Out[388]:
Now we should start estimating and completing features with missing or null values. We will first do this for the Age feature.
We can consider three methods to complete a numerical continuous feature.
A simple way is to generate random numbers between mean and standard deviation.
More accurate way of guessing missing values is to use other correlated features. In our case we note correlation among Age, Gender, and Pclass. Guess Age values using median values for Age across sets of Pclass and Gender feature combinations. So, median Age for Pclass=1 and Gender=0, Pclass=1 and Gender=1, and so on...
Combine methods 1 and 2. So instead of guessing age values based on median, use random numbers between mean and standard deviation, based on sets of Pclass and Gender combinations.
Method 1 and 3 will introduce random noise into our models. The results from multiple executions might vary. We will prefer method 2.
In [389]:
grid = sns.FacetGrid(train_df, col='Pclass', hue='Gender')
grid.map(plt.hist, 'Age', alpha=.5, bins=20)
grid.add_legend();
Let us start by preparing an empty array to contain guessed Age values based on Pclass x Gender combinations.
In [390]:
guess_ages = np.zeros((2,3))
guess_ages
Out[390]:
Now we iterate over Gender (0 or 1) and Pclass (1, 2, 3) to calculate guessed values of Age for the six combinations.
Note that we also tried creating the AgeFill feature using method 3 and realized during model stage that the correlation coeffficient of AgeFill is better when compared with the method 2.
In [391]:
for i in range(0, 2):
for j in range(0, 3):
guess_df = train_df[(train_df['Gender'] == i) & \
(train_df['Pclass'] == j+1)]['Age'].dropna()
# Correlation of AgeFill is -0.014850
# age_mean = guess_df.mean()
# age_std = guess_df.std()
# age_guess = rnd.uniform(age_mean - age_std, age_mean + age_std)
# Correlation of AgeFill is -0.011304
age_guess = guess_df.median()
# Convert random age float to nearest .5 age
guess_ages[i,j] = int( age_guess/0.5 + 0.5 ) * 0.5
guess_ages
Out[391]:
In [392]:
train_df['AgeFill'] = train_df['Age']
for i in range(0, 2):
for j in range(0, 3):
train_df.loc[ (train_df.Age.isnull()) & (train_df.Gender == i) & (train_df.Pclass == j+1),\
'AgeFill'] = guess_ages[i,j]
train_df[train_df['Age'].isnull()][['Gender','Pclass','Age','AgeFill']].head(10)
Out[392]:
We repeat the feature completing goal for the test dataset.
In [393]:
guess_ages = np.zeros((2,3))
for i in range(0, 2):
for j in range(0, 3):
guess_df = test_df[(test_df['Gender'] == i) & \
(test_df['Pclass'] == j+1)]['Age'].dropna()
# Correlation of AgeFill is -0.014850
# age_mean = guess_df.mean()
# age_std = guess_df.std()
# age_guess = rnd.uniform(age_mean - age_std, age_mean + age_std)
# Correlation of AgeFill is -0.011304
age_guess = guess_df.median()
guess_ages[i,j] = int( age_guess/0.5 + 0.5 ) * 0.5
test_df['AgeFill'] = test_df['Age']
for i in range(0, 2):
for j in range(0, 3):
test_df.loc[ (test_df.Age.isnull()) & (test_df.Gender == i) & (test_df.Pclass == j+1),\
'AgeFill'] = guess_ages[i,j]
test_df[test_df['Age'].isnull()][['Gender','Pclass','Age','AgeFill']].head(10)
Out[393]:
We can now drop the Age feature from our datasets.
In [394]:
train_df = train_df.drop(['Age'], axis=1)
test_df = test_df.drop(['Age'], axis=1)
train_df.head()
Out[394]:
We can create a new feature for FamilySize which combines Parch and SibSp. This will enable us to drop Parch and SibSp from our datasets.
Note that we commented out this code as we realized during model stage that the combined feature is reducing the confidence score of our dataset instead of improving it. The correlation score of separate Parch feature is also better than combined FamilySize feature.
In [395]:
# Logistic Regression Score is 0.81032547699214363
# Parch correlation is -0.065878 and SibSp correlation is -0.370618
# Decision: Retain Parch and SibSp as separate features
# Logistic Regression Score is 0.80808080808080807
# FamilySize correlation is -0.233974
# train_df['FamilySize'] = train_df['SibSp'] + train_df['Parch']
# test_df['FamilySize'] = test_df['SibSp'] + test_df['Parch']
# train_df.loc[:, ['Parch', 'SibSp', 'FamilySize']].head(10)
In [396]:
# train_df = train_df.drop(['Parch', 'SibSp'], axis=1)
# test_df = test_df.drop(['Parch', 'SibSp'], axis=1)
# train_df.head()
We can also create an artificial feature combining Pclass and AgeFill.
In [397]:
test_df['Age*Class'] = test_df.AgeFill * test_df.Pclass
train_df['Age*Class'] = train_df.AgeFill * train_df.Pclass
train_df.loc[:, ['Age*Class', 'AgeFill', 'Pclass']].head(10)
Out[397]:
In [398]:
freq_port = train_df.Embarked.dropna().mode()[0]
freq_port
Out[398]:
In [399]:
train_df['EmbarkedFill'] = train_df['Embarked']
train_df.loc[train_df['Embarked'].isnull(), 'EmbarkedFill'] = freq_port
train_df[train_df['Embarked'].isnull()][['Embarked','EmbarkedFill']].head(10)
Out[399]:
We can now drop the Embarked feature from our datasets.
In [400]:
test_df['EmbarkedFill'] = test_df['Embarked']
train_df = train_df.drop(['Embarked'], axis=1)
test_df = test_df.drop(['Embarked'], axis=1)
train_df.head()
Out[400]:
In [401]:
Ports = list(enumerate(np.unique(train_df['EmbarkedFill'])))
Ports_dict = { name : i for i, name in Ports }
train_df['Port'] = train_df.EmbarkedFill.map( lambda x: Ports_dict[x]).astype(int)
Ports = list(enumerate(np.unique(test_df['EmbarkedFill'])))
Ports_dict = { name : i for i, name in Ports }
test_df['Port'] = test_df.EmbarkedFill.map( lambda x: Ports_dict[x]).astype(int)
train_df[['EmbarkedFill', 'Port']].head(10)
Out[401]:
Similarly we can convert the Title feature to numeric enumeration TitleBand banding age groups with titles.
In [402]:
Titles = list(enumerate(np.unique(train_df['Title'])))
Titles_dict = { name : i for i, name in Titles }
train_df['TitleBand'] = train_df.Title.map( lambda x: Titles_dict[x]).astype(int)
Titles = list(enumerate(np.unique(test_df['Title'])))
Titles_dict = { name : i for i, name in Titles }
test_df['TitleBand'] = test_df.Title.map( lambda x: Titles_dict[x]).astype(int)
train_df[['Title', 'TitleBand']].head(10)
Out[402]:
Now we can safely drop the EmbarkedFill and Title features. We this we now have a dataset that only contains numerical values, a requirement for the model stage in our workflow.
In [403]:
train_df = train_df.drop(['EmbarkedFill', 'Title'], axis=1)
test_df = test_df.drop(['EmbarkedFill', 'Title'], axis=1)
train_df.head()
Out[403]:
We can now complete the Fare feature for single missing value in test dataset using mode to get the value that occurs most frequently for this feature. We do this in a single line of code.
Note that we are not creating an intermediate new feature or doing any further analysis for correlation to guess missing feature as we are replacing only a single value. The completion goal achieves desired requirement for model algorithm to operate on non-null values.
We may also want round off the fare to two decimals as it represents currency.
In [404]:
test_df['Fare'].fillna(test_df['Fare'].dropna().median(), inplace=True)
train_df['Fare'] = train_df['Fare'].round(2)
test_df['Fare'] = test_df['Fare'].round(2)
test_df.head(10)
Out[404]:
Now we are ready to train a model and predict the required solution. There are 60+ predictive modelling algorithms to choose from. We must understand the type of problem and solution requirement to narrow down to a select few models which we can evaluate. Our problem is a classification and regression problem. We want to identify relationship between output (Survived or not) with other variables or features (Gender, Age, Port...). We are also perfoming a category of machine learning which is called supervised learning as we are training our model with a given dataset. With these two criteria - Supervised Learning plus Classification and Regression, we can narrow down our choice of models to a few. These include:
In [405]:
X_train = train_df.drop("Survived", axis=1)
Y_train = train_df["Survived"]
X_test = test_df.drop("PassengerId", axis=1).copy()
X_train.shape, Y_train.shape, X_test.shape
Out[405]:
Logistic Regression is a useful model to run early in the workflow. Logistic regression measures the relationship between the categorical dependent variable (feature) and one or more independent variables (features) by estimating probabilities using a logistic function, which is the cumulative logistic distribution. Reference Wikipedia.
Note the confidence score generated by the model based on our training dataset.
In [406]:
# Logistic Regression
logreg = LogisticRegression()
logreg.fit(X_train, Y_train)
Y_pred = logreg.predict(X_test)
acc_log = round(logreg.score(X_train, Y_train) * 100, 2)
acc_log
Out[406]:
We can use Logistic Regression to validate our assumptions and decisions for feature creating and completing goals. This can be done by calculating the correlation coefficient for all features as these relate to survival.
In [407]:
coeff_df = pd.DataFrame(train_df.columns.delete(0))
coeff_df.columns = ['Feature']
coeff_df["Correlation"] = pd.Series(logreg.coef_[0])
coeff_df.sort_values(by='Correlation', ascending=False)
Out[407]:
Next we model using Support Vector Machines which are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. Given a set of training samples, each marked as belonging to one or the other of two categories, an SVM training algorithm builds a model that assigns new test samples to one category or the other, making it a non-probabilistic binary linear classifier. Reference Wikipedia.
Note that the model generates a confidence score which is higher than Logistics Regression model.
In [408]:
# Support Vector Machines
svc = SVC()
svc.fit(X_train, Y_train)
Y_pred = svc.predict(X_test)
acc_svc = round(svc.score(X_train, Y_train) * 100, 2)
acc_svc
Out[408]:
In pattern recognition, the k-Nearest Neighbors algorithm (or k-NN for short) is a non-parametric method used for classification and regression. A sample is classified by a majority vote of its neighbors, with the sample being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor. Reference Wikipedia.
KNN confidence score is better than Logistics Regression but worse than SVM.
In [409]:
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, Y_train)
Y_pred = knn.predict(X_test)
acc_knn = round(knn.score(X_train, Y_train) * 100, 2)
acc_knn
Out[409]:
In machine learning, naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes' theorem with strong (naive) independence assumptions between the features. Naive Bayes classifiers are highly scalable, requiring a number of parameters linear in the number of variables (features) in a learning problem. Reference Wikipedia.
The model generated confidence score is the lowest among the models evaluated so far.
In [410]:
# Gaussian Naive Bayes
gaussian = GaussianNB()
gaussian.fit(X_train, Y_train)
Y_pred = gaussian.predict(X_test)
acc_gaussian = round(gaussian.score(X_train, Y_train) * 100, 2)
acc_gaussian
Out[410]:
The perceptron is an algorithm for supervised learning of binary classifiers (functions that can decide whether an input, represented by a vector of numbers, belongs to some specific class or not). It is a type of linear classifier, i.e. a classification algorithm that makes its predictions based on a linear predictor function combining a set of weights with the feature vector. The algorithm allows for online learning, in that it processes elements in the training set one at a time. Reference Wikipedia.
In [411]:
# Perceptron
perceptron = Perceptron()
perceptron.fit(X_train, Y_train)
Y_pred = perceptron.predict(X_test)
acc_perceptron = round(perceptron.score(X_train, Y_train) * 100, 2)
acc_perceptron
Out[411]:
In [412]:
# Linear SVC
linear_svc = LinearSVC()
linear_svc.fit(X_train, Y_train)
Y_pred = linear_svc.predict(X_test)
acc_linear_svc = round(linear_svc.score(X_train, Y_train) * 100, 2)
acc_linear_svc
Out[412]:
In [413]:
# Stochastic Gradient Descent
sgd = SGDClassifier()
sgd.fit(X_train, Y_train)
Y_pred = sgd.predict(X_test)
acc_sgd = round(sgd.score(X_train, Y_train) * 100, 2)
acc_sgd
Out[413]:
This model uses a decision tree as a predictive model which maps features (tree branches) to conclusions about the target value (tree leaves). Tree models where the target variable can take a finite set of values are called classification trees; in these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. Decision trees where the target variable can take continuous values (typically real numbers) are called regression trees. Reference Wikipedia.
The model confidence score is the highest among models evaluated so far.
In [414]:
# Decision Tree
decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, Y_train)
Y_pred = decision_tree.predict(X_test)
acc_decision_tree = round(decision_tree.score(X_train, Y_train) * 100, 2)
acc_decision_tree
Out[414]:
The next model Random Forests is one of the most popular. Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees (n_estimators=100) at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Reference Wikipedia.
The model confidence score is the highest among models evaluated so far. We decide to use this model's output (Y_pred) for creating our competition submission of results.
In [415]:
# Random Forest
random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)
Y_pred = random_forest.predict(X_test)
random_forest.score(X_train, Y_train)
acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)
acc_random_forest
Out[415]:
In [416]:
models = pd.DataFrame({
'Model': ['Support Vector Machines', 'KNN', 'Logistic Regression',
'Random Forest', 'Naive Bayes', 'Perceptron',
'Stochastic Gradient Decent', 'Linear SVC',
'Decision Tree'],
'Score': [acc_svc, acc_knn, acc_log,
acc_random_forest, acc_gaussian, acc_perceptron,
acc_sgd, acc_linear_svc, acc_decision_tree]})
models.sort_values(by='Score', ascending=False)
Out[416]:
In [417]:
submission = pd.DataFrame({
"PassengerId": test_df["PassengerId"],
"Survived": Y_pred
})
submission.to_csv('data/titanic-kaggle/submission.csv', index=False)
Our submission to the competition site Kaggle results in scoring 3,883 of 6,082 competition entries. This result is indicative while the competition is running. This result only accounts for part of the submission dataset. Not bad for our first attempt. Any suggestions to improve our score are most welcome.