This notebook explores a dataset containing information of passengers of the Titanic. The dataset can be downloaded from Kaggle
How to follow along:
git clone https://github.com/Dataweekends/odsc_intro_to_data_science
cd odsc_intro_to_data_science
ipython notebook
We start by importing the necessary libraries:
In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
Load the csv file into memory using Pandas
In [ ]:
df = pd.read_csv('titanic-train.csv')
What's the content of df
?
In [ ]:
df.head(3)
Describe each attribute (is it discrete? is it continuous? is it a number? is it text?)
In [ ]:
df.info()
Is Pclass
a continuous or discrete class?
In [ ]:
df['Pclass'].value_counts()
What about these: ('SibSp', 'Parch'
)?
In [ ]:
df['SibSp'].value_counts()
In [ ]:
df['Parch'].value_counts()
and what about these: ('Ticket', 'Fare', 'Cabin', 'Embarked'
)?
In [ ]:
df[['Ticket', 'Fare', 'Cabin']].head(3)
In [ ]:
df['Embarked'].value_counts()
ah, yes... Survival!
In [ ]:
df['Survived'].value_counts()
Check if any values are missing
In [ ]:
df.info()
Survived
)PassengerId
)Pclass, Sex, Embarked
)Age, SibSp, Parch, Fare
)Name, Ticket, Cabin
)Age
is only available for 714 passengersCabin
is only available for 204 passengersEmbarked
is missing for 2 passengersPlot the distribution of Age
In [ ]:
df['Age'].plot(kind='hist', figsize=(10,6))
plt.title('Distribution of Age', size = '20')
plt.xlabel('Age', size = '20')
plt.ylabel('Number of passengers', size = '20')
median_age = df['Age'].median()
plt.axvline(median_age, color = 'r')
median_age
impute the missing values for Age using the median Age
In [ ]:
df['Age'].fillna(median_age, inplace = True)
df.info()
check the influence of Age
In [ ]:
df[df['Survived']==1]['Age'].plot(kind='hist', bins = 10, range = (0,100), figsize=(10,6), alpha = 0.3, color = 'g')
df[df['Survived']==0]['Age'].plot(kind='hist', bins = 10, range = (0,100), figsize=(10,6), alpha = 0.3, color = 'r')
plt.title('Distribution of Age', size = '20')
plt.xlabel('Age', size = '20')
plt.ylabel('Number of passengers', size = '20')
plt.legend(['Survived', 'Dead'])
plt.show()
Check the influence of Sex on Survival
In [ ]:
survival_by_gender = df[['Sex','Survived']].pivot_table(columns =
['Survived'], index = ['Sex'], aggfunc=len)
survival_by_gender
In [ ]:
survival_by_gender.plot(kind = 'bar', stacked = True)
plt.show()
Check the influence of Pclass on Survival
In [ ]:
survival_by_Pclass = df[['Pclass','Survived']].pivot_table(columns =
['Survived'], index = ['Pclass'], aggfunc=len)
survival_by_Pclass
In [ ]:
survival_by_Pclass.plot(kind = 'bar', stacked = True)
plt.show()
Ok, so, Age
and Pclass
seem to have some influence on survival rate.
Let's build a simple model to test that
Define a new feature called "Male" that is 1 if Sex = 'male' and 0 otherwise
In [ ]:
df['Male'] = df['Sex'].map({'male': 1, 'female': 0})
df[['Sex', 'Male']].head()
Define simplest model as benchmark
The simplest model is a model that predicts 0 for everybody, i.e. no survival.
How good is it?
In [ ]:
actual_dead = len(df[df['Survived'] == 0])
total_passengers = len(df)
ratio_of_dead = actual_dead / float(total_passengers)
print "If I predict everybody dies, I'm correct %0.1f %% of the time" % (100 * ratio_of_dead)
df['Survived'].value_counts()
We need to do better than that
Define features (X) and target (y) variables
In [ ]:
X = df[['Male', 'Pclass', 'Age']]
y = df['Survived']
Initialize a decision tree model
In [ ]:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(random_state=0)
model
Split the features and the target into a Train and a Test subsets.
Ratio should be 80/20
In [ ]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size = 0.2, random_state=0)
Train the model
In [ ]:
model.fit(X_train, y_train)
Calculate the model score
In [ ]:
my_score = model.score(X_test, y_test)
print "Classification Score: %0.2f" % my_score
Print the confusion matrix for the decision tree model
In [ ]:
from sklearn.metrics import confusion_matrix
y_pred = model.predict(X_test)
print "\n=======confusion matrix=========="
print confusion_matrix(y_test, y_pred)
Now you have a basic pipeline. How can you improve the score? Try:
adding new features could you add a feature for family? could you use the Embark or other as dummies check the get_dummies function here: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html
changing the parameters of the model check the documentation here: http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
changing the model itself check examples here: http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html
Let's have a small competition....