Introduction

This is the second installment of Applying Machine Learning to Kaggle Datasets, a series of ipython notebooks demonstrating the methods described in the Stanford Machine Learning Course. In each noteobok, I apply one method taught in the course to an open kaggle competition.

In this notebook, I demonstrate logistic regression using the Titanic competition.

Outline

  1. Import and examine the data
  2. Construct a logistic model to predict mortality
  3. Optimize model parameters by solving Theta*X=y
  4. Evaluate model results
  5. Submit results to the Kaggle competition

Import Necessary Modules


In [1188]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import statsmodels.api as sm
import code.Linear_Regression_Funcs as LRF
import code.Logistic_Regression_Funcs as LGF

In [1189]:
reload(LGF)


Out[1189]:
<module 'code.Logistic_Regression_Funcs' from 'code/Logistic_Regression_Funcs.pyc'>

1. Read Titanic Data


In [1190]:
train = pd.read_csv("./data/titanic/train.csv", index_col="PassengerId")
train.head()


Out[1190]:
Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
PassengerId
1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 NaN S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 0 PC 17599 71.2833 C85 C
3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 NaN S
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S
5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 NaN S

In [1191]:
# Fill embarcation location nan with a string
#train.Embarked = train.Embarked.fillna('nan')

# Create name category from titles in the name column
#train = LGF.nametitles(train)

Some Exploratory Analysis using Pandas


In [1192]:
#temp = pd.crosstab([train.Pclass, train.Sex],train.Survived.astype(bool))
#temp

In [1193]:
#sb.set(style="white")
#sb.factorplot('Pclass','Survived','Sex',data=train,palette="muted")
#sb.factorplot('Embarked','Survived','Pclass',data=train,palette="muted")
#sb.factorplot('Embarked','Survived','Sex',data=train,palette="muted")
#fg = sb.FacetGrid(train,hue="Pclass",aspect=3,palette="muted")
#fg.map(sb.kdeplot,"Age",bw=4,shade=True,legend=True)
#fg.set(xlim=(0,80))

In [1193]:


In [1194]:
## Transform categorical variables into numeric indicators (For examination only)
#temp = LGF.cat2indicator(train, ['Embarked','Pclass','Sex'])  # Embarcation, Class, Sex
#
## Examine data grouped by survival
#temp.groupby(temp.Survived).describe()

2. Construct a logistic regression model to predict survival

A list of variables to include in the model
  1. Categorical variables (PClass, Sex, Embarcation location, Title)
  2. Continuous variables (Age, Fare, # Parents/Children, # Siblings/Spouses)
  3. Interaction Terms
    • Pclass * Sex
    • Embarcation * Sex

In [1195]:
y = train['Survived']

In [1196]:
# X is an [m x n] matrix.
#    m = number of observations
#    n = number of predictors
X = LGF.make_matrix(train)

3. Optimize Model Parameters via Logistic regression


In [1214]:
results = sm.Logit(y,X).fit(maxiter=1000,method='bfgs')


Optimization terminated successfully.
         Current function value: 0.411809
         Iterations: 108
         Function evaluations: 109
         Gradient evaluations: 109

4. Evaluate Model Results


In [1198]:
results.summary()


Out[1198]:
Logit Regression Results
Dep. Variable: Survived No. Observations: 891
Model: Logit Df Residuals: 879
Method: MLE Df Model: 11
Date: Thu, 04 Dec 2014 Pseudo R-squ.: 0.3816
Time: 13:37:23 Log-Likelihood: -366.92
converged: True LL-Null: -593.33
LLR p-value: 3.632e-90
coef std err z P>|z| [95.0% Conf. Int.]
const 4.1085 0.683 6.019 0.000 2.771 5.446
Embarked_Q 0.3835 0.566 0.678 0.498 -0.725 1.493
Embarked_S -1.1151 0.452 -2.469 0.014 -2.000 -0.230
Embarked_nan 2.9745 24.439 0.122 0.903 -44.924 50.873
Sex_male -4.3837 0.730 -6.003 0.000 -5.815 -2.952
Pclass_2 -0.6279 0.733 -0.856 0.392 -2.065 0.809
Pclass_3 -3.5018 0.621 -5.635 0.000 -4.720 -2.284
Title_Master 2.4168 0.365 6.624 0.000 1.702 3.132
Pclass_2_*_Sex_male -0.6755 0.815 -0.829 0.407 -2.272 0.921
Pclass_3_*_Sex_male 2.0651 0.678 3.048 0.002 0.737 3.393
Embarked_Q_*_Sex_male -1.7974 0.899 -1.998 0.046 -3.560 -0.034
Embarked_S_*_Sex_male 0.6288 0.532 1.181 0.238 -0.415 1.672

In [1212]:
ypredict = results.predict(X)
ypredict = np.round(ypredict)
print "score on training data = ",LGF.score(y,ypredict)


score on training data =  0.822671156004

5. Submit the results to the Kaggle competition


In [1200]:
# Read the test data
test = pd.read_csv("./data/titanic/test.csv",index_col="PassengerId")

In [1201]:
# Construct test model matrix
Xtest = LGF.make_matrix(test,matchcols=X.columns[1:])

In [1202]:
# Calculate predictions by applying model parameters to test model matrix
Ypredict = pd.DataFrame(results.predict(Xtest),index=Xtest.index)
Ypredict = np.round(Ypredict)
Ypredict.columns = ['Survived']
Ypredict = Ypredict.astype(int)
Ypredict.to_csv('./predictions/Logistic_Regression_Prediction.csv',sep=',')

This submission scored a 0.77512, placing 1332 out of 2075 submissions. This is the same score as the "My First Random Forest" benchmark.