Introduction

This is the second installment of Applying Machine Learning to Kaggle Datasets, a series of ipython notebooks demonstrating the methods described in the Stanford Machine Learning Course. In each noteobok, I apply one method taught in the course to an open kaggle competition.

In this notebook, I demonstrate logistic regression using the Titanic competition.

Outline

Import and examine the data
Construct a logistic model to predict mortality
Optimize model parameters by solving Theta*X=y
Evaluate model results
Submit results to the Kaggle competition

Import Necessary Modules



In [1188]:

    
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import statsmodels.api as sm
import code.Linear_Regression_Funcs as LRF
import code.Logistic_Regression_Funcs as LGF



In [1189]:

    
reload(LGF)









    Out[1189]:





<module 'code.Logistic_Regression_Funcs' from 'code/Logistic_Regression_Funcs.pyc'>

1. Read Titanic Data



In [1190]:

    
train = pd.read_csv("./data/titanic/train.csv", index_col="PassengerId")
train.head()









    Out[1190]:






  
    
      
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
    
      PassengerId
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1
       0
       3
                                 Braund, Mr. Owen Harris
         male
       22
       1
       0
              A/5 21171
        7.2500
        NaN
       S
    
    
      2
       1
       1
       Cumings, Mrs. John Bradley (Florence Briggs Th...
       female
       38
       1
       0
               PC 17599
       71.2833
        C85
       C
    
    
      3
       1
       3
                                  Heikkinen, Miss. Laina
       female
       26
       0
       0
       STON/O2. 3101282
        7.9250
        NaN
       S
    
    
      4
       1
       1
            Futrelle, Mrs. Jacques Heath (Lily May Peel)
       female
       35
       1
       0
                 113803
       53.1000
       C123
       S
    
    
      5
       0
       3
                                Allen, Mr. William Henry
         male
       35
       0
       0
                 373450
        8.0500
        NaN
       S



In [1191]:

    
# Fill embarcation location nan with a string
#train.Embarked = train.Embarked.fillna('nan')

# Create name category from titles in the name column
#train = LGF.nametitles(train)

Some Exploratory Analysis using Pandas



In [1192]:

    
#temp = pd.crosstab([train.Pclass, train.Sex],train.Survived.astype(bool))
#temp



In [1193]:

    
#sb.set(style="white")
#sb.factorplot('Pclass','Survived','Sex',data=train,palette="muted")
#sb.factorplot('Embarked','Survived','Pclass',data=train,palette="muted")
#sb.factorplot('Embarked','Survived','Sex',data=train,palette="muted")
#fg = sb.FacetGrid(train,hue="Pclass",aspect=3,palette="muted")
#fg.map(sb.kdeplot,"Age",bw=4,shade=True,legend=True)
#fg.set(xlim=(0,80))



In [1193]:



In [1194]:

    
## Transform categorical variables into numeric indicators (For examination only)
#temp = LGF.cat2indicator(train, ['Embarked','Pclass','Sex'])  # Embarcation, Class, Sex
#
## Examine data grouped by survival
#temp.groupby(temp.Survived).describe()

2. Construct a logistic regression model to predict survival

A list of variables to include in the model

Categorical variables (PClass, Sex, Embarcation location, Title)
Continuous variables (Age, Fare, # Parents/Children, # Siblings/Spouses)
Interaction Terms
- Pclass * Sex
- Embarcation * Sex



In [1195]:

    
y = train['Survived']



In [1196]:

    
# X is an [m x n] matrix.
#    m = number of observations
#    n = number of predictors
X = LGF.make_matrix(train)

3. Optimize Model Parameters via Logistic regression



In [1214]:

    
results = sm.Logit(y,X).fit(maxiter=1000,method='bfgs')









    



Optimization terminated successfully.
         Current function value: 0.411809
         Iterations: 108
         Function evaluations: 109
         Gradient evaluations: 109

4. Evaluate Model Results



In [1198]:

    
results.summary()









    Out[1198]:





Logit Regression Results

  Dep. Variable:      Survived        No. Observations:       891  


  Model:                Logit         Df Residuals:           879  


  Method:                MLE          Df Model:                11  


  Date:           Thu, 04 Dec 2014    Pseudo R-squ.:       0.3816  


  Time:               13:37:23        Log-Likelihood:      -366.92 


  converged:            True          LL-Null:             -593.33 


                                    LLR p-value:        3.632e-90




                           coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  const                      4.1085      0.683      6.019   0.000      2.771     5.446


  Embarked_Q                 0.3835      0.566      0.678   0.498     -0.725     1.493


  Embarked_S                -1.1151      0.452     -2.469   0.014     -2.000    -0.230


  Embarked_nan               2.9745     24.439      0.122   0.903    -44.924    50.873


  Sex_male                  -4.3837      0.730     -6.003   0.000     -5.815    -2.952


  Pclass_2                  -0.6279      0.733     -0.856   0.392     -2.065     0.809


  Pclass_3                  -3.5018      0.621     -5.635   0.000     -4.720    -2.284


  Title_Master               2.4168      0.365      6.624   0.000      1.702     3.132


  Pclass_2_*_Sex_male       -0.6755      0.815     -0.829   0.407     -2.272     0.921


  Pclass_3_*_Sex_male        2.0651      0.678      3.048   0.002      0.737     3.393


  Embarked_Q_*_Sex_male     -1.7974      0.899     -1.998   0.046     -3.560    -0.034


  Embarked_S_*_Sex_male      0.6288      0.532      1.181   0.238     -0.415     1.672



In [1212]:

    
ypredict = results.predict(X)
ypredict = np.round(ypredict)
print "score on training data = ",LGF.score(y,ypredict)









    



score on training data =  0.822671156004

5. Submit the results to the Kaggle competition



In [1200]:

    
# Read the test data
test = pd.read_csv("./data/titanic/test.csv",index_col="PassengerId")



In [1201]:

    
# Construct test model matrix
Xtest = LGF.make_matrix(test,matchcols=X.columns[1:])



In [1202]:

    
# Calculate predictions by applying model parameters to test model matrix
Ypredict = pd.DataFrame(results.predict(Xtest),index=Xtest.index)
Ypredict = np.round(Ypredict)
Ypredict.columns = ['Survived']
Ypredict = Ypredict.astype(int)
Ypredict.to_csv('./predictions/Logistic_Regression_Prediction.csv',sep=',')

This submission scored a 0.77512, placing 1332 out of 2075 submissions. This is the same score as the "My First Random Forest" benchmark.

	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
PassengerId
1	0	3	Braund, Mr. Owen Harris	male	22	1	0	A/5 21171	7.2500	NaN	S
2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38	1	0	PC 17599	71.2833	C85	C
3	1	3	Heikkinen, Miss. Laina	female	26	0	0	STON/O2. 3101282	7.9250	NaN	S
4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35	1	0	113803	53.1000	C123	S
5	0	3	Allen, Mr. William Henry	male	35	0	0	373450	8.0500	NaN	S

Dep. Variable:	Survived	No. Observations:	891
Model:	Logit	Df Residuals:	879
Method:	MLE	Df Model:	11
Date:	Thu, 04 Dec 2014	Pseudo R-squ.:	0.3816
Time:	13:37:23	Log-Likelihood:	-366.92
converged:	True	LL-Null:	-593.33
		LLR p-value:	3.632e-90

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
const	4.1085	0.683	6.019	0.000	2.771 5.446
Embarked_Q	0.3835	0.566	0.678	0.498	-0.725 1.493
Embarked_S	-1.1151	0.452	-2.469	0.014	-2.000 -0.230
Embarked_nan	2.9745	24.439	0.122	0.903	-44.924 50.873
Sex_male	-4.3837	0.730	-6.003	0.000	-5.815 -2.952
Pclass_2	-0.6279	0.733	-0.856	0.392	-2.065 0.809
Pclass_3	-3.5018	0.621	-5.635	0.000	-4.720 -2.284
Title_Master	2.4168	0.365	6.624	0.000	1.702 3.132
Pclass_2_*_Sex_male	-0.6755	0.815	-0.829	0.407	-2.272 0.921
Pclass_3_*_Sex_male	2.0651	0.678	3.048	0.002	0.737 3.393
Embarked_Q_*_Sex_male	-1.7974	0.899	-1.998	0.046	-3.560 -0.034
Embarked_S_*_Sex_male	0.6288	0.532	1.181	0.238	-0.415 1.672