P2 - Titanic Data

Questions

What are the most important factors related to the probability of survival ?

Load Data



In [1]:

    
%pylab inline
import pandas as pd
import matplotlib.pylab as plt
import seaborn as sns
import statsmodels.api as sm

# read the data and inspect
titanic = pd.read_csv('titanic-data.csv')
print titanic.info()
titanic.head()









    



Populating the interactive namespace from numpy and matplotlib
<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 90.5+ KB
None






    Out[1]:






  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      0
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22
      1
      0
      A/5 21171
      7.2500
      NaN
      S
    
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38
      1
      0
      PC 17599
      71.2833
      C85
      C
    
    
      2
      3
      1
      3
      Heikkinen, Miss. Laina
      female
      26
      0
      0
      STON/O2. 3101282
      7.9250
      NaN
      S
    
    
      3
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      female
      35
      1
      0
      113803
      53.1000
      C123
      S
    
    
      4
      5
      0
      3
      Allen, Mr. William Henry
      male
      35
      0
      0
      373450
      8.0500
      NaN
      S



In [2]:

    
# drop those columns we are not interested in.
titanic.drop(["Name", "Ticket", "Cabin", "Embarked", "Fare"], axis=1, inplace=True)

Name and Embarked are dropped from the dataset because passenger name and embarking location shouldn't have any meaningful correlation with their chance of surviving. Arguably, the embarking locations might give some indication of passengers' social-economical background. However, I will be in favor of using "Pclass", because it is specifically mentioned in the special notes in the data souce to be a proxy for social-economic status.

I dropped Fare in favor of Pclass for the same reason.

Cabin is an interesting feature. It is plausible the exact location of the passenger when the accident happens have an significant impact on the chance of survival. According to Wikipedia, the Titanic hits an iceburg at 11:40 pm. Given its already late in night, it is likely people will be sleeping at their cabin. However, there is only 204 data entries has a non-NaN value for this feature. Cabin is hence removed from the dataset as well.

Ticket number seems to be fairly random and not consistent across the dataset. It doesn't seems to contain any useful information. I have dropped from our dataset as well.

Other features like social-economic class (Pclass), gender (Sex), Age, and number of families onboard (Parch and SibSp) are all reasonable factors that could influence the chance of survival.

Data Audit



In [3]:

    
# helpers ----------------------------
def value_in_range(series, min, max): 
    assert pd.to_numeric(series, errors="raise").all() # make sure those are numbers
    return min <= series.min() and series.max() <= max
# ------------------------------------

# Sex can either be male or female
assert titanic["Sex"].isin(["male", "female"]).all() == True

# Survived 
assert titanic["Survived"].isin([0,1]).all() == True

# Pclass should be either 1, 2, or 3 and 
assert titanic["Pclass"].isin([1, 2, 3]).all() == True

# Age should be sensible, say between 0 to 100
AgeSeries = titanic[titanic["Age"].isnull() == False]["Age"]
assert value_in_range(AgeSeries, 0, 100) == True

Data Exploration

Visualisation

We first use the pointplot to explore the effect of "Sex", "Pclass", "SibSp", and "Parch" on survival. The pointplot is great for categorical variable and numerical variable with small countable range. It provides the mean estimate and confidence interval for each possible value of the variable.

For age, we will compare the kdeplot for both who survived and those didn't survive. If the kde are similar, then it indicates the age have little correlation with the chance of survival. If one kde value for survival is higher than the other at a particular age range, then it suggests the chance of survival is higher for than age group.



In [4]:

    
# using seaborn's pointplot to visually explore the relationship between individual categorical varibles 
sns.set_style("whitegrid")
g = sns.PairGrid(titanic, x_vars=["Sex", "Pclass", "SibSp", "Parch"], y_vars=["Survived"], size=4)
g.map(sns.pointplot, ci=99)
g.axes[0,0].set_ylabel("survival rate")
g.fig.suptitle("Point Plots")









    Out[4]:





<matplotlib.text.Text at 0x117a99250>

Sex

The deep slope and small verticle bars (representing 99% confidence interval of mean) in the first visualisation suggests that "Sex" is strongly correlated with the chance of survival. Females have much higher chance of survival than male.

Similar to "Sex", the second visualisation shows people with higher social class are more likely to survive the disaster.

# Family members onboard (sibling, spouse, parents and children)

Overall, the pattern seems to be people with a few family member aboard has the highest chance of survival. Passengers with no family member or a lot of family aboard have a lower chance of survival.

The confidence intervals are relatively large at higher, suggesting the explanation power of those two various might be relatively weak.



In [5]:

    
# ploting the kernal distribution for age 
figure = plt.figure()

ax_top = figure.add_subplot(211)
ax_top.set_xlim(0,85)
ax_top.set_xlabel("Age")
ax_top.set_ylabel("Proportional of Population")
ax_top.set_title("Kernal Density Estimate for Age grouped by survival")

ax_bottom = figure.add_subplot(212)
ax_bottom.set_xlim(0,85)
ax_bottom.set_title("Boxplot for Age distribution grouped by survival")

x = titanic[titanic["Survived"] == 1] 
y = titanic[titanic["Survived"] == 0]

_ = sns.kdeplot(x["Age"].dropna(),
                label="survived == True", 
                cut= True, shade=True, 
                ax=ax_top)

_ = sns.kdeplot(y["Age"].dropna(), 
                label="survived == False", 
                cut=True, shade=True, 
                ax=ax_top)

_ = sns.boxplot(x="Age",
                y="Survived",
                data=titanic.dropna(subset = ["Age"]),
                orient="h",
                ax=ax_bottom)

plt.tight_layout()

Age

Overall, The kernel density estimates look quite similar for those who survived and who didn't. This is the same for the boxplots. The median age for both groups are the same and the difference in quantiles are relatively small.

One obvious misalignment of the two density plots happens when the passenger is a child ( < 14? ). The hump at far left of the the density curve of the suvived population suggests children had a higher chance of survival.

One the flip side, young adults, is less likely to survive comparing to other age groups.

Logit Regression

Because our dependent variable (survived) is a binary variable 0 or 1. We use logit regression to study the influence of various factors on the probability of survival.

We will drop any data entries if it contains NA value in any of the features we are interested in.

Also we will encode female to be 1 and male to be 0. Numercial representation is required for the regression to work.



In [6]:

    
# Drop data points if contain NA in any feature.
titanic_dropna = titanic.dropna(subset=["Survived", "Age", "Sex", "Pclass", "SibSp", "Parch"])

# convert "Sex" to numberic representation, which is required for regressions.
titanic_dropna["Sex"] = titanic_dropna["Sex"].apply(lambda x: {"female": 0, "male": 1}[x])

dep = titanic_dropna["Survived"]
indep = titanic_dropna[["Sex", "Pclass", "SibSp", "Parch", "Age"]]
print(sm.Logit(dep, sm.add_constant(indep)).fit().get_margeff().summary())









    



Optimization terminated successfully.
         Current function value: 0.445814
         Iterations 6
        Logit Marginal Effects       
=====================================
Dep. Variable:               Survived
Method:                          dydx
At:                           overall
==============================================================================
                dy/dx    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Sex           -0.3766      0.017    -21.753      0.000        -0.411    -0.343
Pclass        -0.1879      0.016    -11.698      0.000        -0.219    -0.156
SibSp         -0.0521      0.018     -2.935      0.003        -0.087    -0.017
Parch         -0.0053      0.017     -0.311      0.756        -0.039     0.028
Age           -0.0063      0.001     -5.844      0.000        -0.008    -0.004
==============================================================================






    



/Users/tianchuanting/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

Sex

"Sex" has the largest coefficient (by absolute value) among all variables. A female has a 37% higher chance of surviving than an everything-else identical male. The z-score for "Sex" is -21.75 and the p-value is 0.0000, confirming our intuition from the earlier visualisation that this relationship is statistically significant.

"Social Class" has the second largest coefficient by absolute value. This is again inline with our intuition from visualisation. With a z-score of -11.69 and P-value of 0.0000, the relationship between social class and the chance of survival is statistically significant.

SibSp

The chance of surviving for a passenger is lower than another everything-else equal passenger by 5.21% for every siblings or spouse onboard. While this is not easily seen from our visualisation, SibSp has a Z-score of -2.93 and a p-value of 0.003, meaning it is actually statistically significant at 1% significant level.

Parch

Our regression shows the number of Parents and Children on board does not have a significant connection to the chance of survival.

Age

Age is also a statistically significant factor. The negative coefficient is consistent with our observation from visualisation that children are more likely to survive than adults.

Conclusion

From our analysis, we conclude that Sex, Social-economic class, Age as well as Number of siblings and spouse are the most important factors that is associated with the chance of survival.

Limitation

Our visualisation studies the each features relationship with the chance of survival. It does not explore any complex joint effects of these features on surviving.
On the other hand, the logit regression does explore the joint effects of these features. However, it assumes linear relationships between the dependent variable and independent variables. In reality, the relationship could be much more complication. For example, it could be case, children's survival has little to do with gender. However, for adult, gender became the most prominent factor in determine the survival.



In [ ]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35	0	373450	8.0500	NaN	S