P2 - Titanic Data


Questions

What are the most important factors related to the probability of survival ?


Load Data


In [1]:
%pylab inline
import pandas as pd
import matplotlib.pylab as plt
import seaborn as sns
import statsmodels.api as sm

# read the data and inspect
titanic = pd.read_csv('titanic-data.csv')
print titanic.info()
titanic.head()


Populating the interactive namespace from numpy and matplotlib
<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 90.5+ KB
None
Out[1]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 NaN S

In [2]:
# drop those columns we are not interested in.
titanic.drop(["Name", "Ticket", "Cabin", "Embarked", "Fare"], axis=1, inplace=True)

Name and Embarked are dropped from the dataset because passenger name and embarking location shouldn't have any meaningful correlation with their chance of surviving. Arguably, the embarking locations might give some indication of passengers' social-economical background. However, I will be in favor of using "Pclass", because it is specifically mentioned in the special notes in the data souce to be a proxy for social-economic status.

I dropped Fare in favor of Pclass for the same reason.

Cabin is an interesting feature. It is plausible the exact location of the passenger when the accident happens have an significant impact on the chance of survival. According to Wikipedia, the Titanic hits an iceburg at 11:40 pm. Given its already late in night, it is likely people will be sleeping at their cabin. However, there is only 204 data entries has a non-NaN value for this feature. Cabin is hence removed from the dataset as well.

Ticket number seems to be fairly random and not consistent across the dataset. It doesn't seems to contain any useful information. I have dropped from our dataset as well.

Other features like social-economic class (Pclass), gender (Sex), Age, and number of families onboard (Parch and SibSp) are all reasonable factors that could influence the chance of survival.


Data Audit


In [3]:
# helpers ----------------------------
def value_in_range(series, min, max): 
    assert pd.to_numeric(series, errors="raise").all() # make sure those are numbers
    return min <= series.min() and series.max() <= max
# ------------------------------------

# Sex can either be male or female
assert titanic["Sex"].isin(["male", "female"]).all() == True

# Survived 
assert titanic["Survived"].isin([0,1]).all() == True

# Pclass should be either 1, 2, or 3 and 
assert titanic["Pclass"].isin([1, 2, 3]).all() == True

# Age should be sensible, say between 0 to 100
AgeSeries = titanic[titanic["Age"].isnull() == False]["Age"]
assert value_in_range(AgeSeries, 0, 100) == True


Data Exploration

Visualisation

We first use the pointplot to explore the effect of "Sex", "Pclass", "SibSp", and "Parch" on survival. The pointplot is great for categorical variable and numerical variable with small countable range. It provides the mean estimate and confidence interval for each possible value of the variable.

For age, we will compare the kdeplot for both who survived and those didn't survive. If the kde are similar, then it indicates the age have little correlation with the chance of survival. If one kde value for survival is higher than the other at a particular age range, then it suggests the chance of survival is higher for than age group.


In [4]:
# using seaborn's pointplot to visually explore the relationship between individual categorical varibles 
sns.set_style("whitegrid")
g = sns.PairGrid(titanic, x_vars=["Sex", "Pclass", "SibSp", "Parch"], y_vars=["Survived"], size=4)
g.map(sns.pointplot, ci=99)
g.axes[0,0].set_ylabel("survival rate")
g.fig.suptitle("Point Plots")


Out[4]:
<matplotlib.text.Text at 0x117a99250>

Sex

The deep slope and small verticle bars (representing 99% confidence interval of mean) in the first visualisation suggests that "Sex" is strongly correlated with the chance of survival. Females have much higher chance of survival than male.

Social class (Pclass)

Similar to "Sex", the second visualisation shows people with higher social class are more likely to survive the disaster.

# Family members onboard (sibling, spouse, parents and children)

Overall, the pattern seems to be people with a few family member aboard has the highest chance of survival. Passengers with no family member or a lot of family aboard have a lower chance of survival.

The confidence intervals are relatively large at higher, suggesting the explanation power of those two various might be relatively weak.


In [5]:
# ploting the kernal distribution for age 
figure = plt.figure()

ax_top = figure.add_subplot(211)
ax_top.set_xlim(0,85)
ax_top.set_xlabel("Age")
ax_top.set_ylabel("Proportional of Population")
ax_top.set_title("Kernal Density Estimate for Age grouped by survival")

ax_bottom = figure.add_subplot(212)
ax_bottom.set_xlim(0,85)
ax_bottom.set_title("Boxplot for Age distribution grouped by survival")

x = titanic[titanic["Survived"] == 1] 
y = titanic[titanic["Survived"] == 0]

_ = sns.kdeplot(x["Age"].dropna(),
                label="survived == True", 
                cut= True, shade=True, 
                ax=ax_top)

_ = sns.kdeplot(y["Age"].dropna(), 
                label="survived == False", 
                cut=True, shade=True, 
                ax=ax_top)

_ = sns.boxplot(x="Age",
                y="Survived",
                data=titanic.dropna(subset = ["Age"]),
                orient="h",
                ax=ax_bottom)

plt.tight_layout()


Age

Overall, The kernel density estimates look quite similar for those who survived and who didn't. This is the same for the boxplots. The median age for both groups are the same and the difference in quantiles are relatively small.

One obvious misalignment of the two density plots happens when the passenger is a child ( < 14? ). The hump at far left of the the density curve of the suvived population suggests children had a higher chance of survival.

One the flip side, young adults, is less likely to survive comparing to other age groups.

Logit Regression

Because our dependent variable (survived) is a binary variable 0 or 1. We use logit regression to study the influence of various factors on the probability of survival.

We will drop any data entries if it contains NA value in any of the features we are interested in.

Also we will encode female to be 1 and male to be 0. Numercial representation is required for the regression to work.


In [6]:
# Drop data points if contain NA in any feature.
titanic_dropna = titanic.dropna(subset=["Survived", "Age", "Sex", "Pclass", "SibSp", "Parch"])

# convert "Sex" to numberic representation, which is required for regressions.
titanic_dropna["Sex"] = titanic_dropna["Sex"].apply(lambda x: {"female": 0, "male": 1}[x])

dep = titanic_dropna["Survived"]
indep = titanic_dropna[["Sex", "Pclass", "SibSp", "Parch", "Age"]]
print(sm.Logit(dep, sm.add_constant(indep)).fit().get_margeff().summary())


Optimization terminated successfully.
         Current function value: 0.445814
         Iterations 6
        Logit Marginal Effects       
=====================================
Dep. Variable:               Survived
Method:                          dydx
At:                           overall
==============================================================================
                dy/dx    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Sex           -0.3766      0.017    -21.753      0.000        -0.411    -0.343
Pclass        -0.1879      0.016    -11.698      0.000        -0.219    -0.156
SibSp         -0.0521      0.018     -2.935      0.003        -0.087    -0.017
Parch         -0.0053      0.017     -0.311      0.756        -0.039     0.028
Age           -0.0063      0.001     -5.844      0.000        -0.008    -0.004
==============================================================================
/Users/tianchuanting/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

Sex

"Sex" has the largest coefficient (by absolute value) among all variables. A female has a 37% higher chance of surviving than an everything-else identical male. The z-score for "Sex" is -21.75 and the p-value is 0.0000, confirming our intuition from the earlier visualisation that this relationship is statistically significant.

Social class (Pclass)

"Social Class" has the second largest coefficient by absolute value. This is again inline with our intuition from visualisation. With a z-score of -11.69 and P-value of 0.0000, the relationship between social class and the chance of survival is statistically significant.

SibSp

The chance of surviving for a passenger is lower than another everything-else equal passenger by 5.21% for every siblings or spouse onboard. While this is not easily seen from our visualisation, SibSp has a Z-score of -2.93 and a p-value of 0.003, meaning it is actually statistically significant at 1% significant level.

Parch

Our regression shows the number of Parents and Children on board does not have a significant connection to the chance of survival.

Age

Age is also a statistically significant factor. The negative coefficient is consistent with our observation from visualisation that children are more likely to survive than adults.

Conclusion

From our analysis, we conclude that Sex, Social-economic class, Age as well as Number of siblings and spouse are the most important factors that is associated with the chance of survival.

Limitation

  • Our visualisation studies the each features relationship with the chance of survival. It does not explore any complex joint effects of these features on surviving.

  • On the other hand, the logit regression does explore the joint effects of these features. However, it assumes linear relationships between the dependent variable and independent variables. In reality, the relationship could be much more complication. For example, it could be case, children's survival has little to do with gender. However, for adult, gender became the most prominent factor in determine the survival.


In [ ]: