The RMS Titanic is a well known passenger ship that sank in 1912. The passenger list of the ship is a popular dataset of demographic information.
While most models predict likelihood of surviving the crash given demographic information, we will instead use the dataset to demonstrate how to work with mixed datatypes and combine likelihoods.
Let's set up our environment and load our data
In [19]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from microscopes.models import dd as dirichlet_discrete
from microscopes.models import bb as beta_bernoulli
from microscopes.models import gp as gamma_poisson
from microscopes.models import nich as normal_inverse_chisquared
from sklearn.preprocessing import LabelEncoder as Encoder
from microscopes.mixture.definition import model_definition
from microscopes.common.rng import rng
from microscopes.common.recarray.dataview import numpy_dataview
from microscopes.mixture import model, runner, query
from microscopes.kernels import parallel
from microscopes.common.query import groups, zmatrix_heuristic_block_ordering, zmatrix_reorder
sns.set_context('talk')
%matplotlib inline
In [20]:
ti = sns.load_dataset('titanic')
Some of the columns in the data are reduntant, while others have missing data. Let's ignore columns with lots of missing data, and focus on columns with key information. We'll also drop rows with missing data.
In [21]:
unique_cols = ['survived','pclass','sex','age','sibsp','parch','embark_town','fare']
to_cluster = ti[unique_cols].dropna()
print '%d passengers in dataset' % len(to_cluster.index)
For this example, we'll be using a Dirichlet Process Mixture Model to cluster the passengers
In this mixutre model, each column will have its own distributuion over values conditioned on the cluster assignment. We must specify the distribution by selecting which likelihood we'd like to use for each column.
In [22]:
to_cluster.head()
Out[22]:
For modelling, we must encode all categorical data into integer values. To do this we'll use scikitlearn's LabelEncoder
In [23]:
en1 = Encoder()
to_cluster['sex'] = en1.fit_transform(to_cluster['sex'])
en2 = Encoder()
to_cluster['embark_town'] = en2.fit_transform(to_cluster['embark_town'])
to_cluster.head()
Out[23]:
To decide on which likelihood model to use on each column, let's get some basic statistics on our data
In [24]:
to_cluster.describe()
Out[24]:
The above data table gives us a good idea of what kinds of likelihoods to use for each column:
survived
, which indicates survival from the crash, is a binary column, which we will model with a beta-bernoulli distributionpclass
, which is passenger class in the ship, ranges from 1 to 3. This ordinal column we'll model with a gamma-poisson distributionsex
is another binary column we can model as beta-bernoulliage
is real valued, so we model this column as normally distributedsibsp
, which is number of siblings/spouses on the ship, ranges from 1 to 5. Since this data is integer valued and of limited range, we'll model this as also gamma-poisson.parch
, which is number of parents/children on the ship, has similar characteristics to sibsp
so it'll also be modeled gamma-poissonembark_town
indicates where the passenger boarded the ship. Since this is categorical data, we'll model it with a dirichlet-discrete distribution.fare
is also real valued so we can model this as normal. However, the data is long tailed so we'll take the log of fare
insteadWe have two kinds of normal distributions, the normal inverse-wishart and the normal inverse-chi-square distribution. To determine which model we should use, let's find the correlation between age
and logfare
In [25]:
to_cluster['fare'] = np.log(to_cluster['fare'])
to_cluster[['age','fare']].corr()
Out[25]:
Since the correlation between age
and fare
is near 0, we can model these columns with a normal inverse-chi-square distribution
We'll now define our model using the model_definition
function and run a gibbs sampler for 30000 iterations to learn the latent clusters
Note that for a the dirichlet-discrete likelihood we have to specify the number of categories in the data, in this case 3
In [26]:
nchains = 5
iters = 30000
splits = 7
defn = model_definition(to_cluster.shape[0], [beta_bernoulli, gamma_poisson, beta_bernoulli, normal_inverse_chisquared, gamma_poisson, gamma_poisson, dirichlet_discrete(3), normal_inverse_chisquared])
prng = rng()
data_rec = np.array([(list(to_cluster.ix[c]),) for c in to_cluster.index], dtype=[('', np.float32, to_cluster.shape[1])])
view = numpy_dataview(data_rec)
latents = [model.initialize(defn, view, prng) for _ in xrange(nchains)]
runners = [runner.runner(defn, view, latent, kernel_config=['assign']) for latent in latents]
for i, rnr in enumerate(runners):
print 'starting runner %d at %d iterations' % (i, iters)
rnr.run(r=prng, niters=iters)
print 'runner %d done' % i
infers = [r.get_latent() for r in runners]
In [27]:
zmat = query.zmatrix(infers)
zmat = zmatrix_reorder(zmat, zmatrix_heuristic_block_ordering(zmat))
f, ax = plt.subplots(figsize=(12, 9))
labels = ['%d' % i if i % (zmat.shape[0]/splits) == 0 else '' for i in xrange(zmat.shape[0])]
sns.heatmap(zmat, linewidths=0, square=True, ax=ax, xticklabels=labels, yticklabels=labels)
plt.title('Z-Matrix of Titanic Dataset\nwith %d Chains at %d Iterations' % (nchains, iters))
plt.xlabel('people (sorted)')
plt.ylabel('people (sorted)')
Out[27]:
In [28]:
assignments = infers[0].assignments()
In [29]:
clusters = list(set(assignments))
K = len(clusters)
'%d clusters inferred' % K
Out[29]:
In [30]:
to_cluster['cluster'] = assignments
Now that we have our cluster assignments, let's plot the posterior distributions over each column in the data
Since survived
, pclass
, sex
, sibsp
, and parch
have few observed values, we can plot these with bar graphs
In [31]:
for c in ['survived', 'pclass', 'sex', 'sibsp', 'parch', 'embark_town']:
dist = to_cluster.groupby(['cluster', c])['age'].count()/to_cluster.groupby('cluster')['age'].count()
dist.unstack().plot(kind='bar', figsize=(12,6))
plt.title('distribution over %s among clusters' % c)
plt.ylabel('probability')
plt.xticks(range(K), range(K))
Each of the clusters have their own distribution over values, as shown by the plots
Let's now plot the columns we've modeled as normal inverse-chi-squared, age
and fare
, with a kernel density estimate
Recall that we've transformed fare
into $\text{ln(fare)}$
In [32]:
for c in ['age', 'fare']:
plt.figure(figsize=(12,9))
for k in clusters:
sns.kdeplot(to_cluster.loc[to_cluster['cluster'] == k, c], legend = False)
if c == 'fare':
c = 'log fare'
plt.title('KDE of %s among titanic passengers clusters' % c)
plt.ylabel('probability')
plt.legend(range(K))
plt.xlabel(c)