Modeling with Mixed Datatypes using the Titanic Dataset

The RMS Titanic is a well known passenger ship that sank in 1912. The passenger list of the ship is a popular dataset of demographic information.

While most models predict likelihood of surviving the crash given demographic information, we will instead use the dataset to demonstrate how to work with mixed datatypes and combine likelihoods.

Let's set up our environment and load our data



In [19]:

    
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from microscopes.models import dd as dirichlet_discrete
from microscopes.models import bb as beta_bernoulli
from microscopes.models import gp as gamma_poisson
from microscopes.models import nich as normal_inverse_chisquared
from sklearn.preprocessing import LabelEncoder as Encoder
from microscopes.mixture.definition import model_definition
from microscopes.common.rng import rng
from microscopes.common.recarray.dataview import numpy_dataview
from microscopes.mixture import model, runner, query
from microscopes.kernels import parallel
from microscopes.common.query import groups, zmatrix_heuristic_block_ordering, zmatrix_reorder
sns.set_context('talk')


%matplotlib inline



In [20]:

    
ti = sns.load_dataset('titanic')

Some of the columns in the data are reduntant, while others have missing data. Let's ignore columns with lots of missing data, and focus on columns with key information. We'll also drop rows with missing data.



In [21]:

    
unique_cols = ['survived','pclass','sex','age','sibsp','parch','embark_town','fare']
to_cluster = ti[unique_cols].dropna()
print '%d passengers in dataset' % len(to_cluster.index)









    



712 passengers in dataset

For this example, we'll be using a Dirichlet Process Mixture Model to cluster the passengers

In this mixutre model, each column will have its own distributuion over values conditioned on the cluster assignment. We must specify the distribution by selecting which likelihood we'd like to use for each column.



In [22]:

    
to_cluster.head()









    Out[22]:






  
    
      
      survived
      pclass
      sex
      age
      sibsp
      parch
      embark_town
      fare
    
  
  
    
      0
      0
      3
      male
      22
      1
      0
      Southampton
      7.2500
    
    
      1
      1
      1
      female
      38
      1
      0
      Cherbourg
      71.2833
    
    
      2
      1
      3
      female
      26
      0
      0
      Southampton
      7.9250
    
    
      3
      1
      1
      female
      35
      1
      0
      Southampton
      53.1000
    
    
      4
      0
      3
      male
      35
      0
      0
      Southampton
      8.0500

For modelling, we must encode all categorical data into integer values. To do this we'll use scikitlearn's LabelEncoder



In [23]:

    
en1 = Encoder()
to_cluster['sex'] = en1.fit_transform(to_cluster['sex'])
en2 = Encoder()
to_cluster['embark_town'] = en2.fit_transform(to_cluster['embark_town'])
to_cluster.head()









    Out[23]:






  
    
      
      survived
      pclass
      sex
      age
      sibsp
      parch
      embark_town
      fare
    
  
  
    
      0
      0
      3
      1
      22
      1
      0
      2
      7.2500
    
    
      1
      1
      1
      0
      38
      1
      0
      0
      71.2833
    
    
      2
      1
      3
      0
      26
      0
      0
      2
      7.9250
    
    
      3
      1
      1
      0
      35
      1
      0
      2
      53.1000
    
    
      4
      0
      3
      1
      35
      0
      0
      2
      8.0500

To decide on which likelihood model to use on each column, let's get some basic statistics on our data



In [24]:

    
to_cluster.describe()









    Out[24]:






  
    
      
      survived
      pclass
      sex
      age
      sibsp
      parch
      embark_town
      fare
    
  
  
    
      count
      712.000000
      712.000000
      712.000000
      712.000000
      712.000000
      712.000000
      712.000000
      712.000000
    
    
      mean
      0.404494
      2.240169
      0.636236
      29.642093
      0.514045
      0.432584
      1.595506
      34.567251
    
    
      std
      0.491139
      0.836854
      0.481420
      14.492933
      0.930692
      0.854181
      0.779038
      52.938648
    
    
      min
      0.000000
      1.000000
      0.000000
      0.420000
      0.000000
      0.000000
      0.000000
      0.000000
    
    
      25%
      0.000000
      1.000000
      0.000000
      20.000000
      0.000000
      0.000000
      2.000000
      8.050000
    
    
      50%
      0.000000
      2.000000
      1.000000
      28.000000
      0.000000
      0.000000
      2.000000
      15.645850
    
    
      75%
      1.000000
      3.000000
      1.000000
      38.000000
      1.000000
      1.000000
      2.000000
      33.000000
    
    
      max
      1.000000
      3.000000
      1.000000
      80.000000
      5.000000
      6.000000
      2.000000
      512.329200

The above data table gives us a good idea of what kinds of likelihoods to use for each column:

survived, which indicates survival from the crash, is a binary column, which we will model with a beta-bernoulli distribution
pclass, which is passenger class in the ship, ranges from 1 to 3. This ordinal column we'll model with a gamma-poisson distribution
sex is another binary column we can model as beta-bernoulli
age is real valued, so we model this column as normally distributed
sibsp, which is number of siblings/spouses on the ship, ranges from 1 to 5. Since this data is integer valued and of limited range, we'll model this as also gamma-poisson.
parch, which is number of parents/children on the ship, has similar characteristics to sibsp so it'll also be modeled gamma-poisson
embark_town indicates where the passenger boarded the ship. Since this is categorical data, we'll model it with a dirichlet-discrete distribution.
fare is also real valued so we can model this as normal. However, the data is long tailed so we'll take the log of fare instead

We have two kinds of normal distributions, the normal inverse-wishart and the normal inverse-chi-square distribution. To determine which model we should use, let's find the correlation between age and logfare



In [25]:

    
to_cluster['fare'] = np.log(to_cluster['fare'])

to_cluster[['age','fare']].corr()

Since the correlation between age and fare is near 0, we can model these columns with a normal inverse-chi-square distribution

We'll now define our model using the model_definition function and run a gibbs sampler for 30000 iterations to learn the latent clusters

Note that for a the dirichlet-discrete likelihood we have to specify the number of categories in the data, in this case 3



In [26]:

    
nchains = 5
iters = 30000
splits = 7

defn = model_definition(to_cluster.shape[0], [beta_bernoulli, gamma_poisson, beta_bernoulli, normal_inverse_chisquared, gamma_poisson, gamma_poisson, dirichlet_discrete(3), normal_inverse_chisquared]) 

prng = rng()

data_rec = np.array([(list(to_cluster.ix[c]),) for c in to_cluster.index], dtype=[('', np.float32, to_cluster.shape[1])])
view = numpy_dataview(data_rec)
latents = [model.initialize(defn, view, prng) for _ in xrange(nchains)]
runners = [runner.runner(defn, view, latent, kernel_config=['assign']) for latent in latents]

for i, rnr in enumerate(runners):
    print 'starting runner %d at %d iterations' % (i, iters)
    rnr.run(r=prng, niters=iters)
    print 'runner %d done' % i

infers = [r.get_latent() for r in runners]









    



starting runner 0 at 30000 iterations
runner 0 done
starting runner 1 at 30000 iterations
runner 1 done
starting runner 2 at 30000 iterations
runner 2 done
starting runner 3 at 30000 iterations
runner 3 done
starting runner 4 at 30000 iterations
runner 4 done



In [27]:

    
zmat = query.zmatrix(infers)
zmat = zmatrix_reorder(zmat, zmatrix_heuristic_block_ordering(zmat))
f, ax = plt.subplots(figsize=(12, 9))
labels = ['%d' % i if i % (zmat.shape[0]/splits) == 0 else '' for i in xrange(zmat.shape[0])]
sns.heatmap(zmat, linewidths=0, square=True, ax=ax, xticklabels=labels, yticklabels=labels)
plt.title('Z-Matrix of Titanic Dataset\nwith %d Chains at %d Iterations' % (nchains, iters))
plt.xlabel('people (sorted)')
plt.ylabel('people (sorted)')









    Out[27]:





<matplotlib.text.Text at 0x11fbe4a10>



In [28]:

    
assignments = infers[0].assignments()



In [29]:

    
clusters = list(set(assignments))
K = len(clusters)

'%d clusters inferred' % K









    Out[29]:





'7 clusters inferred'



In [30]:

    
to_cluster['cluster'] = assignments

Now that we have our cluster assignments, let's plot the posterior distributions over each column in the data

Since survived, pclass, sex, sibsp, and parch have few observed values, we can plot these with bar graphs



In [31]:

    
for c in ['survived', 'pclass', 'sex', 'sibsp', 'parch', 'embark_town']:
    dist = to_cluster.groupby(['cluster', c])['age'].count()/to_cluster.groupby('cluster')['age'].count()
    dist.unstack().plot(kind='bar', figsize=(12,6))
    plt.title('distribution over %s among clusters' % c)
    plt.ylabel('probability')
    plt.xticks(range(K), range(K))

Each of the clusters have their own distribution over values, as shown by the plots

Let's now plot the columns we've modeled as normal inverse-chi-squared, age and fare, with a kernel density estimate

Recall that we've transformed fare into $\text{ln(fare)}$



In [32]:

    
for c in ['age', 'fare']:
    plt.figure(figsize=(12,9))
    for k in clusters:
        sns.kdeplot(to_cluster.loc[to_cluster['cluster'] == k, c], legend = False)
    if c == 'fare':
        c = 'log fare'
    plt.title('KDE of %s among titanic passengers clusters' % c)
    plt.ylabel('probability')
    plt.legend(range(K))
    plt.xlabel(c)

	survived	pclass	sex	age	sibsp	embark_town	fare
0	0	3	male	22	1	Southampton	7.2500
1	1	1	female	38	1	Cherbourg	71.2833
2	1	3	female	26	0	Southampton	7.9250
3	1	1	female	35	1	Southampton	53.1000
4	0	3	male	35	0	Southampton	8.0500

	survived	pclass	sex	age	sibsp	parch	embark_town	fare
count	712.000000	712.000000	712.000000	712.000000	712.000000	712.000000	712.000000	712.000000
mean	0.404494	2.240169	0.636236	29.642093	0.514045	0.432584	1.595506	34.567251
std	0.491139	0.836854	0.481420	14.492933	0.930692	0.854181	0.779038	52.938648
min	0.000000	1.000000	0.000000	0.420000	0.000000	0.000000	0.000000	0.000000
25%	0.000000	1.000000	0.000000	20.000000	0.000000	0.000000	2.000000	8.050000
50%	0.000000	2.000000	1.000000	28.000000	0.000000	0.000000	2.000000	15.645850
75%	1.000000	3.000000	1.000000	38.000000	1.000000	1.000000	2.000000	33.000000
max	1.000000	3.000000	1.000000	80.000000	5.000000	6.000000	2.000000	512.329200