In [1]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
sns.set_context('talk')
sns.set_style('darkgrid')
First, the passenger list of the Titanic
In [2]:
titanic = sns.load_dataset("titanic")
In [3]:
titanic.head(n=10)
Out[3]:
One of the categorical variables in this dataset is embark_town
Let's plot the number of passengers departing from each town
In [4]:
ax = titanic.groupby(['embark_town'])['age'].count().plot(kind='bar')
plt.xticks(rotation=0)
plt.xlabel('Departure Town')
plt.ylabel('Passengers')
plt.title('Number of Passengers by Town of Departure')
Out[4]:
Let's look at another example: the cars93 dataset
In [5]:
cars = pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/MASS/Cars93.csv', index_col=0)
In [6]:
cars.head()
Out[6]:
In [7]:
cars.ix[1]
Out[7]:
This dataset has multiple categorical variables
Based on the description of the cars93 datatset, we'll consider Manufacturer
, and DriveTrain
to be categorical variables
Let's plot Manufacturer
and DriveTrain
In [8]:
cars.groupby('Manufacturer')['Model'].count().plot(kind='bar')
plt.ylabel('Cars')
plt.title('Number of Cars by Manufacturer')
Out[8]:
In [9]:
cars.groupby('DriveTrain')['Model'].count().plot(kind='bar')
plt.ylabel('Cars')
plt.title('Number of Cars by Drive Train')
Out[9]:
If our categorical data has labels, we need to convert them to integer id's
In [10]:
def col_2_ids(df, col):
ids = df[col].drop_duplicates().sort(inplace=False).reset_index(drop=True)
ids.index.name = '%s_ids' % col
ids = ids.reset_index()
df = pd.merge(df, ids, how='left')
del df[col]
return df
In [11]:
cat_columns = ['Manufacturer', 'DriveTrain']
for c in cat_columns:
print c
cars = col_2_ids(cars, c)
In [12]:
cars[['%s_ids' % c for c in cat_columns]].head()
Out[12]:
Just as we model binary data with the beta Bernoulli distribution, we can model categorical data with the Dirichlet discrete distribution
The beta Bernoulli distribution allows us to learn the underlying probability, $\theta$, of the binary random variable, $x$
$$P(x=1) =\theta$$$$P(x=0) = 1-\theta$$The Dirichlet discrete distribution extends the beta Bernoulli distribution to the case in which $x$ can assume more than two states
$$\forall i \in [0,1,...n] \hspace{2mm} P(x = i) = \theta_i$$$$\sum_{i=0}^n \theta_i = 1$$Again, the Dirichlet distribution takes advantage of the fact that the Dirichlet distribution and the discrete distribution are conjugate. Note that the discrete distriution is sometimes called the categorical distribution or the multinomial distribution.
To import the Dirichlet discrete distribution call
In [13]:
from microscopes.models import dd as dirichlet_discrete
Then given the specific model we'd want we'd import
from microscopes.model_name.definition import model_definition
NOTE: You must specify the number of categories in your Dirichlet Discrete distribution
For 5
categories, for examples you must specify the likelihood as:
In [14]:
dd5 = dirichlet_discrete(5)
You can then use the model definition as appropriate for your desired model:
In [15]:
from microscopes.irm.definition import model_definition as irm_definition
from microscopes.mixture.definition import model_definition as mm_definition
from microscopes.lda.definition import model_definition as hdp_definition