In [ ]:
If you are interested in the theory section of this post, learning the theory behind Naive Bayes first is neccessary to understand this topic. There are two reasons for this:
Latend Dirichlet Allocation is a topic model. A topic model is simply allows us to find abstract topics that occur in a collection of documents.
Latend Dirichlet Allocation (LDA in the rest of this post) allows us to group observations into unobserved (latent) groups. a
In [10]:
#import rpy2.ipython
%load_ext rpy2.ipython
In [12]:
%%R
y = 2
In [6]:
!pip install -U rpy2
In [5]:
rpy2.__version__
Out[5]:
Sci-kit learn does not implement
This is based on the paper by McAuley & Leskovec (2013).
the
shows up more often in one document than the other. The Dirchlet distribution is paramterized with the number of categories, $K$, and a vector of concentration parameters, $\boldsymbol{\alpha}$
It is a distribution over multinomials.
To make the Beta distribution extra confusing, the two parameters, $\alpha$ and $\beta$ are both abstract.
Where (1) is the prior probability
Propotions come out from a finite number of Bernoulli trials and so they are not continuous. Because of this, the beta distribution is not really appropriate here.
The normal distribution works here because of the CLT. That is an interesting point.
In [26]:
%pylab inline
In [31]:
import numpy as np
s = np.random.dirichlet((10, 5, 3), 50).transpose()
plt.barh(range(50), s[0])
plt.barh(range(50), s[1], left=s[0], color='g')
plt.barh(range(50), s[2], left=s[0]+s[1], color='r')
_ = plt.title("Lengths of Strings")
In [32]:
[i.mean() for i in s] #Mean of each of the lengths of the string
Out[32]:
In [33]:
[i/18 for i in [10, 5, 3]] # Predicted values of each of the lengths of the string
Out[33]:
So essentially all of the $a$ values give you the approximate ratio of what each dimension should be.
In [ ]: