Topic models são a suite de algoritmos que permitem descobrir a estrutura subjacente a uma coleção de documentos. Tendo em vista o atual acesso a massas enormes de documentos de todos os tipos, torna-se necessário o uso de ferramentas. Oprincipal algoritmo determinístico usado é Non-negative Matrix Factorization, que veremos aqui.
Nossa tarefa é achar duas matrizes: $\mathbf{P}$ (de tamanho $|U| \times K$) e $\mathbf{Q}$ (de tamanho $|D| \times K$ matrix) tal que seu produto se aproxime de $\mathbf{R}$: $$\mathbf{R} \approx \mathbf{P} \times \mathbf{Q}^T = \hat{\mathbf{R}}$$
Desta forma, cada fileira de $\mathbf{P}$ representa a força de associação entre um termo e as features (tópicos). Similarmente, cada fileira de $\mathbf{Q}$ representa a força de associação entre uma palavra e as features.
Para ver o valor associado a um item $d_j$ por $u_i$ podemos calcular o produto escalar dos vetores correspondentes a $u_i$ e $d_j$: $$ \hat{r}_{ij} = p_i^T q_j = \sum_{k=1}^{k}p_{ik}q_{kj} $$
Agora precisamos de uma forma para obter $\mathbf{P}$ e $\mathbf{Q}$. Uma das formas é: i) inicializar ambas as matrizes com alguns valores; ii) calcular a 'diferença' dos seus produtos a $\mathbf{R}$; iii) tentar minimizar esta diferença em iterações
Este método é comumente chamado de gradient descent, a diferença (do passo ii, chamada de erro entre o estimado e o real, pode ser calculada da seguinte forma: $$e_{ij}^2=(r_{ij}-\hat{r}_{ij})^2 = (r_{ij} - \sum_{k=1}^k{p_{ik}q_{kj}})^2$$
Vale considerar o erro quadrático pois o valor real pode estar acima ou abaixo do estimado.
Para minimizar o erro, temos que saber em qual direção precisamos modificar os valores de $p_{ik}$ e $q_{kj}$. Isto é, precisamos saber o gradiente nos valores atuais.
Assim, diferenciamos a equação acima com respeito a ambas as variáveis separadamente:
$$\frac{\partial}{\partial p_{ik}}e^2_{ij} = -2(r_{ij}-\hat{r}_{ij})(q_{kj}) = -2e_{ij}q_{kj}$$
$$\frac{\partial}{\partial q_{kj}}e^2_{ij} = -2(r_{ij}-\hat{r}_{ij})(p_{ik}) = -2e_{ij}p_{ik}$$
Com este resultado podemos formular regras de iteração para $p_{ik}$ e $q_{kj}$
Alfa aqui é um constante que determina a taxa de aproximção do mínimo. É aconselhável escolher um valor pequeno pois, se for grande demais, arriscamos ficar oscilando em volta do ponto mínimo.
Assim, como resultado, temos $\mathbf{P}$ de vetores-base, cujas colunas são nossos tópicos e $\mathbf{Q}$ que mostra as filiações dos documentos aos tópicos.
O algoritmo apresentado é o básico para se fatorizar uma matriz. Há muitos métodos para comlicar (e em tese melhorar).
Uma extensão comum é introduzir um fator de regularização para evitar overfitting. Podemos adicionar um parâmetro $\beta$ e modificar o cálculo do erro quadrático, assim: $$e_{ij}^2 = (r_{ij} - \sum_{k=1}^K{p_{ik}q_{kj}})^2 + \frac{\beta}{2} \sum_{k=1}^K{(||P||^2 + ||Q||^2)}$$
Desta forma o novo parâmetro $\beta$ é usado para controlar as magnitudes dos vetores, de tal forma que $\mathbf{P}$ e $\mathbf{Q}$ possam resultar em boas aproximações de $\mathbf{R}$ sem precisarem conter números grandes. Na prática colocamos $\beta$ em torno de 0.02
A nova regra de iteração fica então:
In [48]:
import numpy
###############################################################################
def matrix_factorization(R, P, Q, K, steps=500, alpha=0.0002, beta=0.02):
"""
@INPUT:
R : a matriz a ser fatorizada, dim. N x M
P : uma matriz inicial de dim. N x K
Q : uma matriz inicial de dim. M x K
K : o número de features latentes (neste caso tópicos)
steps : o número máximo de iterações a maximizar, depois para
alpha : a taxa de aprendizado
beta : parametro de regularização
@OUTPUT:
as matrizes finais P and Q
"""
Q = Q.T
# OBS: existem implementações MapReduce otimizadas
for step in xrange(steps):
for i in xrange(len(R)):
for j in xrange(len(R[i])):
if R[i][j] > 0:
eij = R[i][j] - numpy.dot(P[i,:], Q[:,j])
for k in xrange(K):
P[i][k] = P[i][k] + alpha * (2 * eij * Q[k][j] - beta * P[i][k])
Q[k][j] = Q[k][j] + alpha * (2 * eij * P[i][k] - beta * Q[k][j])
eR = numpy.dot(P,Q)
e = 0
for i in xrange(len(R)):
for j in xrange(len(R[i])):
if R[i][j] > 0:
e = e + pow(R[i][j] - numpy.dot(P[i,:],Q[:,j]), 2)
for k in xrange(K):
e = e + (beta/2) * ( pow(P[i][k],2) + pow(Q[k][j],2) )
if e < 0.001:
break
return P, Q.T
Vamos testar? ...
In [49]:
R = [
[5,3,0,1],
[4,0,0,1],
[1,1,0,5],
[1,0,0,4],
[0,1,5,4],
]
R = numpy.array(R)
N = len(R)
M = len(R[0])
K = 3
P = numpy.random.rand(N,K)
Q = numpy.random.rand(M,K)
nP, nQ = matrix_factorization(R, P, Q, K)
In [50]:
nR = numpy.dot(nP, nQ.T)
print nR
In [68]:
d0 = """Python is a 2000 made-for-TV horror movie directed by Richard
Clabaugh. The film features several cult favorite actors, including William
Zabka of The Karate Kid fame, Wil Wheaton, Casper Van Dien, Jenny McCarthy,
Keith Coogan, Robert Englund (best known for his role as Freddy Krueger in the
A Nightmare on Elm Street series of films), Dana Barron, David Bowe, and Sean
Whalen. The film concerns a genetically engineered snake, a python, that
escapes and unleashes itself on a small town. It includes the classic final
girl scenario evident in films like Friday the 13th. It was filmed in Los Angeles,
California and Malibu, California. Python was followed by two sequels: Python
II (2002) and Boa vs. Python (2004), both also made-for-TV films."""
d1 = """Python is a genus of
nonvenomous pythons found in Africa and Asia. Currently, 7 species are
recognised. A member of this genus, P. reticulatus, is among the longest
snakes known. This snake is also very scary and I never want to see one up close.
Which is scariest, pythons, sharks or ninjas?"""
d2 = """The Colt Python is a .357 Magnum caliber revolver formerly
manufactured by Colt's Manufacturing Company of Hartford, Connecticut.
It is sometimes referred to as a "Combat Magnum". It was first introduced
in 1955, the same year as Smith & Wesson's M29 .44 Magnum. The now discontinued
Colt Python targeted the premium revolver market segment. Some firearm
collectors and writers such as Jeff Cooper, Ian V. Hogg, Chuck Hawks, Leroy
Thompson, Renee Smeets and Martin Dougherty have described the Python as the
finest production revolver ever made."""
#d3 = """I am a truly small text"""
d3 = """The fossil record of snakes is relatively poor because snake skeletons
are typically small and fragile making fossilization uncommon. Fossils readily
identifiable as snakes (though often retaining hind limbs) first appear in the
fossil record during the Cretaceous period. The earliest known true snake
fossils (members of the crown group Serpentes) come from the marine simoliophiids,
the oldest of which is the Late Cretaceous (Cenomanian age) Haasiophis terrasanctus,
dated to between 112 and 94 million years old. Based on comparative anatomy, there
is consensus that snakes descended from lizards. Pythons and boas—primitive
groups among modern snakes—have vestigial hind limbs: tiny, clawed digits known as anal
spurs, which are used to grasp during mating. The families Leptotyphlopidae
and Typhlopidae also possess remnants of the pelvic girdle, appearing as horny projections
when visible."""
d4 = """The potato is a starchy, tuberous crop from the perennial nightshade
Solanum tuberosum L. The word "potato" may refer either to the plant itself
or to the edible tuber. In the Andes, where the species is indigenous, there
are some other closely related cultivated potato species. Potatoes were introduced
outside the Andes region approximately four centuries ago, and have since
become an integral part of much of the world's food supply. It is the world's
fourth-largest food crop, following maize, wheat, and rice."""
corpus1 = [d0,d1,d2,d3,d4]
Reusando a função clean:
In [84]:
def clean(doc):
#doc = doc.lower()
doc = doc.replace('.',' ')
doc = doc.replace('-', ' ')
doc = doc.replace(',',' ')
doc = doc.replace('?',' ')
doc = doc.replace(')',' ')
doc = doc.replace('(',' ')
doc = doc.replace(' the ',' ')
doc = doc.replace(' of ',' ')
doc = doc.replace(' is ',' ')
doc = doc.replace(' and ',' ')
doc = doc.replace(' a ',' ')
doc = doc.replace(' on ',' ')
doc = doc.replace(' by ',' ')
doc = doc.replace(' in ',' ')
doc = doc.replace(' for ',' ')
doc = doc.replace(' it ',' ')
doc = doc.replace(' as ',' ')
for _ in range(10):
doc = doc.replace(' ',' ')
return doc
In [88]:
from sklearn.feature_extraction import stop_words
def clean2(doc):
for word in doc.split():
if word in stop_words.ENGLISH_STOP_WORDS:
doc = doc.replace(' '+word+' ',' ')
return doc
In [89]:
def make_matrix(corpus, clean_fn):
matrix = {}
for idx,doc in enumerate(corpus):
doc = clean_fn(doc)
for term in doc.split():
if not matrix.get(term):
matrix[term] = [0] * len(corpus)
matrix[term][idx] = matrix[term][idx] + 1
return matrix.keys(), matrix.values()
In [90]:
def run(corpus, clean_fn):
feature_names, df_matrix = make_matrix(corpus, clean_fn)
R = numpy.array(df_matrix)
N = len(R)
M = len(R[0])
K = 3z
n_top_words = 20
P = numpy.random.rand(N,K)
Q = numpy.random.rand(M,K)
print 'P:',P.shape
print 'Q:',Q.shape
print 'palavras', len(feature_names)
nP, nQ = matrix_factorization(R, P, Q, K)
print 'components shape: ', nP.shape
for idx,topic in enumerate(nP.T):
print '\nTópico', idx
tmp = topic.argsort()[:-n_top_words - 1:-1]
for word_idx in tmp:
print feature_names[word_idx],
print '\n'
In [91]:
run(corpus1)
In [93]:
from sklearn.datasets import fetch_20newsgroups
dataset = fetch_20newsgroups(shuffle=True, random_state=1, remove=('headers', 'footers', 'quotes'))
corpus2 = dataset.data[:30]
run(corpus2, clean1)
In [ ]: