Topic Models

Topic models são a suite de algoritmos que permitem descobrir a estrutura subjacente a uma coleção de documentos. Tendo em vista o atual acesso a massas enormes de documentos de todos os tipos, torna-se necessário o uso de ferramentas. Oprincipal algoritmo determinístico usado é Non-negative Matrix Factorization, que veremos aqui.

NMF

Nossa tarefa é achar duas matrizes: $\mathbf{P}$ (de tamanho $|U| \times K$) e $\mathbf{Q}$ (de tamanho $|D| \times K$ matrix) tal que seu produto se aproxime de $\mathbf{R}$: $$\mathbf{R} \approx \mathbf{P} \times \mathbf{Q}^T = \hat{\mathbf{R}}$$

Desta forma, cada fileira de $\mathbf{P}$ representa a força de associação entre um termo e as features (tópicos). Similarmente, cada fileira de $\mathbf{Q}$ representa a força de associação entre uma palavra e as features.

Para ver o valor associado a um item $d_j$ por $u_i$ podemos calcular o produto escalar dos vetores correspondentes a $u_i$ e $d_j$: $$ \hat{r}_{ij} = p_i^T q_j = \sum_{k=1}^{k}p_{ik}q_{kj} $$

Agora precisamos de uma forma para obter $\mathbf{P}$ e $\mathbf{Q}$. Uma das formas é: i) inicializar ambas as matrizes com alguns valores; ii) calcular a 'diferença' dos seus produtos a $\mathbf{R}$; iii) tentar minimizar esta diferença em iterações

Este método é comumente chamado de gradient descent, a diferença (do passo ii, chamada de erro entre o estimado e o real, pode ser calculada da seguinte forma: $$e_{ij}^2=(r_{ij}-\hat{r}_{ij})^2 = (r_{ij} - \sum_{k=1}^k{p_{ik}q_{kj}})^2$$

Vale considerar o erro quadrático pois o valor real pode estar acima ou abaixo do estimado.

Para minimizar o erro, temos que saber em qual direção precisamos modificar os valores de $p_{ik}$ e $q_{kj}$. Isto é, precisamos saber o gradiente nos valores atuais.

Assim, diferenciamos a equação acima com respeito a ambas as variáveis separadamente:


$$\frac{\partial}{\partial p_{ik}}e^2_{ij} = -2(r_{ij}-\hat{r}_{ij})(q_{kj}) = -2e_{ij}q_{kj}$$


$$\frac{\partial}{\partial q_{kj}}e^2_{ij} = -2(r_{ij}-\hat{r}_{ij})(p_{ik}) = -2e_{ij}p_{ik}$$



Com este resultado podemos formular regras de iteração para $p_{ik}$ e $q_{kj}$

$$p'_{ik} = p_{ik} + \alpha\frac{\partial}{\partial p_{ik}}e^2_{ij} = p_{ik} + 2\alpha e_{ij}q_{kj}$$
$$q'_{kj} = q_{kj} + \alpha\frac{\partial}{\partial q_{kj}}e^2_{ij} = q_{kj} + 2\alpha e_{ij}p_{ik}$$


Alfa aqui é um constante que determina a taxa de aproximção do mínimo. É aconselhável escolher um valor pequeno pois, se for grande demais, arriscamos ficar oscilando em volta do ponto mínimo.



Assim, como resultado, temos $\mathbf{P}$ de vetores-base, cujas colunas são nossos tópicos e $\mathbf{Q}$ que mostra as filiações dos documentos aos tópicos.

Regularização e Outras Estratégias

O algoritmo apresentado é o básico para se fatorizar uma matriz. Há muitos métodos para comlicar (e em tese melhorar).

Uma extensão comum é introduzir um fator de regularização para evitar overfitting. Podemos adicionar um parâmetro $\beta$ e modificar o cálculo do erro quadrático, assim: $$e_{ij}^2 = (r_{ij} - \sum_{k=1}^K{p_{ik}q_{kj}})^2 + \frac{\beta}{2} \sum_{k=1}^K{(||P||^2 + ||Q||^2)}$$

Desta forma o novo parâmetro $\beta$ é usado para controlar as magnitudes dos vetores, de tal forma que $\mathbf{P}$ e $\mathbf{Q}$ possam resultar em boas aproximações de $\mathbf{R}$ sem precisarem conter números grandes. Na prática colocamos $\beta$ em torno de 0.02

A nova regra de iteração fica então:

$$ p'_{ik} = p_{ik} + \alpha \frac{\partial}{\partial p_{ik}}e_{ij}^2 = p_{ik} + \alpha(2 e_{ij} q_{kj} - \beta p_{ik} ) $$
$$ q'_{kj} = q_{kj} + \alpha \frac{\partial}{\partial q_{kj}}e_{ij}^2 = q_{kj} + \alpha(2 e_{ij} p_{ik} - \beta q_{kj} ) $$

Implementando


In [48]:
import numpy
###############################################################################


def matrix_factorization(R, P, Q, K, steps=500, alpha=0.0002, beta=0.02):
    """
    @INPUT:
        R     : a matriz a ser fatorizada, dim. N x M
        P     : uma matriz inicial de dim. N x K
        Q     : uma matriz inicial de dim. M x K
        K     : o número de features latentes (neste caso tópicos)
        steps : o número máximo de iterações a maximizar, depois para
        alpha : a taxa de aprendizado
        beta  : parametro de regularização
    @OUTPUT:
        as matrizes finais P and Q
    """
    Q = Q.T 
    
    # OBS: existem implementações MapReduce otimizadas
    for step in xrange(steps):
        for i in xrange(len(R)):
            for j in xrange(len(R[i])):
                
                if R[i][j] > 0:
                    
                    eij = R[i][j] - numpy.dot(P[i,:], Q[:,j])
                    
                    for k in xrange(K):
                
                        P[i][k] = P[i][k] + alpha * (2 * eij * Q[k][j] - beta * P[i][k])
                        Q[k][j] = Q[k][j] + alpha * (2 * eij * P[i][k] - beta * Q[k][j])
        
        eR = numpy.dot(P,Q)
        e = 0
        for i in xrange(len(R)):
            for j in xrange(len(R[i])):
                if R[i][j] > 0:
                    e = e + pow(R[i][j] - numpy.dot(P[i,:],Q[:,j]), 2)
                    for k in xrange(K):
                        e = e + (beta/2) * ( pow(P[i][k],2) + pow(Q[k][j],2) )
        if e < 0.001:
            break
    return P, Q.T



Vamos testar? ...


In [49]:
R = [
     [5,3,0,1],
     [4,0,0,1],
     [1,1,0,5],
     [1,0,0,4],
     [0,1,5,4],
    ]

R = numpy.array(R)

N = len(R)
M = len(R[0])
K = 3

P = numpy.random.rand(N,K)
Q = numpy.random.rand(M,K)

nP, nQ = matrix_factorization(R, P, Q, K)

In [50]:
nR = numpy.dot(nP, nQ.T)

print nR


[[ 2.77187556  1.83057671  1.4769508   2.80751545]
 [ 2.43195422  1.4715061   1.42761386  2.60784541]
 [ 2.47573409  1.56187622  1.39076158  2.58614644]
 [ 2.02323458  1.28749112  1.12571209  2.10155449]
 [ 2.83665738  1.598853    1.78026805  3.16811296]]

In [68]:
d0 = """Python is a 2000 made-for-TV horror movie directed by Richard
Clabaugh. The film features several cult favorite actors, including William
Zabka of The Karate Kid fame, Wil Wheaton, Casper Van Dien, Jenny McCarthy,
Keith Coogan, Robert Englund (best known for his role as Freddy Krueger in the
A Nightmare on Elm Street series of films), Dana Barron, David Bowe, and Sean
Whalen. The film concerns a genetically engineered snake, a python, that
escapes and unleashes itself on a small town. It includes the classic final
girl scenario evident in films like Friday the 13th. It was filmed in Los Angeles,
 California and Malibu, California. Python was followed by two sequels: Python
 II (2002) and Boa vs. Python (2004), both also made-for-TV films."""

d1 = """Python is a genus of
nonvenomous pythons found in Africa and Asia. Currently, 7 species are
recognised. A member of this genus, P. reticulatus, is among the longest
snakes known. This snake is also very scary and I never want to see one up close. 
Which is scariest, pythons, sharks or ninjas?"""

d2 = """The Colt Python is a .357 Magnum caliber revolver formerly
manufactured by Colt's Manufacturing Company of Hartford, Connecticut.
It is sometimes referred to as a "Combat Magnum". It was first introduced
in 1955, the same year as Smith &amp; Wesson's M29 .44 Magnum. The now discontinued
Colt Python targeted the premium revolver market segment. Some firearm
collectors and writers such as Jeff Cooper, Ian V. Hogg, Chuck Hawks, Leroy
Thompson, Renee Smeets and Martin Dougherty have described the Python as the
finest production revolver ever made."""

#d3 = """I am a truly small text"""
d3 = """The fossil record of snakes is relatively poor because snake skeletons 
are typically small and fragile making fossilization uncommon. Fossils readily 
identifiable as snakes (though often retaining hind limbs) first appear in the 
fossil record during the Cretaceous period. The earliest known true snake 
fossils (members of the crown group Serpentes) come from the marine simoliophiids, 
the oldest of which is the Late Cretaceous (Cenomanian age) Haasiophis terrasanctus,
dated to between 112 and 94 million years old. Based on comparative anatomy, there 
is consensus that snakes descended from lizards. Pythons and boas—primitive 
groups among modern snakes—have vestigial hind limbs: tiny, clawed digits known as anal 
spurs, which are used to grasp during mating. The families Leptotyphlopidae 
and Typhlopidae also possess remnants of the pelvic girdle, appearing as horny projections 
when visible."""

d4 = """The potato is a starchy, tuberous crop from the perennial nightshade 
Solanum tuberosum L. The word "potato" may refer either to the plant itself 
or to the edible tuber. In the Andes, where the species is indigenous, there
are some other closely related cultivated potato species. Potatoes were introduced
outside the Andes region approximately four centuries ago, and have since 
become an integral part of much of the world's food supply. It is the world's 
fourth-largest food crop, following maize, wheat, and rice."""

corpus1 = [d0,d1,d2,d3,d4]



Reusando a função clean:


In [84]:
def clean(doc):
    #doc = doc.lower()
    doc = doc.replace('.',' ')
    doc = doc.replace('-', ' ')
    doc = doc.replace(',',' ')
    doc = doc.replace('?',' ')
    doc = doc.replace(')',' ')
    doc = doc.replace('(',' ')
    
    doc = doc.replace(' the ',' ')
    doc = doc.replace(' of ',' ')
    doc = doc.replace(' is ',' ')
    doc = doc.replace(' and ',' ')
    doc = doc.replace(' a ',' ')
    doc = doc.replace(' on ',' ')
    doc = doc.replace(' by ',' ')
    doc = doc.replace(' in ',' ')
    doc = doc.replace(' for ',' ')
    doc = doc.replace(' it ',' ')
    doc = doc.replace(' as ',' ')
    
    for _ in range(10):
        doc = doc.replace('  ',' ')
    return doc

In [88]:
from sklearn.feature_extraction import stop_words

def clean2(doc):
    for word in doc.split():
        if word in stop_words.ENGLISH_STOP_WORDS:
            doc = doc.replace(' '+word+' ',' ')
    return doc

In [89]:
def make_matrix(corpus, clean_fn):
    matrix = {}
    for idx,doc in enumerate(corpus):
        
        doc = clean_fn(doc)
        
        for term in doc.split():
            if not matrix.get(term):
                matrix[term] = [0] * len(corpus)
            matrix[term][idx] = matrix[term][idx] + 1
    return matrix.keys(), matrix.values()

In [90]:
def run(corpus, clean_fn):
    feature_names, df_matrix = make_matrix(corpus, clean_fn)

    R = numpy.array(df_matrix)
    N = len(R)
    M = len(R[0])
    K = 3z
    n_top_words = 20

    P = numpy.random.rand(N,K)
    Q = numpy.random.rand(M,K)

    print 'P:',P.shape
    print 'Q:',Q.shape
    print 'palavras', len(feature_names)
    nP, nQ = matrix_factorization(R, P, Q, K)

    print 'components shape: ', nP.shape


    for idx,topic in enumerate(nP.T):
        print '\nTópico', idx

        tmp = topic.argsort()[:-n_top_words - 1:-1]
        for word_idx in tmp:
            print feature_names[word_idx],
        print '\n'

In [91]:
run(corpus1)


P: (283, 3)
Q: (5, 3)
palavras 283
components shape:  (283, 3)

Tópico 0
snakes record Colt's films perennial features II Cretaceous terrasanctus, concerns Magnum". visible. fossilization "Combat 13th. comparative movie families David old. 


Tópico 1
Python Cretaceous hind Andes, following spurs, Hogg, Leptotyphlopidae Dougherty town. close. films), Haasiophis reticulatus, anal Cooper, Renee The Pythons A 


Tópico 2
Python revolver snakes Wheaton, crop, closely Leroy tuberous years food descended member and Street poor Typhlopidae when python, lizards. grasp 


In [93]:
from sklearn.datasets import fetch_20newsgroups


dataset = fetch_20newsgroups(shuffle=True, random_state=1, remove=('headers', 'footers', 'quotes'))

corpus2 = dataset.data[:30]

run(corpus2, clean1)


P: (4997, 3)
Q: (30, 3)
palavras 4997
components shape:  (4997, 3)

Tópico 0
- The send 3D ray mail Gm : -- Graphics file A graphics message the objects image and server package 


Tópico 1
- Gm key : 3D server send Mormon door etc. I Graphics FTP He package nM session ------ government Torrey, 


Tópico 2
Gm * ? I creation key repeat keys He Rochester consecutive nM vs Mormon ====================== /specmark key. rl@gauss.technion.ac.il Clipper of: 


In [ ]: