Analiza danych i uczenie maszynowe w Python

Autor notebooka: Jakub Nowacki.

Dane tekstowe

W tym notebooku prezentujemy podejście do analizy, wektoryzacji danych i uczenia maszynowego z użyciem scikit-learn i innych pakietów w Python. Kroki te są wstępem do uczenia maszynowego.



In [33]:

    
%matplotlib inline
import pandas as pd
import numpy as np
import os
import glob
import matplotlib as mpl

# Parametry wykresów
mpl.style.use('ggplot')
mpl.rcParams['figure.figsize'] = (8,6)
mpl.rcParams['font.size'] = 12

Macierz Term-Document

Macierz Term-Document (TD), jest to macierz, w której kolumny stanowią unikalne słowa, a wiersze stanowią wartości ile razy dane słowo wystąpiło w dokumencie. Dla przykładu (za Wikipedią), jeżeli mamy dwa dokumenty:

D1 = "I like databases"
D2 = "I hate databases"

to macierz TD możemy przedstawić jako

	I	like	hate	databases
D1	1	1	0	1
D2	1	0	1	1

Poniżej ten sam przykład zrealizowany w Pandas:



In [34]:

    
sample = pd.DataFrame({
    'docs': ['D1', 'D2'],
    'lines': ['I like databases Databases', 'I hate databases']
})
sample









    Out[34]:







  
    
      
      docs
      lines
    
  
  
    
      0
      D1
      I like databases Databases
    
    
      1
      D2
      I hate databases



In [35]:

    
sample['words'] = sample.lines.str.strip().str.lower().str.split('[\W_]+')
sample









    Out[35]:







  
    
      
      docs
      lines
      words
    
  
  
    
      0
      D1
      I like databases Databases
      [i, like, databases, databases]
    
    
      1
      D2
      I hate databases
      [i, hate, databases]



In [36]:

    
rows = list()
for row in sample[['docs', 'words']].iterrows():
    r = row[1]
    for word in r.words:
        rows.append((r.docs, word))

words = pd.DataFrame(rows, columns=['docs', 'word'])
words



In [37]:

    
words.pivot_table(index='docs', 
                  columns='word', 
                  aggfunc=lambda v: v['word'].count())\
    .fillna(0)

Zadanie

Używając Pandas obliczyć macierz TD dla książek.
Co może wpłynąć na wynik?



In [38]:

    
from sklearn.feature_extraction.text import CountVectorizer

count_vectorizer = CountVectorizer()
count_vectorizer









    Out[38]:





CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)



In [39]:

    
sample['lines']









    Out[39]:





0    I like databases Databases
1              I hate databases
Name: lines, dtype: object



In [40]:

    
X = count_vectorizer.fit_transform(sample['lines'])
X









    Out[40]:





<2x3 sparse matrix of type '<class 'numpy.int64'>'
	with 4 stored elements in Compressed Sparse Row format>



In [41]:

    
print(X)









    



  (0, 0)	2
  (0, 2)	1
  (1, 1)	1
  (1, 0)	1



In [42]:

    
X.toarray()









    Out[42]:





array([[2, 0, 1],
       [1, 1, 0]], dtype=int64)



In [43]:

    
count_vectorizer.get_feature_names()









    Out[43]:





['databases', 'hate', 'like']



In [44]:

    
pd.DataFrame(X.toarray(), 
             columns=count_vectorizer.get_feature_names(), 
             index=sample['docs'])

TF-IDF

Kolejną techniką jest poznane uprzednio Term Frequency–Inverse Document Frequency (TF-IDF). Jest on popularnym algorytmem do analizy danych tekstowych, używany dość często w pozyskiwaniu danych (data mining).

Poniżej przykładowe obliczenie TF-IDF z użyciem wektoryzera scikit-learn.



In [45]:

    
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()
tfidf_vectorizer









    Out[45]:





TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)



In [46]:

    
X = tfidf_vectorizer.fit_transform(sample['lines'])
X









    Out[46]:





<2x3 sparse matrix of type '<class 'numpy.float64'>'
	with 4 stored elements in Compressed Sparse Row format>



In [47]:

    
X.toarray()









    Out[47]:





array([[0.81818021, 0.        , 0.57496187],
       [0.57973867, 0.81480247, 0.        ]])



In [48]:

    
tfidf_vectorizer.get_feature_names()









    Out[48]:





['databases', 'hate', 'like']



In [49]:

    
pd.DataFrame(X.toarray(), 
            columns=tfidf_vectorizer.get_feature_names(), 
            index=sample['docs'])

Miary podobieństwa

Jest wiele miar podobieństwa, które liczą jak odległe są pary wartości od siebie. Miary mogą być używane zarówno do wektorów oraz macierzy, jak i słów oraz dokumentów; zobacz ciekawy opis miar wraz z implementacją w Pythonie.

Pierwszą miarą, która jest standardową miarą podobieństwa dwóch stringów jest miara Levenshteina. Jest wiele implementacji używania tej miary, niemniej, najłatwiej dostępną implementacją jest zawarta w difflib, która jest częścią standardowej biblioteki Pythona.



In [50]:

    
import difflib 

def similarity(a, b):
    return difflib.SequenceMatcher(None, a, b).ratio()

similarity('cat', 'cats')









    Out[50]:





0.8571428571428571



In [51]:

    
similarity('cat', 'dog')









    Out[51]:





0.0



In [52]:

    
sample









    Out[52]:







  
    
      
      docs
      lines
      words
    
  
  
    
      0
      D1
      I like databases Databases
      [i, like, databases, databases]
    
    
      1
      D2
      I hate databases
      [i, hate, databases]



In [53]:

    
similarity(sample.at[0, 'lines'], sample.at[1, 'lines'])









    Out[53]:





0.6190476190476191



In [54]:

    
similarity('cat', 'catepillar')









    Out[54]:





0.46153846153846156

W przypadku dokumentów tekstowych, znane są przynajmniej dwie popularne miary:

cosinusowa
Jaccarda

Miaria cosinusowa zdefiniowana jest jako kąt między dwoma wektorami:

$$ sim(A, B) = \cos(\Theta) = \frac{A \cdot B}{\Vert A \Vert \cdot \Vert B \Vert} $$

Zatem miara wymaga formy zwektowyzowanej do obliczenia wartości; musimy się najpierw posłużyć którymś wektoryzatorem aby otrzymać macierz a potem policzyć miarę.



In [55]:

    
X = count_vectorizer.fit_transform(sample['lines'])



In [56]:

    
from sklearn.metrics.pairwise import cosine_similarity

cosine_similarity(X)









    Out[56]:





array([[1.        , 0.63245553],
       [0.63245553, 1.        ]])

Miara Jaccarda operuje z kolei na zbiorach i zdefiniowana jest jako:

$$ sim(A, B) = \frac{|A \cap B|}{| A \cup B |} $$

Zatem dla naszego przykładu wygląda to następująco:



In [57]:

    
A = sample.words[0]
B = sample.words[1]
A, B









    Out[57]:





(['i', 'like', 'databases', 'databases'], ['i', 'hate', 'databases'])



In [58]:

    
def sim_jaccard(A, B):
    a = set(A)
    b = set(B)
    i = set.intersection(a, b)
    u = set.union(a, b)
    print(i, u)
    return len(i)/len(u)

sim_jaccard(A, B)









    



{'i', 'databases'} {'like', 'databases', 'i', 'hate'}






    Out[58]:





0.5



In [59]:

    
from sklearn.metrics import jaccard_similarity_score

list(set(A))
jaccard_similarity_score(list(set(A)), list(set(B)))









    Out[59]:





0.6666666666666666

Transformatory i pipeliny

Scikit-learn wprowadził szereg ułatwień do przetwarzania danych i tworzenia modeli. W ogólności, możemy korzystać z 3 podstawowych elementów:

Transformetry (funkcje zmieniające dane) i estymatory (modele do wyuczenia) łączy się w pipeliny; więcej o tym można przeczytać w dokumentacji.



In [60]:

    
from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer = TfidfTransformer()
tfidf_transformer









    Out[60]:





TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)



In [61]:

    
td = CountVectorizer().fit_transform(sample['lines'])
tfidf_transformer.fit_transform(td)









    Out[61]:





<2x3 sparse matrix of type '<class 'numpy.float64'>'
	with 4 stored elements in Compressed Sparse Row format>



In [62]:

    
from sklearn.pipeline import Pipeline, make_pipeline
tfidf_pipeline = make_pipeline(CountVectorizer(), TfidfTransformer())
tfidf_pipeline









    Out[62]:





Pipeline(memory=None,
     steps=[('countvectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
  ...'tfidftransformer', TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True))])



In [63]:

    
X = tfidf_pipeline.fit_transform(sample['lines'])
X.toarray()









    Out[63]:





array([[0.81818021, 0.        , 0.57496187],
       [0.57973867, 0.81480247, 0.        ]])



In [64]:

    
pd.DataFrame(X.toarray(), 
             index=sample['docs'], 
             columns=tfidf_pipeline.steps[0][1].get_feature_names())

Można też samemu nazywać elementy pipelinu:



In [65]:

    
list(steps.items())









    



---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-65-c74d8b759810> in <module>()
----> 1 list(steps.items())

NameError: name 'steps' is not defined



In [ ]:

    
steps = {
    'count_vect': CountVectorizer(),
    'tfidf_trans': TfidfTransformer()
}

# Pipeline oczekuje kroków jako listy z krotkami (nazwa, obiekt)
tfidf_pipeline = Pipeline(list(steps.items()))
tfidf_pipeline



In [ ]:

    
tfidf_pipeline.steps



In [ ]:

    
tfidf_pipeline.fit_transform(sample['lines'])



In [ ]:

    
# Co tu się dzieje?
steps['count_vect'].get_feature_names()

Zadanie

Wykorzystaj narzędzia scikit-learn do stworzenie pełnego pipelinu wykonującego wektoryzację z TF-IDF dla danych z książek.

Funkcje użytkownika

Niekiedy mamy potrzebę korzystania z funkcji wybiegających poza zestaw dostępny w scikit-learn. Są dwie metody:

FunctionTransformer
Klasa dziedzicząca po BaseEstimator i TransformerMixin

Obecnie najpierw zalecany jest FunctionTransformer.



In [ ]:

    
import re
import numpy as np

@np.vectorize
def replace_database(linia):
    return re.sub('database', 'DB', linia, flags=re.IGNORECASE)

replace_database(sample['lines'])



In [ ]:

    
from sklearn.preprocessing import FunctionTransformer

replace_func = FunctionTransformer(replace_database, validate=False)
replace_func.fit_transform(sample['lines'])



In [ ]:

    
new_pipeline = make_pipeline(replace_func, TfidfVectorizer())
new_pipeline



In [ ]:

    
X = new_pipeline.fit_transform(sample['lines'])
X



In [ ]:

    
pd.DataFrame(X.toarray(), 
             index=sample['docs'], 
             columns=new_pipeline.steps[1][1].get_feature_names())

Drugą metodą tworzenia funkcji użytkownika jest stworzenie klasy, która dziedziczy po klasach BaseEstimator i TransformerMixin, czyli de facto jak to jest robione dla innych transformerów.



In [ ]:

    
from sklearn.base import BaseEstimator, TransformerMixin

class ReplaceDatabaseTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, replace_from='database', replace_to='DB'):
        self.replace_from = replace_from
        self.replace_to = replace_to

    def fit(self, x, y=None):
        return self

    def _replace_str(self, line):
        return re.sub(self.replace_from, 
                      self.replace_to, 
                      line, 
                      flags=re.IGNORECASE)
    
    def transform(self, data):
        func = np.vectorize(lambda line: self._replace_str(line))
        return func(data)



In [ ]:

    
new_pipeline2 = make_pipeline(ReplaceDatabaseTransformer(), TfidfVectorizer())
new_pipeline2



In [ ]:

    
X = new_pipeline2.fit_transform(sample['lines'])
X



In [ ]:

    
pd.DataFrame(X.toarray(), 
             index=sample['docs'], 
             columns=new_pipeline2.steps[1][1].get_feature_names())

Zadanie

Napisz funkcję zamieniającą wszystkie napotkane URLe na stałą wartość, która może być podawana jako argument.

Klasyfikacja danych tekstowych

W tej części notebooka przeprowadzimy klasyfikację danych tekstowych. Do analiz będziemy wykorzystywać kolekcję wiadomości SMS, zawierających spam i prawdziwe wiadomości, dostępnym w repozytorium UCI. Poniższy kod pobiera dane.



In [68]:

    
import os
import urllib.request
import zipfile

data_path = 'data'

os.makedirs(data_path, exist_ok=True)

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip'
file_name = url.split('/')[-1]
dest_file = os.path.join(data_path, file_name) 

data_file = 'SMSSpamCollection'
data_full = os.path.join(data_path, data_file)

urllib.request.urlretrieve(url, dest_file)

with zipfile.ZipFile(dest_file) as zip_file:
    zip_file.extract(data_file, path=data_path)



In [69]:

    
import pandas as pd

sms = pd.read_csv(data_full, 
                  sep='\t', 
                  names=['is_spam', 'text'])
sms.head()









    Out[69]:







  
    
      
      is_spam
      text
    
  
  
    
      0
      ham
      Go until jurong point, crazy.. Available only ...
    
    
      1
      ham
      Ok lar... Joking wif u oni...
    
    
      2
      spam
      Free entry in 2 a wkly comp to win FA Cup fina...
    
    
      3
      ham
      U dun say so early hor... U c already then say...
    
    
      4
      ham
      Nah I don't think he goes to usf, he lives aro...



In [69]:

    
from sklearn.model_selection import train_test_split

train_sms, test_sms = train_test_split(sms, test_size=0.2)

print('Train set:')
print(train_sms.describe())
print()
print('Test set:')
print(test_sms.describe())









    



Train set:
       is_spam                    text
count     4457                    4457
unique       2                    4189
top        ham  Sorry, I'll call later
freq      3860                      24

Test set:
       is_spam                    text
count     1115                    1115
unique       2                    1092
top        ham  Sorry, I'll call later
freq       965                       6

Naszym zadaniem jest wytrenowanie klasyfikatora, który klasyfikuje wiadomość SMS jako spam. Podobne zadanie z dość szerokim opisem można znaleźć w tym wpisie.

Naive Bayes

Zaczniemy od klasyfikatora Naive Bayes. Używa on twierdzenia Bayesa do obliczenia prawdopodobieństw poszczególnych klas w zależności od dostępnych opcji, przy założeniu niezależności zmiennych losowych (stąd naiwny). Rozpatrzmy przykład (źródło):

Poniżej przykład obliczenia prawdopodobieństwa na podstawie metody Bayesa.

$$P(Yes | Sunny) = \frac{P( Sunny | Yes) P(Yes)}{P (Sunny)}$$$$P (Sunny |Yes) = 3/9 = 0.33$$$$P(Sunny) = 5/14 = 0.36$$

$$P( Yes)= 9/14 = 0.64$$ $$P (Yes | Sunny) = \frac{0.33 \cdot 0.64}{0.36} = 0.60$$

Poniżej, przykład modelu detekcji spamu z użuciem metody Naive Bayes.



In [70]:

    
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
vectorizer









    Out[70]:





TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)



In [71]:

    
X = vectorizer.fit_transform(train_sms['text'])
X









    Out[71]:





<4457x7743 sparse matrix of type '<class 'numpy.float64'>'
	with 59338 stored elements in Compressed Sparse Row format>



In [72]:

    
from sklearn.naive_bayes import MultinomialNB

spam_detector = MultinomialNB().fit(X, train_sms['is_spam'])
spam_detector









    Out[72]:





MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)



In [78]:

    
i = 17
train_sms.iloc[i, 0], spam_detector.predict(X[i])[0], train_sms.iloc[i, 1]









    Out[78]:





('spam', 'spam', 'FreeMsg>FAV XMAS TONES!Reply REAL')



In [79]:

    
from sklearn.metrics import classification_report, confusion_matrix

X = vectorizer.transform(test_sms['text'])

y_pred = spam_detector.predict(X)
y_true = test_sms['is_spam']

print(confusion_matrix(y_true, y_pred))
print(classification_report(y_true, y_pred))









    



[[965   0]
 [ 43 107]]
             precision    recall  f1-score   support

        ham       0.96      1.00      0.98       965
       spam       1.00      0.71      0.83       150

avg / total       0.96      0.96      0.96      1115



In [70]:

    
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
import pandas as pd


sms = pd.read_csv(data_full, sep='\t', names=['is_spam', 'text'])
train_sms, test_sms = train_test_split(sms, test_size=0.2)

steps = [('tfidf', TfidfVectorizer()), ('cls', MultinomialNB())]
nb_pipe = Pipeline(steps=steps)
nb_pipe.fit(train_sms['text'], train_sms['is_spam'])

y_pred = nb_pipe.predict(test_sms['text'])
y_true = test_sms['is_spam']

print(confusion_matrix(y_true, y_pred))
print(classification_report(y_true, y_pred))









    



[[949   0]
 [ 66 100]]
             precision    recall  f1-score   support

        ham       0.93      1.00      0.97       949
       spam       1.00      0.60      0.75       166

avg / total       0.94      0.94      0.93      1115

Zadanie

Zbuduj pipeline wektoryzatora i modelu.
Użyj GridSearchCV do znalezienia najlepszego modelu.
Popraw tokenizację w celu poprawienia jakości klasyfikacji.
Użyj innych klasyfikatorów.

Modelowanie tematów

Modelowanie tematów (topic modelling) jest to zadanie ekstrakcji ważnych elementów z tekstu, które są jego tematami. Wiodącą biblioteką w pythonie do tego celu jest gensim.

Latent Dirichlet Allocation

Jest wiele ciekawych algorytmów wykonujących to zadanie, jednak jednym z najgłośniejszych algorytmów do ekstrakcji tematów jest Latent Dirichlet Allocation (LDA).

Jedną z najskuteczniejszych implementaci LDA w Pythonie jest implementacja dostępna w gensim, która posiada też dość rozbudowaną dokumentację. Wykorzystajmy zatem LDA do naszego zadania.

Do analizy wykorzystamy zbiór 20 newsgroups z scikit-learn.



In [80]:

    
from sklearn.datasets import fetch_20newsgroups

newsgroups = fetch_20newsgroups(shuffle=True, random_state=1,
                                remove=('headers', 'footers', 'quotes'))



In [81]:

    
newsgroups.data[:2]









    Out[81]:





["Well i'm not sure about the story nad it did seem biased. What\nI disagree with is your statement that the U.S. Media is out to\nruin Israels reputation. That is rediculous. The U.S. media is\nthe most pro-israeli media in the world. Having lived in Europe\nI realize that incidences such as the one described in the\nletter have occured. The U.S. media as a whole seem to try to\nignore them. The U.S. is subsidizing Israels existance and the\nEuropeans are not (at least not to the same degree). So I think\nthat might be a reason they report more clearly on the\natrocities.\n\tWhat is a shame is that in Austria, daily reports of\nthe inhuman acts commited by Israeli soldiers and the blessing\nreceived from the Government makes some of the Holocaust guilt\ngo away. After all, look how the Jews are treating other races\nwhen they got power. It is unfortunate.\n",
 "\n\n\n\n\n\n\nYeah, do you expect people to read the FAQ, etc. and actually accept hard\natheism?  No, you need a little leap of faith, Jimmy.  Your logic runs out\nof steam!\n\n\n\n\n\n\n\nJim,\n\nSorry I can't pity you, Jim.  And I'm sorry that you have these feelings of\ndenial about the faith you need to get by.  Oh well, just pretend that it will\nall end happily ever after anyway.  Maybe if you start a new newsgroup,\nalt.atheist.hard, you won't be bummin' so much?\n\n\n\n\n\n\nBye-Bye, Big Jim.  Don't forget your Flintstone's Chewables!  :) \n--\nBake Timmons, III"]



In [82]:

    
newsgroups.target









    Out[82]:





array([17,  0, 17, ...,  9,  4,  9])



In [83]:

    
newsgroups.target_names









    Out[83]:





['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

Zawęzimy trochę próbkę, żeby uczenie było szybsze.



In [84]:

    
n_samples = 2000
newsgroups_samples = newsgroups.data[:n_samples]

Najpierw tworzymy korpus dla LDA używając narzędzi gensim.



In [85]:

    
import nltk
from gensim import corpora, models
from gensim.models.ldamodel import LdaModel

words = [nltk.regexp_tokenize(d.lower(), '\w+') for d in newsgroups_samples]

dictionary = corpora.Dictionary(words)
corpus = [dictionary.doc2bow(text) for text in words]









    



C:\ProgramData\Anaconda3\lib\site-packages\gensim\utils.py:862: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
  warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")

Słownik posiada ID słowa jako klucz i słowo jako wartość.



In [86]:

    
from itertools import islice

for k, v in islice(dictionary.items(), 10):
    print('{}: {}'.format(k, v))









    



0: well
1: i
2: m
3: not
4: sure
5: about
6: the
7: story
8: nad
9: it

Korpus posiada listę krotek, lista na dokument; krotki zawierają pary ID słowa i jego liczność w danych dokumencie.



In [87]:

    
print(corpus[:2])









    



[[(0, 1), (1, 4), (2, 1), (3, 3), (4, 1), (5, 1), (6, 17), (7, 1), (8, 1), (9, 2), (10, 1), (11, 2), (12, 1), (13, 2), (14, 1), (15, 1), (16, 8), (17, 1), (18, 1), (19, 5), (20, 4), (21, 4), (22, 4), (23, 1), (24, 4), (25, 1), (26, 2), (27, 1), (28, 1), (29, 1), (30, 1), (31, 2), (32, 4), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 2), (41, 1), (42, 1), (43, 1), (44, 1), (45, 1), (46, 3), (47, 1), (48, 1), (49, 1), (50, 1), (51, 1), (52, 1), (53, 2), (54, 1), (55, 2), (56, 1), (57, 1), (58, 1), (59, 1), (60, 1), (61, 1), (62, 1), (63, 1), (64, 1), (65, 2), (66, 1), (67, 1), (68, 1), (69, 1), (70, 1), (71, 1), (72, 1), (73, 1), (74, 1), (75, 2), (76, 1), (77, 1), (78, 1), (79, 1), (80, 1), (81, 1), (82, 1), (83, 1), (84, 1), (85, 1), (86, 1), (87, 1), (88, 1), (89, 1), (90, 1), (91, 1), (92, 1), (93, 1), (94, 1), (95, 1), (96, 1), (97, 1), (98, 1), (99, 1), (100, 1), (101, 1), (102, 1)], [(0, 1), (1, 2), (2, 1), (5, 1), (6, 2), (9, 1), (17, 2), (19, 2), (21, 1), (23, 1), (24, 2), (44, 1), (46, 2), (53, 2), (60, 1), (63, 1), (75, 3), (79, 1), (91, 1), (92, 1), (103, 1), (104, 1), (105, 7), (106, 1), (107, 1), (108, 1), (109, 1), (110, 1), (111, 1), (112, 1), (113, 2), (114, 1), (115, 1), (116, 2), (117, 1), (118, 1), (119, 2), (120, 1), (121, 1), (122, 1), (123, 1), (124, 3), (125, 2), (126, 1), (127, 3), (128, 1), (129, 1), (130, 1), (131, 1), (132, 1), (133, 1), (134, 1), (135, 1), (136, 1), (137, 1), (138, 1), (139, 1), (140, 1), (141, 1), (142, 1), (143, 1), (144, 1), (145, 1), (146, 1), (147, 1), (148, 1), (149, 1), (150, 1), (151, 2), (152, 1), (153, 1), (154, 1), (155, 1), (156, 1), (157, 1), (158, 1), (159, 1)]]

Zadanie

Dla 10 pierwszych dokumentów wypisz słowa i ich liczność.

Teraz wytrenujmy sam model LDA.



In [88]:

    
model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=10, alpha="auto")
model









    Out[88]:





<gensim.models.ldamodel.LdaModel at 0x1a04ff57630>



In [89]:

    
for i in range(10):
    print(model.get_document_topics(corpus[i], minimum_probability=0.1))









    



[(7, 0.99340100552816657)]
[(7, 0.98946292387574297)]
[(1, 0.46998139553451107), (7, 0.52251416682123741)]
[(4, 0.16633537596457032), (7, 0.82828027035858187)]
[(0, 0.6996453734915814), (7, 0.28425805069277971)]
[(9, 0.96061919565863985)]
[(0, 0.99818563354973222)]
[(0, 0.84510135298355327), (4, 0.14878333739966915)]
[(0, 0.22123840595863969), (9, 0.76934711243078313)]
[(9, 0.99313520855566895)]



In [90]:

    
k = 0
model.get_topic_terms(topicid=k)









    Out[90]:





[(6, 0.049394249428051845),
 (75, 0.019961484867415665),
 (24, 0.018927750832629032),
 (1, 0.018355421479303465),
 (46, 0.016457198658624449),
 (32, 0.015065313667161167),
 (53, 0.012715351470155322),
 (21, 0.0094795785434583398),
 (16, 0.0087506787626141138),
 (105, 0.0082980143790526185)]



In [91]:

    
model.print_topics()









    Out[91]:





[(0,
  '0.049*"the" + 0.020*"of" + 0.019*"to" + 0.018*"i" + 0.016*"a" + 0.015*"in" + 0.013*"and" + 0.009*"s" + 0.009*"is" + 0.008*"you"'),
 (1,
  '0.038*"the" + 0.024*"to" + 0.020*"of" + 0.018*"is" + 0.017*"a" + 0.016*"and" + 0.016*"in" + 0.015*"that" + 0.011*"i" + 0.010*"you"'),
 (2,
  '0.019*"and" + 0.018*"to" + 0.016*"a" + 0.015*"the" + 0.014*"i" + 0.013*"for" + 0.010*"in" + 0.010*"it" + 0.009*"is" + 0.009*"that"'),
 (3,
  '0.038*"the" + 0.017*"a" + 0.017*"to" + 0.016*"0" + 0.016*"and" + 0.016*"of" + 0.012*"i" + 0.012*"1" + 0.012*"that" + 0.012*"it"'),
 (4,
  '0.041*"the" + 0.026*"to" + 0.023*"i" + 0.022*"and" + 0.019*"a" + 0.017*"of" + 0.014*"that" + 0.012*"is" + 0.011*"in" + 0.010*"you"'),
 (5,
  '0.040*"the" + 0.030*"to" + 0.024*"and" + 0.019*"a" + 0.014*"is" + 0.013*"of" + 0.013*"that" + 0.011*"in" + 0.010*"i" + 0.009*"for"'),
 (6,
  '0.051*"the" + 0.022*"i" + 0.022*"and" + 0.016*"of" + 0.014*"a" + 0.012*"to" + 0.012*"in" + 0.010*"for" + 0.009*"is" + 0.008*"it"'),
 (7,
  '0.053*"the" + 0.028*"to" + 0.025*"of" + 0.023*"a" + 0.019*"and" + 0.016*"in" + 0.012*"is" + 0.012*"that" + 0.011*"i" + 0.010*"it"'),
 (8,
  '0.037*"the" + 0.026*"of" + 0.018*"to" + 0.013*"that" + 0.013*"and" + 0.012*"i" + 0.011*"a" + 0.011*"in" + 0.008*"on" + 0.008*"s"'),
 (9,
  '0.047*"the" + 0.022*"of" + 0.022*"a" + 0.019*"to" + 0.018*"it" + 0.017*"i" + 0.015*"and" + 0.014*"in" + 0.014*"is" + 0.013*"that"')]

Zadanie

Zmień ilość tematów; czy coś się zmieniło?
Zobacz jak tematy mapują się na klasy.
Popraw tokenizację; możesz np.:
- usunąć nie-słowa
- usunąć wyrazy ze stop listy
- sprowadzić słowa do wspólnej wielkości znaku
Sprawdź inne metody budowania korpusu, np.:
- macierz TD
- TF-IDF



In [ ]:

    
from gensim.sklearn_api.ldamodel import LdaTransformer
from gensim.sklearn_api.text2bow import Text2BowTransformer
from sklearn.pipeline import Pipeline

steps = {
    'text2bow': Text2BowTransformer(),
    'lda': LdaTransformer(num_topics=10)
}

p = Pipeline(list(steps.items()))
p



In [ ]:

    
p.fit_transform(newsgroups_samples)



In [ ]:

    
steps['lda'].gensim_model.print_topics()



In [ ]:

    
steps['text2bow'].gensim_model



In [ ]:

    
p.transform(['ala ma kota', 'lala'])

Możemy też zastosować wbudowaną metodę scikit-learn; zobacz dokumentację.



In [ ]:

    
# Funkcja użytkowa do wyświetlania tematów.
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()



In [ ]:

    
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                                max_features=1000,
                                stop_words='english')

tf = tf_vectorizer.fit_transform(newsgroups_samples)

lda = LatentDirichletAllocation(n_components=10, max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)

lda.fit(tf)

tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, 10)

Zadanie

Zrób pipeline z powyższego przetwarzania.
Spróbuj użyć TF-IDF.
Zmień liczbę tematów n_components i zobacz co się zmieni.

Non-negative Matrix Factorization

Kolejnym algorytmem często używanym do ekstrakcji tematów jest Non-negative Matrix Factorization.

Metoda ta używa faktoryzacji do przybliżenia macierzy V jako iloczynu macierzy H i W. W wyniku działania algorytmu macierze H i W mają właściwości klasteryzacyjne danych w macierzy V. Dokładnie W staje się macierzą centroidów a H indykatorem przyporządkowania do klastrów poszczególnych elementów macierzy V.

W praktyce często stosuje się tą metodę jako zamiennik PCA, lecz tylko dla danych pozytywnych, oraz do ekstrakcji tematów.



In [ ]:

    
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation

tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2,
                                   max_features=1000,
                                   stop_words='english')

tfidf = tfidf_vectorizer.fit_transform(newsgroups_samples)

nmf = NMF(n_components=10, random_state=1,
          alpha=.1, l1_ratio=.5).fit(tfidf)

tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words(nmf, tfidf_feature_names, 10)

Zadanie

Zmień ilość tematów n_components.
Spróbuj innej funkcji straty beta_loss; zobacz dokumentację.

	docs	lines	words
0	D1	I like databases Databases	[i, like, databases, databases]
1	D2	I hate databases	[i, hate, databases]

	is_spam	text
0	ham	Go until jurong point, crazy.. Available only ...
1	ham	Ok lar... Joking wif u oni...
2	spam	Free entry in 2 a wkly comp to win FA Cup fina...
3	ham	U dun say so early hor... U c already then say...
4	ham	Nah I don't think he goes to usf, he lives aro...