I got a task to review the 4 text files and

  1. Compute the most important key-words (a key-word can be between 1-3 words)
  2. Choose the top n words from the previously generated list. Compare these key- words with all the words occurring in all of the transcripts.
  3. Generate a score (rank) for these top n words based on analysed transcripts.


The data is 4 wikipedia articles each relating to food. script.txt is an article about food transcript_1.txt is about Fast Food transcript_2.txt is about a resteraunt transcript_3.txt is about Cooking

What is important

I need to find out what the most important key words are, and also to define importance. Two options for what is important

  1. What words identify each document
  2. What words in the script "identify" it the most. This is probably better as questions 2 and 3 make more sense in this context. e.g. If we find what words predict each document what their ranking across transcripts be? What would be the difference between script and transcript


I'll try 1, maybe 2 things (if the first doesn't work)

  1. Download some other articles not related to food. Run LDA and see if we can seperate the other articles from this one. That is if we can find a topic that the script scores highly on and the other documents don't

In [500]:
%pylab inline
import sklearn
import pandas as pd
import numpy as np
import nltk
import os
import re
import seaborn as sns
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.pipeline import Pipeline

In [501]:
path ='./data'
d ={}
corpus = []
for name in os.listdir(path):
    if name.endswith('.txt'):
        f_path = os.path.join(path,name)
        with open(f_path) as f:
            d[name] = f.read()
for name in os.listdir(path):
    if name.endswith('.txt'):
        f_path = os.path.join(path,name)
        with open(f_path) as f:
            d[name] = f.read()

Build a tokenizer

In [593]:
class Tokenizer():
    def __init__(self):
            self.lemma = nltk.stem.WordNetLemmatizer()
            self.stem = nltk.stem.SnowballStemmer("english")
            self.tokenizer =nltk.RegexpTokenizer(r'\w+')
            self.reg = re.compile('\d+')
    def proc_word(self,word):
        word = self.reg.sub('',word)
        word = self.lemma.lemmatize(word)
        return word
    def __call__(self,doc):
        res = [self.proc_word(word) for word in self.tokenizer.tokenize(doc.lower())]
        res = list(filter(lambda x:len(x)>3,res))
        return res


T = Tokenizer()
T("hello the man is Great")

['hello', 'great']

In [503]:
tf_vectorizer = CountVectorizer(max_df=0.75,min_df=0.25,  stop_words='english',tokenizer=Tokenizer(),ngram_range=(1,1))
tf = tf_vectorizer.fit(corpus)
tfed = tf.transform(corpus)
tf_feature_names = tf_vectorizer.get_feature_names()

In [504]:
lda = LatentDirichletAllocation(n_topics=len(d)*2, max_iter=6, learning_method='online', learning_offset=50.,random_state=0).fit(tfed)

In [505]:
def build_pipeline(num_docs):

    :param args: Args passed to the command line via argparser
    :return: A pipeline that implements CountVectorizer and LDA with the args passed to argparser
    tf_vectorizer = CountVectorizer(
        max_df=0.95, #  Keep words that apear in only up to 95% of the documents (eg corpus specific stop words)
        min_df=5, # Only use words that apear in at least 5 documents
        tokenizer=Tokenizer(), # Use our custom tokenizer
        ngram_range=(1, 3) #Use key words of length 1,2 or 3

    lda = LatentDirichletAllocation(
        n_topics=num_docs , # learn twice as many document as we have documents
    pipeline = Pipeline([('count_vectorizer', tf_vectorizer), ('lda', lda)])
    return pipeline

In [506]:
pipeline =build_pipeline(len(d))
model = pipeline.fit(corpus)

In [533]:
def display_topics(comps, feature_names, no_top_words):
    for topic_idx, topic in enumerate(comps):
        print ("Topic %d:" % (topic_idx))
        print (" ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))

In [536]:
no_top_words = 10
display_topics(model.named_steps["lda"].components_, model.named_steps["count_vectorizer"].get_feature_names(), no_top_words)

Topic 0:
      year wa food wine ha restaurant  year
Topic 1:
group c point d f free n rotation theorem step
Topic 2:
sugar fried measure important deep fried different salt deep transformation level
Topic 3:
 s wa new   century cup restaurant th cooking
Topic 4:
group space banach paradox tarski set algebraic banach tarski euclidean decomposition
Topic 5:
vegetable ha vitamin large including obesity industry cooking water mineral
Topic 6:
food fast fast food restaurant cooking used health pizza animal ingredient
Topic 7:
wine grape g state united united state fruit  billion billion university
Topic 8:
pizza b e meat displaystyle  origin usually variety fish
Topic 9:
stand villa end park road villa park holte trinity trinity road wa

In [511]:
def pipeline(doc):
    res = model.transform([doc])
    return res[0]

In [512]:
res = {name:pipeline(doc) for name,doc in d.items()}

In [514]:
D =pd.DataFrame(res)

<matplotlib.axes._subplots.AxesSubplot at 0x7f42ebcb9e48>

In [781]:
D[D>0.1].T.plot.bar(figsize=(20,10),colormap='jet',title="Most important topics per document")

<matplotlib.axes._subplots.AxesSubplot at 0x7f42e62c6208>

<matplotlib.axes._subplots.AxesSubplot at 0x7f42ea6d2dd8>

In [518]:
normed = (D.T - D.T.mean())/D.T.std()

In [541]:
path ='./data/transcripts'
transcripts ={}

for name in os.listdir(path):
    if name.endswith('.txt'):
        f_path = os.path.join(path,name)
        with open(f_path) as f:
            transcripts[name] = f.read()

In [522]:
transcripts = {name:pipeline(doc) for name,doc in transcripts.items()}

In [523]:
X  =pd.DataFrame(transcripts)

In [767]:
ax = X.T.plot.bar(figsize=(20,10),colormap='jet')

In [769]:
fig =ax.get_figure()

Word score

To score the words we'll use the z-score of their tf-idf score where the train corpus are the transcripts and the term frequencyies come from the script

In [790]:
corpus =[]
for key,val in transcripts.items():
tfidf = TfidfVectorizer(tokenizer=Tokenizer(),stop_words='english',ngram_range=(1,4),smooth_idf=True)
tf_model = tfidf.fit(corpus)
script = tf_model.transform([d["script.txt"]])
tf_feature_names = tf_model.get_feature_names()

In [791]:
def display_word_ranks(vector, feature_names, no_top_words):
        ranking_dict = {feature_names[i]: -np.log(vector[i]) #dictionary comprehension, key is word val is score
                   for i in vector.argsort()  # iterate over indices sorted by the value
                   if vector[i] >0 # and only take values that are non zero e.g. appear in the document
        return ranking_dict
z = script.toarray()

In [792]:
ranking_dict =(display_word_ranks(z[0],tf_feature_names,40))

In [795]:
S = pd.Series(ranking_dict)
S = S.sort_values(ascending=False)
S[S!=S.value_counts().values[0]][:20].plot.bar(title="Our top n words scored")

<matplotlib.axes._subplots.AxesSubplot at 0x7f42e6eacf60>

In [789]:
T =S.sample(30).index
Y = S[T]
Y.sort_values(ascending=False).plot.bar(figsize=(15,5),title="Our top n words scored")

<matplotlib.axes._subplots.AxesSubplot at 0x7f42e5ac6390>

