I got a task to review the 4 text files and

  1. Compute the most important key-words (a key-word can be between 1-3 words)
  2. Choose the top n words from the previously generated list. Compare these key- words with all the words occurring in all of the transcripts.
  3. Generate a score (rank) for these top n words based on analysed transcripts.

Data

The data is 4 wikipedia articles each relating to food. script.txt is an article about food transcript_1.txt is about Fast Food transcript_2.txt is about a resteraunt transcript_3.txt is about Cooking

What is important

I need to find out what the most important key words are, and also to define importance. Two options for what is important

  1. What words identify each document
  2. What words in the script "identify" it the most. This is probably better as questions 2 and 3 make more sense in this context. e.g. If we find what words predict each document what their ranking across transcripts be? What would be the difference between script and transcript

Methods

I'll try 1, maybe 2 things (if the first doesn't work)

  1. Download some other articles not related to food. Run LDA and see if we can seperate the other articles from this one. That is if we can find a topic that the script scores highly on and the other documents don't

In [500]:
%pylab inline
import sklearn
import pandas as pd
import numpy as np
import nltk
import os
import re
import seaborn as sns
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.pipeline import Pipeline


Populating the interactive namespace from numpy and matplotlib
/usr/local/lib/python3.5/dist-packages/IPython/core/magics/pylab.py:161: UserWarning: pylab import has clobbered these variables: ['f']
`%matplotlib` prevents importing * from pylab and numpy
  "\n`%matplotlib` prevents importing * from pylab and numpy"

In [501]:
path ='./data'
d ={}
corpus = []
for name in os.listdir(path):
    if name.endswith('.txt'):
        f_path = os.path.join(path,name)
        with open(f_path) as f:
            d[name] = f.read()
            corpus+=d[name].split('.')
path+='/transcripts'
for name in os.listdir(path):
    if name.endswith('.txt'):
        f_path = os.path.join(path,name)
        with open(f_path) as f:
            d[name] = f.read()
            corpus+=d[name].split('.')

Build a tokenizer


In [593]:
class Tokenizer():
    def __init__(self):
            self.lemma = nltk.stem.WordNetLemmatizer()
            self.stem = nltk.stem.SnowballStemmer("english")
            self.tokenizer =nltk.RegexpTokenizer(r'\w+')
            self.reg = re.compile('\d+')
    def proc_word(self,word):
        word = self.reg.sub('',word)
        word = self.lemma.lemmatize(word)
        
        return word
        
    def __call__(self,doc):
        res = [self.proc_word(word) for word in self.tokenizer.tokenize(doc.lower())]
        res = list(filter(lambda x:len(x)>3,res))
        return res

            
            

T = Tokenizer()
T("hello the man is Great")


Out[593]:
['hello', 'great']

In [503]:
tf_vectorizer = CountVectorizer(max_df=0.75,min_df=0.25,  stop_words='english',tokenizer=Tokenizer(),ngram_range=(1,1))
tf = tf_vectorizer.fit(corpus)
tfed = tf.transform(corpus)
tf_feature_names = tf_vectorizer.get_feature_names()

In [ ]:


In [504]:
lda = LatentDirichletAllocation(n_topics=len(d)*2, max_iter=6, learning_method='online', learning_offset=50.,random_state=0).fit(tfed)

In [505]:
def build_pipeline(num_docs):
    '''

    :param args: Args passed to the command line via argparser
    :return: A pipeline that implements CountVectorizer and LDA with the args passed to argparser
    '''
    tf_vectorizer = CountVectorizer(
        max_df=0.95, #  Keep words that apear in only up to 95% of the documents (eg corpus specific stop words)
        min_df=5, # Only use words that apear in at least 5 documents
        stop_words='english',
        tokenizer=Tokenizer(), # Use our custom tokenizer
        ngram_range=(1, 3) #Use key words of length 1,2 or 3
    )

    lda = LatentDirichletAllocation(
        n_topics=num_docs , # learn twice as many document as we have documents
        max_iter=6,
        learning_method='online',
        learning_offset=50.,
         random_state=0)
    pipeline = Pipeline([('count_vectorizer', tf_vectorizer), ('lda', lda)])
    
    return pipeline

In [506]:
pipeline =build_pipeline(len(d))
model = pipeline.fit(corpus)

In [507]:
model.named_steps["count_vectorizer"].get_feature_names()


Out[507]:
['',
 ' ',
 '  ',
 '  season',
 '  seat',
 '  year',
 ' b',
 ' bc',
 ' billion',
 ' c',
 ' c ',
 ' country',
 ' displaystyle',
 ' earliest',
 ' edit',
 ' external',
 ' external link',
 ' f',
 ' fa',
 ' food',
 ' ha',
 ' history',
 ' million',
 ' new',
 ' people',
 ' piece',
 ' pizza',
 ' r',
 ' recent',
 ' reference',
 ' reference ',
 ' restaurant',
 ' season',
 ' seat',
 ' stand',
 ' step',
 ' step ',
 ' villa',
 ' villa park',
 ' wa',
 ' wine',
 ' world',
 ' year',
 ' year ago',
 ' year old',
 'abstract',
 'ac',
 'access',
 'according',
 'account',
 'acid',
 'act',
 'activity',
 'adam',
 'added',
 'adding',
 'addition',
 'additive',
 'adult',
 'affect',
 'africa',
 'age',
 'aging',
 'ago',
 'agricultural',
 'agriculture',
 'aid',
 'air',
 'alcohol',
 'alcoholic',
 'alcoholic beverage',
 'algebraic',
 'algebraic topology',
 'allowed',
 'allows',
 'amenable',
 'america',
 'american',
 'analysis',
 'ancient',
 'animal',
 'antoine',
 'application',
 'applying',
 'approach',
 'approximately',
 'approximately ',
 'april',
 'april ',
 'area',
 'area preserving',
 'armenian',
 'aroma',
 'article',
 'asian',
 'aspect',
 'associated',
 'aston',
 'aston hall',
 'aston villa',
 'atlantic',
 'attendance',
 'august',
 'august ',
 'australia',
 'australian',
 'author',
 'available',
 'average',
 'away',
 'axiom',
 'axiom choice',
 'axis',
 'b',
 'b ',
 'b b',
 'b c',
 'bacon',
 'bacteria',
 'baked',
 'baking',
 'ball',
 'banach',
 'banach tarski',
 'banach tarski paradox',
 'bar',
 'based',
 'basic',
 'bc',
 'beef',
 'began',
 'benefit',
 'best',
 'better',
 'beverage',
 'billion',
 'bitter',
 'blended',
 'blumenthal',
 'board',
 'body',
 'boiling',
 'book',
 'bordeaux',
 'bottle',
 'box',
 'bread',
 'brick',
 'brick oven',
 'britain',
 'british',
 'brought',
 'building',
 'built',
 'business',
 'butter',
 'c',
 'c ',
 'c  f',
 'california',
 'called',
 'calorie',
 'canada',
 'cancer',
 'capacity',
 'carbohydrate',
 'carcinogen',
 'case',
 'casual',
 'category',
 'catered',
 'cause',
 'caused',
 'center',
 'central',
 'century',
 'century bc',
 'cereal',
 'certain',
 'chain',
 'change',
 'changed',
 'characteristic',
 'cheese',
 'chef',
 'chemical',
 'chemist',
 'chemistry',
 'chicago',
 'chicken',
 'child',
 'china',
 'chinese',
 'chip',
 'chip shop',
 'chocolate',
 'choice',
 'chosen',
 'city',
 'claim',
 'class',
 'classification',
 'classified',
 'club',
 'cohomology',
 'cold',
 'color',
 'combination',
 'combining',
 'come',
 'commercial',
 'common',
 'commonly',
 'company',
 'complete',
 'completed',
 'complex',
 'component',
 'compound',
 'concentration',
 'concept',
 'concern',
 'condition',
 'congruent',
 'considered',
 'consists',
 'constructed',
 'construction',
 'consume',
 'consumed',
 'consumer',
 'consumption',
 'contain',
 'container',
 'containing',
 'contains',
 'content',
 'content hide',
 'content hide ',
 'continued',
 'continuous',
 'control',
 'cook',
 'cooked',
 'cookery',
 'cooking',
 'cooking ',
 'cooking food',
 'cooking method',
 'copy',
 'corn',
 'cost',
 'countable',
 'countably',
 'country',
 'course',
 'cover',
 'covered',
 'create',
 'created',
 'crust',
 'cuisine',
 'culinary',
 'cultural',
 'culture',
 'cup',
 'current',
 'customer',
 'cut',
 'cycle',
 'd',
 'daily',
 'dairy',
 'dairy product',
 'dark',
 'date',
 'day',
 'december',
 'december ',
 'decomposition',
 'deep',
 'deep fried',
 'deep frying',
 'deficiency',
 'define',
 'defined',
 'definition',
 'degree',
 'demand',
 'depending',
 'derived',
 'design',
 'designed',
 'despite',
 'developed',
 'development',
 'did',
 'diet',
 'dietary',
 'difference',
 'different',
 'different type',
 'dimension',
 'dimensional',
 'dining',
 'directly',
 'director',
 'discipline',
 'discussion',
 'disease',
 'dish',
 'disjoint',
 'displaystyle',
 'displaystyle ',
 'displaystyle b',
 'distance',
 'doe',
 'double',
 'doubling',
 'doug',
 'doug elli',
 'doug elli stand',
 'dough',
 'drink',
 'drinking',
 'drive',
 'dry',
 'dutch',
 'e',
 'earlier',
 'earliest',
 'early',
 'easily',
 'east',
 'easy',
 'eat',
 'eaten',
 'eating',
 'economic',
 'economics',
 'edible',
 'edit',
 'edit main',
 'edit main article',
 'effect',
 'egg',
 'egg white',
 'element',
 'elli',
 'elli stand',
 'end',
 'end stand',
 'energy',
 'english',
 'enhance',
 'entitled',
 'equidecomposable',
 'equipment',
 'equivalence',
 'equivalent',
 'erice',
 'especially',
 'essential',
 'establishment',
 'estimated',
 'et',
 'etymology',
 'euclidean',
 'euclidean motion',
 'euclidean space',
 'europe',
 'european',
 'event',
 'evidence',
 'evolved',
 'example',
 'executive',
 'executive box',
 'exist',
 'existence',
 'exists',
 'expensive',
 'experiment',
 'extended',
 'external',
 'external link',
 'f',
 'fa',
 'facade',
 'facility',
 'fact',
 'factor',
 'family',
 'famous',
 'fan',
 'farmer',
 'fast',
 'fast food',
 'fast food chain',
 'fast food industry',
 'fast food outlet',
 'fast food restaurant',
 'fat',
 'february',
 'february ',
 'fermentation',
 'fermented',
 'ferran',
 'filling',
 'final',
 'finite',
 'finitely',
 'fish',
 'fish chip',
 'fish chip shop',
 'fixed',
 'fixed point',
 'flatbread',
 'flavor',
 'following',
 'follows',
 'food',
 'food ',
 'food aid',
 'food chain',
 'food cooking',
 'food ha',
 'food include',
 'food industry',
 'food market',
 'food outlet',
 'food preparation',
 'food prepared',
 'food price',
 'food processing',
 'food restaurant',
 'food safety',
 'food science',
 'food served',
 'food wa',
 'football',
 'foreign',
 'form',
 'formal',
 'france',
 'franchise',
 'free',
 'free group',
 'free group generator',
 'french',
 'fresh',
 'fried',
 'frozen',
 'fruit',
 'fruit vegetable',
 'fry',
 'frying',
 'fully',
 'fundamental',
 'fundamental group',
 'future',
 'g',
 'game',
 'garlic',
 'gastronomy',
 'gastronomy ',
 'gastronomy wa',
 'general',
 'generally',
 'generated',
 'generator',
 'george',
 'given',
 'global',
 'good',
 'government',
 'grain',
 'grape',
 'grape variety',
 'great',
 'greek',
 'green',
 'grill',
 'grilled',
 'grilled pizza',
 'grilling',
 'ground',
 'group',
 'group f',
 'group generator',
 'growing',
 'grown',
 'growth',
 'guide',
 'h',
 'ha',
 'ha hosted',
 'ha shown',
 'habit',
 'half',
 'hall',
 'hand',
 'harold',
 'harold mcgee',
 'hausdorff',
 'having',
 'health',
 'heart',
 'heat',
 'held',
 'help',
 'hervé',
 'hide',
 'hide ',
 'high',
 'higher',
 'highest',
 'highly',
 'history',
 'holte',
 'holte end',
 'home',
 'homology',
 'homotopy',
 'host',
 'hosted',
 'hot',
 'hotel',
 'hour',
 'house',
 'household',
 'human',
 'idea',
 'illness',
 'impact',
 'important',
 'impossible',
 'improve',
 'include',
 'included',
 'includes',
 'including',
 'increase',
 'increased',
 'increasing',
 'individual',
 'industrial',
 'industry',
 'influence',
 'information',
 'ingredient',
 'initial',
 'inside',
 'instance',
 'institute',
 'institution',
 'intake',
 'intended',
 'interior',
 'international',
 'introduced',
 'invariant',
 'investigation',
 'investment',
 'involved',
 'involves',
 'isbn',
 'isbn ',
 'isbn  ',
 'island',
 'issue',
 'italian',
 'italy',
 'item',
 'j',
 'japanese',
 'john',
 'juice',
 'june',
 'june ',
 'just',
 'k',
 'kind',
 'king',
 'kitchen',
 'knot',
 'known',
 'kurti',
 'la',
 'lane',
 'lane stand',
 'language',
 'large',
 'larger',
 'largest',
 'late',
 'later',
 'latin',
 'law',
 'le',
 'lead',
 'leading',
 'league',
 'led',
 'left',
 'lemon',
 'let',
 'level',
 'life',
 'like',
 'limited',
 'line',
 'linear',
 'link',
 'linked',
 'liquid',
 'list',
 'local',
 'locally',
 'located',
 'location',
 'london',
 'long',
 'low',
 'lower',
 'lower ground',
 'lower tier',
 'm',
 'main',
 'main article',
 'major',
 'make',
 'making',
 'manifold',
 'manufacture',
 'manufacturing',
 'map',
 'market',
 'marketing',
 'mass',
 'match',
 'material',
 'mathematical',
 'mathematics',
 'mcdonald',
 'mcdonald s',
 'mcgee',
 'meal',
 'mean',
 'meant',
 'measure',
 'meat',
 'meat cooked',
 'medium',
 'meeting',
 'menu',
 'metal',
 'method',
 'metre',
 'michelin',
 'microwave',
 'mid',
 'mid ',
 'middle',
 'milk',
 'million',
 'mineral',
 'modern',
 'molecular',
 'molecular gastronomy',
 'molecular gastronomy wa',
 'molecule',
 'month',
 'motion',
 'movement',
 'moving',
 'mozzarella',
 'multiple',
 'mushroom',
 'n',
 'named',
 'naples',
 'nation',
 'national',
 'native',
 'natural',
 'nature',
 'neapolitan',
 'neapolitan pizza',
 'near',
 'nearly',
 'need',
 'needed',
 'neumann',
 'new',
 'new york',
 'new york city',
 'new zealand',
 'nicholas',
 'nicholas kurti',
 'non',
 'north',
 'north stand',
 'note',
 'notion',
 'number',
 'numerous',
 'nutrient',
 'nutrition',
 'nutritional',
 'obesity',
 'obtained',
 'offer',
 'offered',
 'offering',
 'office',
 'oil',
 'old',
 'older',
 'olive',
 'olive oil',
 'onion',
 'open',
 'opened',
 'operation',
 'option',
 'orbit',
 'order',
 'ordinary',
 'organic',
 'organization',
 'origin',
 'original',
 'originally',
 'outlet',
 'outside',
 'outside home',
 'oven',
 'owner',
 'packaged',
 'packaging',
 'pan',
 'paper',
 'paradox',
 'paradoxical',
 'paradoxical decomposition',
 'paris',
 'park',
 'park wa',
 'particular',
 'particularly',
 'partitioned',
 'pay',
 'peanut',
 'people',
 'permission',
 'person',
 'physical',
 'physicist',
 'pie',
 'piece',
 'pierre',
 'pitch',
 'pizza',
 'pizza dough',
 'pizza pizza',
 'pizza wa',
 'pizzeria',
 'place',
 'plan',
 'plane',
 'plant',
 'play',
 'played',
 'point',
 'poisoning',
 'policy',
 'polygon',
 'pool',
 'poor',
 'popular',
 'population',
 'portion',
 'possibility',
 'possible',
 'potato',
 'practice',
 'pre',
 'preparation',
 'prepared',
 'prepared food',
 'present',
 'presentation',
 'presented',
 'preservation',
 'preserve',
 'preserving',
 'press',
 'previously',
 'price',
 'primary',
 'prior',
 'problem',
 'process',
 'processed',
 'processing',
 'produce',
 'produced',
 'producer',
 'product',
 'production',
 'professor',
 'promotes',
 'proof',
 'property',
 'proportion',
 'protein',
 'proved',
 'provide',
 'public',
 'published',
 'purchase',
 'quality',
 'question',
 'quick',
 'quickly',
 'r',
 'railway',
 'railway station',
 'range',
 'rare',
 'rate',
 'raw',
 'reaction',
 'reading',
 'ready',
 'real',
 'reason',
 'received',
 'recent',
 'recently',
 'recipe',
 'recognized',
 'recommended',
 'record',
 'red',
 'red wine',
 'reduce',
 'reducing',
 'reference',
 'reference ',
 'region',
 'regular',
 'related',
 'relatively',
 'released',
 'religious',
 'remaining',
 'replaced',
 'report',
 'require',
 'required',
 'requirement',
 'requires',
 'research',
 'researcher',
 'respect',
 'restaurant',
 'restaurant ',
 'result',
 'review',
 'rice',
 'rich',
 'rinder',
 'rise',
 'rising',
 'risk',
 'road',
 'road stand',
 'robert',
 'role',
 'roman',
 'roof',
 'room',
 'rosé',
 'rosé wine',
 'rotation',
 'round',
 'royal',
 'rugby',
 's',
 's ',
 'safety',
 'said',
 'salad',
 'sale',
 'salt',
 'sandwich',
 'sauce',
 'sausage',
 'sauvignon',
 'saw',
 'scale',
 'school',
 'science',
 'scientific',
 'scientist',
 'seafood',
 'season',
 'seat',
 'seating',
 'second',
 'seed',
 'seen',
 'selection',
 'sell',
 'selling',
 'sense',
 'series',
 'serve',
 'served',
 'service',
 'service restaurant',
 'serving',
 'set',
 'set e',
 'shape',
 'shaped',
 'shop',
 'short',
 'showed',
 'shown',
 'significant',
 'similar',
 'similarly',
 'simple',
 'simplicial',
 'simplicial complex',
 'single',
 'site',
 'size',
 'skin',
 'sl',
 'sl ',
 'sl  r',
 'slice',
 'slightly',
 'small',
 'smaller',
 'smoking',
 'social',
 'sold',
 'solid',
 'soup',
 'source',
 'south',
 'space',
 'special',
 'specie',
 'specific',
 'specifically',
 'spectator',
 'spent',
 'sphere',
 'spice',
 'square',
 'st',
 'stadium',
 'stage',
 'stand',
 'stand villa',
 'stand villa park',
 'stand wa',
 'standard',
 'starch',
 'start',
 'started',
 'state',
 'state ',
 'statement',
 'station',
 'steaming',
 'step',
 'step ',
 'stick',
 'stock',
 'storage',
 'store',
 'street',
 'string',
 'strong',
 'structure',
 'studied',
 'study',
 'style',
 'style cooking',
 'subgroup',
 'subject',
 'subset',
 'substance',
 ...]

In [533]:
def display_topics(comps, feature_names, no_top_words):
    for topic_idx, topic in enumerate(comps):
        print ("Topic %d:" % (topic_idx))
        print (" ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))

In [536]:
no_top_words = 10
display_topics(model.named_steps["lda"].components_, model.named_steps["count_vectorizer"].get_feature_names(), no_top_words)


Topic 0:
      year wa food wine ha restaurant  year
Topic 1:
group c point d f free n rotation theorem step
Topic 2:
sugar fried measure important deep fried different salt deep transformation level
Topic 3:
 s wa new   century cup restaurant th cooking
Topic 4:
group space banach paradox tarski set algebraic banach tarski euclidean decomposition
Topic 5:
vegetable ha vitamin large including obesity industry cooking water mineral
Topic 6:
food fast fast food restaurant cooking used health pizza animal ingredient
Topic 7:
wine grape g state united united state fruit  billion billion university
Topic 8:
pizza b e meat displaystyle  origin usually variety fish
Topic 9:
stand villa end park road villa park holte trinity trinity road wa

In [532]:
model.named_steps["lda"].components_


Out[532]:
array([[  1.02171121e+03,   2.64226770e+02,   7.53039228e+01, ...,
          1.16661028e-01,   4.01260622e+00,   1.24449240e-01],
       [  7.84177765e+00,   1.23120571e-01,   1.19044590e-01, ...,
          1.16751680e-01,   1.14512014e-01,   1.38551672e-01],
       [  8.85075738e-01,   1.21390062e-01,   1.18309658e-01, ...,
          1.21419933e-01,   1.16528792e-01,   1.17654830e-01],
       ..., 
       [  1.34361330e+01,   2.62059933e-01,   1.26800934e-01, ...,
          1.21023038e-01,   1.17561001e-01,   1.16507266e-01],
       [  2.32712906e+01,   1.52936780e-01,   1.16366473e-01, ...,
          1.17736542e-01,   1.43464882e-01,   1.36238403e-01],
       [  1.77645334e+01,   1.23497588e-01,   1.18142382e-01, ...,
          1.14824151e-01,   1.16948395e-01,   1.17024743e-01]])

In [510]:
D["script.txt"].argsort()[::-1]


Out[510]:
9    6
8    5
7    0
6    3
5    7
4    8
3    1
2    9
1    2
0    4
Name: script.txt, dtype: int64

In [511]:
def pipeline(doc):
    res = model.transform([doc])
    return res[0]

In [512]:
res = {name:pipeline(doc) for name,doc in d.items()}

In [513]:
pipeline(d["pizza.txt"])[0]


Out[513]:
0.18827788832241277

In [514]:
D =pd.DataFrame(res)

In [515]:
D.plot.bar(figsize=(20,10))


Out[515]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f42ebcb9e48>

In [781]:
D[D>0.1].T.plot.bar(figsize=(20,10),colormap='jet',title="Most important topics per document")


Out[781]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f42e62c6208>

In [517]:
sns.heatmap(D.T)


Out[517]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f42ea6d2dd8>

In [518]:
normed = (D.T - D.T.mean())/D.T.std()

In [519]:
D


Out[519]:
algebraic_topology.txt banch_tarski_paradox.txt molecular_gatronomy.txt pizza.txt script.txt transcript_1.txt transcript_2.txt transcript_3.txt villa_park.txt wine.txt
0 0.041485 0.118863 0.112368 0.188278 0.117091 0.204092 0.254821 0.113618 0.331657 0.276992
1 0.090005 0.226947 0.018855 0.001038 0.011714 0.002059 0.009726 0.023718 0.003170 0.036627
2 0.027939 0.025188 0.024058 0.007504 0.007465 0.015594 0.004935 0.014152 0.004839 0.010797
3 0.036773 0.075233 0.587501 0.169753 0.047372 0.115457 0.180344 0.056978 0.158684 0.093544
4 0.779749 0.414800 0.003518 0.000859 0.002248 0.001556 0.006864 0.000072 0.001542 0.001860
5 0.001889 0.004490 0.014883 0.008072 0.134896 0.029700 0.025385 0.129400 0.012831 0.016525
6 0.000139 0.000038 0.200718 0.356994 0.629986 0.547824 0.431183 0.629012 0.003098 0.132073
7 0.015868 0.018606 0.026288 0.000067 0.025382 0.034750 0.056659 0.013954 0.000041 0.382330
8 0.000139 0.115797 0.002411 0.267367 0.016286 0.043101 0.023323 0.012579 0.000041 0.049201
9 0.006013 0.000038 0.009402 0.000067 0.007560 0.005866 0.006760 0.006516 0.484098 0.000051

In [520]:
normed[normed>1].T


Out[520]:
algebraic_topology.txt banch_tarski_paradox.txt molecular_gatronomy.txt pizza.txt script.txt transcript_1.txt transcript_2.txt transcript_3.txt villa_park.txt wine.txt
0 NaN NaN NaN NaN NaN NaN NaN NaN 1.713374 1.111934
1 NaN 2.635183 NaN NaN NaN NaN NaN NaN NaN NaN
2 1.570761 1.255122 1.125534 NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN 2.695683 NaN NaN NaN NaN NaN NaN NaN
4 2.482798 1.106678 NaN NaN NaN NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN 1.923811 NaN NaN 1.814917 NaN NaN
6 NaN NaN NaN NaN 1.299712 NaN NaN 1.295957 NaN NaN
7 NaN NaN NaN NaN NaN NaN NaN NaN NaN 2.816598
8 NaN NaN NaN 2.581843 NaN NaN NaN NaN NaN NaN
9 NaN NaN NaN NaN NaN NaN NaN NaN 2.845323 NaN

In [541]:
path ='./data/transcripts'
transcripts ={}

for name in os.listdir(path):
    if name.endswith('.txt'):
        f_path = os.path.join(path,name)
        with open(f_path) as f:
            transcripts[name] = f.read()

In [522]:
transcripts = {name:pipeline(doc) for name,doc in transcripts.items()}

In [523]:
X  =pd.DataFrame(transcripts)

In [539]:
X['transcript_2.txt'].argmax()


Out[539]:
6

In [767]:
ax = X.T.plot.bar(figsize=(20,10),colormap='jet')



In [769]:
fig =ax.get_figure()
fig.savefig('/tmp/test.png')

Word score

To score the words we'll use the z-score of their tf-idf score where the train corpus are the transcripts and the term frequencyies come from the script


In [790]:
corpus =[]
for key,val in transcripts.items():
    corpus+=val.splitlines()
tfidf = TfidfVectorizer(tokenizer=Tokenizer(),stop_words='english',ngram_range=(1,4),smooth_idf=True)
tf_model = tfidf.fit(corpus)
script = tf_model.transform([d["script.txt"]])
tf_feature_names = tf_model.get_feature_names()

In [791]:
def display_word_ranks(vector, feature_names, no_top_words):
    
        ranking_dict = {feature_names[i]: -np.log(vector[i]) #dictionary comprehension, key is word val is score
                   for i in vector.argsort()  # iterate over indices sorted by the value
                   if vector[i] >0 # and only take values that are non zero e.g. appear in the document
                   }
        
    
        return ranking_dict
z = script.toarray()

In [696]:
z[0]


Out[696]:
array([ 0.,  0.,  0., ...,  0.,  0.,  0.])

In [792]:
ranking_dict =(display_word_ranks(z[0],tf_feature_names,40))

In [795]:
S = pd.Series(ranking_dict)
S = S.sort_values(ascending=False)
S[S!=S.value_counts().values[0]][:20].plot.bar(title="Our top n words scored")


Out[795]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f42e6eacf60>

In [789]:
T =S.sample(30).index
Y = S[T]
Y.sort_values(ascending=False).plot.bar(figsize=(15,5),title="Our top n words scored")


Out[789]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f42e5ac6390>

In [750]:
S.rank(ascending=False,method='first')[:30]


Out[750]:
fast food         1.0
fast              2.0
meal              3.0
customer          4.0
establishment     5.0
area              6.0
business          7.0
deep              8.0
family            9.0
number           10.0
press            11.0
public           12.0
chef             13.0
order            14.0
potato           15.0
steaming         16.0
south            17.0
approximately    18.0
away             19.0
reducing         20.0
quick            21.0
kind             22.0
making           23.0
roasting         24.0
second           25.0
component        26.0
research         27.0
link             28.0
beverage         29.0
claim            30.0
dtype: float64

In [780]:
S[T].rank(ascending=False,method='first').sort_values()


Out[780]:
fast food                                    1.0
family                                       2.0
expensive                                    3.0
white                                        4.0
negative                                     5.0
great                                        6.0
avoid                                        7.0
mexican                                      8.0
kidney                                       9.0
analysis richard                            10.0
cancer caused carcinogen                    11.0
food preservative cured                     12.0
seafood steak                               13.0
mineral vitamin                             14.0
richard peto                                15.0
thing                                       16.0
week                                        17.0
concentration carcinogen anticarcinogens    18.0
used food preservative                      19.0
digital                                     20.0
choice                                      21.0
government                                  22.0
health food                                 23.0
analysis                                    24.0
considered                                  25.0
united                                      26.0
industry                                    27.0
high                                        28.0
source                                      29.0
animal                                      30.0
dtype: float64