Part3

Using the models.ldamodel module from the gensim library, run topic modeling over the corpus. Explore different numbers of topics (varying from 5 to 50), and settle for the parameter which returns topics that you consider to be meaningful at first sight.


In [1]:
# imports
import pandas as pd
import numpy as np
from nltk.corpus import stopwords
from gensim import corpora, models, utils
from nltk.stem import WordNetLemmatizer


/home/christophe/Application/anaconda3/lib/python3.5/site-packages/gensim/utils.py:1015: UserWarning: Pattern library is not installed, lemmatization won't be available.
  warnings.warn("Pattern library is not installed, lemmatization won't be available.")

In [2]:
data = pd.read_csv('hillary-clinton-emails/Emails.csv', index_col=0).dropna()
texts = pd.concat((data.ExtractedBodyText ,data.ExtractedSubject), axis=1)

In [3]:
sw = ['re', 'fw', 'fvv', 'fwd']

To improve the result of the lda model, we group the mails by Subject. We filter the subjects to remove the keywords 're', 'fw', 'fvv', 'fwd'


In [4]:
def filt(row):
    t = utils.simple_preprocess(row.ExtractedSubject)
    filt = list(filter(lambda x: x not in sw, t))
    return ' '.join(filt)
texts['ExtractedSubject'] = texts.apply(filt, axis=1)
texts = texts.groupby(by='ExtractedSubject', as_index=False).apply(lambda x: (x + ' ').sum())

In [5]:
texts.head()


Out[5]:
ExtractedBodyText ExtractedSubject
0 - Benghazi\nSounds good. Thx--be sure I see hi...
1 Jake, when you can, I think I speak for the gr... aar
2 Not until I have a report from Mitchell. No on... abu mazen abu mazen abu mazen abu mazen abu ma...
3 I just wanted to share how much the Secretary ... aceh paper
4 See below - confidential.\nAlso for WJC. acting srsg mulet

we concat the e-mail body wit the subject.


In [6]:
texts.ExtractedBodyText.fillna('',inplace=True)
texts.ExtractedSubject.fillna('',inplace=True)
texts['SubjectBody'] = texts.ExtractedBodyText +  ' ' + texts.ExtractedSubject
mails = texts.SubjectBody

to get more meaningfull topics, we filter out english stop words and some custom words that don't have much meaning as topics.


In [7]:
documents = []

custom = ['like', 'think', 'know', 'want', 'sure', 'thing', 'send', 'sent', 'speech', 'print', 'time','want', 'said', 'maybe', 'today', 'tomorrow', 'thank', 'thanks']
english_stop_words = ["a", "about", "above", "above", "across", "after", "afterwards", "again", "against", "all", "almost", "alone", "along", "already", "also","although","always","am","among", "amongst", "amoungst", "amount",  "an", "and", "another", "any","anyhow","anyone","anything","anyway", "anywhere", "are", "around", "as",  "at", "back","be","became", "because","become","becomes", "becoming", "been", "before", "beforehand", "behind", "being", "below", "beside", "besides", "between", "beyond", "bill", "both", "bottom","but", "by", "call", "can", "cannot", "cant", "co", "con", "could", "couldnt", "cry", "de", "describe", "detail", "do", "done", "down", "due", "during", "each", "eg", "eight", "either", "eleven","else", "elsewhere", "empty", "enough", "etc", "even", "ever", "every", "everyone", "everything", "everywhere", "except", "few", "fifteen", "fify", "fill", "find", "fire", "first", "five", "for", "former", "formerly", "forty", "found", "four", "from", "front", "full", "further", "get", "give", "go", "had", "has", "hasnt", "have", "he", "hence", "her", "here", "hereafter", "hereby", "herein", "hereupon", "hers", "herself", "him", "himself", "his", "how", "however", "hundred", "ie", "if", "in", "inc", "indeed", "interest", "into", "is", "it", "its", "itself", "keep", "last", "latter", "latterly", "least", "less", "ltd", "made", "many", "may", "me", "meanwhile", "might", "mill", "mine", "more", "moreover", "most", "mostly", "move", "much", "must", "my", "myself", "name", "namely", "neither", "never", "nevertheless", "next", "nine", "no", "nobody", "none", "noone", "nor", "not", "nothing", "now", "nowhere", "of", "off", "often", "on", "once", "one", "only", "onto", "or", "other", "others", "otherwise", "our", "ours", "ourselves", "out", "over", "own","part", "per", "perhaps", "please", "put", "rather", "re", "same", "see", "seem", "seemed", "seeming", "seems", "serious", "several", "she", "should", "show", "side", "since", "sincere", "six", "sixty", "so", "some", "somehow", "someone", "something", "sometime", "sometimes", "somewhere", "still", "such", "system", "take", "ten", "than", "that", "the", "their", "them", "themselves", "then", "thence", "there", "thereafter", "thereby", "therefore", "therein", "thereupon", "these", "they", "thickv", "thin", "third", "this", "those", "though", "three", "through", "throughout", "thru", "thus", "to", "together", "too", "top", "toward", "towards", "twelve", "twenty", "two", "un", "under", "until", "up", "upon", "us", "very", "via", "was", "we", "well", "were", "what", "whatever", "when", "whence", "whenever", "where", "whereafter", "whereas", "whereby", "wherein", "whereupon", "wherever", "whether", "which", "while", "whither", "who", "whoever", "whole", "whom", "whose", "why", "will", "with", "within", "without", "would", "yet", "you", "your", "yours", "yourself", "yourselves", "the"]
sw =stopwords.words('english') + sw + custom + english_stop_words
for text in mails:
    t = utils.simple_preprocess(text)
    filt = list(filter(lambda x: (x not in sw) and len(x) > 3, t))
    
    lemmatizer = WordNetLemmatizer()
    lemmatized = [lemmatizer.lemmatize(x) for x in filt]
    filt2 = list(filter(lambda x: (x not in sw) and len(x) > 3, lemmatized))
    documents.append(filt2)

To use the LDA model, we need to transform every document into a list of tuple (ID, term frequency).


In [8]:
dictionary = corpora.Dictionary(documents)
corpus = [dictionary.doc2bow(doc) for doc in documents]

Now we generate the LDA models with different numbers of topics.


In [9]:
import pprint
pp = pprint.PrettyPrinter(depth=2)
for i in range(5, 50, 10):
    print('------------------------------------')
    print(i, 'topics')
    lda = models.LdaModel(corpus, num_topics=i, id2word = dictionary)
    pp.pprint(lda.print_topics(lda.num_topics))
    print()


------------------------------------
5 topics
[(0,
  '0.008*"good" + 0.007*"issue" + 0.007*"schedule" + 0.007*"draft" + '
  '0.006*"state" + 0.005*"meeting" + 0.005*"secretary" + 0.005*"sheet" + '
  '0.005*"africa" + 0.005*"make"'),
 (1,
  '0.007*"email" + 0.007*"list" + 0.007*"secretary" + 0.006*"state" + '
  '0.006*"schedule" + 0.005*"make" + 0.005*"work" + 0.005*"statement" + '
  '0.004*"draft" + 0.004*"need"'),
 (2,
  '0.012*"press" + 0.011*"state" + 0.010*"strategic" + 0.010*"dialogue" + '
  '0.010*"office" + 0.010*"list" + 0.009*"clip" + 0.008*"secretary" + '
  '0.007*"talk" + 0.007*"schedule"'),
 (3,
  '0.009*"haiti" + 0.007*"good" + 0.007*"work" + 0.006*"foreign" + '
  '0.006*"meeting" + 0.005*"draft" + 0.005*"make" + 0.005*"help" + '
  '0.005*"right" + 0.005*"article"'),
 (4,
  '0.006*"work" + 0.006*"schedule" + 0.005*"cheryl" + 0.005*"discus" + '
  '0.005*"lunch" + 0.005*"state" + 0.005*"sheet" + 0.005*"issue" + '
  '0.005*"letter" + 0.004*"dinner"')]

------------------------------------
15 topics
[(0,
  '0.020*"list" + 0.011*"secretary" + 0.010*"cheryl" + 0.009*"work" + '
  '0.007*"turkey" + 0.007*"lauren" + 0.007*"state" + 0.007*"armenia" + '
  '0.007*"ashton" + 0.007*"davutoglu"'),
 (1,
  '0.019*"dialogue" + 0.018*"press" + 0.018*"strategic" + 0.017*"clip" + '
  '0.012*"schedule" + 0.009*"draft" + 0.008*"friday" + 0.007*"copy" + '
  '0.007*"lunch" + 0.007*"state"'),
 (2,
  '0.013*"confirmed" + 0.013*"minister" + 0.009*"secretary" + 0.009*"karzai" + '
  '0.007*"lona" + 0.007*"make" + 0.007*"work" + 0.006*"point" + 0.006*"jake" + '
  '0.005*"prime"'),
 (3,
  '0.013*"update" + 0.011*"haiti" + 0.009*"good" + 0.007*"travel" + '
  '0.006*"office" + 0.005*"follow" + 0.005*"talk" + 0.005*"honduras" + '
  '0.005*"letter" + 0.005*"working"'),
 (4,
  '0.017*"statement" + 0.017*"issue" + 0.010*"brava" + 0.010*"bravo" + '
  '0.010*"meeting" + 0.008*"state" + 0.007*"lona" + 0.006*"clinton" + '
  '0.006*"secretary" + 0.005*"going"'),
 (5,
  '0.009*"work" + 0.009*"press" + 0.008*"good" + 0.007*"discus" + '
  '0.006*"issue" + 0.006*"talk" + 0.006*"release" + 0.005*"strategic" + '
  '0.005*"dialogue" + 0.005*"follow"'),
 (6,
  '0.011*"sheet" + 0.010*"draft" + 0.009*"email" + 0.007*"state" + '
  '0.007*"mubarak" + 0.007*"idea" + 0.007*"work" + 0.006*"foreign" + '
  '0.006*"secretary" + 0.006*"talk"'),
 (7,
  '0.014*"list" + 0.013*"schedule" + 0.007*"need" + 0.007*"followup" + '
  '0.007*"sheet" + 0.006*"week" + 0.005*"discus" + 0.005*"huma" + '
  '0.005*"question" + 0.005*"june"'),
 (8,
  '0.017*"schedule" + 0.012*"state" + 0.009*"work" + 0.008*"meeting" + '
  '0.008*"draft" + 0.008*"right" + 0.008*"conference" + 0.007*"good" + '
  '0.007*"discus" + 0.007*"lona"'),
 (9,
  '0.012*"issue" + 0.010*"state" + 0.007*"kerry" + 0.007*"jake" + 0.007*"talk" '
  '+ 0.007*"good" + 0.006*"statement" + 0.006*"email" + 0.005*"work" + '
  '0.005*"agree"'),
 (10,
  '0.015*"haiti" + 0.014*"help" + 0.011*"work" + 0.011*"ashton" + '
  '0.010*"friend" + 0.010*"conference" + 0.009*"secretary" + 0.009*"office" + '
  '0.009*"pakistan" + 0.008*"team"'),
 (11,
  '0.012*"development" + 0.009*"oprah" + 0.008*"india" + 0.007*"line" + '
  '0.007*"good" + 0.006*"haiti" + 0.006*"canada" + 0.006*"spoke" + '
  '0.006*"thought" + 0.006*"question"'),
 (12,
  '0.009*"year" + 0.009*"work" + 0.008*"mazen" + 0.007*"state" + 0.007*"make" '
  '+ 0.006*"great" + 0.006*"eikenberry" + 0.006*"jack" + 0.006*"talk" + '
  '0.005*"read"'),
 (13,
  '0.013*"office" + 0.012*"list" + 0.010*"state" + 0.009*"dinner" + '
  '0.009*"secretary" + 0.009*"draft" + 0.007*"going" + 0.007*"prayer" + '
  '0.007*"breakfast" + 0.005*"tech"'),
 (14,
  '0.011*"woman" + 0.009*"draft" + 0.009*"latest" + 0.008*"talk" + '
  '0.008*"video" + 0.008*"list" + 0.007*"health" + 0.007*"mother" + '
  '0.006*"good" + 0.006*"email"')]

------------------------------------
25 topics
[(0,
  '0.012*"draft" + 0.010*"year" + 0.009*"plan" + 0.009*"work" + 0.008*"issue" '
  '+ 0.008*"email" + 0.007*"schedule" + 0.007*"tonight" + 0.007*"honduras" + '
  '0.006*"follow"'),
 (1,
  '0.016*"press" + 0.013*"mazen" + 0.011*"need" + 0.010*"follow" + '
  '0.010*"maggie" + 0.009*"strategic" + 0.009*"dialogue" + 0.008*"health" + '
  '0.008*"abortion" + 0.008*"maternal"'),
 (2,
  '0.013*"schedule" + 0.009*"lona" + 0.008*"work" + 0.008*"confirmed" + '
  '0.008*"list" + 0.008*"talk" + 0.007*"latest" + 0.007*"woman" + '
  '0.007*"right" + 0.006*"state"'),
 (3,
  '0.029*"press" + 0.029*"dialogue" + 0.026*"strategic" + 0.018*"clip" + '
  '0.008*"state" + 0.008*"schedule" + 0.006*"ambassador" + 0.006*"foreign" + '
  '0.006*"make" + 0.006*"issue"'),
 (4,
  '0.023*"draft" + 0.012*"breakfast" + 0.012*"prayer" + 0.010*"work" + '
  '0.009*"height" + 0.009*"dorothy" + 0.009*"statement" + 0.009*"read" + '
  '0.008*"believe" + 0.008*"come"'),
 (5,
  '0.022*"list" + 0.012*"good" + 0.010*"work" + 0.008*"june" + 0.008*"talk" + '
  '0.007*"guinea" + 0.006*"article" + 0.006*"lavrov" + 0.006*"pakistan" + '
  '0.006*"family"'),
 (6,
  '0.013*"secretary" + 0.012*"office" + 0.011*"africa" + 0.010*"clinton" + '
  '0.007*"case" + 0.007*"state" + 0.007*"draft" + 0.006*"agenda" + '
  '0.006*"room" + 0.006*"video"'),
 (7,
  '0.015*"schedule" + 0.009*"india" + 0.008*"sheet" + 0.008*"uruguay" + '
  '0.008*"peru" + 0.008*"close" + 0.008*"concert" + 0.008*"consideration" + '
  '0.008*"coming" + 0.008*"latest"'),
 (8,
  '0.011*"oprah" + 0.009*"state" + 0.009*"help" + 0.008*"year" + '
  '0.008*"schedule" + 0.008*"discus" + 0.007*"pakistan" + 0.007*"forward" + '
  '0.007*"secretary" + 0.007*"little"'),
 (9,
  '0.024*"memo" + 0.014*"possible" + 0.012*"issue" + 0.010*"berlin" + '
  '0.010*"copy" + 0.009*"help" + 0.008*"clinton" + 0.008*"state" + '
  '0.008*"request" + 0.007*"jake"'),
 (10,
  '0.020*"work" + 0.020*"ashton" + 0.015*"haiti" + 0.014*"lady" + '
  '0.013*"friend" + 0.012*"conference" + 0.010*"good" + 0.010*"make" + '
  '0.008*"article" + 0.008*"discus"'),
 (11,
  '0.032*"cheryl" + 0.024*"friday" + 0.011*"haiti" + 0.009*"travel" + '
  '0.008*"idea" + 0.007*"office" + 0.007*"ashton" + 0.007*"later" + '
  '0.007*"eikenberry" + 0.007*"letter"'),
 (12,
  '0.019*"harry" + 0.017*"reid" + 0.016*"secretary" + 0.014*"sheet" + '
  '0.008*"need" + 0.008*"ashton" + 0.007*"africa" + 0.007*"office" + '
  '0.007*"message" + 0.006*"speak"'),
 (13,
  '0.030*"clip" + 0.023*"strategic" + 0.021*"press" + 0.020*"dialogue" + '
  '0.010*"state" + 0.010*"menendez" + 0.009*"schedule" + 0.007*"holbrooke" + '
  '0.007*"bolivia" + 0.007*"alexander"'),
 (14,
  '0.023*"sheet" + 0.017*"state" + 0.016*"armenia" + 0.016*"turkey" + '
  '0.016*"text" + 0.016*"davutoglu" + 0.016*"mubarak" + 0.015*"schedule" + '
  '0.009*"haiti" + 0.009*"house"'),
 (15,
  '0.014*"richard" + 0.014*"wrote" + 0.009*"karl" + 0.009*"karzai" + '
  '0.009*"caitlin" + 0.006*"schedule" + 0.005*"department" + 0.005*"benghazi" '
  '+ 0.005*"jones" + 0.005*"thursday"'),
 (16,
  '0.016*"issue" + 0.015*"statement" + 0.014*"bravo" + 0.014*"brava" + '
  '0.010*"clinton" + 0.008*"work" + 0.008*"secretary" + 0.007*"state" + '
  '0.007*"africa" + 0.007*"change"'),
 (17,
  '0.015*"state" + 0.014*"water" + 0.013*"lunch" + 0.010*"lona" + '
  '0.009*"schedule" + 0.009*"john" + 0.009*"haiti" + 0.009*"secretary" + '
  '0.008*"response" + 0.008*"come"'),
 (18,
  '0.012*"secretary" + 0.011*"dinner" + 0.011*"jake" + 0.010*"ready" + '
  '0.010*"email" + 0.009*"draft" + 0.008*"report" + 0.008*"tech" + '
  '0.008*"blair" + 0.008*"issue"'),
 (19,
  '0.012*"need" + 0.012*"schedule" + 0.010*"canada" + 0.009*"meeting" + '
  '0.008*"list" + 0.008*"line" + 0.008*"hang" + 0.008*"plane" + 0.007*"global" '
  '+ 0.006*"shuttle"'),
 (20,
  '0.016*"letter" + 0.013*"lautenberg" + 0.011*"house" + 0.008*"update" + '
  '0.007*"going" + 0.007*"email" + 0.006*"video" + 0.006*"obama" + '
  '0.006*"cooperation" + 0.006*"reject"'),
 (21,
  '0.027*"development" + 0.017*"reaction" + 0.013*"letter" + 0.013*"sign" + '
  '0.010*"work" + 0.010*"email" + 0.010*"list" + 0.010*"update" + '
  '0.007*"edits" + 0.007*"news"'),
 (22,
  '0.017*"list" + 0.013*"copy" + 0.009*"discus" + 0.007*"hatfield" + '
  '0.007*"dalton" + 0.007*"confirmed" + 0.007*"help" + 0.006*"asked" + '
  '0.006*"work" + 0.006*"issue"'),
 (23,
  '0.014*"good" + 0.013*"work" + 0.009*"haiti" + 0.008*"talk" + 0.008*"woman" '
  '+ 0.008*"mepp" + 0.008*"info" + 0.008*"state" + 0.007*"issue" + '
  '0.007*"meeting"'),
 (24,
  '0.020*"update" + 0.011*"talk" + 0.009*"message" + 0.009*"draft" + '
  '0.007*"jack" + 0.007*"ending" + 0.007*"email" + 0.007*"schedule" + '
  '0.007*"state" + 0.006*"secretary"')]

------------------------------------
35 topics
[(0,
  '0.015*"message" + 0.013*"june" + 0.013*"pakistan" + 0.011*"issue" + '
  '0.010*"texting" + 0.010*"received" + 0.009*"discus" + 0.009*"campaign" + '
  '0.009*"kenya" + 0.009*"saeb"'),
 (1,
  '0.021*"cheryl" + 0.021*"lunch" + 0.015*"holiday" + 0.014*"water" + '
  '0.012*"love" + 0.011*"report" + 0.011*"travel" + 0.011*"wait" + '
  '0.011*"lona" + 0.009*"hope"'),
 (2,
  '0.024*"schedule" + 0.015*"state" + 0.013*"water" + 0.008*"honduras" + '
  '0.007*"department" + 0.007*"pakistan" + 0.007*"august" + 0.007*"word" + '
  '0.006*"best" + 0.006*"resolution"'),
 (3,
  '0.012*"jack" + 0.012*"good" + 0.011*"work" + 0.009*"list" + 0.009*"meeting" '
  '+ 0.009*"news" + 0.009*"lavrov" + 0.009*"schedule" + 0.008*"woman" + '
  '0.008*"told"'),
 (4,
  '0.036*"list" + 0.018*"development" + 0.013*"work" + 0.010*"great" + '
  '0.010*"blair" + 0.010*"foreign" + 0.009*"king" + 0.008*"possible" + '
  '0.008*"right" + 0.008*"story"'),
 (5,
  '0.022*"friday" + 0.016*"schedule" + 0.012*"bangl" + 0.012*"funeral" + '
  '0.012*"india" + 0.010*"mini" + 0.010*"team" + 0.010*"medium" + 0.009*"mepp" '
  '+ 0.009*"visitation"'),
 (6,
  '0.015*"qatar" + 0.015*"sheikha" + 0.015*"mosa" + 0.015*"maggie" + '
  '0.011*"help" + 0.010*"release" + 0.010*"decide" + 0.010*"dinner" + '
  '0.010*"update" + 0.008*"night"'),
 (7,
  '0.012*"public" + 0.010*"work" + 0.009*"ending" + 0.009*"good" + '
  '0.008*"woman" + 0.008*"follow" + 0.007*"issue" + 0.007*"jack" + '
  '0.007*"summary" + 0.007*"state"'),
 (8,
  '0.023*"list" + 0.017*"case" + 0.015*"goldman" + 0.011*"care" + 0.011*"told" '
  '+ 0.010*"haiti" + 0.010*"briefing" + 0.009*"brazilian" + 0.008*"make" + '
  '0.008*"huma"'),
 (9,
  '0.012*"going" + 0.011*"woman" + 0.009*"nyse" + 0.009*"state" + 0.008*"good" '
  '+ 0.008*"funeral" + 0.008*"representation" + 0.008*"asap" + '
  '0.008*"confirmation" + 0.007*"make"'),
 (10,
  '0.049*"strategic" + 0.048*"press" + 0.044*"dialogue" + 0.044*"clip" + '
  '0.018*"letter" + 0.014*"lautenberg" + 0.009*"love" + 0.009*"food" + '
  '0.009*"release" + 0.009*"harry"'),
 (11,
  '0.014*"height" + 0.014*"dorothy" + 0.012*"draft" + 0.007*"version" + '
  '0.007*"domestic" + 0.007*"benghazi" + 0.007*"march" + 0.007*"islamist" + '
  '0.007*"militia" + 0.007*"libyan"'),
 (12,
  '0.016*"john" + 0.016*"kerry" + 0.015*"secretary" + 0.014*"plan" + '
  '0.013*"bolivia" + 0.009*"state" + 0.009*"tonight" + 0.009*"concert" + '
  '0.009*"agree" + 0.009*"chinese"'),
 (13,
  '0.020*"help" + 0.015*"contact" + 0.012*"dying" + 0.010*"patient" + '
  '0.010*"message" + 0.010*"feedback" + 0.010*"believe" + 0.010*"middle" + '
  '0.010*"east" + 0.010*"comment"'),
 (14,
  '0.018*"draft" + 0.018*"list" + 0.016*"going" + 0.013*"prayer" + '
  '0.013*"breakfast" + 0.010*"lunch" + 0.009*"ashton" + 0.008*"wrote" + '
  '0.008*"lady" + 0.008*"faxed"'),
 (15,
  '0.015*"draft" + 0.015*"text" + 0.015*"davutoglu" + 0.014*"turkey" + '
  '0.014*"armenia" + 0.014*"office" + 0.012*"haiti" + 0.011*"line" + '
  '0.010*"hang" + 0.010*"ashton"'),
 (16,
  '0.022*"secretary" + 0.020*"work" + 0.012*"reaction" + 0.012*"clinton" + '
  '0.012*"office" + 0.011*"state" + 0.008*"discussion" + 0.008*"article" + '
  '0.008*"meeting" + 0.008*"draft"'),
 (17,
  '0.017*"state" + 0.011*"draft" + 0.011*"update" + 0.009*"change" + '
  '0.009*"benghazi" + 0.008*"letter" + 0.008*"good" + 0.007*"department" + '
  '0.007*"email" + 0.007*"house"'),
 (18,
  '0.015*"schedule" + 0.011*"secretary" + 0.010*"state" + 0.009*"haiti" + '
  '0.008*"issue" + 0.008*"statement" + 0.007*"help" + 0.007*"trip" + '
  '0.007*"confirmed" + 0.007*"bravo"'),
 (19,
  '0.016*"family" + 0.014*"copy" + 0.012*"secretary" + 0.012*"hungarian" + '
  '0.012*"bajnai" + 0.012*"confirmed" + 0.012*"gordon" + 0.012*"minister" + '
  '0.012*"prime" + 0.012*"death"'),
 (20,
  '0.026*"dinner" + 0.021*"tech" + 0.016*"read" + 0.011*"leave" + 0.011*"list" '
  '+ 0.011*"later" + 0.011*"thursday" + 0.011*"update" + 0.011*"caitlin" + '
  '0.011*"korea"'),
 (21,
  '0.018*"issue" + 0.013*"statement" + 0.012*"draft" + 0.010*"guinea" + '
  '0.010*"french" + 0.010*"goodwill" + 0.009*"dollar" + 0.009*"little" + '
  '0.008*"pakistan" + 0.008*"billion"'),
 (22,
  '0.016*"brava" + 0.016*"statement" + 0.016*"africa" + 0.014*"list" + '
  '0.014*"issue" + 0.014*"bravo" + 0.011*"state" + 0.011*"clinton" + '
  '0.009*"abortion" + 0.009*"health"'),
 (23,
  '0.018*"canada" + 0.018*"mazen" + 0.012*"knee" + 0.010*"seen" + '
  '0.008*"decision" + 0.007*"original" + 0.007*"beat" + 0.007*"goldman" + '
  '0.007*"photo" + 0.007*"forward"'),
 (24,
  '0.022*"email" + 0.015*"make" + 0.013*"hope" + 0.013*"lissa" + '
  '0.013*"yellow" + 0.010*"secretary" + 0.010*"draft" + 0.010*"jake" + '
  '0.010*"holder" + 0.008*"morning"'),
 (25,
  '0.034*"sheet" + 0.020*"mubarak" + 0.020*"talk" + 0.013*"reid" + '
  '0.012*"report" + 0.011*"harry" + 0.011*"haiti" + 0.011*"statement" + '
  '0.011*"schedule" + 0.010*"video"'),
 (26,
  '0.013*"friday" + 0.010*"rest" + 0.008*"intervention" + 0.008*"lanka" + '
  '0.008*"update" + 0.008*"make" + 0.008*"told" + 0.007*"come" + '
  '0.007*"monday" + 0.007*"huma"'),
 (27,
  '0.032*"oprah" + 0.016*"eikenberry" + 0.016*"jack" + 0.011*"morning" + '
  '0.010*"update" + 0.010*"monday" + 0.009*"make" + 0.009*"budget" + '
  '0.008*"kurt" + 0.008*"secure"'),
 (28,
  '0.017*"schedule" + 0.013*"state" + 0.010*"idea" + 0.009*"secretary" + '
  '0.008*"asked" + 0.007*"menendez" + 0.007*"best" + 0.007*"great" + '
  '0.007*"email" + 0.007*"followup"'),
 (29,
  '0.013*"sheet" + 0.013*"expo" + 0.012*"letter" + 0.011*"copy" + 0.011*"peru" '
  '+ 0.011*"uruguay" + 0.011*"argentina" + 0.009*"company" + 0.009*"shanghai" '
  '+ 0.009*"follow"'),
 (30,
  '0.015*"draft" + 0.012*"billion" + 0.011*"pakistan" + 0.011*"little" + '
  '0.010*"dollar" + 0.010*"goodwill" + 0.009*"info" + 0.009*"issue" + '
  '0.008*"political" + 0.008*"conference"'),
 (31,
  '0.020*"haiti" + 0.016*"ashton" + 0.014*"talk" + 0.012*"june" + 0.011*"week" '
  '+ 0.010*"lady" + 0.009*"effort" + 0.009*"friday" + 0.009*"graduation" + '
  '0.009*"shaw"'),
 (32,
  '0.014*"work" + 0.013*"schedule" + 0.010*"good" + 0.010*"internet" + '
  '0.010*"freedom" + 0.010*"revision" + 0.010*"unga" + 0.008*"plane" + '
  '0.008*"need" + 0.007*"idea"'),
 (33,
  '0.034*"dialogue" + 0.031*"press" + 0.027*"clip" + 0.027*"strategic" + '
  '0.018*"list" + 0.013*"table" + 0.011*"office" + 0.011*"huma" + 0.010*"talk" '
  '+ 0.010*"latest"'),
 (34,
  '0.014*"need" + 0.008*"policy" + 0.008*"good" + 0.008*"follow" + '
  '0.008*"make" + 0.007*"point" + 0.007*"response" + 0.007*"posner" + '
  '0.007*"message" + 0.006*"state"')]

------------------------------------
45 topics
[(0,
  '0.023*"draft" + 0.018*"lissa" + 0.014*"morning" + 0.014*"foreign" + '
  '0.014*"affair" + 0.011*"help" + 0.010*"lamplighter" + 0.010*"ireland" + '
  '0.010*"story" + 0.009*"patient"'),
 (1,
  '0.021*"lunch" + 0.021*"mazen" + 0.013*"holiday" + 0.010*"point" + '
  '0.010*"year" + 0.009*"hope" + 0.009*"question" + 0.008*"jersey" + '
  '0.008*"qadahfi" + 0.008*"huma"'),
 (2,
  '0.030*"table" + 0.015*"office" + 0.015*"discussion" + 0.015*"round" + '
  '0.013*"development" + 0.012*"haiti" + 0.011*"list" + 0.009*"idea" + '
  '0.008*"come" + 0.008*"cheryl"'),
 (3,
  '0.016*"line" + 0.015*"hang" + 0.013*"health" + 0.012*"goldman" + '
  '0.012*"house" + 0.012*"maternal" + 0.012*"abortion" + 0.011*"bildt" + '
  '0.011*"case" + 0.010*"internet"'),
 (4,
  '0.028*"help" + 0.013*"holbrooke" + 0.013*"draft" + 0.013*"dying" + '
  '0.010*"menendez" + 0.010*"military" + 0.010*"patient" + 0.009*"state" + '
  '0.009*"coup" + 0.009*"honduras"'),
 (5,
  '0.014*"copy" + 0.013*"cheryl" + 0.012*"idea" + 0.012*"work" + 0.010*"asked" '
  '+ 0.009*"holder" + 0.009*"love" + 0.009*"follow" + 0.009*"need" + '
  '0.007*"article"'),
 (6,
  '0.017*"woman" + 0.011*"office" + 0.011*"travel" + 0.010*"state" + '
  '0.008*"told" + 0.008*"saeb" + 0.008*"mother" + 0.008*"bibi" + 0.006*"voice" '
  '+ 0.006*"come"'),
 (7,
  '0.023*"eikenberry" + 0.023*"haiti" + 0.023*"jack" + 0.017*"make" + '
  '0.013*"cheryl" + 0.012*"case" + 0.012*"thursday" + 0.011*"edits" + '
  '0.011*"intervention" + 0.011*"arturo"'),
 (8,
  '0.025*"draft" + 0.021*"breakfast" + 0.021*"prayer" + 0.021*"reaction" + '
  '0.017*"work" + 0.012*"paper" + 0.012*"faxed" + 0.008*"friday" + '
  '0.008*"special" + 0.008*"create"'),
 (9,
  '0.023*"list" + 0.021*"update" + 0.011*"lauren" + 0.011*"bolivia" + '
  '0.009*"huma" + 0.009*"honduras" + 0.009*"eastern" + 0.009*"mashabane" + '
  '0.009*"wheel" + 0.009*"mission"'),
 (10,
  '0.024*"goodwill" + 0.023*"dollar" + 0.023*"little" + 0.022*"issue" + '
  '0.021*"pakistan" + 0.020*"billion" + 0.013*"draft" + 0.012*"denis" + '
  '0.012*"note" + 0.010*"asked"'),
 (11,
  '0.024*"bangl" + 0.024*"india" + 0.018*"kurtzer" + 0.018*"conversation" + '
  '0.013*"follow" + 0.013*"good" + 0.012*"went" + 0.012*"confirmed" + '
  '0.012*"letter" + 0.012*"friday"'),
 (12,
  '0.029*"schedule" + 0.013*"event" + 0.013*"friday" + 0.012*"putin" + '
  '0.010*"meeting" + 0.010*"mini" + 0.010*"article" + 0.010*"comment" + '
  '0.009*"affair" + 0.009*"right"'),
 (13,
  '0.014*"good" + 0.014*"water" + 0.010*"john" + 0.010*"kerry" + 0.010*"sudan" '
  '+ 0.010*"meeting" + 0.009*"visit" + 0.008*"talk" + 0.007*"huma" + '
  '0.007*"draft"'),
 (14,
  '0.038*"ashton" + 0.026*"lady" + 0.023*"haiti" + 0.023*"conference" + '
  '0.023*"friend" + 0.018*"turkey" + 0.017*"text" + 0.017*"davutoglu" + '
  '0.015*"armenia" + 0.012*"message"'),
 (15,
  '0.028*"friday" + 0.017*"qatar" + 0.017*"sheikha" + 0.017*"briefing" + '
  '0.017*"mosa" + 0.011*"work" + 0.011*"help" + 0.011*"schedule" + '
  '0.011*"armenia" + 0.009*"davutoglu"'),
 (16,
  '0.019*"funeral" + 0.014*"visitation" + 0.009*"need" + 0.009*"refaming" + '
  '0.009*"domestic" + 0.009*"russian" + 0.009*"moines" + 0.009*"campbell" + '
  '0.009*"memo" + 0.009*"debate"'),
 (17,
  '0.017*"draft" + 0.017*"ending" + 0.013*"video" + 0.013*"vote" + '
  '0.013*"talking" + 0.013*"going" + 0.012*"morning" + 0.012*"statement" + '
  '0.012*"make" + 0.009*"schedule"'),
 (18,
  '0.021*"coat" + 0.021*"info" + 0.014*"love" + 0.014*"chinese" + '
  '0.014*"walker" + 0.014*"death" + 0.014*"spargo" + 0.014*"johnnie" + '
  '0.008*"huma" + 0.007*"visit"'),
 (19,
  '0.014*"report" + 0.014*"secretary" + 0.014*"schedule" + 0.010*"question" + '
  '0.010*"make" + 0.010*"hope" + 0.010*"state" + 0.009*"lona" + 0.008*"week" + '
  '0.008*"confirmed"'),
 (20,
  '0.018*"freedom" + 0.018*"message" + 0.018*"internet" + 0.017*"revision" + '
  '0.012*"talk" + 0.012*"make" + 0.012*"contact" + 0.012*"ashton" + '
  '0.009*"issue" + 0.007*"action"'),
 (21,
  '0.021*"harry" + 0.021*"reid" + 0.017*"pakistan" + 0.017*"development" + '
  '0.012*"texting" + 0.011*"received" + 0.011*"campaign" + 0.010*"talk" + '
  '0.009*"sheet" + 0.009*"donation"'),
 (22,
  '0.016*"lautenberg" + 0.015*"letter" + 0.013*"brazilian" + 0.013*"maggie" + '
  '0.011*"secretary" + 0.010*"email" + 0.009*"case" + 0.009*"need" + '
  '0.009*"copy" + 0.008*"posner"'),
 (23,
  '0.013*"lona" + 0.011*"letter" + 0.011*"cheryl" + 0.009*"need" + '
  '0.009*"good" + 0.008*"schedule" + 0.008*"august" + 0.007*"copying" + '
  '0.007*"follow" + 0.007*"told"'),
 (24,
  '0.017*"schedule" + 0.016*"work" + 0.014*"plane" + 0.011*"shuttle" + '
  '0.010*"knee" + 0.010*"state" + 0.010*"lona" + 0.008*"confirmed" + '
  '0.008*"going" + 0.008*"secretary"'),
 (25,
  '0.018*"good" + 0.014*"dinner" + 0.013*"state" + 0.012*"issue" + '
  '0.011*"work" + 0.011*"tech" + 0.009*"message" + 0.008*"lavrov" + '
  '0.008*"list" + 0.007*"security"'),
 (26,
  '0.023*"mini" + 0.021*"schedule" + 0.019*"photo" + 0.018*"friday" + '
  '0.010*"king" + 0.010*"look" + 0.010*"reading" + 0.010*"khartoum" + '
  '0.010*"recommend" + 0.010*"printing"'),
 (27,
  '0.027*"issue" + 0.022*"statement" + 0.019*"brava" + 0.019*"bravo" + '
  '0.012*"clinton" + 0.011*"state" + 0.010*"followup" + 0.008*"discus" + '
  '0.008*"secretary" + 0.007*"work"'),
 (28,
  '0.017*"state" + 0.016*"list" + 0.014*"schedule" + 0.012*"good" + '
  '0.012*"benghazi" + 0.011*"office" + 0.009*"house" + 0.009*"asking" + '
  '0.009*"work" + 0.008*"comm"'),
 (29,
  '0.022*"office" + 0.020*"secretary" + 0.019*"state" + 0.012*"talk" + '
  '0.012*"saudabayev" + 0.012*"public" + 0.011*"week" + 0.008*"department" + '
  '0.008*"best" + 0.008*"update"'),
 (30,
  '0.037*"list" + 0.026*"statement" + 0.012*"waiting" + 0.011*"draft" + '
  '0.010*"food" + 0.010*"melanne" + 0.010*"letter" + 0.009*"voice" + '
  '0.009*"woman" + 0.007*"meeting"'),
 (31,
  '0.038*"office" + 0.031*"secretary" + 0.020*"state" + 0.020*"haiti" + '
  '0.013*"schedule" + 0.012*"department" + 0.009*"phone" + 0.008*"meeting" + '
  '0.008*"agree" + 0.008*"morale"'),
 (32,
  '0.016*"woman" + 0.012*"event" + 0.011*"development" + 0.011*"best" + '
  '0.011*"folder" + 0.011*"worker" + 0.011*"yellow" + 0.011*"room" + '
  '0.010*"state" + 0.008*"talk"'),
 (33,
  '0.017*"possible" + 0.014*"foreign" + 0.011*"tony" + 0.011*"campolo" + '
  '0.011*"consideration" + 0.011*"supposed" + 0.011*"rieser" + 0.008*"good" + '
  '0.007*"best" + 0.007*"news"'),
 (34,
  '0.027*"draft" + 0.026*"water" + 0.018*"agenda" + 0.015*"discussion" + '
  '0.014*"morocco" + 0.014*"king" + 0.010*"right" + 0.010*"tell" + '
  '0.010*"latest" + 0.009*"secretary"'),
 (35,
  '0.024*"schedule" + 0.019*"state" + 0.014*"list" + 0.013*"sheet" + '
  '0.011*"office" + 0.011*"secretary" + 0.010*"peru" + 0.010*"argentina" + '
  '0.010*"uruguay" + 0.009*"discus"'),
 (36,
  '0.014*"blair" + 0.014*"follow" + 0.014*"woman" + 0.014*"followup" + '
  '0.014*"talk" + 0.011*"issue" + 0.011*"family" + 0.011*"death" + '
  '0.011*"work" + 0.009*"ready"'),
 (37,
  '0.033*"canada" + 0.026*"idea" + 0.013*"vote" + 0.013*"decide" + '
  '0.013*"iaea" + 0.013*"breakdown" + 0.013*"original" + 0.011*"travel" + '
  '0.009*"running" + 0.009*"mario"'),
 (38,
  '0.033*"list" + 0.017*"oprah" + 0.012*"work" + 0.009*"monday" + '
  '0.008*"future" + 0.008*"state" + 0.008*"team" + 0.008*"note" + '
  '0.006*"secretary" + 0.006*"speak"'),
 (39,
  '0.071*"strategic" + 0.070*"dialogue" + 0.070*"press" + 0.065*"clip" + '
  '0.023*"sheet" + 0.018*"mubarak" + 0.009*"work" + 0.008*"need" + '
  '0.008*"coverage" + 0.008*"march"'),
 (40,
  '0.015*"benghazi" + 0.012*"week" + 0.010*"comment" + 0.010*"update" + '
  '0.010*"discus" + 0.010*"local" + 0.010*"east" + 0.010*"believe" + '
  '0.010*"jack" + 0.010*"pakistan"'),
 (41,
  '0.019*"memo" + 0.015*"budget" + 0.014*"germany" + 0.014*"berlin" + '
  '0.010*"mission" + 0.010*"year" + 0.010*"update" + 0.010*"email" + '
  '0.010*"lunch" + 0.010*"statement"'),
 (42,
  '0.012*"session" + 0.012*"honduras" + 0.012*"sponsor" + 0.012*"resolution" + '
  '0.012*"asks" + 0.009*"washington" + 0.009*"create" + 0.009*"progress" + '
  '0.009*"date" + 0.009*"possible"'),
 (43,
  '0.012*"work" + 0.011*"meeting" + 0.011*"medium" + 0.011*"mepp" + '
  '0.010*"good" + 0.008*"state" + 0.008*"jake" + 0.008*"follow" + '
  '0.008*"update" + 0.007*"likely"'),
 (44,
  '0.016*"jake" + 0.016*"night" + 0.011*"benghazi" + 0.011*"reject" + '
  '0.011*"reilly" + 0.011*"emailed" + 0.011*"islamist" + 0.011*"march" + '
  '0.011*"goldman" + 0.011*"militia"')]

when we choose the number of topics to be 25, the topics seem to be the most meaningfull.


In [10]:
lda = models.LdaModel(corpus, num_topics=25, id2word = dictionary)
pp.pprint(lda.print_topics(lda.num_topics))


[(0,
  '0.014*"family" + 0.012*"text" + 0.012*"davutoglu" + 0.011*"armenia" + '
  '0.011*"schedule" + 0.011*"turkey" + 0.010*"state" + 0.010*"work" + '
  '0.009*"table" + 0.009*"info"'),
 (1,
  '0.026*"dinner" + 0.018*"tech" + 0.014*"health" + 0.013*"abortion" + '
  '0.013*"maternal" + 0.009*"list" + 0.009*"love" + 0.009*"idea" + '
  '0.009*"copying" + 0.009*"romano"'),
 (2,
  '0.014*"copy" + 0.010*"check" + 0.009*"state" + 0.009*"benghazi" + '
  '0.009*"message" + 0.009*"make" + 0.008*"need" + 0.008*"hatfield" + '
  '0.008*"dalton" + 0.007*"good"'),
 (3,
  '0.017*"talk" + 0.014*"benghazi" + 0.010*"saudabayev" + 0.010*"mexico" + '
  '0.009*"clinton" + 0.007*"expectation" + 0.007*"update" + 0.007*"decision" + '
  '0.007*"honduras" + 0.007*"coup"'),
 (4,
  '0.014*"water" + 0.013*"state" + 0.011*"little" + 0.010*"billion" + '
  '0.010*"pakistan" + 0.010*"dollar" + 0.010*"goodwill" + 0.008*"work" + '
  '0.008*"schedule" + 0.008*"friday"'),
 (5,
  '0.015*"make" + 0.015*"email" + 0.009*"mashabane" + 0.009*"going" + '
  '0.007*"change" + 0.007*"proposed" + 0.007*"foreign" + 0.006*"travel" + '
  '0.006*"discus" + 0.006*"update"'),
 (6,
  '0.017*"haiti" + 0.015*"ashton" + 0.015*"sheet" + 0.014*"list" + '
  '0.012*"conference" + 0.012*"mubarak" + 0.012*"lady" + 0.010*"friend" + '
  '0.009*"schedule" + 0.009*"team"'),
 (7,
  '0.010*"confirmed" + 0.010*"robinson" + 0.010*"minister" + 0.009*"john" + '
  '0.008*"haiti" + 0.008*"africa" + 0.008*"kerry" + 0.007*"plan" + '
  '0.007*"clean" + 0.007*"memo"'),
 (8,
  '0.016*"issue" + 0.014*"statement" + 0.011*"help" + 0.010*"bravo" + '
  '0.010*"brava" + 0.010*"question" + 0.009*"reid" + 0.009*"harry" + '
  '0.009*"year" + 0.008*"state"'),
 (9,
  '0.015*"good" + 0.012*"jake" + 0.012*"letter" + 0.011*"clinton" + '
  '0.010*"draft" + 0.010*"lautenberg" + 0.008*"africa" + 0.008*"meeting" + '
  '0.006*"latest" + 0.006*"prepared"'),
 (10,
  '0.018*"cheryl" + 0.017*"work" + 0.017*"oprah" + 0.012*"reaction" + '
  '0.011*"schedule" + 0.010*"jack" + 0.009*"morning" + 0.009*"draft" + '
  '0.008*"calling" + 0.008*"tonight"'),
 (11,
  '0.011*"talk" + 0.008*"funeral" + 0.008*"good" + 0.008*"issue" + '
  '0.007*"message" + 0.007*"budget" + 0.006*"meeting" + 0.006*"early" + '
  '0.006*"internet" + 0.006*"freedom"'),
 (12,
  '0.012*"update" + 0.009*"project" + 0.008*"vote" + 0.007*"pakistan" + '
  '0.006*"amory" + 0.006*"idea" + 0.006*"lovins" + 0.006*"economist" + '
  '0.006*"care" + 0.006*"travel"'),
 (13,
  '0.060*"strategic" + 0.056*"dialogue" + 0.055*"press" + 0.050*"clip" + '
  '0.011*"list" + 0.010*"latest" + 0.009*"memo" + 0.007*"berlin" + '
  '0.007*"coat" + 0.006*"love"'),
 (14,
  '0.013*"sheet" + 0.009*"expo" + 0.009*"peru" + 0.009*"uruguay" + '
  '0.009*"argentina" + 0.008*"honduras" + 0.007*"told" + 0.007*"good" + '
  '0.007*"resolution" + 0.007*"company"'),
 (15,
  '0.017*"draft" + 0.012*"breakfast" + 0.012*"prayer" + 0.010*"talk" + '
  '0.009*"jake" + 0.009*"going" + 0.009*"good" + 0.007*"faxed" + 0.007*"spoke" '
  '+ 0.007*"arturo"'),
 (16,
  '0.009*"jack" + 0.009*"week" + 0.008*"eikenberry" + 0.008*"brazilian" + '
  '0.007*"copying" + 0.007*"statement" + 0.007*"work" + 0.007*"lunch" + '
  '0.006*"american" + 0.006*"holder"'),
 (17,
  '0.011*"need" + 0.008*"list" + 0.008*"good" + 0.008*"office" + 0.008*"draft" '
  '+ 0.007*"briefing" + 0.007*"talk" + 0.006*"forward" + 0.006*"issue" + '
  '0.006*"copy"'),
 (18,
  '0.014*"list" + 0.012*"travel" + 0.010*"ending" + 0.009*"read" + '
  '0.009*"state" + 0.009*"secretary" + 0.008*"office" + 0.008*"leave" + '
  '0.008*"holbrooke" + 0.007*"menendez"'),
 (19,
  '0.017*"press" + 0.012*"idea" + 0.011*"clip" + 0.010*"dialogue" + '
  '0.009*"strategic" + 0.007*"hope" + 0.007*"good" + 0.007*"knee" + '
  '0.007*"lissa" + 0.007*"make"'),
 (20,
  '0.021*"list" + 0.013*"work" + 0.009*"washington" + 0.008*"canada" + '
  '0.008*"house" + 0.007*"email" + 0.007*"post" + 0.006*"best" + '
  '0.006*"editorial" + 0.006*"article"'),
 (21,
  '0.020*"schedule" + 0.011*"state" + 0.009*"plane" + 0.009*"update" + '
  '0.008*"mazen" + 0.008*"line" + 0.007*"secretary" + 0.007*"shuttle" + '
  '0.007*"king" + 0.007*"need"'),
 (22,
  '0.017*"secretary" + 0.011*"draft" + 0.010*"talk" + 0.009*"good" + '
  '0.009*"right" + 0.008*"letter" + 0.008*"come" + 0.008*"follow" + '
  '0.007*"year" + 0.006*"bolivia"'),
 (23,
  '0.025*"schedule" + 0.016*"lona" + 0.015*"state" + 0.014*"woman" + '
  '0.013*"friday" + 0.013*"mini" + 0.008*"bildt" + 0.007*"monday" + '
  '0.006*"voice" + 0.006*"huma"'),
 (24,
  '0.025*"office" + 0.024*"secretary" + 0.017*"state" + 0.016*"development" + '
  '0.008*"schedule" + 0.007*"discus" + 0.007*"article" + 0.006*"meeting" + '
  '0.006*"work" + 0.006*"department"')]

In [ ]: