Assignment

Generate a word cloud based on the raw corpus -- I recommend you to use the Python word_cloud library. With the help of nltk (already available in your Anaconda environment), implement a standard text pre-processing pipeline (e.g., tokenization, stopword removal, stemming, etc.) and generate a new word cloud. Discuss briefly the pros and cons (if any) of the two word clouds you generated.

Part 1

Generate a word cloud based on the raw corpus



In [713]:

    
import pandas as pd
import numpy as np
from wordcloud import WordCloud
from IPython.display import Image, display
data = pd.read_csv('hillary-clinton-emails/emails.csv')
data.ix[:,:8].head()









    Out[713]:






  
    
      
      Id
      DocNumber
      MetadataSubject
      MetadataTo
      MetadataFrom
      SenderPersonId
      MetadataDateSent
      MetadataDateReleased
    
  
  
    
      0
      1
      C05739545
      WOW
      H
      Sullivan, Jacob J
      87.0
      2012-09-12T04:00:00+00:00
      2015-05-22T04:00:00+00:00
    
    
      1
      2
      C05739546
      H: LATEST: HOW SYRIA IS AIDING QADDAFI AND MOR...
      H
      NaN
      NaN
      2011-03-03T05:00:00+00:00
      2015-05-22T04:00:00+00:00
    
    
      2
      3
      C05739547
      CHRIS STEVENS
      ;H
      Mills, Cheryl D
      32.0
      2012-09-12T04:00:00+00:00
      2015-05-22T04:00:00+00:00
    
    
      3
      4
      C05739550
      CAIRO CONDEMNATION - FINAL
      H
      Mills, Cheryl D
      32.0
      2012-09-12T04:00:00+00:00
      2015-05-22T04:00:00+00:00
    
    
      4
      5
      C05739554
      H: LATEST: HOW SYRIA IS AIDING QADDAFI AND MOR...
      Abedin, Huma
      H
      80.0
      2011-03-11T05:00:00+00:00
      2015-05-22T04:00:00+00:00



In [714]:

    
data.ix[:,8:].head()









    Out[714]:






  
    
      
      MetadataPdfLink
      MetadataCaseNumber
      MetadataDocumentClass
      ExtractedSubject
      ExtractedTo
      ExtractedFrom
      ExtractedCc
      ExtractedDateSent
      ExtractedCaseNumber
      ExtractedDocNumber
      ExtractedDateReleased
      ExtractedReleaseInPartOrFull
      ExtractedBodyText
      RawText
    
  
  
    
      0
      DOCUMENTS/HRC_Email_1_296/HRCH2/DOC_0C05739545...
      F-2015-04841
      HRC_Email_296
      FW: Wow
      NaN
      Sullivan, Jacob J <Sullivan11@state.gov>
      NaN
      Wednesday, September 12, 2012 10:16 AM
      F-2015-04841
      C05739545
      05/13/2015
      RELEASE IN FULL
      NaN
      UNCLASSIFIED\nU.S. Department of State\nCase N...
    
    
      1
      DOCUMENTS/HRC_Email_1_296/HRCH1/DOC_0C05739546...
      F-2015-04841
      HRC_Email_296
      NaN
      NaN
      NaN
      NaN
      NaN
      F-2015-04841
      C05739546
      05/13/2015
      RELEASE IN PART
      B6\nThursday, March 3, 2011 9:45 PM\nH: Latest...
      UNCLASSIFIED\nU.S. Department of State\nCase N...
    
    
      2
      DOCUMENTS/HRC_Email_1_296/HRCH2/DOC_0C05739547...
      F-2015-04841
      HRC_Email_296
      Re: Chris Stevens
      B6
      Mills, Cheryl D <MillsCD@state.gov>
      Abedin, Huma
      Wednesday, September 12, 2012 11:52 AM
      F-2015-04841
      C05739547
      05/14/2015
      RELEASE IN PART
      Thx
      UNCLASSIFIED\nU.S. Department of State\nCase N...
    
    
      3
      DOCUMENTS/HRC_Email_1_296/HRCH2/DOC_0C05739550...
      F-2015-04841
      HRC_Email_296
      FVV: Cairo Condemnation - Final
      NaN
      Mills, Cheryl D <MillsCD@state.gov>
      Mitchell, Andrew B
      Wednesday, September 12,2012 12:44 PM
      F-2015-04841
      C05739550
      05/13/2015
      RELEASE IN PART
      NaN
      UNCLASSIFIED\nU.S. Department of State\nCase N...
    
    
      4
      DOCUMENTS/HRC_Email_1_296/HRCH1/DOC_0C05739554...
      F-2015-04841
      HRC_Email_296
      NaN
      NaN
      NaN
      NaN
      NaN
      F-2015-04841
      C05739554
      05/13/2015
      RELEASE IN PART
      H <hrod17@clintonemail.com>\nFriday, March 11,...
      B6\nUNCLASSIFIED\nU.S. Department of State\nCa...

Let's only keep the text which is relevant for the raw body corpus, which is: ExtractedSubject and ExtractedBodyText, which encompass most of what the message should be about.



In [715]:

    
data_cut = data[['ExtractedSubject','ExtractedBodyText']].copy()
data_cut









    Out[715]:






  
    
      
      ExtractedSubject
      ExtractedBodyText
    
  
  
    
      0
      FW: Wow
      NaN
    
    
      1
      NaN
      B6\nThursday, March 3, 2011 9:45 PM\nH: Latest...
    
    
      2
      Re: Chris Stevens
      Thx
    
    
      3
      FVV: Cairo Condemnation - Final
      NaN
    
    
      4
      NaN
      H <hrod17@clintonemail.com>\nFriday, March 11,...
    
    
      5
      Meet The Right Wing Extremist Behind Anti-Musl...
      Pis print.\n-•-...-^\nH < hrod17@clintonernail...
    
    
      6
      FW: Anti-Muslim film director in hiding, follo...
      NaN
    
    
      7
      NaN
      H <hrod17@clintonemail.corn>\nFriday, March 11...
    
    
      8
      FVV: Secretary's remarks
      FYI
    
    
      9
      more on Libya
      B6\nWednesday, September 12, 2012 6:16 PM\nFwd...
    
    
      10
      AbZ and Hb3 on Libya and West Bank/Gaza
      Fyi\nB6\n— —
    
    
      11
      NaN
      B6\nWednesday, September 12, 2012 6:16 PM\nFwd...
    
    
      12
      hey
      Fyi
    
    
      13
      NaN
      Anne-Marie Slaughter\nSunday, March 13, 2011 9...
    
    
      14
      RE: Not a dry eye in NEA
      _ .....\nFrom Randolph, Lawrence M\nSent: Wedn...
    
    
      15
      NaN
      I asked to attend your svtc today with Embassy...
    
    
      16
      Fw: The Youth of Libya
      Hope. See picture below Kamala sent.
    
    
      17
      Fw: One More Photo
      Another photo.
    
    
      18
      Fw: S today
      This is nice.
    
    
      19
      NaN
      Amazing.\nSullivan, Jacob J <Sullivanii@state,...
    
    
      20
      Fwd: more on libya
      H <hrod17@clintonernaii.com›\nWednesday, Septe...
    
    
      21
      Fwd: more on libya
      Pis print.\nH < hrod17@clintoriernail.corn>\nW...
    
    
      22
      Fw: H: Magariaf on attack on US in Libya. Sid
      Pis print.
    
    
      23
      Fw: H: Magariaf on attack on US in Libya. Sid
      Follow Up Flag: Follow up\nFlag Status: Flagge...
    
    
      24
      Re: Proposed Quad Deal
      B5
    
    
      25
      NaN
      Sidney Blumenthal\nThursday, September 13, 201...
    
    
      26
      Re: Fwd: more on libya
      Will do.
    
    
      27
      Fw: Amb Stevens
      Remind me to discuss
    
    
      28
      CNN Belief Biog. Prothero
      http://religion.b1ogs.cnn.com/20 1 2/09/13/my-...
    
    
      29
      Fw: chris Stevens mission
      See note below
    
    
      ...
      ...
      ...
    
    
      7915
      NaN
      sbwhoeop\nTuesday, December 14, 2010 9:35 AM\n...
    
    
      7916
      FW: Remarks of the Spokesman of the Islamic Em...
      Death of Holbrookell.y,J3_4.:5 _043.) ,:u1.01 ...
    
    
      7917
      Nides
      NaN
    
    
      7918
      Fw:
      NaN
    
    
      7919
      Fw: Norwegian Agency for Development Cooperati...
      on: 12/13/2025\nFrom\nTo: Verveer, Melanne S\n...
    
    
      7920
      Re: Friday
      That was what lona had when she went in to tal...
    
    
      7921
      Fw: #1 Social Media campaign of 2010
      Fyi
    
    
      7922
      A new draft of the town hall speech
      Madame Secretary, I wanted to flag for you tha...
    
    
      7923
      RE: A new draft of the town hall speech
      It's at the bottom of page 1:\n...Ambassador R...
    
    
      7924
      Speech this morning
      I will be at a USGLC breakfast doing qddr so w...
    
    
      7925
      tax bill
      Passed the Senate 81 to 19. START vote on moti...
    
    
      7926
      FW: Matt Lee's piece with write-through
      -^
    
    
      7927
      FW: she is simply a rock star
      Traffic below from bottom up
    
    
      7928
      Fwd: Re BB
      fyi\nForwarded message
    
    
      7929
      FW: Bill Burke White
      Agree — he did all the writing and integrating...
    
    
      7930
      FW: thanks again
      Nice\nForgot to tell you about our harrowing c...
    
    
      7931
      FW: EU Stopped EURO 47 Million
      NaN
    
    
      7932
      Fw: more wikithink
      Worth a read.
    
    
      7933
      H: Jamie Rubin finally lands. Sid
      The unsigned editorials will appear in all of ...
    
    
      7934
      Fw: (AP) Germany: pullout from Afghanistan to ...
      NaN
    
    
      7935
      Fw: (AP) Palestinians seek state recognition i...
      NaN
    
    
      7936
      NaN
      Mills, Cheryl D <MillsCD@state.gov>\nThursday,...
    
    
      7937
      FW: The Envoy
      THE NBC/YORKER\nDecember 14, 2010\nThe Envoy\n...
    
    
      7938
      update
      Hi. Sorry I haven't had a chance to see you, b...
    
    
      7939
      Fwd: FW: Richard (TNR)
      B6\nI assume you saw this by now -- if not, it...
    
    
      7940
      Fw: Wyden
      NaN
    
    
      7941
      Senate
      Big change of plans in the Senate. Senator Rei...
    
    
      7942
      Re: Fwd: FW: Richard (TNR)
      NaN
    
    
      7943
      NaN
      PVerveer B6\nFriday, December 17, 2010 12:12 A...
    
    
      7944
      FW: Note for Secretary Clinton
      See below.
    
  

7945 rows × 2 columns



In [716]:

    
data_csv = data_cut.to_csv()



In [739]:

    
# Credit to - https://github.com/amueller/word_cloud/blob/master/examples/simple.py

# Make the wordcloud
wordcloud = WordCloud(background_color='white', width=1000, height=500).generate(data_csv)
image = wordcloud.to_image()
display(Image('image_raw.png'))
#image.show()

From here, it's clear that words such as Re, Fw, pm (i.e. referring to time), which are not related to the content of the text itself but rather deal with email processing, are disproportiate in the full picture and should be taken out.

Part 2

a) implement a standard text pre-processing pipeline (e.g., tokenization, stopword removal, stemming, etc.)



In [718]:

    
import re
import nltk

a) Tokenization

The purpose of tokenization is to chop up long strings into individual words or symbols. This allows for further processing of the words.

Let's first put all of the values into 1 column called "Text" and get rid of the NaNs. At this point, we do not care to distinguish between the content in the subject and the body.



In [719]:

    
for i in data_cut['ExtractedBodyText']:
    data_cut.loc[len(data_cut)] = [i, 'nan']



In [720]:

    
data_clean = data_cut.drop('ExtractedBodyText', axis =1)
data_clean.columns = ['Text']
data_clean = data_clean[pd.notnull(data_clean['Text'])]
data_clean.head()









    Out[720]:






  
    
      
      Text
    
  
  
    
      0
      FW: Wow
    
    
      2
      Re: Chris Stevens
    
    
      3
      FVV: Cairo Condemnation - Final
    
    
      5
      Meet The Right Wing Extremist Behind Anti-Musl...
    
    
      6
      FW: Anti-Muslim film director in hiding, follo...

Side-note: Removing capitalization

Convert all of the strings to lowercase while they are whole, as we'll need it later on.



In [721]:

    
for index, row in data_clean.iterrows():
    row['Text'] = row['Text'].lower()
data_clean.head()









    Out[721]:






  
    
      
      Text
    
  
  
    
      0
      fw: wow
    
    
      2
      re: chris stevens
    
    
      3
      fvv: cairo condemnation - final
    
    
      5
      meet the right wing extremist behind anti-musl...
    
    
      6
      fw: anti-muslim film director in hiding, follo...

Now let's do some tokenization.

Tokenization



In [722]:

    
from nltk import word_tokenize
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')



In [723]:

    
data_tokenized = data_clean.copy()
for index, row in data_tokenized.iterrows():
    row['Text'] = tokenizer.tokenize(row['Text'])
data_tokenized.columns = ['TokenizedText']
data_tokenized.reset_index(drop=True, inplace=True)
data_tokenized.head()









    Out[723]:






  
    
      
      TokenizedText
    
  
  
    
      0
      [fw, wow]
    
    
      1
      [re, chris, stevens]
    
    
      2
      [fvv, cairo, condemnation, final]
    
    
      3
      [meet, the, right, wing, extremist, behind, an...
    
    
      4
      [fw, anti, muslim, film, director, in, hiding,...

You can now see that the sentences / subject lines are broken up into words within a list, which we can now use to check for stopworkds.

b) Removing Stopwords

Let's see what are the stopwords that are in the nltk database that we can remove.



In [724]:

    
from nltk.corpus import stopwords # Import the stop word list
stop = stopwords.words('english')

We will also add "fw", "fwd", "fvv", "re", "pm", and "am" to the stopwords.txt list as they are not helpful and in this context.



In [725]:

    
print(stopwords.words("english") )









    



['pm', 'am', 'fwd', 'fw', 'fvv', 're', 'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', 'couldn', 'didn', 'doesn', 'hadn', 'hasn', 'haven', 'isn', 'ma', 'mightn', 'mustn', 'needn', 'shan', 'shouldn', 'wasn', 'weren', 'won', 'wouldn']

As a reference to check whether the stopwords have been removed is line 15, where there is "your".



In [726]:

    
# Remove stop words from "words"
data_no_stop = data_tokenized.copy()
data_no_stop['TokenizedText'] = data_no_stop['TokenizedText'].apply(lambda x: [item for item in x if item not in stop])
data_no_stop.tail(17)









    Out[726]:






  
    
      
      TokenizedText
    
  
  
    
      12985
      [bottom, page, 1, ambassador, richard, holbroo...
    
    
      12986
      [usglc, breakfast, qddr, morning, meeting, see...
    
    
      12987
      [passed, senate, 81, 19, start, vote, motion, ...
    
    
      12988
      []
    
    
      12989
      [traffic, bottom]
    
    
      12990
      [fyi, forwarded, message]
    
    
      12991
      [agree, writing, integrating, first, instance,...
    
    
      12992
      [nice, forgot, tell, harrowing, circling, atte...
    
    
      12993
      [worth, read]
    
    
      12994
      [unsigned, editorials, appear, bloomberg, onli...
    
    
      12995
      [mills, cheryl, millscd, state, gov, thursday,...
    
    
      12996
      [nbc, yorker, december, 14, 2010, envoy, poste...
    
    
      12997
      [hi, sorry, chance, see, want, hear, directly,...
    
    
      12998
      [b6, assume, saw, worth, read, forwarded, mess...
    
    
      12999
      [big, change, plans, senate, senator, reid, an...
    
    
      13000
      [pverveer, b6, friday, december, 17, 2010, 12,...
    
    
      13001
      [see]

You can see that the stopwords in line 15 is now gone, so we are set!

c) Stemming

We need to change the words into more standard forms to reduce the inflectual forms, such as "forwarded" in line 12998 and "integrating" in line 12991.



In [727]:

    
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")



In [728]:

    
data_stemming = data_no_stop.copy()
for index, row in data_stemming.iterrows():
    for idx, list in enumerate(row):
        for idx, item in enumerate(list):
            list[idx] = stemmer.stem(item)



In [730]:

    
data_stemming.tail(17)









    Out[730]:






  
    
      
      TokenizedText
    
  
  
    
      12985
      [bottom, page, 1, ambassador, richard, holbroo...
    
    
      12986
      [usglc, breakfast, qddr, morn, meet, see, spee...
    
    
      12987
      [pass, senat, 81, 19, start, vote, motion, pro...
    
    
      12988
      []
    
    
      12989
      [traffic, bottom]
    
    
      12990
      [fyi, forward, messag]
    
    
      12991
      [agre, write, integr, first, instanc, stay, la...
    
    
      12992
      [nice, forgot, tell, harrow, circl, attempt, l...
    
    
      12993
      [worth, read]
    
    
      12994
      [unsign, editori, appear, bloomberg, onlin, pr...
    
    
      12995
      [mill, cheryl, millscd, state, gov, thursday, ...
    
    
      12996
      [nbc, yorker, decemb, 14, 2010, envoy, post, h...
    
    
      12997
      [hi, sorri, chanc, see, want, hear, direct, gr...
    
    
      12998
      [b6, assum, saw, worth, read, forward, messag]
    
    
      12999
      [big, chang, plan, senat, senat, reid, announc...
    
    
      13000
      [pverveer, b6, friday, decemb, 17, 2010, 12, 1...
    
    
      13001
      [see]

As you can see, "forwarded" in line 12998 is now "forward" and "integrating" in line 12991 is now "integ". However, this the resulting words are not pretty.

d) generate a new word cloud



In [734]:

    
data_csv_new = data_stemming.to_csv()

wordcloud_new = WordCloud(background_color='white', width=1000, height=500).generate(data_csv_new)
image_new = wordcloud_new.to_image()
display(Image('image_new.png'))
#image_new.show()

Commentary: the resulting words are quite strange - Bloomberg is not something that is expected to have come up.

Extra - Cleaning - Removing Words / #s < 4 Letters

Remove anything that is shorter than 5 letters or numbers long (to remove single numbers or single letters or simple words).



In [732]:

    
data_extra = data_stemming.copy()
for index, row in data_extra.iterrows():
    for index, item in row.iteritems():
        for i in item:
            if len(i) < 4:
                item.remove(i)
data_extra.head()









    Out[732]:






  
    
      
      TokenizedText
    
  
  
    
      0
      []
    
    
      1
      [chris, steven]
    
    
      2
      [cairo, condemn, final]
    
    
      3
      [meet, right, wing, extremist, behind, anti, m...
    
    
      4
      [anti, muslim, film, director, hide, follow, l...

Tokenized, Stop-word-removed and Stemmed Corpus



In [735]:

    
data_csv_new_lt = data_extra.to_csv()

wordcloud_lt = WordCloud(background_color='white', width=1000, height=500).generate(data_csv_new_lt)
image_new_lt = wordcloud_lt.to_image()
display(Image('image_new_lt.png'))
#image_new_lt.show()

I'm not sure why the words have a ' after them, but the resuts are as they should be. Let's compare it with the original raw corpus:

Raw Corpus (Old)



In [740]:

    
display(Image('image_raw.png'))

Discussion

When you compare the raw corpus with the tokenized, stop-word-removed, and stemmed word graphs, you'll see that some new words have come up in the new corpus, such as: state, secretary, president, and others. In the old one, we had a lot of junk, like pm and re.

There are not many new insights to gain, as many of the words in the final Wordcloud are logically relevant to her campaign / work: secretary, Obama, call, state, time, etc.

	Id	DocNumber	MetadataSubject	MetadataTo	MetadataFrom	SenderPersonId	MetadataDateSent	MetadataDateReleased
0	1	C05739545	WOW	H	Sullivan, Jacob J	87.0	2012-09-12T04:00:00+00:00	2015-05-22T04:00:00+00:00
1	2	C05739546	H: LATEST: HOW SYRIA IS AIDING QADDAFI AND MOR...	H	NaN	NaN	2011-03-03T05:00:00+00:00	2015-05-22T04:00:00+00:00
2	3	C05739547	CHRIS STEVENS	;H	Mills, Cheryl D	32.0	2012-09-12T04:00:00+00:00	2015-05-22T04:00:00+00:00
3	4	C05739550	CAIRO CONDEMNATION - FINAL	H	Mills, Cheryl D	32.0	2012-09-12T04:00:00+00:00	2015-05-22T04:00:00+00:00
4	5	C05739554	H: LATEST: HOW SYRIA IS AIDING QADDAFI AND MOR...	Abedin, Huma	H	80.0	2011-03-11T05:00:00+00:00	2015-05-22T04:00:00+00:00

	MetadataPdfLink	MetadataCaseNumber	MetadataDocumentClass	ExtractedSubject	ExtractedTo	ExtractedFrom	ExtractedCc	ExtractedDateSent	ExtractedCaseNumber	ExtractedDocNumber	ExtractedDateReleased	ExtractedReleaseInPartOrFull	ExtractedBodyText	RawText
0	DOCUMENTS/HRC_Email_1_296/HRCH2/DOC_0C05739545...	F-2015-04841	HRC_Email_296	FW: Wow	NaN	Sullivan, Jacob J <Sullivan11@state.gov>	NaN	Wednesday, September 12, 2012 10:16 AM	F-2015-04841	C05739545	05/13/2015	RELEASE IN FULL	NaN	UNCLASSIFIED\nU.S. Department of State\nCase N...
1	DOCUMENTS/HRC_Email_1_296/HRCH1/DOC_0C05739546...	F-2015-04841	HRC_Email_296	NaN	NaN	NaN	NaN	NaN	F-2015-04841	C05739546	05/13/2015	RELEASE IN PART	B6\nThursday, March 3, 2011 9:45 PM\nH: Latest...	UNCLASSIFIED\nU.S. Department of State\nCase N...
2	DOCUMENTS/HRC_Email_1_296/HRCH2/DOC_0C05739547...	F-2015-04841	HRC_Email_296	Re: Chris Stevens	B6	Mills, Cheryl D <MillsCD@state.gov>	Abedin, Huma	Wednesday, September 12, 2012 11:52 AM	F-2015-04841	C05739547	05/14/2015	RELEASE IN PART	Thx	UNCLASSIFIED\nU.S. Department of State\nCase N...
3	DOCUMENTS/HRC_Email_1_296/HRCH2/DOC_0C05739550...	F-2015-04841	HRC_Email_296	FVV: Cairo Condemnation - Final	NaN	Mills, Cheryl D <MillsCD@state.gov>	Mitchell, Andrew B	Wednesday, September 12,2012 12:44 PM	F-2015-04841	C05739550	05/13/2015	RELEASE IN PART	NaN	UNCLASSIFIED\nU.S. Department of State\nCase N...
4	DOCUMENTS/HRC_Email_1_296/HRCH1/DOC_0C05739554...	F-2015-04841	HRC_Email_296	NaN	NaN	NaN	NaN	NaN	F-2015-04841	C05739554	05/13/2015	RELEASE IN PART	H <hrod17@clintonemail.com>\nFriday, March 11,...	B6\nUNCLASSIFIED\nU.S. Department of State\nCa...

	TokenizedText
0	[fw, wow]
1	[re, chris, stevens]
2	[fvv, cairo, condemnation, final]
3	[meet, the, right, wing, extremist, behind, an...
4	[fw, anti, muslim, film, director, in, hiding,...

	TokenizedText
12985	[bottom, page, 1, ambassador, richard, holbroo...
12986	[usglc, breakfast, qddr, morning, meeting, see...
12987	[passed, senate, 81, 19, start, vote, motion, ...
12988	[]
12989	[traffic, bottom]
12990	[fyi, forwarded, message]
12991	[agree, writing, integrating, first, instance,...
12992	[nice, forgot, tell, harrowing, circling, atte...
12993	[worth, read]
12994	[unsigned, editorials, appear, bloomberg, onli...
12995	[mills, cheryl, millscd, state, gov, thursday,...
12996	[nbc, yorker, december, 14, 2010, envoy, poste...
12997	[hi, sorry, chance, see, want, hear, directly,...
12998	[b6, assume, saw, worth, read, forwarded, mess...
12999	[big, change, plans, senate, senator, reid, an...
13000	[pverveer, b6, friday, december, 17, 2010, 12,...
13001	[see]

	TokenizedText
0	[]
1	[chris, steven]
2	[cairo, condemn, final]
3	[meet, right, wing, extremist, behind, anti, m...
4	[anti, muslim, film, director, hide, follow, l...