In [1]:
# Some resources:
# http://www.nltk.org/book/
# http://victoria.lviv.ua/html/fl5/NaturalLanguageProcessingWithPython.pdf
# https://www.analyticsvidhya.com/blog/2017/04/natural-language-processing-made-easy-using-spacy-%E2%80%8Bin-python/
# https://radimrehurek.com/gensim/tutorial.html
# http://textacy.readthedocs.io/en/latest/

In [1]:
# download the file and extract to the same directory as this notebook
# http://ai.stanford.edu/~amaas/data/sentiment/
import glob

neg_reviews = []
pos_reviews = []

for filename in glob.glob('./aclImdb/train/neg/*.txt'):
    with open(filename) as infile:
        neg_reviews.append(infile.read())
for filename in glob.glob('./aclImdb/train/pos/*.txt'):
    with open(filename) as infile:
        pos_reviews.append(infile.read())
        
print len(neg_reviews)
len(pos_reviews)


12500
Out[1]:
12500

In [2]:
neg_reviews[0]


Out[2]:
'Trying to conceive of something as insipid as THE SENTINEL would be pretty difficult. The problems are many. The result is terrible and loaded with plot holes.<br /><br />Michael Douglas stars as Pete Garrison, a Secret Service agent who "took one" for Reagan during the attempt on his life. Years later we find Pete assigned to the Whitehouse Family, mainly as a guard for the First Lady (Kim Basinger, L.A. CONFIDENTIAL). Troubles arise as we see Pete\'s close involvement with the First Lady, and a sudden threat against the President himself (David Rasche, UNITED 93). When Pete fails a polygraph test, he\'s singled out as a disgruntled agent by investigator David Breckinridge (Kiefer Sutherland, 24 TV series).<br /><br />As the presidential assassination plot unfolds, Pete finds himself on the run from his own people. His only confidant is the First Lady, and she\'s reluctant to tell anyone about their affections for one another (which is why Pete failed the polygraph in the first place). But is Pete really innocent? Or is he simply trying to buy time until he can kill the President? If he is innocent, how can he help prevent the assassination attempt while running from the Secret Service? <br /><br />The one, big, overwhelming problem with this film is that there\'s no justification for the reason behind the presidential threat. Isn\'t that what the movie\'s supposed to be about? One would think so! But the audience is never let in on why the assassin(s) want to kill the Prez. Hmm. Someone forget to put that in the script somewhere? <br /><br />And what\'s with David Breckinridge\'s (Kiefer\'s) new partner, Jill Marin (Eva Longoria, CARLITA\'S WAY)? Seems that she was put in the film strictly as a piece of a$$-candy. What was her purpose again? Did she do anything other than look nice in tight pants and a low-cut blouse?<br /><br />There are so many problems with the basic premise of The Sentinel as to be laughable. The action is too easily stymied by the "What the...?" responses sure to be uttered by those unfortunate enough to watch the movie.'

In [3]:
import bs4
soup = bs4.BeautifulSoup(neg_reviews[0], "lxml")
soup.text


Out[3]:
u'Trying to conceive of something as insipid as THE SENTINEL would be pretty difficult. The problems are many. The result is terrible and loaded with plot holes.Michael Douglas stars as Pete Garrison, a Secret Service agent who "took one" for Reagan during the attempt on his life. Years later we find Pete assigned to the Whitehouse Family, mainly as a guard for the First Lady (Kim Basinger, L.A. CONFIDENTIAL). Troubles arise as we see Pete\'s close involvement with the First Lady, and a sudden threat against the President himself (David Rasche, UNITED 93). When Pete fails a polygraph test, he\'s singled out as a disgruntled agent by investigator David Breckinridge (Kiefer Sutherland, 24 TV series).As the presidential assassination plot unfolds, Pete finds himself on the run from his own people. His only confidant is the First Lady, and she\'s reluctant to tell anyone about their affections for one another (which is why Pete failed the polygraph in the first place). But is Pete really innocent? Or is he simply trying to buy time until he can kill the President? If he is innocent, how can he help prevent the assassination attempt while running from the Secret Service? The one, big, overwhelming problem with this film is that there\'s no justification for the reason behind the presidential threat. Isn\'t that what the movie\'s supposed to be about? One would think so! But the audience is never let in on why the assassin(s) want to kill the Prez. Hmm. Someone forget to put that in the script somewhere? And what\'s with David Breckinridge\'s (Kiefer\'s) new partner, Jill Marin (Eva Longoria, CARLITA\'S WAY)? Seems that she was put in the film strictly as a piece of a$$-candy. What was her purpose again? Did she do anything other than look nice in tight pants and a low-cut blouse?There are so many problems with the basic premise of The Sentinel as to be laughable. The action is too easily stymied by the "What the...?" responses sure to be uttered by those unfortunate enough to watch the movie.'

In [4]:
print soup.text


Trying to conceive of something as insipid as THE SENTINEL would be pretty difficult. The problems are many. The result is terrible and loaded with plot holes.Michael Douglas stars as Pete Garrison, a Secret Service agent who "took one" for Reagan during the attempt on his life. Years later we find Pete assigned to the Whitehouse Family, mainly as a guard for the First Lady (Kim Basinger, L.A. CONFIDENTIAL). Troubles arise as we see Pete's close involvement with the First Lady, and a sudden threat against the President himself (David Rasche, UNITED 93). When Pete fails a polygraph test, he's singled out as a disgruntled agent by investigator David Breckinridge (Kiefer Sutherland, 24 TV series).As the presidential assassination plot unfolds, Pete finds himself on the run from his own people. His only confidant is the First Lady, and she's reluctant to tell anyone about their affections for one another (which is why Pete failed the polygraph in the first place). But is Pete really innocent? Or is he simply trying to buy time until he can kill the President? If he is innocent, how can he help prevent the assassination attempt while running from the Secret Service? The one, big, overwhelming problem with this film is that there's no justification for the reason behind the presidential threat. Isn't that what the movie's supposed to be about? One would think so! But the audience is never let in on why the assassin(s) want to kill the Prez. Hmm. Someone forget to put that in the script somewhere? And what's with David Breckinridge's (Kiefer's) new partner, Jill Marin (Eva Longoria, CARLITA'S WAY)? Seems that she was put in the film strictly as a piece of a$$-candy. What was her purpose again? Did she do anything other than look nice in tight pants and a low-cut blouse?There are so many problems with the basic premise of The Sentinel as to be laughable. The action is too easily stymied by the "What the...?" responses sure to be uttered by those unfortunate enough to watch the movie.

In [5]:
import pyprind
pbar = pyprind.ProgBar(12500)
for index in range(12500):
    neg_reviews[index] = bs4.BeautifulSoup(neg_reviews[index], 'lxml').text
    pos_reviews[index] = bs4.BeautifulSoup(pos_reviews[index], 'lxml').text
    pbar.update()


0%                          100%
[##############################] | ETA: 00:00:00
Total time elapsed: 00:01:29

In [6]:
print neg_reviews[1]


I'm just quite disappointed with "Soul Survivors". It doesn't worth even a comment in this forum. The script is very poor as well as all the "acting" and for our entertainment it features a pointless plot.Please, do yourselves a favor! Be a real "Survivor"...Don't waste your time in this piece of crap! Someday you'll thank me!

In [7]:
import re
def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = re.sub('[\W]+', ' ', text.lower()) + ' ' + ' '.join(emoticons).replace('-', '')
    return text.strip()

In [8]:
preprocessor('<test> Hey guys!!1! :)</test> This is a test :P lolz')


Out[8]:
'hey guys 1 this is a test p lolz :) :P'

In [52]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer()
docs = np.array([
    'The sun is shining',
    'The weather is sweet',
    'The sun is shining and the weather is sweet'])
bag = count.fit_transform(docs)
print count.vocabulary_
print bag.toarray()


{u'and': 0, u'weather': 6, u'sweet': 4, u'sun': 3, u'is': 1, u'the': 5, u'shining': 2}
[[0 1 1 1 0 1 0]
 [0 1 0 0 1 1 1]
 [1 2 1 1 1 2 1]]

In [47]:
import nltk
nltk.download('stopwords')


[nltk_data] Downloading package stopwords to
[nltk_data]     /home/cognizac/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
Out[47]:
True

In [9]:
from nltk.corpus import stopwords
stops = stopwords.words('english')

In [10]:
stops


Out[10]:
[u'i',
 u'me',
 u'my',
 u'myself',
 u'we',
 u'our',
 u'ours',
 u'ourselves',
 u'you',
 u'your',
 u'yours',
 u'yourself',
 u'yourselves',
 u'he',
 u'him',
 u'his',
 u'himself',
 u'she',
 u'her',
 u'hers',
 u'herself',
 u'it',
 u'its',
 u'itself',
 u'they',
 u'them',
 u'their',
 u'theirs',
 u'themselves',
 u'what',
 u'which',
 u'who',
 u'whom',
 u'this',
 u'that',
 u'these',
 u'those',
 u'am',
 u'is',
 u'are',
 u'was',
 u'were',
 u'be',
 u'been',
 u'being',
 u'have',
 u'has',
 u'had',
 u'having',
 u'do',
 u'does',
 u'did',
 u'doing',
 u'a',
 u'an',
 u'the',
 u'and',
 u'but',
 u'if',
 u'or',
 u'because',
 u'as',
 u'until',
 u'while',
 u'of',
 u'at',
 u'by',
 u'for',
 u'with',
 u'about',
 u'against',
 u'between',
 u'into',
 u'through',
 u'during',
 u'before',
 u'after',
 u'above',
 u'below',
 u'to',
 u'from',
 u'up',
 u'down',
 u'in',
 u'out',
 u'on',
 u'off',
 u'over',
 u'under',
 u'again',
 u'further',
 u'then',
 u'once',
 u'here',
 u'there',
 u'when',
 u'where',
 u'why',
 u'how',
 u'all',
 u'any',
 u'both',
 u'each',
 u'few',
 u'more',
 u'most',
 u'other',
 u'some',
 u'such',
 u'no',
 u'nor',
 u'not',
 u'only',
 u'own',
 u'same',
 u'so',
 u'than',
 u'too',
 u'very',
 u's',
 u't',
 u'can',
 u'will',
 u'just',
 u'don',
 u'should',
 u'now',
 u'd',
 u'll',
 u'm',
 u'o',
 u're',
 u've',
 u'y',
 u'ain',
 u'aren',
 u'couldn',
 u'didn',
 u'doesn',
 u'hadn',
 u'hasn',
 u'haven',
 u'isn',
 u'ma',
 u'mightn',
 u'mustn',
 u'needn',
 u'shan',
 u'shouldn',
 u'wasn',
 u'weren',
 u'won',
 u'wouldn']

In [11]:
def remove_stopwords(text):
    return ' '.join([word for word in text.split() if word not in stops])

In [12]:
text = remove_stopwords(neg_reviews[1].lower())
text


Out[12]:
u'i\'m quite disappointed "soul survivors". doesn\'t worth even comment forum. script poor well "acting" entertainment features pointless plot.please, favor! real "survivor"...don\'t waste time piece crap! someday you\'ll thank me!'

In [54]:
import textacy

remove_stopwords(textacy.preprocess.preprocess_text(neg_reviews[1], no_contractions=True, no_punct=True, lowercase=True))


Out[54]:
u'quite disappointed soul survivors worth even comment forum script poor well acting entertainment features pointless plotplease favor real survivordo waste time piece crap someday thank'

In [55]:
text = textacy.preprocess.preprocess_text(neg_reviews[1], no_contractions=True, no_punct=True, lowercase=True)
text


Out[55]:
u'i am just quite disappointed with soul survivors it does not worth even a comment in this forum the script is very poor as well as all the acting and for our entertainment it features a pointless plotplease do yourselves a favor be a real survivordo not waste your time in this piece of crap someday you will thank me'

In [57]:
doc = textacy.Doc(text)

In [58]:
doc.to_bag_of_words(as_strings=True, normalize='lemma')


Out[58]:
{u'-PRON-': 7,
 u'acting': 1,
 u'comment': 1,
 u'crap': 1,
 u'disappointed': 1,
 u'entertainment': 1,
 u'favor': 1,
 u'feature': 1,
 u'forum': 1,
 u'piece': 1,
 u'plotplease': 1,
 u'pointless': 1,
 u'poor': 1,
 u'real': 1,
 u'script': 1,
 u'someday': 1,
 u'soul': 1,
 u'survivor': 1,
 u'survivordo': 1,
 u'thank': 1,
 u'time': 1,
 u'waste': 1,
 u'worth': 1}

In [66]:
[ng for ng in textacy.extract.ngrams(doc, n=2, filter_stops=False)]


Out[66]:
[i am,
 am just,
 just quite,
 quite disappointed,
 disappointed with,
 with soul,
 soul survivors,
 survivors it,
 it does,
 does not,
 not worth,
 worth even,
 even a,
 a comment,
 comment in,
 in this,
 this forum,
 forum the,
 the script,
 script is,
 is very,
 very poor,
 poor as,
 as well,
 well as,
 as all,
 all the,
 the acting,
 acting and,
 and for,
 for our,
 our entertainment,
 entertainment it,
 it features,
 features a,
 a pointless,
 pointless plotplease,
 plotplease do,
 do yourselves,
 yourselves a,
 a favor,
 favor be,
 be a,
 a real,
 real survivordo,
 survivordo not,
 not waste,
 waste your,
 your time,
 time in,
 in this,
 this piece,
 piece of,
 of crap,
 crap someday,
 someday you,
 you will,
 will thank,
 thank me]

In [68]:
bigram_vectorizer = CountVectorizer(ngram_range=(2, 2), min_df=1)
analyze = bigram_vectorizer.build_analyzer()
analyze('Bi-grams are cool!')


Out[68]:
[u'bi grams', u'grams are', u'are cool']

In [71]:
bg = bigram_vectorizer.fit_transform(neg_reviews)

bg.get_shape()


Out[71]:
(12500, 829466)

In [72]:
bigram_vectorizer.get_feature_names()


Out[72]:
[u'00 am',
 u'00 and',
 u'00 back',
 u'00 bin',
 u'00 budget',
 u'00 dollars',
 u'00 feet',
 u'00 for',
 u'00 in',
 u'00 it',
 u'00 keoni',
 u'00 know',
 u'00 loan',
 u'00 making',
 u'00 need',
 u'00 on',
 u'00 one',
 u'00 or',
 u'00 outlay',
 u'00 pm',
 u'00 price',
 u'00 rental',
 u'00 replacement',
 u'00 second',
 u'00 several',
 u'00 so',
 u'00 spent',
 u'00 they',
 u'00 to',
 u'00 will',
 u'00 you',
 u'000 00',
 u'000 000',
 u'000 10',
 u'000 and',
 u'000 annual',
 u'000 as',
 u'000 audience',
 u'000 bc',
 u'000 bet',
 u'000 budget',
 u'000 bullets',
 u'000 but',
 u'000 by',
 u'000 can',
 u'000 commands',
 u'000 continuity',
 u'000 corpses',
 u'000 could',
 u'000 crack',
 u'000 dollars',
 u'000 fathoms',
 u'000 feet',
 u'000 film',
 u'000 films',
 u'000 fingers',
 u'000 first',
 u'000 frock',
 u'000 from',
 u'000 get',
 u'000 gold',
 u'000 grand',
 u'000 he',
 u'000 heart',
 u'000 how',
 u'000 id',
 u'000 if',
 u'000 in',
 u'000 is',
 u'000 it',
 u'000 lbs',
 u'000 leagues',
 u'000 mega',
 u'000 miles',
 u'000 mocha',
 u'000 moment',
 u'000 movies',
 u'000 muscats',
 u'000 muslim',
 u'000 native',
 u'000 no',
 u'000 not',
 u'000 now',
 u'000 of',
 u'000 on',
 u'000 ouch',
 u'000 out',
 u'000 people',
 u'000 prison',
 u'000 project',
 u'000 pyramid',
 u'000 ray',
 u'000 reasons',
 u'000 retitled',
 u'000 rpms',
 u'000 seats',
 u'000 should',
 u'000 signatures',
 u'000 so',
 u'000 stupid',
 u'000 surly',
 u'000 the',
 u'000 then',
 u'000 thirteen',
 u'000 this',
 u'000 though',
 u'000 times',
 u'000 to',
 u'000 virtual',
 u'000 voters',
 u'000 watch',
 u'000 wedding',
 u'000 which',
 u'000 year',
 u'000 years',
 u'000 zulus',
 u'0000000000001 out',
 u'00001 10',
 u'00001 chances',
 u'00015 seconds',
 u'001 and',
 u'001 on',
 u'001 was',
 u'007 at',
 u'007 movie',
 u'007 movies',
 u'007 or',
 u'007 the',
 u'00am timeslot',
 u'00pm on',
 u'00s either',
 u'00s that',
 u'01 30',
 u'01 is',
 u'01 of',
 u'01 show',
 u'01pm you',
 u'02 out',
 u'02 to',
 u'029 keypunch',
 u'02i am',
 u'03 oct',
 u'03 wish',
 u'04 just',
 u'04 plays',
 u'04 was',
 u'04 what_the_bleep_',
 u'04 who',
 u'041 and',
 u'05 david',
 u'05 gave',
 u'05 jay',
 u'05 with',
 u'050 joke',
 u'06 and',
 u'06 bob',
 u'06 don',
 u'06 had',
 u'06 with',
 u'07 2004',
 u'07 fall',
 u'08 10',
 u'08 because',
 u'08 either',
 u'08 guess',
 u'08 in',
 u'087 sorter',
 u'09 after',
 u'09 griffith',
 u'09 since',
 u'0f 10',
 u'0s add',
 u'10 00',
 u'10 000',
 u'10 06',
 u'10 10',
 u'10 11',
 u'10 12',
 u'10 13',
 u'10 14',
 u'10 15',
 u'10 1979',
 u'10 20',
 u'10 20000',
 u'10 2002',
 u'10 21',
 u'10 28',
 u'10 29',
 u'10 30',
 u'10 31',
 u'10 about',
 u'10 absolute',
 u'10 absolutely',
 u'10 across',
 u'10 acting',
 u'10 add',
 u'10 admit',
 u'10 after',
 u'10 all',
 u'10 also',
 u'10 am',
 u'10 an',
 u'10 and',
 u'10 another',
 u'10 ap3',
 u'10 are',
 u'10 army',
 u'10 as',
 u'10 ashamed',
 u'10 at',
 u'10 atmosphere',
 u'10 average',
 u'10 avoid',
 u'10 awful',
 u'10 bad',
 u'10 bah',
 u'10 barely',
 u'10 based',
 u'10 because',
 u'10 become',
 u'10 before',
 u'10 being',
 u'10 below',
 u'10 best',
 u'10 bethany',
 u'10 bit',
 u'10 bleah',
 u'10 bo',
 u'10 breasts',
 u'10 british',
 u'10 btw',
 u'10 bucks',
 u'10 but',
 u'10 by',
 u'10 cameras',
 u'10 car',
 u'10 carlito',
 u'10 cars',
 u'10 catastrophic',
 u'10 cents',
 u'10 cg',
 u'10 character',
 u'10 characters',
 u'10 cinematography',
 u'10 clock',
 u'10 come',
 u'10 commandments',
 u'10 completely',
 u'10 could',
 u'10 crappiness',
 u'10 crouch',
 u'10 dare',
 u'10 day',
 u'10 days',
 u'10 deleted',
 u'10 denzel',
 u'10 derboiler',
 u'10 dermot',
 u'10 devoted',
 u'10 dialogue',
 u'10 did',
 u'10 didn',
 u'10 different',
 u'10 dir',
 u'10 direction',
 u'10 director',
 u'10 directors',
 u'10 disappointed',
 u'10 disgusting',
 u'10 do',
 u'10 does',
 u'10 dollars',
 u'10 don',
 u'10 dreadful',
 u'10 effects',
 u'10 either',
 u'10 entertainment',
 u'10 episodes',
 u'10 etc',
 u'10 even',
 u'10 excellent',
 u'10 explanations',
 u'10 extremely',
 u'10 failure',
 u'10 fair',
 u'10 families',
 u'10 far',
 u'10 fav',
 u'10 favorite',
 u'10 feel',
 u'10 feet',
 u'10 film',
 u'10 films',
 u'10 first',
 u'10 following',
 u'10 foot',
 u'10 for',
 u'10 found',
 u'10 from',
 u'10 ft',
 u'10 garbage',
 u'10 gave',
 u'10 genetic',
 u'10 get',
 u'10 girls',
 u'10 give',
 u'10 given',
 u'10 good',
 u'10 got',
 u'10 grade',
 u'10 grand',
 u'10 great',
 u'10 grenades',
 u'10 griffith',
 u'10 groundhog',
 u'10 guess',
 u'10 guilty',
 u'10 guys',
 u'10 had',
 u'10 hadn',
 u'10 hang',
 u'10 hardcore',
 u'10 has',
 u'10 have',
 u'10 here',
 u'10 his',
 u'10 hmmmm',
 u'10 hope',
 u'10 horror',
 u'10 hour',
 u'10 hours',
 u'10 how',
 u'10 however',
 u'10 http',
 u'10 hype',
 u'10 if',
 u'10 im',
 u'10 in',
 u'10 indemnity',
 u'10 independent',
 u'10 instead',
 u'10 is',
 u'10 isn',
 u'10 it',
 u'10 items',
 u'10 its',
 u'10 jaws',
 u'10 jokes',
 u'10 judging',
 u'10 just',
 u'10 kids',
 u'10 know',
 u'10 last',
 u'10 laurel',
 u'10 less',
 u'10 like',
 u'10 liked',
 u'10 limping',
 u'10 line',
 u'10 lines',
 u'10 lisa',
 u'10 little',
 u'10 ll',
 u'10 long',
 u'10 looks',
 u'10 low',
 u'10 made',
 u'10 mainly',
 u'10 major',
 u'10 makes',
 u'10 max',
 u'10 may',
 u'10 maybe',
 u'10 mean',
 u'10 meaning',
 u'10 means',
 u'10 meat',
 u'10 men',
 u'10 might',
 u'10 miles',
 u'10 million',
 u'10 millions',
 u'10 min',
 u'10 minus',
 u'10 minute',
 u'10 minutes',
 u'10 modern',
 u'10 month',
 u'10 most',
 u'10 movie',
 u'10 movies',
 u'10 ms',
 u'10 music',
 u'10 my',
 u'10 no',
 u'10 nominations',
 u'10 nostalgia',
 u'10 not',
 u'10 now',
 u'10 nudity',
 u'10 obviously',
 u'10 of',
 u'10 oh',
 u'10 on',
 u'10 one',
 u'10 only',
 u'10 option',
 u'10 or',
 u'10 other',
 u'10 out',
 u'10 overall',
 u'10 pack',
 u'10 page',
 u'10 pages',
 u'10 parking',
 u'10 patiently',
 u'10 people',
 u'10 percent',
 u'10 pints',
 u'10 plan',
 u'10 plot',
 u'10 point',
 u'10 points',
 u'10 possible',
 u'10 pounds',
 u'10 probably',
 u'10 purely',
 u'10 quid',
 u'10 ranking',
 u'10 rate',
 u'10 rated',
 u'10 rating',
 u'10 ratings',
 u'10 really',
 u'10 reasons',
 u'10 remember',
 u'10 resulting',
 u'10 rjt',
 u'10 sadly',
 u'10 sam',
 u'10 save',
 u'10 saw',
 u'10 scale',
 u'10 scales',
 u'10 score',
 u'10 scores',
 u'10 script',
 u'10 seasons',
 u'10 seaton',
 u'10 second',
 u'10 seconds',
 u'10 seem',
 u'10 sentences',
 u'10 seriously',
 u'10 set',
 u'10 shame',
 u'10 she',
 u'10 ship',
 u'10 should',
 u'10 simply',
 u'10 since',
 u'10 snoozes',
 u'10 so',
 u'10 social',
 u'10 solely',
 u'10 some',
 u'10 sometimes',
 u'10 sorry',
 u'10 sounds',
 u'10 spacey',
 u'10 special',
 u'10 speed',
 u'10 spoilers',
 u'10 star',
 u'10 stars',
 u'10 steps',
 u'10 story',
 u'10 storyline',
 u'10 strictly',
 u'10 strongly',
 u'10 sublime',
 u'10 sure',
 u'10 tend',
 u'10 terrible',
 u'10 than',
 u'10 that',
 u'10 the',
 u'10 their',
 u'10 then',
 u'10 there',
 u'10 they',
 u'10 things',
 u'10 think',
 u'10 this',
 u'10 those',
 u'10 though',
 u'10 thought',
 u'10 thousand',
 u'10 threw',
 u'10 thumbs',
 u'10 time',
 u'10 times',
 u'10 to',
 u'10 too',
 u'10 total',
 u'10 truly',
 u'10 trying',
 u'10 turds',
 u'10 two',
 u'10 unnecessary',
 u'10 until',
 u'10 unwatchable',
 u'10 updated',
 u'10 used',
 u'10 very',
 u'10 viewers',
 u'10 vote',
 u'10 votes',
 u'10 was',
 u'10 were',
 u'10 what',
 u'10 when',
 u'10 where',
 u'10 which',
 u'10 while',
 u'10 why',
 u'10 will',
 u'10 wish',
 u'10 with',
 u'10 without',
 u'10 won',
 u'10 words',
 u'10 works',
 u'10 worms',
 u'10 worst',
 u'10 would',
 u'10 writer',
 u'10 x10',
 u'10 year',
 u'10 years',
 u'10 yikes',
 u'10 you',
 u'10 yr',
 u'10 yuck',
 u'100 000',
 u'100 acting',
 u'100 action',
 u'100 adrenalin',
 u'100 agent',
 u'100 albanian',
 u'100 american',
 u'100 and',
 u'100 bad',
 u'100 because',
 u'100 behind',
 u'100 believable',
 u'100 better',
 u'100 bills',
 u'100 boring',
 u'100 but',
 u'100 can',
 u'100 cardboard',
 u'100 cartoons',
 u'100 certain',
 u'100 clunks',
 u'100 comedies',
 u'100 comes',
 u'100 concerned',
 u'100 crores',
 u'100 day',
 u'100 degrees',
 u'100 dry',
 u'100 episodes',
 u'100 every',
 u'100 fake',
 u'100 family',
 u'100 fantasy',
 u'100 feet',
 u'100 films',
 u'100 for',
 u'100 french',
 u'100 ft',
 u'100 full',
 u'100 generally',
 u'100 grand',
 u'100 grueling',
 u'100 guys',
 u'100 hand',
 u'100 hell',
 u'100 hiroshimas',
 u'100 hopefully',
 u'100 however',
 u'100 ideas',
 u'100 if',
 u'100 imdb',
 u'100 injured',
 u'100 inspiring',
 u'100 iq',
 u'100 is',
 u'100 it',
 u'100 kills',
 u'100 list',
 u'100 loyal',
 u'100 make',
 u'100 maniacs',
 u'100 manics',
 u'100 material',
 u'100 men',
 u'100 meters',
 u'100 miles',
 u'100 million',
 u'100 mins',
 u'100 minute',
 u'100 minutes',
 u'100 mistakes',
 u'100 mock',
 u'100 more',
 u'100 mormon',
 u'100 movies',
 u'100 mph',
 u'100 notes',
 u'100 odd',
 u'100 of',
 u'100 on',
 u'100 page',
 u'100 people',
 u'100 per',
 u'100 percent',
 u'100 perfection',
 u'100 planes',
 u'100 plus',
 u'100 pointless',
 u'100 police',
 u'100 policemen',
 u'100 poo',
 u'100 predictable',
 u'100 pure',
 u'100 quotes',
 u'100 ranked',
 u'100 really',
 u'100 retarded',
 u'100 serb',
 u'100 serious',
 u'100 so',
 u'100 somewhere',
 u'100 south',
 u'100 still',
 u'100 straws',
 u'100 super',
 u'100 sure',
 u'100 synonyms',
 u'100 tax',
 u'100 terrible',
 u'100 that',
 u'100 the',
 u'100 these',
 u'100 think',
 u'100 times',
 u'100 title',
 u'100 to',
 u'100 unrealistic',
 u'100 untrue',
 u'100 visual',
 u'100 votes',
 u'100 was',
 u'100 waste',
 u'100 west',
 u'100 without',
 u'100 words',
 u'100 worst',
 u'100 wouldn',
 u'100 yards',
 u'100 yds',
 u'100 year',
 u'100 years',
 u'100 yrds',
 u'100 zombies',
 u'1000 2000',
 u'1000 bullets',
 u'1000 corpses',
 u'1000 dollar',
 u'1000 in',
 u'1000 month',
 u'1000 movies',
 u'1000 of',
 u'1000 opportunities',
 u'1000 other',
 u'1000 the',
 u'1000 times',
 u'1000 to',
 u'1000 votes',
 u'1000 whoever',
 u'1000 word',
 u'1000 words',
 u'1000000 bucks',
 u'1000lb saber',
 u'1000s of',
 u'1001 questions',
 u'100b out',
 u'100min expect',
 u'100mph at',
 u'100s press',
 u'100th grade',
 u'100th time',
 u'100x better',
 u'100x what',
 u'100yards and',
 u'101 and',
 u'101 blind',
 u'101 dalamatians',
 u'101 dalmatians',
 u'101 dalmations',
 u'101 deliver',
 u'101 end',
 u'101 how',
 u'101 minute',
 u'101 minutes',
 u'101 other',
 u'101 scene',
 u'101 scrawl',
 u'101 script',
 u'101 teacher',
 u'101 tell',
 u'101 there',
 u'101 with',
 u'102 dalmatians',
 u'102 dalmations',
 u'102 delta',
 u'102 lets',
 u'102 minute',
 u'102 minutes',
 u'103 minutes',
 u'104 minutes',
 u'104 pal',
 u'105 lb',
 u'105 minutes',
 u'105 year',
 u'1050 us',
 u'105lbs bust',
 u'106 minutes',
 u'107 don',
 u'107 minute',
 u'107 minutes',
 u'108 films',
 u'108 minutes',
 u'108 movies',
 u'108 odd',
 u'109 minute',
 u'109 minutes',
 u'10another one',
 u'10as this',
 u'10cheesiness 10',
 u'10hulkamaniacs hulk',
 u'10i don',
 u'10lame meter',
 u'10lines never',
 u'10mil limited',
 u'10min then',
 u'10minutes all',
 u'10minutes on',
 u'10objectionable things',
 u'10overall 10another',
 u'10overall 10watch',
 u'10overall rating',
 u'10overall too',
 u'10p showing',
 u'10p simon',
 u'10p what',
 u'10pm so',
 u'10pm till',
 u'10quality 10',
 u'10rated lot',
 u'10rating as',
 u'10s flippantly',
 u'10s here',
 u'10s is',
 u'10syed shabbir',
 u'10th century',
 u'10th modesty',
 u'10th scene',
 u'10th the',
 u'10th time',
 u'10the alliance',
 u'10the dream',
 u'10the egg',
 u'10the vipers',
 u'10this has',
 u'10total 10syed',
 u'10umney last',
 u'10watch it',
 u'10www residenthazard',
 u'10x less',
 u'10x smaller',
 u'10yr old',
 u'11 00',
 u'11 07',
 u'11 10',
 u'11 12',
 u'11 13',
 u'11 2001',
 u'11 2002',
 u'11 2009',
 u'11 24',
 u'11 activism',
 u'11 actors',
 u'11 all',
 u'11 and',
 u'11 angelo',
 u'11 apostles',
 u'11 around',
 u'11 as',
 u'11 attacks',
 u'11 being',
 u'11 bucks',
 u'11 christian',
 u'11 decided',
 u'11 did',
 u'11 dollars',
 u'11 eur',
 u'11 exactly',
 u'11 filmed',
 u'11 for',
 u'11 friday',
 u'11 go',
 u'11 grease',
 u'11 happened',
 u'11 hit',
 u'11 if',
 u'11 in',
 u'11 is',
 u'11 jackson',
 u'11 just',
 u'11 like',
 u'11 luke',
 u'11 miles',
 u'11 minutes',
 u'11 mission',
 u'11 movie',
 u'11 of',
 u'11 on',
 u'11 or',
 u'11 other',
 u'11 pair',
 u'11 people',
 u'11 perpetrators',
 u'11 pm',
 u'11 reckless',
 u'11 satire',
 u'11 security',
 u'11 september',
 u'11 shame',
 u'11 she',
 u'11 shohei',
 u'11 so',
 u'11 style',
 u'11 the',
 u'11 there',
 u'11 this',
 u'11 to',
 u'11 using',
 u'11 uwe',
 u'11 was',
 u'11 wasn',
 u'11 watched',
 u'11 we',
 u'11 went',
 u'11 when',
 u'11 will',
 u'11 world',
 u'11 ya',
 u'11 year',
 u'11 years',
 u'11 yr',
 u'110 concentration',
 u'110 minute',
 u'110 minutes',
 u'110 story',
 u'110 year',
 u'1100 ad',
 u'1100ad and',
 u'111 minute',
 u'112 for',
 u'112 minutes',
 u'114 minutes',
 u'1146 the',
 u'115 minute',
 u'115 minutes',
 u'116 minutes',
 u'116 wasted',
 u'117 and',
 u'117 aside',
 u'117 spectacle',
 u'11f tigers',
 u'11m of',
 u'11th 2001',
 u'11th grader',
 u'11th or',
 u'12 000',
 u'12 00pm',
 u'12 01',
 u'12 01pm',
 u'12 10',
 u'12 100',
 u'12 13',
 u'12 14',
 u'12 15',
 u'12 1993',
 u'12 1am',
 u'12 2002',
 u'12 2009',
 u'12 24',
 u'12 26',
 u'12 32',
 u'12 actors',
 u'12 although',
 u'12 amazing',
 u'12 and',
 u'12 angry',
 u'12 apologise',
 u'12 but',
 u'12 clock',
 u'12 dancers',
 u'12 demerit',
 u'12 different',
 u'12 dollars',
 u'12 euros',
 u'12 featured',
 u'12 films',
 u'12 foot',
 u'12 for',
 u'12 ft',
 u'12 good',
 u'12 goodbye',
 u'12 guess',
 u'12 handcuff',
 u'12 hour',
 u'12 hours',
 u'12 if',
 u'12 in',
 u'12 it',
 u'12 miles',
 u'12 million',
 u'12 minute',
 u'12 minutes',
 u'12 monkeys',
 u'12 months',
 u'12 most',
 u'12 never',
 u'12 of',
 u'12 or',
 u'12 people',
 u'12 pounder',
 u'12 quite',
 u'12 really',
 u'12 remix',
 u'12 sept',
 u'12 sexy',
 u'12 she',
 u'12 should',
 u'12 so',
 u'12 that',
 u'12 the',
 u'12 this',
 u'12 through',
 u'12 times',
 u'12 to',
 u'12 us',
 u'12 was',
 u'12 well',
 u'12 what',
 u'12 when',
 u'12 which',
 u'12 who',
 u'12 worst',
 u'12 yards',
 u'12 year',
 u'12 years',
 u'120 minutes',
 u'120 page',
 u'120 years',
 u'1200 bc',
 u'1200 miles',
 u'1200f degree',
 u'1201 alarms',
 u'1202 and',
 u'123 000',
 u'123 million',
 u'123 minutes',
 u'12383499143743701 bullets',
 u'125 150',
 u'125 lbs',
 u'128 minutes',
 u'12a certificate',
 u'12a here',
 u'12a torchwood',
 u'12m was',
 u'12s for',
 u'12th 14th',
 u'12th and',
 u'12th century',
 ...]

In [93]:
t = bg[:, 0] > 0
np.array(neg_reviews)[t.toarray().reshape(12500,)]


Out[93]:
array([ u'I saw this film when it was first released. The memory of how bad it was has stayed with me almost forty years. I didn\'t want to trust my own sentiments about the movie when I saw it, so I consulted a movie review published in a major metropolitan newspaper the next day- sentiment confirmed, the reviewer wrote that the movie was incoherent, indecipherable, and uninspiring. A little research reveals that the producer was star Leslie Caron\'s husband, thus the whiff of nepotism suggests the beginning for this awful film. The film\'s roster of many capable actors - Caron, Warren Oates, Scatman Crothers, Gloria Grahame, and James Sikking among others - suggests that it holds some promise. But the death of this film is attributable to its terrible screenplay. The "mystery" implicated is so obscure and so little revealed throughout the film that the viewer is left perplexed from scene to scene. The movie seems torn between being a detective mystery and an espionage thriller, but never settles upon one or the other. The sense of suspense is entirely absent. The main characters settle on playing dry, emotionless types in a fashion that inspires no empathy whatsoever. The cinematography is pedestrian. The result is that the hapless viewer loses interest in the characters, the plot, and, in the end, the film itself. I am little surprised that there is no version of this pathetic film available to purchase. I hope that if TCM finds a print of this film and feels compelled to air it that it is safely relegated to the 4:00 am slot.',
       u'One of the five worst movies I have ever watched. And I\'m not exaggerating. In fact, I recommend watching it so you can get the same feeling of incredulity as you might by watching Showgirls.Out of 400 votes, the movie gets a user rating of 5.3/10. But there is a disproportionate number of voters who gave it a 10/10, probably due to the message of the movie - nuclear weapons are the bane of mankind. Chuck Murdock is an all-star little league pitcher who gives up baseball because there are nuclear weapons. Soon "Amazing Grace" Smith is an all-star Boston Celtic who is inspired by Chuck\'s story and gives up basketball. Soon all sports leagues from the professional level to college to high school to little league dismantle in a world-wide protest. Later all the children of the world go on a silence strike. This inspires the President of the United States to meet with the Soviet Premier, who in time agree to eliminate all nuclear weapons in time for the start of the next Little League season. The movie ends with Chuck about to throw out the first pitch, with the President telling his new best friend Chuck not to worry about striking out every batter, as he hasn\'t thrown a baseball in a year.Somewhere along the line a nefarious underworld boss kills Amazing Grace. When the President finds out he is told that the FBI can verify the killer but will never be able to prove it. So the President calls the underworld boss ("But it\'s one a.m." "I don\'t care, get him on the line") and tell him that he is to resign from all company boards that he sits on and sell all stocks that he has. And to not get out of line again.Honestly, this movie was so crappy that I couldn\'t turn it off. It was on television from 2:30 am to 4:00 am, and I watched it all. I wasn\'t turned off by the anti-nuclear weapons propaganda. I was turned off by the implausible break down of all organized sports. I don\'t even understand why "Amazing Grace" Smith was killed. And with all these famous athletes becoming Chuck\'s friends, why the father was constantly upset with his son taking a principled stand. And there was the clich\u0102\u0160 moment near the end when dad tells Chuck, "I never told you this, but I\'m proud of you." Cue hug.'], 
      dtype='<U8681')

In [41]:
[t for t in text.tokens]


Out[41]:
[Trying,
 to,
 conceive,
 of,
 something,
 as,
 insipid,
 as,
 THE,
 SENTINEL,
 would,
 be,
 pretty,
 difficult,
 .,
 The,
 problems,
 are,
 many,
 .,
 The,
 result,
 is,
 terrible,
 and,
 loaded,
 with,
 plot,
 holes,
 .,
 Michael,
 Douglas,
 stars,
 as,
 Pete,
 Garrison,
 ,,
 a,
 Secret,
 Service,
 agent,
 who,
 ",
 took,
 one,
 ",
 for,
 Reagan,
 during,
 the,
 attempt,
 on,
 his,
 life,
 .,
 Years,
 later,
 we,
 find,
 Pete,
 assigned,
 to,
 the,
 Whitehouse,
 Family,
 ,,
 mainly,
 as,
 a,
 guard,
 for,
 the,
 First,
 Lady,
 (,
 Kim,
 Basinger,
 ,,
 L.A.,
 CONFIDENTIAL,
 ),
 .,
 Troubles,
 arise,
 as,
 we,
 see,
 Pete,
 's,
 close,
 involvement,
 with,
 the,
 First,
 Lady,
 ,,
 and,
 a,
 sudden,
 threat,
 against,
 the,
 President,
 himself,
 (,
 David,
 Rasche,
 ,,
 UNITED,
 93,
 ),
 .,
 When,
 Pete,
 fails,
 a,
 polygraph,
 test,
 ,,
 he,
 's,
 singled,
 out,
 as,
 a,
 disgruntled,
 agent,
 by,
 investigator,
 David,
 Breckinridge,
 (,
 Kiefer,
 Sutherland,
 ,,
 24,
 TV,
 series).As,
 the,
 presidential,
 assassination,
 plot,
 unfolds,
 ,,
 Pete,
 finds,
 himself,
 on,
 the,
 run,
 from,
 his,
 own,
 people,
 .,
 His,
 only,
 confidant,
 is,
 the,
 First,
 Lady,
 ,,
 and,
 she,
 's,
 reluctant,
 to,
 tell,
 anyone,
 about,
 their,
 affections,
 for,
 one,
 another,
 (,
 which,
 is,
 why,
 Pete,
 failed,
 the,
 polygraph,
 in,
 the,
 first,
 place,
 ),
 .,
 But,
 is,
 Pete,
 really,
 innocent,
 ?,
 Or,
 is,
 he,
 simply,
 trying,
 to,
 buy,
 time,
 until,
 he,
 can,
 kill,
 the,
 President,
 ?,
 If,
 he,
 is,
 innocent,
 ,,
 how,
 can,
 he,
 help,
 prevent,
 the,
 assassination,
 attempt,
 while,
 running,
 from,
 the,
 Secret,
 Service,
 ?,
 The,
 one,
 ,,
 big,
 ,,
 overwhelming,
 problem,
 with,
 this,
 film,
 is,
 that,
 there,
 's,
 no,
 justification,
 for,
 the,
 reason,
 behind,
 the,
 presidential,
 threat,
 .,
 Is,
 n't,
 that,
 what,
 the,
 movie,
 's,
 supposed,
 to,
 be,
 about,
 ?,
 One,
 would,
 think,
 so,
 !,
 But,
 the,
 audience,
 is,
 never,
 let,
 in,
 on,
 why,
 the,
 assassin(s,
 ),
 want,
 to,
 kill,
 the,
 Prez,
 .,
 Hmm,
 .,
 Someone,
 forget,
 to,
 put,
 that,
 in,
 the,
 script,
 somewhere,
 ?,
 And,
 what,
 's,
 with,
 David,
 Breckinridge,
 's,
 (,
 Kiefer,
 's,
 ),
 new,
 partner,
 ,,
 Jill,
 Marin,
 (,
 Eva,
 Longoria,
 ,,
 CARLITA,
 'S,
 WAY,
 ),
 ?,
 Seems,
 that,
 she,
 was,
 put,
 in,
 the,
 film,
 strictly,
 as,
 a,
 piece,
 of,
 a$$-candy,
 .,
 What,
 was,
 her,
 purpose,
 again,
 ?,
 Did,
 she,
 do,
 anything,
 other,
 than,
 look,
 nice,
 in,
 tight,
 pants,
 and,
 a,
 low,
 -,
 cut,
 blouse?There,
 are,
 so,
 many,
 problems,
 with,
 the,
 basic,
 premise,
 of,
 The,
 Sentinel,
 as,
 to,
 be,
 laughable,
 .,
 The,
 action,
 is,
 too,
 easily,
 stymied,
 by,
 the,
 ",
 What,
 the,
 ...,
 ?,
 ",
 responses,
 sure,
 to,
 be,
 uttered,
 by,
 those,
 unfortunate,
 enough,
 to,
 watch,
 the,
 movie,
 .]

In [42]:
[s for s in text.sents]


Out[42]:
[Trying to conceive of something as insipid as THE SENTINEL would be pretty difficult.,
 The problems are many.,
 The result is terrible and loaded with plot holes.,
 Michael Douglas stars as Pete Garrison, a Secret Service agent who "took one" for Reagan during the attempt on his life.,
 Years later we find Pete assigned to the Whitehouse Family, mainly as a guard for the First Lady (Kim Basinger, L.A. CONFIDENTIAL).,
 Troubles arise as we see Pete's close involvement with the First Lady, and a sudden threat against the President himself (David Rasche, UNITED 93).,
 When Pete fails a polygraph test, he's singled out as a disgruntled agent by investigator David Breckinridge (Kiefer Sutherland, 24 TV series).As the presidential assassination plot unfolds, Pete finds himself on the run from his own people.,
 His only confidant is the First Lady, and she's reluctant to tell anyone about their affections for one another (which is why Pete failed the polygraph in the first place).,
 But is Pete really innocent?,
 Or is he simply trying to buy time until he can kill the President?,
 If he is innocent, how can he help prevent the assassination attempt while running from the Secret Service?,
 The one, big, overwhelming problem with this film is that there's no justification for the reason behind the presidential threat.,
 Isn't that what the movie's supposed to be about?,
 One would think so!,
 But the audience is never let in on why the assassin(s) want to kill the Prez.,
 Hmm.,
 Someone forget to put that in the script somewhere?,
 And what's with David Breckinridge's (Kiefer's) new partner, Jill Marin (Eva Longoria, CARLITA'S WAY)?,
 Seems that she was put in the film strictly as a piece of a$$-candy.,
 What was her purpose again?,
 Did she do anything other than look nice in tight pants and a low-cut blouse?There are so many problems with the basic premise of The Sentinel as to be laughable.,
 The action is too easily stymied by the "What the...?" responses sure to be uttered by those unfortunate enough to watch the movie.]