Dataset

Download the dataset and save it to a directory at per your convience. IMDB comments


In [1]:
import pandas as pd # Used for dataframe functions
import json # parse json string
import nltk # Natural language toolkit for TDIDF etc.
from bs4 import BeautifulSoup # Parse html string .. to extract text
import re # Regex parser 
import numpy as np # Linear algebbra 
from sklearn import * # machine learning
import matplotlib.pyplot as plt # Visualization


# Wordcloud does not work on Windows. 
# Comment the below if you want to skip
from wordcloud import WordCloud # Word cloud visualization
import scipy #Sparse matrix 

np.set_printoptions(precision=4)
pd.options.display.max_columns = 1000
pd.options.display.max_rows = 10
pd.options.display.float_format = lambda f: "%.4f" % f

%matplotlib inline

Run the following lines when you run this notebook first time on your system.


In [2]:
import nltk
nltk.download("punkt")
nltk.download("stopwords")
nltk.download("wordnet")
nltk.download('averaged_perceptron_tagger')
nltk.download("vader_lexicon")


[nltk_data] Downloading package punkt to /Users/abasar/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/abasar/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/abasar/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/abasar/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/abasar/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
Out[2]:
True

In [3]:
print(nltk.__version__)


3.4.5

Now let's see how to create text classifier using nltk and scikit learn.


In [4]:
# The following line does not work on Windows system
!head -n 1 /data/imdb-comments.json


{"label":"test","sentiment":"pos","name":"0_10.txt","content":"I went and saw this movie last night after being coaxed to by a few friends of mine. I'll admit that I was reluctant to see it because from what I knew of Ashton Kutcher he was only able to do comedy. I was wrong. Kutcher played the character of Jake Fischer very well, and Kevin Costner played Ben Randall with such professionalism. The sign of a good movie is that it can toy with our emotions. This one did exactly that. The entire theater (which was sold out) was overcome by laughter during the first half of the movie, and were moved to tears during the second half. While exiting the theater I not only saw many women in tears, but many full grown men as well, trying desperately not to let anyone see them crying. This movie was great, and I suggest that you go see it before you judge."}

In [5]:
data = []
with open("/data/imdb-comments.json", "r", encoding="utf8") as f:
    for l in f.readlines():
        data.append(json.loads(l))

In [6]:
comments = pd.DataFrame.from_dict(data)
comments.sample(10)


Out[6]:
label sentiment name content
11965 test pos 951_10.txt if u haven't seen Vijay in "Ghillli", "Gilly" ...
9445 test pos 7251_9.txt Robert Carlyle excels again. The period was ca...
39395 train neg 11706_3.txt A novel by Remarque. A cast that looks great o...
1160 test pos 11044_7.txt 'Anne Christie' was Garbo's 14th film and the ...
21277 test neg 6650_1.txt This is one of those movies where I was rootin...
6957 test pos 5011_10.txt I saw the movie at the Nashville film festival...
49884 train neg 9898_1.txt This version of "Moby Dick" insults the audien...
26891 train pos 11702_10.txt A Thief in the Night has got to be the best ou...
24352 test neg 9418_2.txt When you see Barry Corbin in the cast of a mov...
38956 train neg 11310_1.txt First of all, yes, animals have emotions. If y...

In [7]:
comments.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 4 columns):
label        50000 non-null object
sentiment    50000 non-null object
name         50000 non-null object
content      50000 non-null object
dtypes: object(4)
memory usage: 1.5+ MB

In [8]:
comments.label.value_counts()


Out[8]:
train    25000
test     25000
Name: label, dtype: int64

In [9]:
comments.groupby(["label", "sentiment"]).content.count().unstack()


Out[9]:
sentiment neg pos
label
test 12500 12500
train 12500 12500

In [10]:
np.random.seed(1)
v = list(comments["content"].sample(1))[0]
v


Out[10]:
'When we started watching this series on cable, I had no idea how addictive it would be. Even when you hate a character, you hold back because they are so beautifully developed, you can almost understand why they react to frustration, fear, greed or temptation the way they do. It\'s almost as if the viewer is experiencing one of Christopher\'s learning curves.<br /><br />I can\'t understand why Adriana would put up with Christopher\'s abuse of her, verbally, physically and emotionally, but I just have to read the newspaper to see how many women can and do tolerate such behavior. Carmella has a dream house, endless supply of expensive things, but I\'m sure she would give it up for a loving and faithful husband - or maybe not. That\'s why I watch.<br /><br />It doesn\'t matter how many times you watch an episode, you can find something you missed the first five times. We even watch episodes out of sequence (watch season 1 on late night with commercials but all the language, A&E with language censored, reruns on the Movie Network) - whenever they\'re on, we\'re there. We\'ve been totally spoiled now.<br /><br />I also love the Malaprop\'s. "An albacore around my neck" is my favorite of Johnny Boy. When these jewels have entered our family vocabulary, it is a sign that I should get a life. I will when the series ends, and I have collected all the DVD\'s, and put the collection in my will.'

In [11]:
comments.head()


Out[11]:
label sentiment name content
0 test pos 0_10.txt I went and saw this movie last night after bei...
1 test pos 10000_7.txt Actor turned director Bill Paxton follows up h...
2 test pos 10001_9.txt As a recreational golfer with some knowledge o...
3 test pos 10002_8.txt I saw this film in a sneak preview, and it is ...
4 test pos 10003_8.txt Bill Paxton has taken the true story of the 19...

In [12]:
comments["content"].values[0]


Out[12]:
"I went and saw this movie last night after being coaxed to by a few friends of mine. I'll admit that I was reluctant to see it because from what I knew of Ashton Kutcher he was only able to do comedy. I was wrong. Kutcher played the character of Jake Fischer very well, and Kevin Costner played Ben Randall with such professionalism. The sign of a good movie is that it can toy with our emotions. This one did exactly that. The entire theater (which was sold out) was overcome by laughter during the first half of the movie, and were moved to tears during the second half. While exiting the theater I not only saw many women in tears, but many full grown men as well, trying desperately not to let anyone see them crying. This movie was great, and I suggest that you go see it before you judge."

Vader Sentiment Analysis


In [13]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()

In [14]:
sia.polarity_scores(comments["content"].values[0])


Out[14]:
{'neg': 0.085, 'neu': 0.775, 'pos': 0.14, 'compound': 0.7956}

In [15]:
def sentiment_score(text):
    return sia.polarity_scores(text)["compound"]

sentiment_score(comments["content"].values[0])


Out[15]:
0.7956

In [16]:
%%time 
comments["vader_score"] = comments["content"].apply(lambda text: sentiment_score(text))


CPU times: user 1min 39s, sys: 192 ms, total: 1min 39s
Wall time: 1min 39s

In [17]:
comments["vader_sentiment"] = np.where(comments["vader_score"]>0, "pos", "neg")

In [18]:
comments.head()


Out[18]:
label sentiment name content vader_score vader_sentiment
0 test pos 0_10.txt I went and saw this movie last night after bei... 0.7956 pos
1 test pos 10000_7.txt Actor turned director Bill Paxton follows up h... 0.9948 pos
2 test pos 10001_9.txt As a recreational golfer with some knowledge o... 0.9726 pos
3 test pos 10002_8.txt I saw this film in a sneak preview, and it is ... 0.9266 pos
4 test pos 10003_8.txt Bill Paxton has taken the true story of the 19... 0.9906 pos

In [19]:
comments.vader_sentiment.value_counts()


Out[19]:
pos    33008
neg    16992
Name: vader_sentiment, dtype: int64

In [20]:
print(metrics.classification_report(comments["sentiment"], comments["vader_sentiment"]))


              precision    recall  f1-score   support

         neg       0.79      0.54      0.64     25000
         pos       0.65      0.86      0.74     25000

    accuracy                           0.70     50000
   macro avg       0.72      0.70      0.69     50000
weighted avg       0.72      0.70      0.69     50000

As we see above the accuracy is the range of 0.70. Vader model performed better for the positive sentiment compared to negative sentiment. Let's now use statistical model using TFIDF which generally perform better.

Sentiment Analysis using statistical model using TFIDF


In [21]:
def preprocess(text):
    
    # Remove html tags
    text = BeautifulSoup(text.lower(), "html5lib").text 
    
    # Replace the occurrences of multiple consecutive non-word ccharacters 
    # with a single space (" ")
    text = re.sub(r"[\W]+", " ", text)
    return text

preprocess(v)


Out[21]:
'when we started watching this series on cable i had no idea how addictive it would be even when you hate a character you hold back because they are so beautifully developed you can almost understand why they react to frustration fear greed or temptation the way they do it s almost as if the viewer is experiencing one of christopher s learning curves i can t understand why adriana would put up with christopher s abuse of her verbally physically and emotionally but i just have to read the newspaper to see how many women can and do tolerate such behavior carmella has a dream house endless supply of expensive things but i m sure she would give it up for a loving and faithful husband or maybe not that s why i watch it doesn t matter how many times you watch an episode you can find something you missed the first five times we even watch episodes out of sequence watch season 1 on late night with commercials but all the language a e with language censored reruns on the movie network whenever they re on we re there we ve been totally spoiled now i also love the malaprop s an albacore around my neck is my favorite of johnny boy when these jewels have entered our family vocabulary it is a sign that i should get a life i will when the series ends and i have collected all the dvd s and put the collection in my will '

In [22]:
%%time
# Apply the preprocessing logic to all comments
comments["content"] = comments["content"].apply(preprocess)


CPU times: user 30.9 s, sys: 57.6 ms, total: 30.9 s
Wall time: 30.9 s

In [23]:
comments_train = comments[comments["label"] == "train"]
comments_train.sample(10)


Out[23]:
label sentiment name content vader_score vader_sentiment
25496 train pos 10447_10.txt there are many people in our lives that we mee... 0.9814 pos
41520 train neg 2369_4.txt sixth escapade for freddy krueger in which he ... 0.9802 pos
48525 train neg 8674_2.txt the first few minutes of the bodyguard do have... 0.4435 pos
26037 train pos 10934_10.txt after reviewing this intense martial arts movi... 0.6862 pos
43823 train neg 4441_3.txt triumph of love is proof that not every coméd... 0.9217 pos
39666 train neg 11950_2.txt this is without a doubt the worst movie i have... 0.1877 pos
40882 train neg 1795_2.txt i had the opportunity to preview this film as ... 0.6735 pos
43156 train neg 3841_1.txt roeg has done some great movies but this a tur... 0.6858 pos
37185 train pos 9718_7.txt possible spoiler in some way how to alienate f... 0.9909 pos
40330 train neg 1298_3.txt sogo ishii can be a skilled filmmaker under th... -0.9517 neg

In [24]:
comments_test = comments[comments["label"] == "test"]
comments_test.sample(10)


Out[24]:
label sentiment name content vader_score vader_sentiment
21277 test neg 6650_1.txt this is one of those movies where i was rootin... -0.9583 neg
12620 test neg 10108_1.txt spoilersi m going to be as kind as i can about... -0.1144 neg
4485 test pos 2788_9.txt as a former erasmus student i enjoyed this fil... 0.9514 pos
4744 test pos 301_10.txt francis ford coppola s masterpiece was a great... -0.8899 neg
10329 test pos 8047_10.txt some of the reviewers here have foolishly judg... 0.9353 pos
14366 test neg 11680_1.txt this is by far one of the most pretentious fil... -0.0498 neg
1007 test pos 10907_8.txt this is truly a funny movie his dance scene do... 0.9489 pos
4539 test pos 2836_8.txt this film released in 1951 has the usual eleme... 0.1714 pos
7864 test pos 5829_10.txt one of the best tv shows out there if not the ... 0.9692 pos
19582 test neg 5124_4.txt this movie was jerry bruckheimer s idea to sel... 0.9559 pos

In [25]:
X_train = comments_train["content"].values
y_train = np.where(comments_train.sentiment == "pos", 1, 0)

In [26]:
X_test = comments_test["content"].values
y_test = np.where(comments_test.sentiment == "pos", 1, 0)

In [27]:
# http://snowball.tartarus.org/algorithms/porter/stemmer.html
# http://www.nltk.org/howto/stem.html

from nltk.stem.snowball import SnowballStemmer
from nltk.stem.porter import PorterStemmer
print(SnowballStemmer.languages)


('arabic', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish')

In [28]:
porter = PorterStemmer()
snowball = SnowballStemmer("english")
lemmatizer = nltk.wordnet.WordNetLemmatizer()

values = []

for s in nltk.word_tokenize("""
            revival
            allowance 
            inference 
            relational
            runner
            runs
            ran
            has
            having
            generously
            wasn't
            leaves
            swimming
            relative
            relating
            """):
    values.append((s, porter.stem(s)
          , snowball.stem(s), lemmatizer.lemmatize(s, "v")))
    
pd.DataFrame(values, columns = ["original", "porter", "snowball", "lemmatizer"])


Out[28]:
original porter snowball lemmatizer
0 revival reviv reviv revival
1 allowance allow allow allowance
2 inference infer infer inference
3 relational relat relat relational
4 runner runner runner runner
... ... ... ... ...
10 wasn't wasn't wasn't wasn't
11 leaves leav leav leave
12 swimming swim swim swim
13 relative rel relat relative
14 relating relat relat relate

15 rows × 4 columns


In [29]:
stopwords = nltk.corpus.stopwords.words("english")
print(len(stopwords), stopwords)


179 ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

Lets drop the following words from stopwords since they are likely good indicator of sentiment.


In [30]:
stopwords.remove("no")
stopwords.remove("nor")
stopwords.remove("not")

In [31]:
sentence = """Financial Services revenues increased $0.5 billion, or 5%, primarily due to
lower impairments and volume growth, partially offset by lower gains."""

stemmer = SnowballStemmer("english")
#stemmer = PorterStemmer()
def my_tokenizer(s):
    terms = nltk.word_tokenize(s.lower())
    #terms = re.split("\s", s.lower())
    #terms = [re.sub(r"[\.!]", "", v) for v in terms if len(v)>2]
    #terms = [v for v in terms if len(v)>2]
    terms = [v for v in terms if v not in stopwords]
    terms = [stemmer.stem(w) for w in terms]
    #terms = [term for term in terms if len(term) > 2]
    return terms 
print(my_tokenizer(sentence))


['financi', 'servic', 'revenu', 'increas', '$', '0.5', 'billion', ',', '5', '%', ',', 'primarili', 'due', 'lower', 'impair', 'volum', 'growth', ',', 'partial', 'offset', 'lower', 'gain', '.']

In [32]:
tfidf = feature_extraction.text.TfidfVectorizer(tokenizer=my_tokenizer, max_df = 0.95, min_df=0.0001 
                                                , ngram_range=(1, 2))

corpus = ["Today is Wednesday"
          , "Delhi weather is hot today."
          , "Delhi roads are not busy in the morning"]

doc_term_matrix = tfidf.fit_transform(corpus)

# returns term and index in the feature matrix
print("Vocabulary: ", tfidf.vocabulary_)


Vocabulary:  {'today': 13, 'wednesday': 18, 'today wednesday': 15, 'delhi': 3, 'weather': 16, 'hot': 6, '.': 0, 'delhi weather': 5, 'weather hot': 17, 'hot today': 7, 'today .': 14, 'road': 11, 'not': 9, 'busi': 1, 'morn': 8, 'delhi road': 4, 'road not': 12, 'not busi': 10, 'busi morn': 2}

In [33]:
columns = [None] * len(tfidf.vocabulary_)
for term in tfidf.vocabulary_:
    columns[tfidf.vocabulary_[term]] = term
columns
scores = pd.DataFrame(doc_term_matrix.toarray()
                      , columns= columns)
scores


Out[33]:
. busi busi morn delhi delhi road delhi weather hot hot today morn not not busi road road not today today . today wednesday weather weather hot wednesday
0 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.4736 0.0000 0.6228 0.0000 0.0000 0.6228
1 0.3501 0.0000 0.0000 0.2663 0.0000 0.3501 0.3501 0.3501 0.0000 0.0000 0.0000 0.0000 0.0000 0.2663 0.3501 0.0000 0.3501 0.3501 0.0000
2 0.0000 0.3414 0.3414 0.2597 0.3414 0.0000 0.0000 0.0000 0.3414 0.3414 0.3414 0.3414 0.3414 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000

In [34]:
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

In [35]:
X_test_tfidf.shape, y_test.shape, X_train_tfidf.shape, y_train.shape


Out[35]:
((25000, 193057), (25000,), (25000, 193057), (25000,))

Let's estimate the memory requirment if the data is presented in dense matrix format


In [36]:
cell_count = np.product(X_train_tfidf.shape)
bytes = cell_count * 4 
GBs = bytes / (1024 ** 3) 
GBs


Out[36]:
17.979834228754044

In [37]:
sparsity = 1 - X_train_tfidf.count_nonzero() / cell_count
sparsity


Out[37]:
0.9992081565547999

In [38]:
1 - X_train_tfidf.nnz / cell_count


Out[38]:
0.9992081565547999

In [39]:
print("Type of doc_term_matrix", type(X_train_tfidf))


Type of doc_term_matrix <class 'scipy.sparse.csr.csr_matrix'>

Byte size of the training doc sparse doc


In [40]:
print(X_train_tfidf.data.nbytes / (1024.0 ** 3), "GB")


0.02847442775964737 GB

Classification Model


In [41]:
%%time
lr = linear_model.LogisticRegression(C = 0.6, random_state = 1
                            , n_jobs = 8, solver="saga")
lr.fit(X_train_tfidf, y_train)
y_train_pred = lr.predict(X_train_tfidf)
y_test_pred = lr.predict(X_test_tfidf)
print("Training accuracy: ", metrics.accuracy_score(y_train, y_train_pred))
print("Test accuracy: ", metrics.accuracy_score(y_test, y_test_pred))


Training accuracy:  0.93476
Test accuracy:  0.88596
CPU times: user 1.01 s, sys: 5.75 ms, total: 1.02 s
Wall time: 1.07 s

In [ ]:


In [42]:
fpr, tpr, thresholds = metrics.roc_curve(y_test,
                        lr.predict_proba(X_test_tfidf)[:, [1]])
auc = metrics.auc(fpr, tpr)

plt.plot(fpr, tpr)
plt.ylim(0, 1)
plt.xlim(0, 1)
plt.plot([0,1], [0,1], ls = "--", color = "k")
plt.xlabel("False Postive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve, auc: %.4f" % auc);



In [43]:
%%time
from sklearn import naive_bayes, ensemble
bayes = naive_bayes.MultinomialNB(alpha=1)
bayes.fit(X_train_tfidf, y_train)
print("accuracy: ", bayes.score(X_test_tfidf, y_test))


accuracy:  0.86456
CPU times: user 144 ms, sys: 8.79 ms, total: 152 ms
Wall time: 45.1 ms

In [44]:
%%time
est = tree.DecisionTreeClassifier()
est.fit(X_train_tfidf, y_train)
print("accuracy: ", est.score(X_test_tfidf, y_test))


accuracy:  0.72388
CPU times: user 42.6 s, sys: 80.9 ms, total: 42.6 s
Wall time: 41.2 s

In [45]:
columns = [None] * len(tfidf.vocabulary_)
for term in tfidf.vocabulary_:
    columns[tfidf.vocabulary_[term]] = term
result = pd.DataFrame({"feature": columns
                    , "importance": est.feature_importances_})
result = result.sort_values("importance", ascending = False)
result = result[result.importance > 0.0]
print("Top 50 terms: ", list(result.feature[:50]))


Top 50 terms:  ['bad', 'worst', 'wast', 'great', 'poor', 'bore', 'aw', 'no', 'love', 'excel', 'noth', 'stupid', 'perfect', 'best', 'terribl', 'disappoint', 'suppos', 'beauti', 'enjoy', 'tri', 'wors', 'world', 'fail', 'stori', 'one', 'film', 'favorit', 'amaz', 'good', 'plot', 'act', 'annoy', 'like', 'brilliant', 'script', 'time', 'alway', '1', 'reason', '4 10', 'even', 'see', 'save', 'also', 'today', 'mani', 'dull', 'movi', 'show', 'ridicul']

Important terms for a document


In [46]:
vocab_by_term = tfidf.vocabulary_
vocab_by_idx = dict({(vocab_by_term[term], term) 
                                 for term in vocab_by_term})

In [47]:
str(vocab_by_term)[:100]


Out[47]:
"{'bromwel': 22087, 'high': 77073, 'cartoon': 24828, 'comedi': 31274, 'ran': 135967, 'time': 171635, "

In [48]:
str(vocab_by_idx)[:100]


Out[48]:
"{40948: 'dialogu wooden', 155508: 'situat two', 92122: 'less make', 110635: 'movi theatr', 3549: 'ac"

In [49]:
idx = 5
print("Content:\n", X_train[idx])
row = X_train_tfidf[idx]
terms = [(vocab_by_idx[row.indices[i]], row.data[i])
             for i, term in enumerate(row.indices)]
pd.Series(dict(terms)).sort_values(ascending = False)


Content:
 this isn t the comedic robin williams nor is it the quirky insane robin williams of recent thriller fame this is a hybrid of the classic drama without over dramatization mixed with robin s new love of the thriller but this isn t a thriller per se this is more a mystery suspense vehicle through which williams attempts to locate a sick boy and his keeper also starring sandra oh and rory culkin this suspense drama plays pretty much like a news report until william s character gets close to achieving his goal i must say that i was highly entertained though this movie fails to teach guide inspect or amuse it felt more like i was watching a guy williams as he was actually performing the actions from a third person perspective in other words it felt real and i was able to subscribe to the premise of the story all in all it s worth a watch though it s definitely not friday saturday night fare it rates a 7 7 10 from the fiend 
Out[49]:
william         0.2851
robin           0.2244
robin william   0.1723
thriller        0.1661
7 7             0.1222
                 ...  
stori           0.0269
charact         0.0264
get             0.0254
not             0.0188
movi            0.0180
Length: 131, dtype: float64

In [50]:
idx = 50
row = X_train_tfidf[idx]
terms = [(vocab_by_idx[row.indices[i]], row.data[i]) 
                 for i, term in enumerate(row.indices)]
top_terms= list(pd.Series(dict(terms))\
                .sort_values(ascending = False)[:50].index)
wc = WordCloud(background_color="white", 
    width=500, height=500, max_words=50).generate("+".join(top_terms))
plt.figure(figsize=(10, 10))
plt.imshow(wc)
plt.axis("off");


Build Pipeline for classificaiton Model


In [51]:
%%time
tfidf =feature_extraction.text.TfidfVectorizer(
              tokenizer=my_tokenizer
            , stop_words = stopwords
            , ngram_range=(1, 2)
        )

pipe = pipeline.Pipeline([
    ("tfidf", tfidf),
    ("est", linear_model.LogisticRegression(C = 1.0, random_state = 1
                            , n_jobs = 8, solver="saga"))
])
pipe.fit(X_train, y_train)


/anaconda3/lib/python3.7/site-packages/sklearn/feature_extraction/text.py:300: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ["'d", "'s", 'could', 'might', 'must', "n't", 'need', 'sha', 'wo', 'would'] not in stop_words.
  'stop_words.' % sorted(inconsistent))
CPU times: user 1min 15s, sys: 454 ms, total: 1min 15s
Wall time: 1min 13s
Out[51]:
Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 2), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=['i', 'me', 'my', 'myself', 'we',
                                             'our', 'ours', 'ourselves', 'you...
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function my_tokenizer at 0x1a25b347a0>,
                                 use_idf=True, vocabulary=None)),
                ('est',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                                    l1_ratio=None, max_iter=100,
                                    multi_class='warn', n_jobs=8, penalty='l2',
                                    random_state=1, solver='saga', tol=0.0001,
                                    verbose=0, warm_start=False))],
         verbose=False)

In [52]:
import pickle

In [53]:
with open("/tmp/model.pkl", "wb") as f:
    pickle.dump(pipe, f)

In [54]:
!ls -lh /tmp/model.pkl


-rw-r--r--  1 abasar  wheel   125M Oct 25 12:05 /tmp/model.pkl

In [55]:
with open("/tmp/model.pkl", "rb") as f:
    model = pickle.load(f)

In [56]:
doc1 = """when we started watching this series on 
cable i had no idea how addictive it would be 
even when you hate a character you hold back because 
they are so beautifully developed you can almost
understand why they react to frustration fear greed 
or temptation the way they do it s almost as if the
viewer is experiencing one of christopher s learning 
curves i can t understand why adriana would put up with 
christopher s abuse of her verbally physically and 
emotionally but i just have to read the newspaper to 
see how many women can and do tolerate such behavior 
carmella has a dream house endless supply of expensive 
things but i m sure she would give it up for a loving 
and faithful husband or maybe not that s why i watch 
it doesn t matter how many times you watch an episode
you can find something you missed the first five times 
we even watch episodes out of sequence watch season 1 
on late night with commercials but all the language a e 
with language censored reruns on the movie network whenever 
they re on we re there we ve been totally spoiled now i also 
love the malaprop s an albacore around my neck is my favorite of
johnny boy when these jewels have entered our family vocabulary 
it is a sign that i should get a life i will when the series
ends and i have collected all the dvd s and put the collection 
in my will"""
doc1 = preprocess(doc1)

In [57]:
model.predict_proba(np.array([doc1]))[:, 1]


/anaconda3/lib/python3.7/site-packages/sklearn/feature_extraction/text.py:300: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ["'d", "'s", 'could', 'might', 'must', "n't", 'need', 'sha', 'wo', 'would'] not in stop_words.
  'stop_words.' % sorted(inconsistent))
Out[57]:
array([0.7886])

Hashing Vectorizer

Convert a collection of text documents to a matrix of deterministic hash token (murmur3) occurrences

It turns a collection of text documents into a scipy.sparse matrix holding token occurrence counts (or binary occurrence information), possibly normalized as token frequencies if norm=’l1’ or projected on the euclidean unit sphere if norm=’l2’.

Advantages

  • it is very low memory scalable to large datasets as there is no need to store a vocabulary dictionary in memory
  • it is fast to pickle and un-pickle as it holds no state besides the constructor parameters
  • it can be used in a streaming (partial fit) or parallel pipeline as there is no state computed during fit.

Disadvantages

  • there is no way to compute the inverse transform (from feature indices to string feature names) which can be a problem when trying to introspect which features are most important to a model.
  • there can be collisions: distinct tokens can be mapped to the same feature index. However in practice this is rarely an issue if n_features is large enough (e.g. 2 ** 18 for text classification problems).
  • no IDF weighting as this would render the transformer stateful.

In [58]:
hashing_vectorizer = feature_extraction.text.HashingVectorizer(n_features=2 ** 3
                                            , tokenizer=my_tokenizer, ngram_range=(1, 2))

corpus = ["Today is Wednesday"
          , "Delhi weather is hot today."
          , "Delhi roads are not busy in the morning"]

doc_term_matrix = hashing_vectorizer.fit_transform(corpus)

pd.DataFrame(doc_term_matrix.toarray()) # Each cell is normalized (l2) row-wise


Out[58]:
0 1 2 3 4 5 6 7
0 0.0000 0.5774 0.0000 0.5774 0.0000 0.0000 -0.5774 0.0000
1 0.0000 0.3015 0.6030 0.3015 0.0000 -0.6030 -0.3015 0.0000
2 0.0000 -0.4472 0.0000 0.0000 0.4472 -0.4472 -0.4472 0.4472

In [59]:
%%time

n_features = int(X_train_tfidf.shape[1] * 0.8)

hashing_vectorizer = feature_extraction.text.HashingVectorizer(n_features=n_features
                                         , tokenizer=my_tokenizer, ngram_range=(1, 2))
X_train_hash = hashing_vectorizer.fit_transform(X_train)
X_test_hash = hashing_vectorizer.transform(X_test)


CPU times: user 2min 6s, sys: 389 ms, total: 2min 6s
Wall time: 1min 57s

In [60]:
X_train_hash


Out[60]:
<25000x193057 sparse matrix of type '<class 'numpy.float64'>'
	with 5383300 stored elements in Compressed Sparse Row format>

In [61]:
X_train_hash.shape, X_test_hash.shape


Out[61]:
((25000, 193057), (25000, 193057))

In [62]:
print(X_train_hash.data.nbytes / (1024.0 ** 3), "GB")


0.040108710527420044 GB

In [63]:
%%time
lr = linear_model.LogisticRegression(C = 1.0, random_state = 1,
                            solver = "liblinear")
lr.fit(X_train_hash, y_train)
y_train_pred = lr.predict(X_train_hash)
y_test_pred = lr.predict(X_test_hash)
print("Training accuracy: ", metrics.accuracy_score(y_train, y_train_pred))
print("Test accuracy: ", metrics.accuracy_score(y_test, y_test_pred))


Training accuracy:  0.92204
Test accuracy:  0.87676
CPU times: user 8.62 s, sys: 65.9 ms, total: 8.69 s
Wall time: 1.51 s

In [64]:
print(metrics.classification_report(y_test, y_test_pred))


              precision    recall  f1-score   support

           0       0.88      0.87      0.88     12500
           1       0.87      0.88      0.88     12500

    accuracy                           0.88     25000
   macro avg       0.88      0.88      0.88     25000
weighted avg       0.88      0.88      0.88     25000


In [ ]: