Text Classification using TF-IDF based Features

In this notebook, we use a number of conventional feature engineering based approaches for sentiment classification.



In [1]:

    
import pandas as pd
import numpy as np
import os

%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')



In [2]:

    
sentences = pd.read_csv(
    'imdb_labelled.txt', sep='\|', header=None, engine='python',
    names=['sentence', 'sentiment_label']
)



In [3]:

    
sentences.head()









    Out[3]:







  
    
      
      sentence
      sentiment_label
    
  
  
    
      0
      A very, very, very slow-moving, aimless movie ...
      0
    
    
      1
      Not sure who was more lost - the flat characte...
      0
    
    
      2
      Attempting artiness with black & white and cle...
      0
    
    
      3
      Very little music or anything to speak of.
      0
    
    
      4
      The best scene in the movie was when Gerardo i...
      1



In [4]:

    
sentences.shape









    Out[4]:





(1000, 2)

Next we check for label imbalance.



In [5]:

    
counts = sentences.sentiment_label.value_counts()
ax = counts.plot(kind='bar', rot=0, title='Sentences per label')

So this dataset has almost equal number of articles.

Next we analyze the distribution of words in the two class of articles.



In [6]:

    
from nltk.tokenize import word_tokenize

sentences = sentences.assign(word_list=sentences.sentence.apply(word_tokenize))
sentences.head()









    Out[6]:







  
    
      
      sentence
      sentiment_label
      word_list
    
  
  
    
      0
      A very, very, very slow-moving, aimless movie ...
      0
      [A, very, ,, very, ,, very, slow-moving, ,, ai...
    
    
      1
      Not sure who was more lost - the flat characte...
      0
      [Not, sure, who, was, more, lost, -, the, flat...
    
    
      2
      Attempting artiness with black & white and cle...
      0
      [Attempting, artiness, with, black, &, white, ...
    
    
      3
      Very little music or anything to speak of.
      0
      [Very, little, music, or, anything, to, speak,...
    
    
      4
      The best scene in the movie was when Gerardo i...
      1
      [The, best, scene, in, the, movie, was, when, ...



In [7]:

    
# Create a counter for each label and update the relevant counter as we iterate over the entire dataframe
from collections import Counter


counters = {label: Counter() for label in sentences.sentiment_label.unique()}

for _, row in sentences.iterrows():
    c = counters[row.loc['sentiment_label']]
    c.update(row.loc['word_list'])
    
def make_df(label, c):
    """Make a DataFrame out of the counter with words in the index."""
    df = pd.DataFrame.from_dict(c, orient='index')
    df.columns =['sentiment_{}'.format(label)]
    return df

counter_dfs = {label: make_df(label, c) for label, c in counters.items()}

Now we look at the high-frequency and low-frequency words for sentiment label 0.



In [8]:

    
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

english_stopwords = set(stopwords.words('english'))

print(english_stopwords)









    



[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
{'i', 'mustn', "shan't", 'each', 'off', 'an', "you're", 'couldn', 'not', 'don', 'ma', 'other', "hasn't", 'me', 'under', 'then', 'about', 'nor', 'do', 'm', 'myself', 'few', 'hers', 'from', 're', "you'd", 'that', 'can', 'above', "she's", 'whom', 'his', 'haven', "couldn't", 'but', 'during', 'than', 'its', "wasn't", 'am', 'd', 'itself', 'very', 'if', 'wasn', 'a', 'at', 'once', "shouldn't", 'they', "wouldn't", 'theirs', 'be', 'isn', 'before', "you'll", 'mightn', 'after', 've', 'this', "aren't", "doesn't", 'ain', 'hadn', "don't", 'the', 'wouldn', 'been', 'these', 'them', 'have', 'too', 'there', 'which', "that'll", 'out', "you've", 'o', 'didn', 'our', 'ourselves', 'so', 'you', 'it', 'doing', 'herself', 's', 'or', 'all', 'while', 'most', 'into', 'down', 'doesn', 'my', 'both', 'no', 'having', 'was', 'who', 'below', "mustn't", 'he', 'between', "weren't", "haven't", 'any', 'her', 'such', 'by', 'your', 'why', 'needn', 'through', "mightn't", 'ours', 'and', 'here', 'him', 'hasn', 'some', 'she', 'yours', "should've", 'll', 'more', 'their', 'same', 'were', 'weren', 'does', 'against', 'because', 'yourselves', 'should', 'are', 'to', 'only', 'himself', 'now', 'over', 'further', 'where', 'had', 'as', 't', 'just', 'won', 'up', 'we', 'own', 'will', 'for', 'being', 'did', 'on', 'again', "isn't", 'is', "needn't", "hadn't", "won't", 'shan', 'yourself', 'what', "didn't", 'with', 'those', 'until', 'themselves', 'how', 'in', 'y', "it's", 'when', 'shouldn', 'aren', 'has', 'of'}



In [9]:

    
fig, ax = plt.subplots(1, 2, figsize=(14, 6))

sent_0_counts = counter_dfs[0].sort_values(by='sentiment_0', ascending=False)

sent_0_counts = sent_0_counts.drop(english_stopwords, errors='ignore')

sent_0_counts.head(30).plot(kind='barh', title='40 most frequent words in sentiment-0.', ax=ax[0], fontsize=14)
sent_0_counts.tail(30).plot(kind='barh', title='40 least frequent words in sentiment-0.', ax=ax[1], fontsize=14)

plt.tight_layout()

High-frequency and low-frequency words for sentiment label 1.



In [10]:

    
fig, ax = plt.subplots(1, 2, figsize=(14, 6))

sent_1_counts = counter_dfs[1].sort_values(by='sentiment_1', ascending=False)
sent_1_counts = sent_1_counts.drop(english_stopwords, errors='ignore')

sent_1_counts.head(30).plot(kind='barh', title='40 most frequent words in sentiment-1.', ax=ax[0], fontsize=14)
sent_1_counts.tail(30).plot(kind='barh', title='40 least frequent words in sentiment-1.', ax=ax[1], fontsize=14)

plt.tight_layout()



In [11]:

    
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.linear_model.stochastic_gradient import SGDClassifier

model = SGDClassifier(alpha=1E-3, random_state=1234, max_iter=1000)
feature_calc = TfidfVectorizer(stop_words=english_stopwords)
pipeline = make_pipeline(feature_calc, model)



In [12]:

    
X_train = sentences.sample(frac=0.7, replace=False, random_state=1234).loc[:, 'sentence']
Y_train = sentences.sentiment_label.loc[X_train.index]



In [13]:

    
pipeline = pipeline.fit(X_train.values, Y_train.values)



In [14]:

    
X_test = sentences.drop(X_train.index, axis=0).loc[:, 'sentence']
Y_test = sentences.sentiment_label.loc[X_test.index]
predictions = pipeline.predict(X_test.values)



In [15]:

    
from sklearn.metrics import classification_report

results = classification_report(Y_test.values, predictions)
print(results)









    



             precision    recall  f1-score   support

          0       0.82      0.78      0.80       152
          1       0.78      0.82      0.80       148

avg / total       0.80      0.80      0.80       300



In [ ]:

	sentence	sentiment_label
0	A very, very, very slow-moving, aimless movie ...	0
1	Not sure who was more lost - the flat characte...	0
2	Attempting artiness with black & white and cle...	0
3	Very little music or anything to speak of.	0
4	The best scene in the movie was when Gerardo i...	1

	sentence	sentiment_label	word_list
0	A very, very, very slow-moving, aimless movie ...	0	[A, very, ,, very, ,, very, slow-moving, ,, ai...
1	Not sure who was more lost - the flat characte...	0	[Not, sure, who, was, more, lost, -, the, flat...
2	Attempting artiness with black & white and cle...	0	[Attempting, artiness, with, black, &, white, ...
3	Very little music or anything to speak of.	0	[Very, little, music, or, anything, to, speak,...
4	The best scene in the movie was when Gerardo i...	1	[The, best, scene, in, the, movie, was, when, ...