Text Classification Assessment

This assessment is very much like the Text Classification Project we just completed, and the dataset is very similar.

The moviereviews2.tsv dataset contains the text of 6000 movie reviews. 3000 are positive, 3000 are negative, and the text has been preprocessed as a tab-delimited file. As before, labels are given as pos and neg.

We've included 20 reviews that contain either NaN data, or have strings made up of whitespace.

For more information on this dataset visit http://ai.stanford.edu/~amaas/data/sentiment/

Task #1: Perform imports and load the dataset into a pandas DataFrame

For this exercise you can load the dataset from '../TextFiles/moviereviews2.tsv'.


In [6]:
import spacy as spacy
import numpy as np
import pandas as pd

data = pd.read_csv('../TextFiles/moviereviews2.tsv', sep='\t')

data.head()


Out[6]:
label review
0 pos I loved this movie and will watch it again. Or...
1 pos A warm, touching movie that has a fantasy-like...
2 pos I was not expecting the powerful filmmaking ex...
3 neg This so-called "documentary" tries to tell tha...
4 pos This show has been my escape from reality for ...

Task #2: Check for missing values:


In [9]:
# Check for NaN values:

data.isnull().sum()


Out[9]:
label      0
review    20
dtype: int64

In [10]:
# Check for whitespace strings (it's OK if there aren't any!):

white_spaces = []

for i, lb, rw in data.itertuples():
    if type(rw) == str:
        if rw.isspace():
            white_spaces.append(i)
            
len(white_spaces)


Out[10]:
0

Task #3: Remove NaN values:


In [47]:
data.drop(white_spaces, inplace=True)
data.dropna(inplace=True)

Task #4: Take a quick look at the label column:


In [48]:
data.groupby('label').count()


Out[48]:
review
label
neg 2990
pos 2990

In [49]:
data['label'].value_counts()


Out[49]:
neg    2990
pos    2990
Name: label, dtype: int64

Task #5: Split the data into train & test sets:

You may use whatever settings you like. To compare your results to the solution notebook, use test_size=0.33, random_state=42


In [50]:
from sklearn.model_selection import train_test_split

X = data['review']
y = data['label']

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.33, random_state=42)

X_train.head()
#print(X_train.shape, " ", y_train.shape)
#print(X_test.shape, " ", y_test.shape)


Out[50]:
192     Why do people who do not know what a particula...
4691    Drum scene is wild! Cook, Jr. is unsung hero o...
5398    For long time I haven't seen such a good fanta...
4646    Although it got some favorable press after pla...
5001    Not a bad word to say about this film really. ...
Name: review, dtype: object

Task #6: Build a pipeline to vectorize the date, then train and fit a model

You may use whatever model you like. To compare your results to the solution notebook, use LinearSVC.


In [51]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

text_clf = Pipeline([('tfidf', TfidfVectorizer()),
                     ('clf', LinearSVC())
])

# Feed the training data through the pipeline
text_clf.fit(X_train, y_train)


Out[51]:
Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, use_idf=True,
                                 vocabulary=None)),
                ('clf',
                 LinearSVC(C=1.0, class_weight=None, dual=True,
                           fit_intercept=True, intercept_scaling=1,
                           loss='squared_hinge', max_iter=1000,
                           multi_class='ovr', penalty='l2', random_state=None,
                           tol=0.0001, verbose=0))],
         verbose=False)

Task #7: Run predictions and analyze the results


In [53]:
# Form a prediction set

predictions = text_clf.predict(X_test)

In [54]:
# Report the confusion matrix

from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, predictions)


Out[54]:
array([[900,  91],
       [ 63, 920]])

In [56]:
# Print a classification report

from sklearn.metrics import classification_report

print(classification_report(y_test, predictions))


              precision    recall  f1-score   support

         neg       0.93      0.91      0.92       991
         pos       0.91      0.94      0.92       983

    accuracy                           0.92      1974
   macro avg       0.92      0.92      0.92      1974
weighted avg       0.92      0.92      0.92      1974


In [58]:
# Print the overall accuracy

from sklearn.metrics import accuracy_score

print(accuracy_score(y_test, predictions) * 100)


92.19858156028369

Great job!