This assessment is very much like the Text Classification Project we just completed, and the dataset is very similar.
The moviereviews2.tsv dataset contains the text of 6000 movie reviews. 3000 are positive, 3000 are negative, and the text has been preprocessed as a tab-delimited file. As before, labels are given as pos and neg.
We've included 20 reviews that contain either NaN data, or have strings made up of whitespace.
For more information on this dataset visit http://ai.stanford.edu/~amaas/data/sentiment/
In [1]:
import numpy as np
import pandas as pd
df = pd.read_csv('../TextFiles/moviereviews2.tsv', sep='\t')
df.head()
Out[1]:
In [3]:
# Check for NaN values:
df.isnull().sum()
Out[3]:
In [2]:
# Check for whitespace strings (it's OK if there aren't any!):
blanks = [] # start with an empty list
for i,lb,rv in df.itertuples(): # iterate over the DataFrame
if type(rv)==str: # avoid NaN values
if rv.isspace(): # test 'review' for whitespace
blanks.append(i) # add matching index numbers to the list
len(blanks)
Out[2]:
In [4]:
df.dropna(inplace=True)
In [5]:
df['label'].value_counts()
Out[5]:
In [6]:
from sklearn.model_selection import train_test_split
X = df['review']
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
In [7]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
text_clf = Pipeline([('tfidf', TfidfVectorizer()),
('clf', LinearSVC()),
])
# Feed the training data through the pipeline
text_clf.fit(X_train, y_train)
Out[7]:
In [8]:
# Form a prediction set
predictions = text_clf.predict(X_test)
In [9]:
# Report the confusion matrix
from sklearn import metrics
print(metrics.confusion_matrix(y_test,predictions))
In [10]:
# Print a classification report
print(metrics.classification_report(y_test,predictions))
In [11]:
# Print the overall accuracy
print(metrics.accuracy_score(y_test,predictions))