Data Preprocessing

Use a small subset of data to experiment with the data preprocessing and feature extraction. Testing the CSV module and look at the data.


In [ ]:
import csv
import re
subsetData = open("SAsubset.csv", "r")
for row in csv.DictReader(subsetData):
    print row['Sentiment'], row['SentimentText']
subsetData.close()

Typical Noisy data

  • escape character
  • url
  • @handle

In [ ]:
def getData(csvFname):
    sent = []
    tweet = []
    dataSource = open(csvFname, "r")
    for row in csv.DictReader(dataSource):
        sent.append(row['Sentiment'])
        tweet.append(row['SentimentText'])
    dataSource.close()
    return sent, tweet

In [ ]:
sent, tweet = getData("SAsubset.csv")

from scipy.stats import itemfreq
itemfreq(sent)

In [ ]:
tweet

ballpark preprocessing: "unescape", lowercase, remove all puncts


In [ ]:
tweet[15]

In [ ]:
from HTMLParser import HTMLParser
h = HTMLParser()
print h.unescape(tweet[15])

In [ ]:
re.sub("[^\w\s]", " ", h.unescape(tweet[15])).lower()

modify the getData a little and the the 200K tweets dataset.


In [ ]:
def getData(csvFname):
    h = HTMLParser()
    corpus = []
    dataSource = open(csvFname, "r")
    for row in csv.DictReader(dataSource):
        try:
            corpus.append({"tweet": re.sub("[^a-zA-Z\s]", " ", h.unescape(row['SentimentText'])).lower(), "sent": int(row['Sentiment'])})
        except:
            continue
    dataSource.close()
    return corpus
corpus = getData("SA200K.csv")

In [ ]:
print len(corpus)
print corpus[2]

Feature extraction

Conversion of tweets to BOW feature matrix (using only default setting of CountVectorizer)


In [ ]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
vectorizer

In [ ]:
X = vectorizer.fit_transform([item['tweet'] for item in corpus])
X

In [ ]:
#X.toarray()

In [ ]:
vectorizer.get_feature_names()

In [ ]:
y = [item['sent'] for item in corpus]

Randomly split the X and y into training and test set


In [ ]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=1697)

In [ ]:
X_train

In [ ]:
X_test

In [ ]:
y_train

Training

Try to fit a naive bayes classifier $$h_{\theta}(X)$$ Naive Bayes convergence rate: $$\sim O(\log{n})$$


In [ ]:
from sklearn.naive_bayes import MultinomialNB
hx_nb = MultinomialNB()

In [ ]:
hx_nb.fit(X_train, y_train)

In [ ]:
hx_nb.predict(X_train)

Evaluate the effectiveness of nbmX using F1 score


In [ ]:
from sklearn.metrics import confusion_matrix, f1_score

In [ ]:
print confusion_matrix(y_train, hx_nb.predict(X_train))
print f1_score(y_train, hx_nb.predict(X_train))

Do it on test set


In [ ]:
print confusion_matrix(y_test, hx_nb.predict(X_test))
print f1_score(y_test, hx_nb.predict(X_test))

Classify a new tweet


In [ ]:
newTweetFeatureVector = vectorizer.transform(["I feel so bad now. Let's go to hell!"])

In [ ]:
newTweetFeatureVector

In [ ]:
hx_nb.predict(newTweetFeatureVector)

In [ ]:
newTweetFeatureVector = vectorizer.transform(["scikit learn is so cool!"])
hx_nb.predict(newTweetFeatureVector)

In [ ]:
newTweetFeatureVector = vectorizer.transform(["I am feeling not good with scikit learn"])
hx_nb.predict(newTweetFeatureVector)

In [ ]:
hx_nb.predict_proba(newTweetFeatureVector)

Logistic regression with regularization (C is the regularization rate) $$ \sim O(n)$$


In [ ]:
from sklearn.linear_model import LogisticRegression

In [ ]:
hx_log = LogisticRegression(C=0.6)

In [ ]:
hx_log.fit(X_train, y_train)

In [ ]:
confusion_matrix(y_train, hx_log.predict(X_train))

In [ ]:
print "Training set F1: %s" %f1_score(y_train, hx_log.predict(X_train))
print "Test set F1: %s" %f1_score(y_test, hx_log.predict(X_test))

Tuning

Tuning the value of C in the above LogisticRegression model

Bigram tokenization


In [ ]:
bigramvect = CountVectorizer(ngram_range = (1,2))

In [ ]:
X_bi = bigramvect.fit_transform([item['tweet'] for item in corpus])

In [ ]:
X_bi

In [ ]:
X

In [ ]:
X_train_bi, X_test_bi, y_train_bi, y_test_bi = train_test_split(X_bi, y, test_size = 0.3, random_state=1697)

In [ ]:
bnb = MultinomialNB()
bi_nbhx = bnb.fit(X_train_bi, y_train_bi)

In [ ]:
confusion_matrix(y_train_bi, bi_nbhx.predict(X_train_bi))

In [ ]:
f1_score(y_train_bi, bi_nbhx.predict(X_train_bi))

In [ ]:
f1_score(y_test_bi, bi_nbhx.predict(X_test_bi))

In [ ]:
newTweetFeatureVector = bigramvect.transform(["I am feeling not good with scikit learn"])
bi_nbhx.predict(newTweetFeatureVector)[0]

Your move

Create a bot which talk back based on the sentiment of your input sentence.


In [ ]: