The sentiment analysis program we wrote earlier (in session 1) adopts a non-machine learning algorithm. That is, it tries to define what words have good and bad sentiments and assumes all the necessary words of good and bad sentiments exist in the word_sentiment.csv file.
Machine Learning (ML) is a class of algorithms, which are data-driven, i.e. unlike "normal" algorithms, it is the data that "tells" what the "good answer" is. A machine learning algorithm would not have such coded definition of what a good and bad sentiment is, but would "learn-by-examples". That is, you will show several words which have been labeled as good sentiment and bad sentiment and a good ML algorithm will eventually learn and be able to predict whether or not an unseen word has a good or bad sentiment. This particular example of sentiment analysis is "supervised", which means that your example words must be labeled, or explicitly say which words are good and which are bad.
On the other hand, in the case of unsupervised learning, the word examples are not labeled. Of course, in such a case the algorithm itself cannot "invent" what a good sentiment is, but it can try to cluster the data into different groups, e.g. it can figure out that words that are close to certain other words are different from words closer to some other words (eg. words close to the word "mother" are most likely good). There are "intermediate" forms of supervision, i.e. semi-supervised and active learning. Technically, these are supervised methods in which there is some "smart" way to avoid a large number of labeled examples.
NLTK module is built for working with language data. NLTK supports classification, tokenization, stemming, tagging, parsing, and semantic reasoning functionalities. We will use the NLTK module and employ the naive Bayes method to classify words as being either positive or negative sentiment. You can also use other modules specifically meant for ML eg. sklearn module.
In [ ]:
def feature_extractor(word):
"""Extract the features for a given word and return a dictonary of the features"""
start_letter = word[0]
last_letter = word[-1]
return {'start_letter' : start_letter,'last_letter' : last_letter}
def main():
print(feature_extractor('poonacha'))
main()
In [ ]:
import csv
def feature_extractor(word):
"""Extract the features for a given word and return a dictonary of the features"""
start_letter = word[0]
last_letter = word[-1]
return {'start_letter' : start_letter,'last_letter' : last_letter}
def ML_train(sentiment_corpus):
"""Create feature set from the corpus given to to it."""
feature_set = []
with open(sentiment_corpus,'rt',encoding = 'utf-8') as sentobj:
sentiment_handle = csv.reader(sentobj)
for sentiment in sentiment_handle:
new_row = []
new_row.append(feature_extractor(sentiment[0])) #get the dictionary of features for a word
if int(sentiment[1]) >= 0: # Club the sentiment values (-5 to + 5) to just positive or negative
new_row.append('positive')
else:
new_row.append('negative')
feature_set.append(new_row)
print(feature_set)
def main():
sentiment_csv = "C:/Users/kmpoo/Dropbox/HEC/Teaching/Python for PhD Mar 2018/python4phd/Session 3/Sent/word_sentiment.csv"
ML_train(sentiment_csv)
main()
We will split the feature data set into training and test data sets. The training set is used to train our ML model and then the testing set can be used to check how good the model is. It is normal to use 20% of the data set for testing purposes. In our case we will retain 1500 words for training and the rest for testing.
In [ ]:
import csv
import random
def feature_extractor(word):
"""Extract the features for a given word and return a dictonary of the features"""
start_letter = word[0]
last_letter = word[-1]
return {'start_letter' : start_letter,'last_letter' : last_letter}
def ML_train(sentiment_corpus):
"""Create feature set from the corpus given to to it. Split the feature set into training and testing sets"""
feature_set = []
with open(sentiment_corpus,'rt',encoding = 'utf-8') as sentobj:
sentiment_handle = csv.reader(sentobj)
for sentiment in sentiment_handle:
new_row = []
new_row.append(feature_extractor(sentiment[0])) #get the dictionary of features for a word
if int(sentiment[1]) >= 0: # Club the sentiment values (-5 to + 5) to just positive or negative
new_row.append('positive')
else:
new_row.append('negative')
feature_set.append(new_row)
random.shuffle(feature_set)
# We need to shuffle the features since the word_sentiment.csv had words arranged in alphabetical order
train_set = feature_set[:1500] #the first 1500 words becomes our training set
test_set = feature_set[1500:]
print(len(test_set))
def main():
sentiment_csv = "C:/Users/kmpoo/Dropbox/HEC/Teaching/Python for PhD Mar 2018/python4phd/Session 3/Sent/word_sentiment.csv"
ML_train(sentiment_csv)
main()
In [ ]:
import csv
import random
import nltk
def feature_extractor(word):
"""Extract the features for a given word and return a dictonary of the features"""
start_letter = word[0]
last_letter = word[-1]
return {'start_letter' : start_letter,'last_letter' : last_letter}
def ML_train(sentiment_corpus):
"""Create feature set from the corpus given to to it. Split the feature set into training and testing sets.
Train the classifier using the naive Bayes model and return the classifier. """
feature_set = []
with open(sentiment_corpus,'rt',encoding = 'utf-8') as sentobj:
sentiment_handle = csv.reader(sentobj)
for sentiment in sentiment_handle:
new_row = []
new_row.append(feature_extractor(sentiment[0])) #get the dictionary of features for a word
if int(sentiment[1]) >= 0: # Club the sentiment values (-5 to + 5) to just positive or negative
new_row.append('positive')
else:
new_row.append('negative')
feature_set.append(new_row)
random.shuffle(feature_set)
# We need to shuffle the features since the word_sentiment.csv had words arranged in alphabetical order
train_set = feature_set[:1500] #the first 1500 words becomes our training set
test_set = feature_set[1500:]
classifier = nltk.NaiveBayesClassifier.train(train_set)
# Note: to create the classifier we need to provide a dictonary of features and the label ONLY
return classifier
def main():
sentiment_csv = "C:/Users/kmpoo/Dropbox/HEC/Teaching/Python for PhD Mar 2018/python4phd/Session 3/Sent/word_sentiment.csv"
classifier = ML_train(sentiment_csv)
input_word = input('Enter a word ').lower()
sentiment = classifier.classify(feature_extractor(input_word))
print('Sentiment of word "', input_word,'" is : ',sentiment)
main()
Find how good the model is in identifying the labels. Ensure that the test set is distinct from the training corpus. If we simply re-used the training set as the test set, then a model that simply memorized its input, without learning how to generalize to new examples, would receive misleadingly high scores. The function nltk.classify.accuracy() will calculate the accuracy of a classifier model on a given test set.
In [ ]:
import csv
import random
import nltk
def feature_extractor(word):
"""Extract the features for a given word and return a dictonary of the features"""
start_letter = word[0]
last_letter = word[-1]
return {'start_letter' : start_letter,'last_letter' : last_letter}
def ML_train(sentiment_corpus):
"""Create feature set from the corpus given to to it. Split the feature set into training and testing sets.
Train the classifier using the naive Bayes model and return the classifier. """
feature_set = []
with open(sentiment_corpus,'rt',encoding = 'utf-8') as sentobj:
sentiment_handle = csv.reader(sentobj)
for sentiment in sentiment_handle:
new_row = []
new_row.append(feature_extractor(sentiment[0])) #get the dictionary of features for a word
if int(sentiment[1]) >= 0: # Club the sentiment values (-5 to + 5) to just positive or negative
new_row.append('positive')
else:
new_row.append('negative')
feature_set.append(new_row)
random.shuffle(feature_set)
# We need to shuffle the features since the word_sentiment.csv had words arranged in alphabetical order
train_set = feature_set[:1500] #the first 1500 words becomes our training set
test_set = feature_set[1500:]
classifier = nltk.NaiveBayesClassifier.train(train_set)
# Note: to create the classifier we need to provide a dictonary of features and the label ONLY
print('Test accuracy of the classifier = ',nltk.classify.accuracy(classifier, test_set))
print(classifier.show_most_informative_features())
return classifier
def main():
sentiment_csv = "C:/Users/kmpoo/Dropbox/HEC/Teaching/Python for PhD Mar 2018/python4phd/Session 3/Sent/word_sentiment.csv"
classifier = ML_train(sentiment_csv)
input_word = input('Enter a word ').lower()
sentiment = classifier.classify(feature_extractor(input_word))
print('Sentiment of word "', input_word,'" is : ',sentiment)
main()
In [ ]:
#Enter code here
#
Using a seperate dev-test set, we can generate a list of the errors that the classifier makes when predicting the sentiment. We can then examine individual error cases where the model predicted the wrong label, and try to determine what additional pieces of information would allow it to make the right decision (or which existing pieces of information are tricking it into making the wrong decision). The feature set can then be adjusted accordingly.