Natural language processsing is a booming field in the finance industry because of the massive amounts of user generated data that has recently become avaiable for analysis. Websites such as Twitter, StockTwits and Facebook, if analyzed correctly, can yield very relevient insights for predictions. Natural language processing can be a very daunting task because of how many layers of analysis you can add and the immense size of the English language. NLTK is a Python module and data repository, which serves as a tool for easy language processing. This paper proposes a simple method of deriving either a positive or negative sentiment orientation and highlights a few of the useful tools NLTK offers.
Firstly we need to download the NLTK repository, which is achieved by running nltk.download(). You can choose to download all text examples or download just the movie reviews which we use in this example into the corpus folder.
In [125]:
import nltk
#nltk.download()
Wordnet is a NLTK sub package that can be used to link togther words with their respective synonyms, antonyms, part of speech and definitions. This is a very powerful tool because creating a similar wordnet requires a significant amount of databasing and organization.
In [126]:
from nltk.corpus import wordnet
#example word: plan, you may change this word to anything that has synonyms and antonyms
test_word = 'good'
syns = wordnet.synsets(test_word)
# Lemma is another word for something simialr to a synonym
print(syns[1].lemmas()[1].name())
Here we are going to verify that the we are using the version of the word we think we are using by pulling the definition
In [127]:
print(syns[0].definition())
We can also test test the word by using the .examples() method which will yield examples of the word in question
In [128]:
#examples
print(syns[0].examples())
For each word we can create a comprehensive list of all synonyms and antonyms by creating a for loop.
In [129]:
synonyms =[]
antonyms=[]
for syn in wordnet.synsets(test_word):
for l in syn.lemmas():
#gather all lemmas of each synonym in the synonym set
synonyms.append(l.name())
if l.antonyms():
#gather all antonyms of each lemma of each synonym in the synonym set
antonyms.append(l.antonyms()[0].name())
print(set(synonyms))
print(set(antonyms))
In the English language, there are multiple different ways of expressing an idea. Very often, people think that by using a synonym, the meaning of the sentence is unchanged. Under many circumstances, this is true, however to the computer a slight word change can make a big difference in the returned list of lemmas and antonyms. One method we can use to determine the similarity between to words to make sure any syntax changes we make dont alter the meaning of the word is to use the Wu and Palmer system of determing semantic similarity by calling on the wup_similarity method.
In [130]:
# yields % similarity of the words
w1 = wordnet.synset('ship.n.01') # ship , refrences that it is the noun ship, 1st entry of similar words
w2 = wordnet.synset('boat.n.01')# boat
print('word similarity is',w1.wup_similarity(w2)*100,'%')
In [131]:
w1 = wordnet.synset('ship.n.01') # ship , the noun ship, 1st entry of similar word
w2 = wordnet.synset('car.n.01')# boat
print('word similarity is',w1.wup_similarity(w2)*100,'%')
In [132]:
w1 = wordnet.synset('ship.n.01') # ship , the noun ship, 1st entry of similar word
w2 = wordnet.synset('cat.n.01')# boat
print('word similarity is',w1.wup_similarity(w2)*100,'%')
TurnItIn as a use case
Many students try to buy essays online. The services that sell those papers often use a form of natrual language processing to change the words with synonyms. The above method could determine if that has been occuring by gauging the similarities to other papers
Now we are going to begin to develop our module for determining sentiment. We are going to achieve this by tokenizing the movie reviews in our set. Tokenizing is the process of splitting blocks of text in to list of individual words. By doing this we can determining if the occurence of a particular word can be used as an indicator for posotive or negative sentiment. When doing this, we hope to see results that do NOT use any non-substantial words such as conjuctions (i.e. 'the', 'an', 'or', 'and').
This method does not have to be used only for determining sentiment. Other potential use cases could include determining the subject of a block of text or determining the origin of the author by indicators that would represent local slangs. There are 2 main benefits to using this method:
In [133]:
import random
from nltk.corpus import movie_reviews #1000 labeled posotive or negative movie reviews
documents = [(list(movie_reviews.words(fileid)), category) # list of tuples for features
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
In [134]:
#documents[1] # prints the first movie review in tokenized format
In [135]:
random.shuffle(documents)# removes bias by not training and testing on the same set
all_words = []
for w in movie_reviews.words():
all_words.append(w.lower())
all_words = nltk.FreqDist(all_words)# makes frequency distrubution
print(all_words.most_common(10)) # prints(10 most common words , frequency)
print("stupid occured ",all_words['stupid'],'times') # prints frequency of stupid
Above we created a list all_words of all the used words, then turned that into a frequency distrubution and printed the top 15 most common words, then we printed how many times the word 'stupid' occured in the list all_words. Notice how these are obviously all conjunctions, common prepositions, and words like "the, that" as well as a hyphen.
Below, we are going to create our training set and testing set. First we pull out the keys from our all_words frequency distrubution. Considering all_words is a dictionary of ("the word", Frequency) we will now have each word. Our feature set, which is the raw data modified to show our distingushing feature, has the words that what we are defining in our training and testing set.
In [136]:
word_features = list(all_words.keys())[:3000]
def find_features(document):
words = set(document) # converting list to set makes all the words not the amount
features ={}
for w in word_features:
features[w] = (w in words)
return features
featuresets = [(find_features(rev),category) for (rev, category) in documents]
'''
Here we create a testing and training set by arbitrariy splitting up
the 3000 words in word_features
'''
training_set =featuresets[1900:]
testing_set = featuresets[:1900]
featuresets[0][1]
#here, we reference the dictionary (list of features, pos or neg) to give the second entry
Out[136]:
First, we are going to test using a simple NaiveBayesClassifier provided by NLTK. We will have it return the prediction accuracy and the most informative features. We hope to see two things that will demonstrate the efficiency of this:
In [137]:
classifier =nltk.NaiveBayesClassifier.train(training_set)
print("original naieve bayes algo accuracy:", nltk.classify.accuracy(classifier,testing_set)*100)
classifier.show_most_informative_features(15)
As you can see above the Naive Bayes Classifier is not a very accurate algorithim. To increase accuracy, we are going to try to use as many diffrent classifier methods as possible to test the data on. From there we are going to define a function that creates a voting system where each classifier votes. The outcome is the average of the votes, for example if 5/8 classifiers say positive, we will vote positive. We are also going to print the classification and the confidence of the algo.
In [138]:
from sklearn.naive_bayes import MultinomialNB, GaussianNB , BernoulliNB
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.linear_model import LogisticRegression,SGDClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC
from nltk.classify import ClassifierI
from statistics import mode
In [139]:
class VoteClassifier(ClassifierI):
def __init__(self,*classifiers):
self._classifiers = classifiers
def classify(self, features):
votes = []
for c in self._classifiers:
v = c.classify(features)
votes.append(v)
return mode(votes)
def confidence(self, features):
votes = []
for c in self._classifiers:
v = c.classify(features)
votes.append(v)
choice_votes = votes.count(mode(votes))
conf = choice_votes / len(votes)
return conf
In [140]:
MNB_classifier = SklearnClassifier(MultinomialNB())
MNB_classifier.train(training_set)
print("MNB_classifier accuracy percent:", (nltk.classify.accuracy(MNB_classifier, testing_set))*100)
BernoulliNB_classifier = SklearnClassifier(BernoulliNB())
BernoulliNB_classifier.train(training_set)
print("BernoulliNB_classifier accuracy percent:", (nltk.classify.accuracy(BernoulliNB_classifier, testing_set))*100)
LogisticRegression_classifier = SklearnClassifier(LogisticRegression())
LogisticRegression_classifier.train(training_set)
print("LogisticRegression_classifier accuracy percent:", (nltk.classify.accuracy(LogisticRegression_classifier, testing_set))*100)
SGDClassifier_classifier = SklearnClassifier(SGDClassifier())
SGDClassifier_classifier.train(training_set)
print("SGDClassifier_classifier accuracy percent:", (nltk.classify.accuracy(SGDClassifier_classifier, testing_set))*100)
#Another possible test, but is known to be largely inaccurate
#SVC_classifier = SklearnClassifier(SVC())
#SVC_classifier.train(training_set)
#print("SVC_classifier accuracy percent:", (nltk.classify.accuracy(SVC_classifier, testing_set))*100)
LinearSVC_classifier = SklearnClassifier(LinearSVC())
LinearSVC_classifier.train(training_set)
print("LinearSVC_classifier accuracy percent:", (nltk.classify.accuracy(LinearSVC_classifier, testing_set))*100)
NuSVC_classifier = SklearnClassifier(NuSVC())
NuSVC_classifier.train(training_set)
print("NuSVC_classifier accuracy percent:", (nltk.classify.accuracy(NuSVC_classifier, testing_set))*100)
Above we printed the accuracy for all the classifiers and their respective performance. Despite their individual accuracies, it is important to note that they were generally in agreement. Below, we are printing the confidence, which is calcualted by the amount of classifiers in agreement. This removes any time that individual classifeirs may have been lacking in their ability to predict a ceartin secenrio. However, it will more heavily weight any individual machine learning funcation that may have been more accurate than even the majority.
Accuracy optimization depends on the data set in question. To optimize, you must try running all individual classifiers, and then selectively remove those that failed to meet sufficient accuracy. For further investigation, one could leverage a data set that is structured in a way that pos or neg reviews are grouped in order to test the accuracy for just posotive or negative reviews. You may find that some are biased towards one side and can be removed for better accuracy. Obviously, regardless of any investigation you make, you also likely be able to increase the accuracy and applicability of the algothim by using a larger training set. Remember, in practice, using a larger training set will increase the time and processing power necissary for completion, which is relevant if the speed of execution is important.
In [141]:
voted_classifier = VoteClassifier(classifier,
NuSVC_classifier,
LinearSVC_classifier,
SGDClassifier_classifier,
MNB_classifier,
BernoulliNB_classifier,
LogisticRegression_classifier)
print("voted_classifier accuracy percent:", (nltk.classify.accuracy(voted_classifier, testing_set))*100)
In [142]:
#Example Classifications of the first two tests
print("Classification 0:", voted_classifier.classify(testing_set[0][0]), "Confidence %:",voted_classifier.confidence(testing_set[0][0])*100)
print("Classification 1:", voted_classifier.classify(testing_set[1][0]), "Confidence %:",voted_classifier.confidence(testing_set[1][0])*100)
In [143]:
def make_confidence(number_of_tests):
confidence = []
for x in range (0,number_of_tests):
confidence.append(voted_classifier.confidence(testing_set[x][0])*100)
return confidence
import matplotlib.pyplot as plt
import pandas as pd
y = make_confidence(1000) #use all 1000 tests
x = range(0, len(y))
#create a dictionary to sort data
count_data = dict((i, y.count(i)) for i in y)
Now, we can figure out what is the distribution of confidence, what the average and standard deviation are, to get an idea of its true accuracy
In [144]:
%matplotlib inline
from decimal import Decimal
import collections
#Sort the Dictionary
od = collections.OrderedDict(sorted(count_data.items()))
plt.style.use('fivethirtyeight')
plt.bar(range(len(od)), od.values(), align='center')
plt.xticks(range(len(od)), [float(Decimal("%.2f" % key)) for key in od.keys()])
plt.show()
This data is interesting. First, the x labels show four possible options That is because the voting process decides on a simple majority, which in this case, is best out of seven
57.14% represents 4/7, 71.43% is 5/7, 85.71% is 6/7 and 100.0% is 7/7
In [145]:
labels = ['4/7','5/7','6/7','7/7']
plt.pie(list(count_data.values()), labels = labels, autopct='%1.1f%%')
plt.show()
This shows us that only a smaller of the time do all 7 tests agree while the largest percentage of the time, only 5/7 tests agree. This points to the limitations of the tests we use in this example
In [146]:
import numpy as np
mean = np.mean(y, dtype=np.float64)
print("The mean confidence level is: ", mean)
stdev = np.std(y, dtype = np.float64)
print("The standard deviation of the confidence is: ", stdev)
In [147]:
#Linear Regression
import statsmodels.api as sm
from statsmodels import regression
Factor models are a way of explaining the results via a linear combination of its inherent alpha as well as exposure to other indicators. The general form of a factor model is
$$Y = \alpha + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_n X_n$$
In [148]:
# Let's define everything in familiar regression terms
X = x
Y = y
def linreg(x,y):
# We add a constant so that we can also fit an intercept (alpha) to the model
x = sm.add_constant(x)
model = regression.linear_model.OLS(y,x).fit()
# Remove the constant now that we're done
x = x[:, 1]
#print(model.params)
return model.params[0], model.params[1]
#alpha and beta
alpha, beta = linreg(X,Y)
print ('alpha: ' + str(alpha))
print ('beta: ' + str(beta))
In [149]:
X2 = np.linspace(X.start, X.stop)
Y_hat = X2 * beta + alpha
plt.scatter(X, Y, alpha=0.25) # Plot the raw data
plt.xlabel("Number of Tests", fontsize = "16")
plt.ylabel("Confidence", fontsize = "16")
plt.axis('tight')
# Add the regression line, colored in red
plt.plot(X2, Y_hat, 'r', alpha=0.9);
In the future, we hope to process Twitter information, newspaper articles, and other public data that we can analyze for sentiment analysis regarding any topic. We can adjust the parameters of each machine learning classifier to optimize how the data is read through for the best results, which in turn exchanges processing power for better answers.
We hope you learned something about natural language processing. We admit that there is definite room for improvement. These are only the beginning steps to use and implement machine learning and natural language processing. This represents a huge step toward how machines are able to understand the English language much like humans do, and most likely, even better than we do.
Special thanks to pythonprogramming.com for online guidance through this project.