In [1]:
from adjective_phrase_tagger.adj_phrase_tagger import AdjectivePhraseTagger
from estnltk import Text
Create Text object, AdjectivePhraseTagger object and tag adjective phrases as a new layer of the Text object.
In [2]:
tagger = AdjectivePhraseTagger(return_layer=True) # return_layer=True returns only the adjective phrase layer
sent = Text("Peaaegu 8-aastane koer oli väga energiline ja mänguhimuline.")
tagger.tag(sent)
Out[2]:
type is the type of the phrase:
measurement_adj means that the adjective in the phrase either contains a number or some other type of measurement
intersects_with_verb signifies whether the found adjective phrase intersects with a verb phrase in the text; this happens mostly in the case of participles as in the following sentence:
In [3]:
tagger.tag(Text("Ta oli väga üllatunud."))
Out[3]:
adverb_class marks the intensity of the adverb in the phrase. Each class has also been assigned a weight (adverb_weight) noting its intensity. Currently there are 6 classes with their corresponding weights:
All the adverbs are not divided into classes, therefore some do have unknow as adverb_class and adverb_weight.
Adjective phrases can be used for sentiment analysis - determining the polarity of the text. While this is often done using only adjectives, the phrases consisting of an adverb and an adjective can give more precise results because adverbs in these kinds of phrases are usually some sort of intensifiers. For this purpose, the most frequent adverbs are already divided into classes and assigned weights based on their intensifying properties (see above).
To illustrate this, let's build a very simple system for sentiment analysis. For this, we can use hinnavaatlus.csv dataset that contains user reviews and their ratings (positive, negative and neutral).
First, let's extract adjectives from the user reviews and create separate frequency lists of adjectives appearing in positive and negative reviews.
In [4]:
import csv
from collections import defaultdict
pos = {}
neg = {}
adjectives = defaultdict(lambda : defaultdict(int))
with open('data/hinnavaatlus.csv', newline='') as csvfile:
reader = csv.reader(csvfile, delimiter=',', quotechar='"')
for idx, row in enumerate(reader):
tagged = tagger.tag(Text(row[1]))
label = row[2]
for tag in tagged:
if len(tag) > 0:
adj = tag['lemmas'][-1]
adjectives[label][adj] += 1
Of course, we can imagine that not all the adjectives used in positive reviews are positive and the same with negative reviews. To overcome this problem, we can use the volcanoplot (tutorial here) tool which visualises the two lists and helps us find over-represented words from both. For this, we need to save both lexicons into csv files.
In [5]:
with open("neg.csv", "w") as fout:
writer = csv.writer(fout, dialect = 'excel')
for row in adjectives['Negatiivne']:
writer.writerow([row, adjectives['Negatiivne'][row]])
In [6]:
with open("pos.csv", "w") as fout:
writer = csv.writer(fout, dialect = 'excel')
for row in adjectives['Positiivne']:
writer.writerow([row, adjectives['Positiivne'][row]])
From volcanoplot we save two lexicons - one for positive (data/positive.txt) and one for negative (data/negative.txt) words. Now let's decide that an adjective appearing in the positive lexicon has a score of 1 and an adjective in negative lexicon has a score of -1. Adjectives not present in either of the lexicons have a score 0.
In [7]:
negative = []
with open("data/negative.txt", "r") as fin:
words = fin.readlines()
negative = set([word.strip() for word in words])
positive = []
with open("data/positive.txt", "r") as fin:
words = fin.readlines()
positive = ([word.strip() for word in words])
Now we can assign a score to each adjective and compute weights to phrases containing of an adverb and an adjective by multiplying the score of an adjective by the weight of the preceding adverb. By summing the scores of all the phrases in a review, we can calculate the polarity of the review.
In [8]:
with open('data/hinnavaatlus.csv', newline='') as csvfile:
reader = csv.reader(csvfile, delimiter=',', quotechar='"')
review_scores = {}
for idx, row in enumerate(reader):
tagged = tagger.tag(Text(row[1]))
total_score = []
if idx < 10:
print(row[1])
for i in tagged:
if i['lemmas'][-1] in positive:
if 'adverb_weight' in i:
score = 1*i['adverb_weight']
else:
score = 1
elif i['lemmas'][-1] in negative:
if 'adverb_weight' in i:
score = -1*i['adverb_weight']
else:
score = -1
else:
score = 0
if idx < 10:
print(i['lemmas'], ' ', score)
total_score.append(score)
if idx < 10:
print("Total score: ", str(sum(total_score)))
print("-----------------------------")
review_scores[row[1]] = sum(total_score)
As we saved the reviews and their scores to the dict review_scores, we can sort it and find reviews that have the highest and lowest scores.
In [9]:
from collections import OrderedDict
sorted_scores = OrderedDict(sorted(review_scores.items(), key=lambda t: t[1], reverse = True))
Let's print 5 most positive reviews:
In [10]:
for idx, i in enumerate(sorted_scores):
if idx < 5:
print(i, sorted_scores[i])
print()
And 5 most negative reviews:
In [11]:
for idx, i in enumerate(OrderedDict(reversed(list(sorted_scores.items())))):
if idx < 5:
print(i, sorted_scores[i])
print()