Below is a small system for training and testing a Support Vector classifier on sentiment analysis data from the 2017 Semeval Task 4a, containing English tweets.
Currently the system only contains a single feature type: each tweet is represented by the set of words it contains. More specifically, a binary feature is created for each word in the vocabulary of the full training set, and the value of each feature for any given tweet is 1 if the word is present and 0 otherwise.
Your task will be to improve the performance of the system by implementing other binary features. (If you want to include non-binary features, you will also have to change the provided code)
Before we start, let's download the dataset:
In [67]:
!wget http://sandbox.hlt.bme.hu/~recski/stuff/4a.tgz
And extract the files:
In [68]:
!tar xvvf 4a.tgz
4a.train and 4a.dev are the full datasets for training and testing, test.train and test.dev are small samples from these that you may want to use while debugging your solution
Before you get started, let's walk through the main components of the system.
The Featurizer class implements features as static methods and also converts train and test data to data structures handled by sklearn, the library we use for training an SVC model.
In [69]:
import numpy as np
import scipy
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')
class Featurizer():
@staticmethod
def bag_of_words(text):
for word in word_tokenize(text):
yield word
feature_functions = [
'bag_of_words']
def __init__(self):
self.labels = {}
self.labels_by_id = {}
self.features = {}
self.features_by_id = {}
self.next_feature_id = 0
self.next_label_id = 0
def to_sparse(self, events):
"""convert sets of ints to a scipy.sparse.csr_matrix"""
data, row_ind, col_ind = [], [], []
for event_index, event in enumerate(events):
for feature in event:
data.append(1)
row_ind.append(event_index)
col_ind.append(feature)
n_features = self.next_feature_id
n_events = len(events)
matrix = scipy.sparse.csr_matrix(
(data, (row_ind, col_ind)), shape=(n_events, n_features))
return matrix
def featurize(self, dataset, allow_new_features=False):
events, labels = [], []
n_events = len(dataset)
for c, (text, label) in enumerate(dataset):
if c % 2000 == 0:
print("{0:.0%}...".format(c/n_events), end='')
if label not in self.labels:
self.labels[label] = self.next_label_id
self.labels_by_id[self.next_label_id] = label
self.next_label_id += 1
labels.append(self.labels[label])
events.append(set())
for function_name in Featurizer.feature_functions:
function = getattr(Featurizer, function_name)
for feature in function(text):
if feature not in self.features:
if not allow_new_features:
continue
self.features[feature] = self.next_feature_id
self.features_by_id[self.next_feature_id] = feature
self.next_feature_id += 1
feat_id = self.features[feature]
events[-1].add(feat_id)
print('done, sparsifying...', end='')
events_sparse = self.to_sparse(events)
labels_array = np.array(labels)
print('done!')
return events_sparse, labels_array
We'll need to evaluate our output against the gold data, using the metrics defined for the competition:
In [70]:
from collections import defaultdict
def evaluate(predictions, dev_labels):
stats_by_label = defaultdict(lambda: defaultdict(int))
for i, gold in enumerate(dev_labels):
auto = predictions[i]
# print(auto, gold)
if auto == gold:
stats_by_label[auto]['tp'] += 1
else:
stats_by_label[auto]['fp'] += 1
stats_by_label[gold]['fn'] += 1
print("{:>8} {:>8} {:>8} {:>8} {:>8} {:>8}".format(
'label', 'n_true', 'n_tagged', 'precision', 'recall', 'F-score'))
for label, stats in stats_by_label.items():
all_tagged = stats['tp'] + stats['fp']
stats['prec'] = stats['tp'] / all_tagged if all_tagged else 0
all_true = stats['tp'] + stats['fn']
stats['rec'] = stats['tp'] / all_true if all_true else 0
stats['f'] = (2 / ((1/stats['prec']) + (1/stats['rec']))
if stats['prec'] > 0 and stats['rec'] > 0 else 0)
print("{:>8} {:>8} {:>8} {:>8.2f} {:>8.2f} {:>8.2f}".format(
label, all_true, all_tagged, stats['prec'], stats['rec'],
stats['f']))
accuracy = (
sum([stats_by_label[label]['tp'] for label in stats_by_label]) /
len(predictions)) if predictions else 0
av_rec = sum([stats['rec'] for stats in stats_by_label.values()]) / 3
f_pn = (stats_by_label['positive']['f'] +
stats_by_label['negative']['f']) / 2
print()
print("{:>10} {:>.4f}".format('Acc:', accuracy))
print("{:>10} {:>.4f}".format('P/N av. F:', f_pn))
print("{:>10} {:>.4f}".format('Av.rec:', av_rec))
We need a small function to read the data from file:
In [71]:
import sys
def read_data(fn):
data = []
with open(fn) as f:
for line in f:
if not line:
continue
fields = line.strip().split('\t')
if line.strip() == '"':
continue
answer, text = fields[1:3]
data.append((text, answer))
return data
And finally a main function to run an experiment:
In [72]:
from sklearn import svm
def sa_exp(train_file, dev_file):
print('reading data...')
train_data = read_data(train_file)
dev_data = read_data(dev_file)
print('featurizing train...')
featurizer = Featurizer()
train_events, train_labels = featurizer.featurize(
train_data, allow_new_features=True)
print('featurizing dev...')
dev_events, dev_labels = featurizer.featurize(
dev_data, allow_new_features=False)
print('training...')
model = svm.LinearSVC()
model.fit(train_events, train_labels)
print('predicting...')
predictions = model.predict(dev_events)
predicted_labels = [
featurizer.labels_by_id[label] for label in predictions]
dev_labels = [
featurizer.labels_by_id[label] for label in dev_labels]
print('evaluating...')
print()
evaluate(predicted_labels, dev_labels)
Let's see how the system performs currently:
In [73]:
sa_exp('4a.train', '4a.dev')
Now it's time to get started! Try to improve the main performance figures by implementing new features in the Featurizer class! Make sure that each feature function is a generator and that you add function names to the class variable feature_functions. Some ideas for features are listed below, but you should also come up with some ideas on your own:
There are many on the internet (google is your friend). Just get a couple and use it!
Try to get more info on rare or unseen words. You may even want to use the code from last week's excercise