12. Semantics 2 - Lab excercise

Improving a baseline Sentiment Analysis algorithm

Below is a small system for training and testing a Support Vector classifier on sentiment analysis data from the 2017 Semeval Task 4a, containing English tweets.

Currently the system only contains a single feature type: each tweet is represented by the set of words it contains. More specifically, a binary feature is created for each word in the vocabulary of the full training set, and the value of each feature for any given tweet is 1 if the word is present and 0 otherwise.

Your task will be to improve the performance of the system by implementing other binary features. (If you want to include non-binary features, you will also have to change the provided code)

Before we start, let's download the dataset:



In [67]:

    
!wget http://sandbox.hlt.bme.hu/~recski/stuff/4a.tgz









    



--2017-12-05 16:44:08--  http://sandbox.hlt.bme.hu/~recski/stuff/4a.tgz
Resolving sandbox.hlt.bme.hu (sandbox.hlt.bme.hu)... 152.66.88.21
Connecting to sandbox.hlt.bme.hu (sandbox.hlt.bme.hu)|152.66.88.21|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1457736 (1,4M) [application/x-gzip]
Saving to: ‘4a.tgz.1’

4a.tgz.1            100%[===================>]   1,39M  --.-KB/s    in 0,1s    

2017-12-05 16:44:08 (11,3 MB/s) - ‘4a.tgz.1’ saved [1457736/1457736]

And extract the files:



In [68]:

    
!tar xvvf 4a.tgz









    



-rw-rw-r-- recski/recski 381255 2017-08-15 15:27 4a.dev
-rw-rw-r-- recski/recski 2613689 2017-08-15 15:27 4a.train
-rw-rw-r-- recski/recski   14227 2017-08-16 12:21 test.dev
-rw-rw-r-- recski/recski  145899 2017-08-16 12:20 test.train

4a.train and 4a.dev are the full datasets for training and testing, test.train and test.dev are small samples from these that you may want to use while debugging your solution

Before you get started, let's walk through the main components of the system.

The Featurizer class implements features as static methods and also converts train and test data to data structures handled by sklearn, the library we use for training an SVC model.



In [69]:

    
import numpy as np
import scipy

from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')


class Featurizer():
    @staticmethod
    def bag_of_words(text):
        for word in word_tokenize(text):
            yield word

    feature_functions = [
        'bag_of_words']

    def __init__(self):
        self.labels = {}
        self.labels_by_id = {}
        self.features = {}
        self.features_by_id = {}
        self.next_feature_id = 0
        self.next_label_id = 0

    def to_sparse(self, events):
        """convert sets of ints to a scipy.sparse.csr_matrix"""
        data, row_ind, col_ind = [], [], []
        for event_index, event in enumerate(events):
            for feature in event:
                data.append(1)
                row_ind.append(event_index)
                col_ind.append(feature)

        n_features = self.next_feature_id
        n_events = len(events)
        matrix = scipy.sparse.csr_matrix(
            (data, (row_ind, col_ind)), shape=(n_events, n_features))

        return matrix

    def featurize(self, dataset, allow_new_features=False):
        events, labels = [], []
        n_events = len(dataset)
        for c, (text, label) in enumerate(dataset):
            if c % 2000 == 0:
                print("{0:.0%}...".format(c/n_events), end='')
            if label not in self.labels:
                self.labels[label] = self.next_label_id
                self.labels_by_id[self.next_label_id] = label
                self.next_label_id += 1
            labels.append(self.labels[label])
            events.append(set())
            for function_name in Featurizer.feature_functions:
                function = getattr(Featurizer, function_name)
                for feature in function(text):
                    if feature not in self.features:
                        if not allow_new_features:
                            continue
                        self.features[feature] = self.next_feature_id
                        self.features_by_id[self.next_feature_id] = feature
                        self.next_feature_id += 1
                    feat_id = self.features[feature]
                    events[-1].add(feat_id)
        
        print('done, sparsifying...', end='')
        events_sparse = self.to_sparse(events)
        labels_array = np.array(labels)
        print('done!')

        return events_sparse, labels_array









    



[nltk_data] Downloading package punkt to /home/recski/nltk_data...
[nltk_data]   Package punkt is already up-to-date!

We'll need to evaluate our output against the gold data, using the metrics defined for the competition:



In [70]:

    
from collections import defaultdict

def evaluate(predictions, dev_labels):
    stats_by_label = defaultdict(lambda: defaultdict(int))
    for i, gold in enumerate(dev_labels):
        auto = predictions[i]
        # print(auto, gold)
        if auto == gold:
            stats_by_label[auto]['tp'] += 1
        else:
            stats_by_label[auto]['fp'] += 1
            stats_by_label[gold]['fn'] += 1

    print("{:>8} {:>8} {:>8} {:>8} {:>8} {:>8}".format(
        'label', 'n_true', 'n_tagged', 'precision', 'recall', 'F-score'))
    for label, stats in stats_by_label.items():
        all_tagged = stats['tp'] + stats['fp']
        stats['prec'] = stats['tp'] / all_tagged if all_tagged else 0
        all_true = stats['tp'] + stats['fn']
        stats['rec'] = stats['tp'] / all_true if all_true else 0
        stats['f'] = (2 / ((1/stats['prec']) + (1/stats['rec']))
                      if stats['prec'] > 0 and stats['rec'] > 0 else 0)

        print("{:>8} {:>8} {:>8} {:>8.2f} {:>8.2f} {:>8.2f}".format(
            label, all_true, all_tagged, stats['prec'], stats['rec'],
            stats['f']))

    accuracy = (
        sum([stats_by_label[label]['tp'] for label in stats_by_label]) /
        len(predictions)) if predictions else 0

    av_rec = sum([stats['rec'] for stats in stats_by_label.values()]) / 3
    f_pn = (stats_by_label['positive']['f'] +
            stats_by_label['negative']['f']) / 2

    print()
    print("{:>10} {:>.4f}".format('Acc:', accuracy))
    print("{:>10} {:>.4f}".format('P/N av. F:', f_pn))
    print("{:>10} {:>.4f}".format('Av.rec:', av_rec))

We need a small function to read the data from file:



In [71]:

    
import sys
def read_data(fn):
    data = []
    with open(fn) as f:
        for line in f:
            if not line:
                continue
            fields = line.strip().split('\t')
            if line.strip() == '"':
                continue
            answer, text = fields[1:3]
            data.append((text, answer))
    return data

And finally a main function to run an experiment:



In [72]:

    
from sklearn import svm

def sa_exp(train_file, dev_file):
    print('reading data...')
    train_data = read_data(train_file)
    dev_data = read_data(dev_file)

    print('featurizing train...')
    featurizer = Featurizer()
    train_events, train_labels = featurizer.featurize(
        train_data, allow_new_features=True)
    print('featurizing dev...')
    dev_events, dev_labels = featurizer.featurize(
        dev_data, allow_new_features=False)

    print('training...')
    model = svm.LinearSVC()
    model.fit(train_events, train_labels)

    print('predicting...')
    predictions = model.predict(dev_events)
    predicted_labels = [
        featurizer.labels_by_id[label] for label in predictions]

    dev_labels = [
        featurizer.labels_by_id[label] for label in dev_labels]

    print('evaluating...')
    print()
    evaluate(predicted_labels, dev_labels)

Let's see how the system performs currently:



In [73]:

    
sa_exp('4a.train', '4a.dev')









    



reading data...
featurizing train...
0%...11%...22%...33%...44%...56%...67%...78%...89%...done, sparsifying...done!
featurizing dev...
0%...76%...done, sparsifying...done!
training...
predicting...
evaluating...

   label   n_true n_tagged precision   recall  F-score
 neutral     1338     1404     0.67     0.70     0.68
positive      864      889     0.62     0.64     0.63
negative      430      339     0.53     0.41     0.46

      Acc: 0.6337
P/N av. F: 0.5475
   Av.rec: 0.5849

Now it's time to get started! Try to improve the main performance figures by implementing new features in the Featurizer class! Make sure that each feature function is a generator and that you add function names to the class variable feature_functions. Some ideas for features are listed below, but you should also come up with some ideas on your own:

Ideas for simple features

What words are used? Should this be case sensitive?
What punctuation is used? Should all of them count?
Do word ngrams help? But for which values of n?
Emojis?

Sentiment lexicons

There are many on the internet (google is your friend). Just get a couple and use it!

Some more ideas

part-of-speech (POS) tags - try the POS-tagger in NLTK
POS ngrams (but maybe not all of them?)
can WordNet be of any use? - recall from last week that there's an NLTK WordNet interface

Advanced

Try to get more info on rare or unseen words. You may even want to use the code from last week's excercise