A Baseline Named Entity Recognizer for Twitter

In this notebook I'll follow the example presented in Named entities and random fields to train a conditional random field to recognize named entities in Twitter data. The data and some of the code below are taken from a programming assignment in the amazing class Natural Language Processing offered by Coursera. In the assignment we were shown how to build a named entity recognizer using deep learning with a bidirectional LSTM, which is a pretty complicated approach and I wanted to have a baseline model to see what sort of accuracy should be expected on this data.

1. Preparing the Data

First load the text and tags for training, validation and test data:


In [1]:
def read_data(file_path):
    tokens = []
    tags = []
    
    tweet_tokens = []
    tweet_tags = []
    for line in open(file_path, encoding='utf-8'):
        line = line.strip()
        if not line:
            if tweet_tokens:
                tokens.append(tweet_tokens)
                tags.append(tweet_tags)
            tweet_tokens = []
            tweet_tags = []
        else:
            token, tag = line.split()
            # Replace all urls with <URL> token
            # Replace all users with <USR> token
            if token.startswith("http://") or token.startswith("https://"): token = "<URL>"
            elif token.startswith("@"): token = "<USR>"
            tweet_tokens.append(token)
            tweet_tags.append(tag) 
    return tokens, tags
train_tokens, train_tags = read_data('data/train.txt')
validation_tokens, validation_tags = read_data('data/validation.txt')
test_tokens, test_tags = read_data('data/test.txt')

The CRF model uses part of speech tags as features so we'll need to add those to the datasets.


In [2]:
%%time
import nltk

def build_sentence(tokens, tags):
    pos_tags = [item[-1] for item in nltk.pos_tag(tokens)]
    return list(zip(tokens, pos_tags, tags))

def build_sentences(tokens_set, tags_set):
    return [build_sentence(tokens, tags) for tokens, tags in zip(tokens_set, tags_set)]

train_sents = build_sentences(train_tokens, train_tags)
validation_sents = build_sentences(validation_tokens, validation_tags)
test_sents = build_sentences(test_tokens, test_tags)


CPU times: user 7.06 s, sys: 192 ms, total: 7.26 s
Wall time: 7.26 s

2. Computing Features


In [22]:
def word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]

    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word[-3:]': word[-3:],
        'word[-2:]': word[-2:],
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
        'postag': postag,
        'postag[:2]': postag[:2],
    }
    if i > 0:
        word1 = sent[i - 1][0]
        postag1 = sent[i - 1][1]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
            '-1:postag': postag1,
            '-1:postag[:2]': postag1[:2],
        })
    else:
        features['BOS'] = True

    if i < len(sent) - 1:
        word1 = sent[i + 1][0]
        postag1 = sent[i + 1][1]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
            '+1:postag': postag1,
            '+1:postag[:2]': postag1[:2],
        })
    else:
        features['EOS'] = True

    return features

def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]


def sent2labels(sent):
    return [label for token, postag, label in sent]


def sent2tokens(sent):
    return [token for token, postag, label in sent]


X_train = [sent2features(s) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]

X_validation = [sent2features(s) for s in validation_sents]
y_validation = [sent2labels(s) for s in validation_sents]

X_test = [sent2features(s) for s in test_sents]
y_test = [sent2labels(s) for s in test_sents]

3. Train the Model


In [4]:
import sklearn_crfsuite

In [5]:
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=0.12,
    c2=0.01,
    max_iterations=100,
    all_possible_transitions=True
)
crf.fit(X_train, y_train)


Out[5]:
CRF(algorithm='lbfgs', all_possible_states=None,
  all_possible_transitions=True, averaging=None, c=None, c1=0.12, c2=0.01,
  calibration_candidates=None, calibration_eta=None,
  calibration_max_trials=None, calibration_rate=None,
  calibration_samples=None, delta=None, epsilon=None, error_sensitive=None,
  gamma=None, keep_tempfiles=None, linesearch=None, max_iterations=100,
  max_linesearch=None, min_freq=None, model_filename=None,
  num_memories=None, pa_type=None, period=None, trainer_cls=None,
  variance=None, verbose=False)

4. Evaluate the Model

We evaluate the model using the CoNLL shared task evaluation script.


In [20]:
from evaluation import precision_recall_f1

def eval_conll(model, tokens, tags, short_report=True):
    """Computes NER quality measures using CONLL shared task script."""
    tags_pred = model.predict(tokens)
    y_true = [y for s in tags for y in s] 
    y_pred = [y for s in tags_pred for y in s] 
    results = precision_recall_f1(y_true, y_pred, print_results=True, short_report=short_report)
    return results

In [23]:
print('-' * 20 + ' Train set quality: ' + '-' * 20)
train_results = eval_conll(crf, X_train, y_train, short_report=False)

print('-' * 20 + ' Validation set quality: ' + '-' * 20)
validation_results = eval_conll(crf, X_validation, y_validation, short_report=False)

print('-' * 20 + ' Test set quality: ' + '-' * 20)
test_results = eval_conll(crf, X_test, y_test, short_report=False)


-------------------- Train set quality: --------------------
processed 99983 tokens with 4489 phrases; found: 4476 phrases; correct: 4433.

precision:  99.04%; recall:  98.75%; F1:  98.90

	     company: precision:   98.75%; recall:   98.13%; F1:   98.44; predicted:   639

	    facility: precision:   97.76%; recall:   97.13%; F1:   97.44; predicted:   312

	     geo-loc: precision:   99.20%; recall:   99.40%; F1:   99.30; predicted:   998

	       movie: precision:  100.00%; recall:  100.00%; F1:  100.00; predicted:    68

	 musicartist: precision:   97.85%; recall:   98.28%; F1:   98.06; predicted:   233

	       other: precision:   98.94%; recall:   98.68%; F1:   98.81; predicted:   755

	      person: precision:   99.32%; recall:   98.98%; F1:   99.15; predicted:   883

	     product: precision:   99.68%; recall:   99.06%; F1:   99.37; predicted:   316

	  sportsteam: precision:  100.00%; recall:   99.54%; F1:   99.77; predicted:   216

	      tvshow: precision:  100.00%; recall:   96.55%; F1:   98.25; predicted:    56

-------------------- Validation set quality: --------------------
processed 12112 tokens with 537 phrases; found: 317 phrases; correct: 213.

precision:  67.19%; recall:  39.66%; F1:  49.88

	     company: precision:   78.67%; recall:   56.73%; F1:   65.92; predicted:    75

	    facility: precision:   76.92%; recall:   29.41%; F1:   42.55; predicted:    13

	     geo-loc: precision:   76.25%; recall:   53.98%; F1:   63.21; predicted:    80

	       movie: precision:  100.00%; recall:   14.29%; F1:   25.00; predicted:     1

	 musicartist: precision:   55.56%; recall:   17.86%; F1:   27.03; predicted:     9

	       other: precision:   52.17%; recall:   29.63%; F1:   37.80; predicted:    46

	      person: precision:   67.12%; recall:   43.75%; F1:   52.97; predicted:    73

	     product: precision:   14.29%; recall:    5.88%; F1:    8.33; predicted:    14

	  sportsteam: precision:   33.33%; recall:   10.00%; F1:   15.38; predicted:     6

	      tvshow: precision:    0.00%; recall:    0.00%; F1:    0.00; predicted:     0

-------------------- Test set quality: --------------------
processed 12534 tokens with 604 phrases; found: 383 phrases; correct: 273.

precision:  71.28%; recall:  45.20%; F1:  55.32

	     company: precision:   85.71%; recall:   57.14%; F1:   68.57; predicted:    56

	    facility: precision:   72.73%; recall:   51.06%; F1:   60.00; predicted:    33

	     geo-loc: precision:   82.91%; recall:   58.79%; F1:   68.79; predicted:   117

	       movie: precision:    0.00%; recall:    0.00%; F1:    0.00; predicted:     2

	 musicartist: precision:   33.33%; recall:    7.41%; F1:   12.12; predicted:     6

	       other: precision:   55.07%; recall:   36.89%; F1:   44.19; predicted:    69

	      person: precision:   64.20%; recall:   50.00%; F1:   56.22; predicted:    81

	     product: precision:   37.50%; recall:   10.71%; F1:   16.67; predicted:     8

	  sportsteam: precision:   81.82%; recall:   29.03%; F1:   42.86; predicted:    11

	      tvshow: precision:    0.00%; recall:    0.00%; F1:    0.00; predicted:     0

5. Tuning Parameters

I tried tuning the parameters c1 and c2 of the model using randomized grid search but was not able to improve the results that way. I plan to try GPyOpt to see if that will do better but don't have time to do that here.


In [ ]: