Training a Sentiment Analysis LSTM Using Noisy Crowd Labels

This is a version of the crowdsourcing tutorial that uses PySpark using Pandas instead of SparkSQL.

In this tutorial, we'll provide a simple walkthrough of how to use Snorkel to resolve conflicts in a noisy crowdsourced dataset for a sentiment analysis task, and then use these denoised labels to train an LSTM sentiment analysis model which can be applied to new, unseen data to automatically make predictions!

  1. Creating basic Snorkel objects: Candidates, Contexts, and Labels
  2. Training the GenerativeModel to resolve labeling conflicts
  3. Training a simple LSTM sentiment analysis model, which can then be used on new, unseen data!

Note that this is a simple tutorial meant to give an overview of the mechanics of using Snorkel-- we'll note places where more careful fine-tuning could be done!

Task Detail: Weather Sentiments in Tweets

In this tutorial we focus on the Weather sentiment task from Crowdflower.

In this task, contributors were asked to grade the sentiment of a particular tweet relating to the weather. Contributors could choose among the following categories:

  1. Positive
  2. Negative
  3. I can't tell
  4. Neutral / author is just sharing information
  5. Tweet not related to weather condition

The catch is that 20 contributors graded each tweet. Thus, in many cases contributors assigned conflicting sentiment labels to the same tweet.

The task comes with two data files (to be found in the data directory of the tutorial:

  1. weather-non-agg-DFE.csv contains the raw contributor answers for each of the 1,000 tweets.
  2. weather-evaluated-agg-DFE.csv contains gold sentiment labels by trusted workers for each of the 1,000 tweets.

In [2]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import os
import numpy as np
from snorkel import SnorkelSession
session = SnorkelSession()


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

Step 1: Preprocessing - Data Loading

We load the raw data for our crowdsourcing task (stored in a local csv file) into a dataframe


In [3]:
import pandas as pd

# Load Raw Crowdsourcing Data
raw_crowd_answers = pd.read_csv("data/weather-non-agg-DFE.csv")

# Load Groundtruth Crowdsourcing Data
gold_crowd_answers = pd.read_csv("data/weather-evaluated-agg-DFE.csv")
# Filter out low-confidence answers
gold_answers = gold_crowd_answers[['tweet_id', 'sentiment', 'tweet_body']][(gold_crowd_answers.correct_category == 'Yes') & (gold_crowd_answers.correct_category_conf == 1)] 

# Keep Only the Tweets with Available Groundtruth
candidate_labeled_tweets = raw_crowd_answers.join(gold_answers.set_index('tweet_id',drop=False),on=['tweet_id'],lsuffix='.raw',rsuffix='.gold',how='inner')
candidate_labeled_tweets = candidate_labeled_tweets[['tweet_id.raw','tweet_body.raw','worker_id','emotion']]
candidate_labeled_tweets.columns = ['tweet_id','tweet_body','worker_id','emotion']

As mentioned above, contributors can provide conflicting labels for the same tweet:


In [4]:
candidate_labeled_tweets.sort_values(['worker_id','tweet_id']).head()


Out[4]:
tweet_id tweet_body worker_id emotion
1527 79195142 G'morning, Sunshine: 60s and partly sunny? OK! 6332651 Neutral / author is just sharing information
1512 79196060 I woke up to a beautiful ball of red coming ov... 6332651 Positive
1517 80054061 @mention It's supposed to go up to 70 today. S... 6332651 Neutral / author is just sharing information
1546 80056390 @mention that was the effect I was hoping for ... 6332651 Negative
1526 81215474 Dateline: Elkhart Lake, WI - It's cloudy, chil... 6332651 Positive

Step 2: Generating Snorkel Objects

Candidates

Candidates are the core objects in Snorkel representing objects to be classified. We'll use a helper function to create a custom Candidate sub-class, Tweet, with values representing the possible labels that it can be classified with:


In [5]:
from snorkel.models import candidate_subclass

values = list(candidate_labeled_tweets.emotion.unique())

Tweet = candidate_subclass('Tweet', ['tweet'], values=values)

Contexts

All Candidate objects point to one or more Context objects, which represent the raw data that they are rooted in. In this case, our candidates will each point to a single Context object representing the raw text of the tweet.

Once we have defined the Context for each Candidate, we can commit them to the database. Note that we also split into two sets while doing this:

  1. Training set (split=0): The tweets for which we have noisy, conflicting crowd labels; we will resolve these conflicts using the GenerativeModel and then use them as training data for the LSTM

  2. Test set (split=1): We will pretend that we do not have any crowd labels for this split of the data, and use these to test the LSTM's performance on unseen data


In [6]:
from snorkel.models import Context, Candidate
from snorkel.contrib.models.text import RawText

# Make sure DB is cleared
session.query(Context).delete()
session.query(Candidate).delete()

# Now we create the candidates with a simple loop
tweet_bodies = candidate_labeled_tweets \
    [["tweet_id", "tweet_body"]] \
    .sort_values("tweet_id") \
    .drop_duplicates()

# Generate and store the tweet candidates to be classified
# Note: We split the tweets in two sets: one for which the crowd 
# labels are not available to Snorkel (test, 10%) and one for which we assume
# crowd labels are obtained (to be used for training, 90%)
total_tweets = len(tweet_bodies)
tweet_list = []
test_split = total_tweets*0.1
for i, t in tweet_bodies.iterrows():
    split = 1 if i <= test_split else 0
    raw_text = RawText(stable_id=t.tweet_id, name=t.tweet_id, text=t.tweet_body)
    tweet = Tweet(tweet=raw_text, split=split)
    tweet_list.append(tweet)
    session.add(tweet)
session.commit()

Labels

Next, we'll store the labels for each of the training candidates in a sparse matrix (which will also automatically be saved to the Snorkel database), with one row for each candidate and one column for each crowd worker:


In [7]:
from snorkel.annotations import LabelAnnotator
from collections import defaultdict

# Extract worker votes
# Cache locally to speed up for this small set
worker_labels = candidate_labeled_tweets[["tweet_id", "worker_id", "emotion"]]
wls = defaultdict(list)
for i, row in worker_labels.iterrows():
    wls[str(row.tweet_id)].append((str(row.worker_id), row.emotion))
    

# Create a label generator
def worker_label_generator(t):
    """A generator over the different (worker_id, label_id) pairs for a Tweet."""
    for worker_id, label in wls[t.tweet.name]:
        yield worker_id, label
        
labeler = LabelAnnotator(label_generator=worker_label_generator)
%time L_train = labeler.apply(split=0)
L_train


  0%|          | 0/629 [00:00<?, ?it/s]
Clearing existing...
Running UDF...
100%|██████████| 629/629 [00:05<00:00, 123.33it/s]
CPU times: user 5.14 s, sys: 44 ms, total: 5.18 s
Wall time: 5.42 s

Out[7]:
<629x102 sparse matrix of type '<type 'numpy.int64'>'
	with 12580 stored elements in Compressed Sparse Row format>

Finally, we load the ground truth ("gold") labels for both the training and test sets, and store as numpy arrays"


In [8]:
gold_labels = defaultdict(list)

# Get gold labels in verbose form
verbose_labels = dict([(str(t.tweet_id), t.sentiment) 
                       for i, t in gold_answers[["tweet_id", "sentiment"]].iterrows()])

# Iterate over splits, align with Candidate ordering
for split in range(2):
    cands = session.query(Tweet).filter(Tweet.split == split).order_by(Tweet.id).all() 
    for c in cands:
        # Think this is just an odd way of label encoding between 1 and 5?
        gold_labels[split].append(values.index(verbose_labels[c.tweet.name]) + 1) 
        
train_cand_labels = np.array(gold_labels[0])
test_cand_labels = np.array(gold_labels[1])

Step 3: Resolving Crowd Conflicts with the Generative Model

Until now we have converted the raw crowdsourced data into a labeling matrix that can be provided as input to Snorkel. We will now show how to:

  1. Use Snorkel's generative model to learn the accuracy of each crowd contributor.
  2. Use the learned model to estimate a marginal distribution over the domain of possible labels for each task.
  3. Use the estimated marginal distribution to obtain the maximum a posteriori probability estimate for the label that each task takes.

In [9]:
# Imports
from snorkel.learning.gen_learning import GenerativeModel

# Initialize Snorkel's generative model for
# learning the different worker accuracies.
gen_model = GenerativeModel(lf_propensity=True)

In [10]:
# Train the generative model
gen_model.train(
    L_train,
    reg_type=2,
    reg_param=0.1,
    epochs=30
)


Inferred cardinality: 5

Infering the MAP assignment for each task

Each task corresponds to an independent random variable. Thus, we can simply associate each task with the most probably label based on the estimated marginal distribution and get an accuracy score:


In [11]:
accuracy = gen_model.score(L_train, train_cand_labels)
print("Accuracy: {:.10f}".format(accuracy))


Accuracy: 0.9952305246

Majority vote

It seems like we did well- but how well? Given that this is a fairly simple task--we have 20 contributors per tweet (and most of them are far better than random)--we expect majority voting to perform extremely well, so we can check against majority vote:


In [12]:
from collections import Counter

# Collect the majority vote answer for each tweet
mv = []
for i in range(L_train.shape[0]):
    c = Counter([L_train[i,j] for j in L_train[i].nonzero()[1]])
    mv.append(c.most_common(1)[0][0])
mv = np.array(mv)

# Count the number correct by majority vote
n_correct = np.sum([1 for i in range(L_train.shape[0]) if mv[i] == train_cand_labels[i]])
print ("Accuracy:{}".format(n_correct / float(L_train.shape[0])))
print ("Number incorrect:{}".format(L_train.shape[0] - n_correct))


Accuracy:0.985691573927
Number incorrect:9

We see that while majority vote makes 10 errors, the Snorkel model makes only 3! What about an average crowd worker?

Average human accuracy

We see that the average accuracy of a single crowd worker is in fact much lower:


In [13]:
accs = []
for j in range(L_train.shape[1]):
    n_correct = np.sum([1 for i in range(L_train.shape[0]) if L_train[i,j] == train_cand_labels[i]])
    acc = n_correct / float(L_train[:,j].nnz)
    accs.append(acc)
print( "Mean Accuracy:{}".format( np.mean(accs)))


Mean Accuracy:0.729664764868

Step 4: Training an ML Model with Snorkel for Sentiment Analysis over Unseen Tweets

In the previous step, we saw that Snorkel's generative model can help to denoise crowd labels automatically. However, what happens when we don't have noisy crowd labels for a tweet?

In this step, we'll use the estimates of the generative model as probabilistic training labels to train a simple LSTM sentiment analysis model, which takes as input a tweet for which no crowd labels are available and predicts its sentiment.

First, we get the probabilistic training labels (training marginals) which are just the marginal estimates of the generative model:


In [14]:
train_marginals = gen_model.marginals(L_train)

In [15]:
from snorkel.annotations import save_marginals
save_marginals(session, L_train, train_marginals)


Saved 629 marginals

Next, we'll train a simple LSTM:


In [16]:
from snorkel.learning.tensorflow import TextRNN


/dfs/scratch0/paroma/anaconda2/envs/babble/lib/python2.7/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters

In [17]:
train_kwargs = {
    'lr':         0.01,
    'dim':        100,
    'n_epochs':   200,
    'dropout':    0.2,
    'print_freq': 5
}

lstm = TextRNN(seed=1701, cardinality=Tweet.cardinality)
train_cands = session.query(Tweet).filter(Tweet.split == 0).order_by(Tweet.id).all()
lstm.train(train_cands, train_marginals, **train_kwargs)


/dfs/scratch0/paroma/anaconda2/envs/babble/lib/python2.7/site-packages/snorkel/learning/tensorflow/rnn/rnn_base.py:36: UserWarning: Candidate 618 has argument past max length for model:	[arg ends at index 28; max len 28]
  warnings.warn('\t'.join([w.format(i), info]))
WARNING:tensorflow:From /dfs/scratch0/paroma/anaconda2/envs/babble/lib/python2.7/site-packages/tensorflow/python/ops/rnn.py:430: calling reverse_sequence (from tensorflow.python.ops.array_ops) with seq_dim is deprecated and will be removed in a future version.
Instructions for updating:
seq_dim is deprecated, use seq_axis instead
WARNING:tensorflow:From /dfs/scratch0/paroma/anaconda2/envs/babble/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py:454: calling reverse_sequence (from tensorflow.python.ops.array_ops) with batch_dim is deprecated and will be removed in a future version.
Instructions for updating:
batch_dim is deprecated, use batch_axis instead
WARNING:tensorflow:From /dfs/scratch0/paroma/anaconda2/envs/babble/lib/python2.7/site-packages/snorkel/learning/tensorflow/noise_aware_model.py:77: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See @{tf.nn.softmax_cross_entropy_with_logits_v2}.

/dfs/scratch0/paroma/anaconda2/envs/babble/lib/python2.7/site-packages/tensorflow/python/ops/gradients_impl.py:108: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
[TextRNN] Training model
[TextRNN] n_train=629  #epochs=200  batch size=256
[TextRNN] Epoch 0 (1.05s)	Average loss=1.547467
[TextRNN] Epoch 5 (2.06s)	Average loss=0.177643
[TextRNN] Epoch 10 (3.15s)	Average loss=0.050926
[TextRNN] Epoch 15 (4.16s)	Average loss=0.051957
[TextRNN] Epoch 20 (5.14s)	Average loss=0.027945
[TextRNN] Epoch 25 (6.11s)	Average loss=0.025162
[TextRNN] Epoch 30 (7.07s)	Average loss=0.026350
[TextRNN] Epoch 35 (8.03s)	Average loss=0.037287
[TextRNN] Epoch 40 (8.95s)	Average loss=0.026180
[TextRNN] Epoch 45 (9.89s)	Average loss=0.028730
[TextRNN] Epoch 50 (10.80s)	Average loss=0.023424
[TextRNN] Epoch 55 (11.76s)	Average loss=0.023584
[TextRNN] Epoch 60 (12.70s)	Average loss=0.021186
[TextRNN] Epoch 65 (13.68s)	Average loss=0.028317
[TextRNN] Epoch 70 (14.62s)	Average loss=0.021332
[TextRNN] Epoch 75 (15.65s)	Average loss=0.020816
[TextRNN] Epoch 80 (16.63s)	Average loss=0.020187
[TextRNN] Epoch 85 (17.56s)	Average loss=0.020813
[TextRNN] Epoch 90 (18.49s)	Average loss=0.034834
[TextRNN] Epoch 95 (19.49s)	Average loss=0.019741
[TextRNN] Epoch 100 (20.49s)	Average loss=0.038676
[TextRNN] Epoch 105 (21.50s)	Average loss=0.025656
[TextRNN] Epoch 110 (22.48s)	Average loss=0.034486
[TextRNN] Epoch 115 (23.43s)	Average loss=0.021657
[TextRNN] Epoch 120 (24.31s)	Average loss=0.021864
[TextRNN] Epoch 125 (25.24s)	Average loss=0.018930
[TextRNN] Epoch 130 (26.20s)	Average loss=0.019719
[TextRNN] Epoch 135 (27.15s)	Average loss=0.020109
[TextRNN] Epoch 140 (28.11s)	Average loss=0.019114
[TextRNN] Epoch 145 (29.10s)	Average loss=0.018110
[TextRNN] Epoch 150 (30.09s)	Average loss=0.019963
[TextRNN] Epoch 155 (31.03s)	Average loss=0.018345
[TextRNN] Epoch 160 (31.99s)	Average loss=0.020997
[TextRNN] Epoch 165 (32.97s)	Average loss=0.019281
[TextRNN] Epoch 170 (33.84s)	Average loss=0.019457
[TextRNN] Epoch 175 (34.78s)	Average loss=0.017222
[TextRNN] Epoch 180 (35.77s)	Average loss=0.020338
[TextRNN] Epoch 185 (36.76s)	Average loss=0.020534
[TextRNN] Epoch 190 (37.72s)	Average loss=0.019646
[TextRNN] Epoch 195 (38.69s)	Average loss=0.020606
[TextRNN] Epoch 199 (39.46s)	Average loss=0.015863
[TextRNN] Training done (39.46s)

In [18]:
test_cands = session.query(Tweet).filter(Tweet.split == 1).order_by(Tweet.id).all()
accuracy = lstm.score(test_cands, test_cand_labels)
print("Accuracy: {:.10f}".format(accuracy))


Accuracy: 0.6666666667

We see that we're already close to the accuracy of an average crowd worker! If we wanted to improve the score, we could tune the LSTM model using grid search (see the Intro tutorial), use pre-trained word embeddings, or many other common techniques for getting state-of-the-art scores. Notably, we're doing this without using gold labels, but rather noisy crowd-labels!

For more, checkout the other tutorials!