In [ ]:
__author__ = "Bill MacCartney and Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2020"
This homework and associated bake-off are devoted to developing really effective relation extraction systems using distant supervision.
As with the previous assignments, this notebook first establishes a baseline system. The initial homework questions ask you to create additional baselines and suggest areas for innovation, and the final homework question asks you to develop an original system for you to enter into the bake-off.
See the first notebook in this unit for set-up instructions.
In [ ]:
import numpy as np
import os
import rel_ext
from sklearn.linear_model import LogisticRegression
import utils
As usual, we unite our corpus and KB into a dataset, and create some splits for experimentation:
In [ ]:
rel_ext_data_home = os.path.join('data', 'rel_ext_data')
In [ ]:
corpus = rel_ext.Corpus(os.path.join(rel_ext_data_home, 'corpus.tsv.gz'))
In [ ]:
kb = rel_ext.KB(os.path.join(rel_ext_data_home, 'kb.tsv.gz'))
In [ ]:
dataset = rel_ext.Dataset(corpus, kb)
You are not wedded to this set-up for splits. The bake-off will be conducted on a previously unseen test-set, so all of the data in dataset
is fair game:
In [ ]:
splits = dataset.build_splits(
split_names=['tiny', 'train', 'dev'],
split_fracs=[0.01, 0.79, 0.20],
seed=1)
In [ ]:
splits
In [ ]:
def simple_bag_of_words_featurizer(kbt, corpus, feature_counter):
for ex in corpus.get_examples_for_entities(kbt.sbj, kbt.obj):
for word in ex.middle.split(' '):
feature_counter[word] += 1
for ex in corpus.get_examples_for_entities(kbt.obj, kbt.sbj):
for word in ex.middle.split(' '):
feature_counter[word] += 1
return feature_counter
In [ ]:
featurizers = [simple_bag_of_words_featurizer]
In [ ]:
model_factory = lambda: LogisticRegression(fit_intercept=True, solver='liblinear')
In [ ]:
baseline_results = rel_ext.experiment(
splits,
train_split='train',
test_split='dev',
featurizers=featurizers,
model_factory=model_factory,
verbose=True)
Studying model weights might yield insights:
In [ ]:
rel_ext.examine_model_weights(baseline_results)
This simple baseline sums the GloVe vector representations for all of the words in the "middle" span and feeds those representations into the standard LogisticRegression
-based model_factory
. The crucial parameter that enables this is vectorize=False
. This essentially says to rel_ext.experiment
that your featurizer or your model will do the work of turning examples into vectors; in that case, rel_ext.experiment
just organizes these representations by relation type.
In [ ]:
GLOVE_HOME = os.path.join('data', 'glove.6B')
In [ ]:
glove_lookup = utils.glove2dict(
os.path.join(GLOVE_HOME, 'glove.6B.300d.txt'))
In [ ]:
def glove_middle_featurizer(kbt, corpus, np_func=np.sum):
reps = []
for ex in corpus.get_examples_for_entities(kbt.sbj, kbt.obj):
for word in ex.middle.split():
rep = glove_lookup.get(word)
if rep is not None:
reps.append(rep)
# A random representation of the right dimensionality if the
# example happens not to overlap with GloVe's vocabulary:
if len(reps) == 0:
dim = len(next(iter(glove_lookup.values())))
return utils.randvec(n=dim)
else:
return np_func(reps, axis=0)
In [ ]:
glove_results = rel_ext.experiment(
splits,
train_split='train',
test_split='dev',
featurizers=[glove_middle_featurizer],
vectorize=False, # Crucial for this featurizer!
verbose=True)
With the same basic code design, one can also use the PyTorch models included in the course repo, or write new ones that are better aligned with the task. For those models, it's likely that the featurizer will just return a list of tokens (or perhaps a list of lists of tokens), and the model will map those into vectors using an embedding.
The code in rel_ext
makes it very easy to experiment with other classifier models: one need only redefine the model_factory
argument. This question asks you to assess a Support Vector Classifier.
To submit: A wrapper function run_svm_model_factory
that does the following:
rel_ext.experiment
with the model factory set to one based in an SVC
with kernel='linear'
and all other arguments left with default values. splits
.dev
part of splits
.featurizers
as defined above. rel_ext.experiment
for this set-up.The function test_run_svm_model_factory
will check that your function conforms to these general specifications.
In [ ]:
def run_svm_model_factory():
##### YOUR CODE HERE
In [ ]:
def test_run_svm_model_factory(run_svm_model_factory):
results = run_svm_model_factory()
assert 'featurizers' in results, \
"The return value of `run_svm_model_factory` seems not to be correct"
# Check one of the models to make sure it's an SVC:
assert 'SVC' in results['models']['adjoins'].__class__.__name__, \
"It looks like the model factor wasn't set to use an SVC."
In [ ]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
test_run_svm_model_factory(run_svm_model_factory)
The current bag-of-words representation makes no distinction between "forward" and "reverse" examples. But, intuitively, there is big difference between X and his son Y and Y and his son X. This question asks you to modify simple_bag_of_words_featurizer
to capture these differences.
To submit:
A feature function directional_bag_of_words_featurizer
that is just like simple_bag_of_words_featurizer
except that it distinguishes "forward" and "reverse". To do this, you just need to mark each word feature for whether it is derived from a subject–object example or from an object–subject example. The included function test_directional_bag_of_words_featurizer
should help verify that you've done this correctly.
A call to rel_ext.experiment
with directional_bag_of_words_featurizer
as the only featurizer. (Aside from this, use all the default values for rel_ext.experiment
as exemplified above in this notebook.)
rel_ext.experiment
returns some of the core objects used in the experiment. How many feature names does the vectorizer
have for the experiment run in the previous step? Include the code needed for getting this value. (Note: we're partly asking you to figure out how to get this value by using the sklearn documentation, so please don't ask how to do it!)
In [ ]:
def directional_bag_of_words_featurizer(kbt, corpus, feature_counter):
# Append these to the end of the keys you add/access in
# `feature_counter` to distinguish the two orders. You'll
# need to use exactly these strings in order to pass
# `test_directional_bag_of_words_featurizer`.
subject_object_suffix = "_SO"
object_subject_suffix = "_OS"
##### YOUR CODE HERE
return feature_counter
# Call to `rel_ext.experiment`:
##### YOUR CODE HERE
In [ ]:
def test_directional_bag_of_words_featurizer(corpus):
from collections import defaultdict
kbt = rel_ext.KBTriple(rel='worked_at', sbj='Randall_Munroe', obj='xkcd')
feature_counter = defaultdict(int)
# Make sure `feature_counter` is being updated, not reinitialized:
feature_counter['is_OS'] += 5
feature_counter = directional_bag_of_words_featurizer(kbt, corpus, feature_counter)
expected = defaultdict(
int, {'is_OS':6,'a_OS':1,'webcomic_OS':1,'created_OS':1,'by_OS':1})
assert feature_counter == expected, \
"Expected:\n{}\nGot:\n{}".format(expected, feature_counter)
In [ ]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
test_directional_bag_of_words_featurizer(corpus)
Our corpus distribution contains part-of-speech (POS) tagged versions of the core text spans. Let's begin to explore whether there is information in these sequences, focusing on middle_POS
.
To submit:
A feature function middle_bigram_pos_tag_featurizer
that is just like simple_bag_of_words_featurizer
except that it creates a feature for bigram POS sequences. For example, given
The/DT dog/N napped/V
we obtain the list of bigram POS sequences
b = ['<s> DT', 'DT N', 'N V', 'V </s>']
.
Of course, middle_bigram_pos_tag_featurizer
should return count dictionaries defined in terms of such bigram POS lists, on the model of simple_bag_of_words_featurizer
. Don't forget the start and end tags, to model those environments properly! The included function test_middle_bigram_pos_tag_featurizer
should help verify that you've done this correctly.
A call to rel_ext.experiment
with middle_bigram_pos_tag_featurizer
as the only featurizer. (Aside from this, use all the default values for rel_ext.experiment
as exemplified above in this notebook.)
In [ ]:
def middle_bigram_pos_tag_featurizer(kbt, corpus, feature_counter):
##### YOUR CODE HERE
return feature_counter
def get_tag_bigrams(s):
"""Suggested helper method for `middle_bigram_pos_tag_featurizer`.
This should be defined so that it returns a list of str, where each
element is a POS bigram."""
# The values of `start_symbol` and `end_symbol` are defined
# here so that you can use `test_middle_bigram_pos_tag_featurizer`.
start_symbol = "<s>"
end_symbol = "</s>"
##### YOUR CODE HERE
def get_tags(s):
"""Given a sequence of word/POS elements (lemmas), this function
returns a list containing just the POS elements, in order.
"""
return [parse_lem(lem)[1] for lem in s.strip().split(' ') if lem]
def parse_lem(lem):
"""Helper method for parsing word/POS elements. It just splits
on the rightmost / and returns (word, POS) as a tuple of str."""
return lem.strip().rsplit('/', 1)
# Call to `rel_ext.experiment`:
##### YOUR CODE HERE
In [ ]:
def test_middle_bigram_pos_tag_featurizer(corpus):
from collections import defaultdict
kbt = rel_ext.KBTriple(rel='worked_at', sbj='Randall_Munroe', obj='xkcd')
feature_counter = defaultdict(int)
# Make sure `feature_counter` is being updated, not reinitialized:
feature_counter['<s> VBZ'] += 5
feature_counter = middle_bigram_pos_tag_featurizer(kbt, corpus, feature_counter)
expected = defaultdict(
int, {'<s> VBZ':6,'VBZ DT':1,'DT JJ':1,'JJ VBN':1,'VBN IN':1,'IN </s>':1})
assert feature_counter == expected, \
"Expected:\n{}\nGot:\n{}".format(expected, feature_counter)
In [ ]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
test_middle_bigram_pos_tag_featurizer(corpus)
The following allows you to use NLTK's WordNet API to get the synsets compatible with dog as used as a noun:
from nltk.corpus import wordnet as wn
dog = wn.synsets('dog', pos='n')
dog
[Synset('dog.n.01'),
Synset('frump.n.01'),
Synset('dog.n.03'),
Synset('cad.n.01'),
Synset('frank.n.02'),
Synset('pawl.n.01'),
Synset('andiron.n.01')]
This question asks you to create synset-based features from the word/tag pairs in middle_POS
.
To submit:
A feature function synset_featurizer
that is just like simple_bag_of_words_featurizer
except that it returns a list of synsets derived from middle_POS
. Stringify these objects with str
so that they can be dict
keys. Use convert_tag
(included below) to convert tags to pos
arguments usable by wn.synsets
. The included function test_synset_featurizer
should help verify that you've done this correctly.
A call to rel_ext.experiment
with synset_featurizer
as the only featurizer. (Aside from this, use all the default values for rel_ext.experiment
.)
In [ ]:
from nltk.corpus import wordnet as wn
def synset_featurizer(kbt, corpus, feature_counter):
##### YOUR CODE HERE
return feature_counter
def get_synsets(s):
"""Suggested helper method for `synset_featurizer`. This should
be completed so that it returns a list of stringified Synsets
associated with elements of `s`.
"""
# Use `parse_lem` from the previous question to get a list of
# (word, POS) pairs. Remember to convert the POS strings.
wt = [parse_lem(lem) for lem in s.strip().split(' ') if lem]
##### YOUR CODE HERE
def convert_tag(t):
"""Converts tags so that they can be used by WordNet:
| Tag begins with | WordNet tag |
|-----------------|-------------|
| `N` | `n` |
| `V` | `v` |
| `J` | `a` |
| `R` | `r` |
| Otherwise | `None` |
"""
if t[0].lower() in {'n', 'v', 'r'}:
return t[0].lower()
elif t[0].lower() == 'j':
return 'a'
else:
return None
# Call to `rel_ext.experiment`:
##### YOUR CODE HERE
In [ ]:
def test_synset_featurizer(corpus):
from collections import defaultdict
kbt = rel_ext.KBTriple(rel='worked_at', sbj='Randall_Munroe', obj='xkcd')
feature_counter = defaultdict(int)
# Make sure `feature_counter` is being updated, not reinitialized:
feature_counter["Synset('be.v.01')"] += 5
feature_counter = synset_featurizer(kbt, corpus, feature_counter)
# The full return values for this tend to be long, so we just
# test a few examples to avoid cluttering up this notebook.
test_cases = {
"Synset('be.v.01')": 6,
"Synset('embody.v.02')": 1
}
for ss, expected in test_cases.items():
result = feature_counter[ss]
assert result == expected, \
"Incorrect count for {}: Expected {}; Got {}".format(ss, expected, result)
In [ ]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
test_synset_featurizer(corpus)
There are many options, and this could easily grow into a project. Here are a few ideas:
sklearn
and elsewhere.In the cell below, please provide a brief technical description of your original system, so that the teaching team can gain an understanding of what it does. This will help us to understand your code and analyze all the submissions to identify patterns and strategies.
In [ ]:
# Enter your system description in this cell.
# Please do not remove this comment.
For the bake-off, we will release a test set. The announcement will go out on the discussion forum. You will evaluate your custom model from the previous question on these new datasets using the function rel_ext.bake_off_experiment
. Rules:
The cells below this one constitute your bake-off entry.
People who enter will receive the additional homework point, and people whose systems achieve the top score will receive an additional 0.5 points. We will test the top-performing systems ourselves, and only systems for which we can reproduce the reported results will win the extra 0.5 points.
Late entries will be accepted, but they cannot earn the extra 0.5 points. Similarly, you cannot win the bake-off unless your homework is submitted on time.
The announcement will include the details on where to submit your entry.
In [ ]:
# Enter your bake-off assessment code in this cell.
# Please do not remove this comment.
if 'IS_GRADESCOPE_ENV' not in os.environ:
pass
# Please enter your code in the scope of the above conditional.
##### YOUR CODE HERE
In [ ]:
# On an otherwise blank line in this cell, please enter
# your macro-average f-score (an F_0.5 score) as reported
# by the code above. Please enter only a number between
# 0 and 1 inclusive. Please do not remove this comment.
if 'IS_GRADESCOPE_ENV' not in os.environ:
pass
# Please enter your score in the scope of the above conditional.
##### YOUR CODE HERE