This notebook shows how to implement a basic CNN for part-of-speech tagging model in Thinc (without external dependencies) and train the model on the Universal Dependencies AnCora corpus. The tutorial shows three different workflows:
In [ ]:
!pip install "thinc>=8.0.0a0" ml_datasets "tqdm>=4.41"
We start by making sure the computation is performed on GPU if available. prefer_gpu
should be called right after importing Thinc, and it returns a boolean indicating whether the GPU has been activated.
In [ ]:
from thinc.api import prefer_gpu
prefer_gpu()
We also define the following helper functions for loading the data, and training and evaluating a given model. Don't forget to call model.initialize
with a batch of input and output data to initialize the model and fill in any missing shapes.
In [ ]:
import ml_datasets
from tqdm.notebook import tqdm
from thinc.api import fix_random_seed
fix_random_seed(0)
def train_model(model, optimizer, n_iter, batch_size):
(train_X, train_y), (dev_X, dev_y) = ml_datasets.ud_ancora_pos_tags()
model.initialize(X=train_X[:5], Y=train_y[:5])
for n in range(n_iter):
loss = 0.0
batches = model.ops.multibatch(batch_size, train_X, train_y, shuffle=True)
for X, Y in tqdm(batches, leave=False):
Yh, backprop = model.begin_update(X)
d_loss = []
for i in range(len(Yh)):
d_loss.append(Yh[i] - Y[i])
loss += ((Yh[i] - Y[i]) ** 2).sum()
backprop(d_loss)
model.finish_update(optimizer)
score = evaluate(model, dev_X, dev_y, batch_size)
print(f"{n}\t{loss:.2f}\t{score:.3f}")
def evaluate(model, dev_X, dev_Y, batch_size):
correct = 0
total = 0
for X, Y in model.ops.multibatch(batch_size, dev_X, dev_Y):
Yh = model.predict(X)
for yh, y in zip(Yh, Y):
correct += (y.argmax(axis=1) == yh.argmax(axis=1)).sum()
total += y.shape[0]
return float(correct / total)
Here's the model definition, using the >>
operator for the chain
combinator. The strings2arrays
transform converts a sequence of strings to a list of arrays. with_array
transforms sequences (the sequences of arrays) into a contiguous 2-dimensional array on the way into and out of the model it wraps. This means our model has the following signature: Model[Sequence[str], Sequence[Array2d]]
.
In [ ]:
from thinc.api import Model, chain, strings2arrays, with_array, HashEmbed, expand_window, Relu, Softmax, Adam, warmup_linear
width = 32
vector_width = 16
nr_classes = 17
learn_rate = 0.001
n_iter = 10
batch_size = 128
with Model.define_operators({">>": chain}):
model = strings2arrays() >> with_array(
HashEmbed(nO=width, nV=vector_width, column=0)
>> expand_window(window_size=1)
>> Relu(nO=width, nI=width * 3)
>> Relu(nO=width, nI=width)
>> Softmax(nO=nr_classes, nI=width)
)
optimizer = Adam(learn_rate)
In [ ]:
train_model(model, optimizer, n_iter, batch_size)
Thinc's config system lets describe arbitrary trees of objects. The config can include values like hyperparameters or training settings, or references to functions and the values of their arguments. Thinc will then construct the config bottom-up – so you can define one function with its arguments, and then pass the return value into another function.
If we want to rebuild the model defined above in a config file, we first need to break down its structure:
chain
(any number of positional arguments)strings2arrays
(no arguments)with_array
(one argument layer)chain
(any number of positional arguments)HashEmbed
expand_window
Relu
Relu
Softmax
chain
takes a variable number of positional arguments (the layers to compose). In the config, positional arguments can be expressed using *
in the dot notation. For example, model.layer
could describe a function passed to model
as the argument layer
, while model.*.relu
defines a positional argument passed to model
. The name of the argument, e.g. relu
– doesn't matter in this case. It just needs to be unique.
⚠️ Important note: This example is mostly intended to show what's possible. We don't recommend "programming via config files" as shown here, since it doesn't really solve any problem and makes the model definition just as complicated. Instead, we recommend a hybrid approach: wrap the model definition in a registed function and configure it via the config.
In [ ]:
CONFIG = """
[hyper_params]
width = 32
vector_width = 16
learn_rate = 0.001
[training]
n_iter = 10
batch_size = 128
[model]
@layers = "chain.v1"
[model.*.strings2arrays]
@layers = "strings2arrays.v1"
[model.*.with_array]
@layers = "with_array.v1"
[model.*.with_array.layer]
@layers = "chain.v1"
[model.*.with_array.layer.*.hashembed]
@layers = "HashEmbed.v1"
nO = ${hyper_params:width}
nV = ${hyper_params:vector_width}
column = 0
[model.*.with_array.layer.*.expand_window]
@layers = "expand_window.v1"
window_size = 1
[model.*.with_array.layer.*.relu1]
@layers = "Relu.v1"
nO = ${hyper_params:width}
nI = 96
[model.*.with_array.layer.*.relu2]
@layers = "Relu.v1"
nO = ${hyper_params:width}
nI = ${hyper_params:width}
[model.*.with_array.layer.*.softmax]
@layers = "Softmax.v1"
nO = 17
nI = ${hyper_params:width}
[optimizer]
@optimizers = "Adam.v1"
learn_rate = ${hyper_params:learn_rate}
"""
When the config is loaded, it's first parsed as a dictionary and all references to values from other sections, e.g. ${hyper_params:width}
are replaced. The result is a nested dictionary describing the objects defined in the config.
In [ ]:
from thinc.api import registry, Config
config = Config().from_str(CONFIG)
config
registry.make_from_config
then creates the objects and calls the functions bottom-up.
In [ ]:
C = registry.make_from_config(config)
C
We now have a model, optimizer and training settings, built from the config, and can use them to train the model.
In [ ]:
model = C["model"]
optimizer = C["optimizer"]
n_iter = C["training"]["n_iter"]
batch_size = C["training"]["batch_size"]
train_model(model, optimizer, n_iter, batch_size)
The @thinc.registry
decorator lets you register your own layers and model definitions, which can then be referenced in config files. This approach gives you the most flexibility, while also keeping your config and model definitions concise.
💡 The function you register will be filled in by the config – e.g. the value of
width
defined in the config block will be passed in as the argumentwidth
. If arguments are missing, you'll see a validation error. If you're using type hints in the function, the values will be parsed to ensure they always have the right type. If they're invalid – e.g. if you're passing in a list as the value ofwidth
– you'll see an error. This makes it easier to prevent bugs caused by incorrect values lower down in the network.
In [ ]:
import thinc
from thinc.api import Model, chain, strings2arrays, with_array, HashEmbed, expand_window, Relu, Softmax, Adam, warmup_linear
@thinc.registry.layers("cnn_tagger.v1")
def create_cnn_tagger(width: int, vector_width: int, nr_classes: int = 17):
with Model.define_operators({">>": chain}):
model = strings2arrays() >> with_array(
HashEmbed(nO=width, nV=vector_width, column=0)
>> expand_window(window_size=1)
>> Relu(nO=width, nI=width * 3)
>> Relu(nO=width, nI=width)
>> Softmax(nO=nr_classes, nI=width)
)
return model
The config would then only need to define one model block with @layers = "cnn_tagger.v1"
and the function arguments. Whether you move them out to a section like [hyper_params]
or just hard-code them into the block is up to you. The advantage of a separate section is that the values are preserved in the parsed config object (and not just passed into the function), so you can always print and view them.
In [ ]:
CONFIG = """
[hyper_params]
width = 32
vector_width = 16
learn_rate = 0.001
[training]
n_iter = 10
batch_size = 128
[model]
@layers = "cnn_tagger.v1"
width = ${hyper_params:width}
vector_width = ${hyper_params:vector_width}
nr_classes = 17
[optimizer]
@optimizers = "Adam.v1"
learn_rate = ${hyper_params:learn_rate}
"""
In [ ]:
C = registry.make_from_config(Config().from_str(CONFIG))
C
In [ ]:
model = C["model"]
optimizer = C["optimizer"]
n_iter = C["training"]["n_iter"]
batch_size = C["training"]["batch_size"]
train_model(model, optimizer, n_iter, batch_size)