Basic CNN part-of-speech tagger with Thinc

This notebook shows how to implement a basic CNN for part-of-speech tagging model in Thinc (without external dependencies) and train the model on the Universal Dependencies AnCora corpus. The tutorial shows three different workflows:

Composing the model in code (basic usage)
Composing the model via a config file only (mostly to demonstrate advanced usage of configs)
Composing the model in code and configuring it via config (recommended)



In [ ]:

    
!pip install "thinc>=8.0.0a0" ml_datasets "tqdm>=4.41"

We start by making sure the computation is performed on GPU if available. prefer_gpu should be called right after importing Thinc, and it returns a boolean indicating whether the GPU has been activated.



In [ ]:

    
from thinc.api import prefer_gpu

prefer_gpu()

We also define the following helper functions for loading the data, and training and evaluating a given model. Don't forget to call model.initialize with a batch of input and output data to initialize the model and fill in any missing shapes.



In [ ]:

    
import ml_datasets
from tqdm.notebook import tqdm
from thinc.api import fix_random_seed

fix_random_seed(0)

def train_model(model, optimizer, n_iter, batch_size):
    (train_X, train_y), (dev_X, dev_y) = ml_datasets.ud_ancora_pos_tags()
    model.initialize(X=train_X[:5], Y=train_y[:5])
    for n in range(n_iter):
        loss = 0.0
        batches = model.ops.multibatch(batch_size, train_X, train_y, shuffle=True)
        for X, Y in tqdm(batches, leave=False):
            Yh, backprop = model.begin_update(X)
            d_loss = []
            for i in range(len(Yh)):
                d_loss.append(Yh[i] - Y[i])
                loss += ((Yh[i] - Y[i]) ** 2).sum()
            backprop(d_loss)
            model.finish_update(optimizer)
        score = evaluate(model, dev_X, dev_y, batch_size)
        print(f"{n}\t{loss:.2f}\t{score:.3f}")
        
def evaluate(model, dev_X, dev_Y, batch_size):
    correct = 0
    total = 0
    for X, Y in model.ops.multibatch(batch_size, dev_X, dev_Y):
        Yh = model.predict(X)
        for yh, y in zip(Yh, Y):
            correct += (y.argmax(axis=1) == yh.argmax(axis=1)).sum()
            total += y.shape[0]
    return float(correct / total)

1. Composing the model in code

Here's the model definition, using the >> operator for the chain combinator. The strings2arrays transform converts a sequence of strings to a list of arrays. with_array transforms sequences (the sequences of arrays) into a contiguous 2-dimensional array on the way into and out of the model it wraps. This means our model has the following signature: Model[Sequence[str], Sequence[Array2d]].



In [ ]:

    
from thinc.api import Model, chain, strings2arrays, with_array, HashEmbed, expand_window, Relu, Softmax, Adam, warmup_linear

width = 32
vector_width = 16
nr_classes = 17
learn_rate = 0.001
n_iter = 10
batch_size = 128

with Model.define_operators({">>": chain}):
    model = strings2arrays() >> with_array(
        HashEmbed(nO=width, nV=vector_width, column=0)
        >> expand_window(window_size=1)
        >> Relu(nO=width, nI=width * 3)
        >> Relu(nO=width, nI=width)
        >> Softmax(nO=nr_classes, nI=width)
    )
optimizer = Adam(learn_rate)



In [ ]:

    
train_model(model, optimizer, n_iter, batch_size)

Composing the model via a config file

Thinc's config system lets describe arbitrary trees of objects. The config can include values like hyperparameters or training settings, or references to functions and the values of their arguments. Thinc will then construct the config bottom-up – so you can define one function with its arguments, and then pass the return value into another function.

If we want to rebuild the model defined above in a config file, we first need to break down its structure:

chain (any number of positional arguments)
- strings2arrays (no arguments)
- with_array (one argument layer)
  - layer: chain (any number of positional arguments)
    - HashEmbed
    - expand_window
    - Relu
    - Relu
    - Softmax

chain takes a variable number of positional arguments (the layers to compose). In the config, positional arguments can be expressed using * in the dot notation. For example, model.layer could describe a function passed to model as the argument layer, while model.*.relu defines a positional argument passed to model. The name of the argument, e.g. relu – doesn't matter in this case. It just needs to be unique.

⚠️ Important note: This example is mostly intended to show what's possible. We don't recommend "programming via config files" as shown here, since it doesn't really solve any problem and makes the model definition just as complicated. Instead, we recommend a hybrid approach: wrap the model definition in a registed function and configure it via the config.



In [ ]:

    
CONFIG = """
[hyper_params]
width = 32
vector_width = 16
learn_rate = 0.001

[training]
n_iter = 10
batch_size = 128

[model]
@layers = "chain.v1"

[model.*.strings2arrays]
@layers = "strings2arrays.v1"

[model.*.with_array]
@layers = "with_array.v1"

[model.*.with_array.layer]
@layers = "chain.v1"

[model.*.with_array.layer.*.hashembed]
@layers = "HashEmbed.v1"
nO = ${hyper_params:width}
nV = ${hyper_params:vector_width}
column = 0

[model.*.with_array.layer.*.expand_window]
@layers = "expand_window.v1"
window_size = 1

[model.*.with_array.layer.*.relu1]
@layers = "Relu.v1"
nO = ${hyper_params:width}
nI = 96

[model.*.with_array.layer.*.relu2]
@layers = "Relu.v1"
nO = ${hyper_params:width}
nI = ${hyper_params:width}

[model.*.with_array.layer.*.softmax]
@layers = "Softmax.v1"
nO = 17
nI = ${hyper_params:width}

[optimizer]
@optimizers = "Adam.v1"
learn_rate = ${hyper_params:learn_rate}
"""

When the config is loaded, it's first parsed as a dictionary and all references to values from other sections, e.g. ${hyper_params:width} are replaced. The result is a nested dictionary describing the objects defined in the config.



In [ ]:

    
from thinc.api import registry, Config

config = Config().from_str(CONFIG)
config

registry.make_from_config then creates the objects and calls the functions bottom-up.



In [ ]:

    
C = registry.make_from_config(config)
C

We now have a model, optimizer and training settings, built from the config, and can use them to train the model.



In [ ]:

    
model = C["model"]
optimizer = C["optimizer"]
n_iter = C["training"]["n_iter"]
batch_size = C["training"]["batch_size"]
train_model(model, optimizer, n_iter, batch_size)

Composing the model with code and config

The @thinc.registry decorator lets you register your own layers and model definitions, which can then be referenced in config files. This approach gives you the most flexibility, while also keeping your config and model definitions concise.

💡 The function you register will be filled in by the config – e.g. the value of width defined in the config block will be passed in as the argument width. If arguments are missing, you'll see a validation error. If you're using type hints in the function, the values will be parsed to ensure they always have the right type. If they're invalid – e.g. if you're passing in a list as the value of width – you'll see an error. This makes it easier to prevent bugs caused by incorrect values lower down in the network.



In [ ]:

    
import thinc
from thinc.api import Model, chain, strings2arrays, with_array, HashEmbed, expand_window, Relu, Softmax, Adam, warmup_linear

@thinc.registry.layers("cnn_tagger.v1")
def create_cnn_tagger(width: int, vector_width: int, nr_classes: int = 17):
    with Model.define_operators({">>": chain}):
        model = strings2arrays() >> with_array(
            HashEmbed(nO=width, nV=vector_width, column=0)
            >> expand_window(window_size=1)
            >> Relu(nO=width, nI=width * 3)
            >> Relu(nO=width, nI=width)
            >> Softmax(nO=nr_classes, nI=width)
        )
    return model

The config would then only need to define one model block with @layers = "cnn_tagger.v1" and the function arguments. Whether you move them out to a section like [hyper_params] or just hard-code them into the block is up to you. The advantage of a separate section is that the values are preserved in the parsed config object (and not just passed into the function), so you can always print and view them.



In [ ]:

    
CONFIG = """
[hyper_params]
width = 32
vector_width = 16
learn_rate = 0.001

[training]
n_iter = 10
batch_size = 128

[model]
@layers = "cnn_tagger.v1"
width = ${hyper_params:width}
vector_width = ${hyper_params:vector_width}
nr_classes = 17

[optimizer]
@optimizers = "Adam.v1"
learn_rate = ${hyper_params:learn_rate}
"""



In [ ]:

    
C = registry.make_from_config(Config().from_str(CONFIG))
C



In [ ]:

    
model = C["model"]
optimizer = C["optimizer"]
n_iter = C["training"]["n_iter"]
batch_size = C["training"]["batch_size"]
train_model(model, optimizer, n_iter, batch_size)