MatchZoo Quick Start


In [1]:
import matchzoo as mz
print(mz.__version__)


Using TensorFlow backend.
2.1.0

Define Task

There are two types of tasks available in MatchZoo. mz.tasks.Ranking and mz.tasks.Classification. We will use a ranking task for this demo.


In [2]:
task = mz.tasks.Ranking()
print(task)


Ranking Task

Prepare Data


In [3]:
train_raw = mz.datasets.toy.load_data(stage='train', task=task)
test_raw = mz.datasets.toy.load_data(stage='test', task=task)

In [4]:
type(train_raw)


Out[4]:
matchzoo.data_pack.data_pack.DataPack

DataPack is a MatchZoo native data structure that most MatchZoo data handling processes build upon. A DataPack is consists of three pandas.DataFrame:


In [5]:
train_raw.left.head()


Out[5]:
text_left
id_left
Q1 how are glacier caves formed?
Q2 How are the directions of the velocity and for...
Q5 how did apollo creed die
Q6 how long is the term for federal judges
Q7 how a beretta model 21 pistols magazines works

In [6]:
train_raw.right.head()


Out[6]:
text_right
id_right
D1-0 A partly submerged glacier cave on Perito More...
D1-1 The ice facade is approximately 60 m high
D1-2 Ice formations in the Titlis glacier cave
D1-3 A glacier cave is a cave formed within the ice...
D1-4 Glacier caves are often called ice caves , but...

In [7]:
train_raw.relation.head()


Out[7]:
id_left id_right label
0 Q1 D1-0 0.0
1 Q1 D1-1 0.0
2 Q1 D1-2 0.0
3 Q1 D1-3 1.0
4 Q1 D1-4 0.0

It is also possible to convert a DataPack into a single pandas.DataFrame that holds all information.


In [8]:
train_raw.frame().head()


Out[8]:
id_left text_left id_right text_right label
0 Q1 how are glacier caves formed? D1-0 A partly submerged glacier cave on Perito More... 0.0
1 Q1 how are glacier caves formed? D1-1 The ice facade is approximately 60 m high 0.0
2 Q1 how are glacier caves formed? D1-2 Ice formations in the Titlis glacier cave 0.0
3 Q1 how are glacier caves formed? D1-3 A glacier cave is a cave formed within the ice... 1.0
4 Q1 how are glacier caves formed? D1-4 Glacier caves are often called ice caves , but... 0.0

However, using such pandas.DataFrame consumes much more memory if there are many duplicates in the texts, and that is the exact reason why we use DataPack. For more details about data handling, consult matchzoo/tutorials/data_handling.ipynb.

Preprocessing

MatchZoo preprocessors are used to convert a raw DataPack into a DataPack that ready to be fed into a model.


In [9]:
preprocessor = mz.preprocessors.BasicPreprocessor()

There are two steps to use a preprocessor. First, fit. Then, transform. fit will only changes the preprocessor's inner state but not the input DataPack.


In [10]:
preprocessor.fit(train_raw)


Processing text_left with chain_transform of Tokenize => Lowercase => PuncRemoval: 100%|██████████| 13/13 [00:00<00:00, 1194.41it/s]
Processing text_right with chain_transform of Tokenize => Lowercase => PuncRemoval: 100%|██████████| 100/100 [00:00<00:00, 4426.94it/s]
Processing text_right with append: 100%|██████████| 100/100 [00:00<00:00, 160516.80it/s]
Building FrequencyFilter from a datapack.: 100%|██████████| 100/100 [00:00<00:00, 69742.33it/s]
Processing text_right with transform: 100%|██████████| 100/100 [00:00<00:00, 96067.43it/s]
Processing text_left with extend: 100%|██████████| 13/13 [00:00<00:00, 14364.05it/s]
Processing text_right with extend: 100%|██████████| 100/100 [00:00<00:00, 129854.61it/s]
Building Vocabulary from a datapack.: 100%|██████████| 1665/1665 [00:00<00:00, 1703712.16it/s]
Out[10]:
<matchzoo.preprocessors.basic_preprocessor.BasicPreprocessor at 0x1333081d0>

fit will gather useful information into its context, which will be used later in a transform or used to set hyper-parameters of a model.


In [11]:
preprocessor.context


Out[11]:
{'filter_unit': <matchzoo.preprocessors.units.frequency_filter.FrequencyFilter at 0x133308a20>,
 'vocab_unit': <matchzoo.preprocessors.units.vocabulary.Vocabulary at 0x133515cf8>,
 'vocab_size': 285,
 'embedding_input_dim': 285,
 'input_shapes': [(30,), (30,)]}

Once fit, the preprocessor has enough information to transform. transform will not change the preprocessor's inner state and the input DataPack, but return a transformed DataPack.


In [12]:
train_processed = preprocessor.transform(train_raw)
test_processed = preprocessor.transform(test_raw)


Processing text_left with chain_transform of Tokenize => Lowercase => PuncRemoval: 100%|██████████| 13/13 [00:00<00:00, 6229.40it/s]
Processing text_right with chain_transform of Tokenize => Lowercase => PuncRemoval: 100%|██████████| 100/100 [00:00<00:00, 4721.45it/s]
Processing text_right with transform: 100%|██████████| 100/100 [00:00<00:00, 38168.20it/s]
Processing text_left with transform: 100%|██████████| 13/13 [00:00<00:00, 20127.70it/s]
Processing text_right with transform: 100%|██████████| 100/100 [00:00<00:00, 106158.04it/s]
Processing length_left with len: 100%|██████████| 13/13 [00:00<00:00, 20568.07it/s]
Processing length_right with len: 100%|██████████| 100/100 [00:00<00:00, 146398.05it/s]
Processing text_left with transform: 100%|██████████| 13/13 [00:00<00:00, 24954.67it/s]
Processing text_right with transform: 100%|██████████| 100/100 [00:00<00:00, 66010.45it/s]
Processing text_left with chain_transform of Tokenize => Lowercase => PuncRemoval: 100%|██████████| 3/3 [00:00<00:00, 1892.74it/s]
Processing text_right with chain_transform of Tokenize => Lowercase => PuncRemoval: 100%|██████████| 20/20 [00:00<00:00, 3610.80it/s]
Processing text_right with transform: 100%|██████████| 20/20 [00:00<00:00, 32948.19it/s]
Processing text_left with transform: 100%|██████████| 3/3 [00:00<00:00, 6275.77it/s]
Processing text_right with transform: 100%|██████████| 20/20 [00:00<00:00, 35833.44it/s]
Processing length_left with len: 100%|██████████| 3/3 [00:00<00:00, 1872.74it/s]
Processing length_right with len: 100%|██████████| 20/20 [00:00<00:00, 36776.01it/s]
Processing text_left with transform: 100%|██████████| 3/3 [00:00<00:00, 3333.22it/s]
Processing text_right with transform: 100%|██████████| 20/20 [00:00<00:00, 23838.04it/s]

In [13]:
train_processed.left.head()


Out[13]:
text_left length_left
id_left
Q1 [263, 117, 232, 112, 21, 0, 0, 0, 0, 0, 0, 0, ... 5
Q2 [263, 117, 89, 194, 22, 89, 225, 186, 195, 105... 15
Q5 [263, 275, 268, 236, 158, 0, 0, 0, 0, 0, 0, 0,... 5
Q6 [263, 101, 157, 89, 50, 37, 274, 141, 0, 0, 0,... 8
Q7 [263, 102, 63, 58, 164, 3, 38, 222, 0, 0, 0, 0... 8

As we can see, text_left is already in sequence form that nerual networks love.

Just to make sure we have the correct sequence:


In [14]:
vocab_unit = preprocessor.context['vocab_unit']
print('Orig Text:', train_processed.left.loc['Q1']['text_left'])
sequence = train_processed.left.loc['Q1']['text_left']
print('Transformed Indices:', sequence)
print('Transformed Indices Meaning:',
      '_'.join([vocab_unit.state['index_term'][i] for i in sequence]))


Orig Text: [263, 117, 232, 112, 21, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Transformed Indices: [263, 117, 232, 112, 21, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Transformed Indices Meaning: how_are_glacier_caves_formed_________________________

For more details about preprocessing, consult matchzoo/tutorials/data_handling.ipynb.

Build Model

MatchZoo provides many built-in text matching models.


In [15]:
mz.models.list_available()


Out[15]:
[matchzoo.models.naive.Naive,
 matchzoo.models.dssm.DSSM,
 matchzoo.models.cdssm.CDSSM,
 matchzoo.models.dense_baseline.DenseBaseline,
 matchzoo.models.arci.ArcI,
 matchzoo.models.arcii.ArcII,
 matchzoo.models.match_pyramid.MatchPyramid,
 matchzoo.models.knrm.KNRM,
 matchzoo.models.duet.DUET,
 matchzoo.models.drmmtks.DRMMTKS,
 matchzoo.models.drmm.DRMM,
 matchzoo.models.anmm.ANMM,
 matchzoo.models.mvlstm.MVLSTM,
 matchzoo.contrib.models.match_lstm.MatchLSTM,
 matchzoo.models.conv_knrm.ConvKNRM]

Let's use mz.models.DenseBaseline for our demo.


In [16]:
model = mz.models.DenseBaseline()

The model is initialized with a hyper parameter table, in which values are partially filled. To view parameters and their values, use print.


In [17]:
print(model.params)


model_class                   <class 'matchzoo.models.dense_baseline.DenseBaseline'>
input_shapes                  None
task                          None
optimizer                     adam
with_multi_layer_perceptron   True
mlp_num_units                 256
mlp_num_layers                3
mlp_num_fan_out               64
mlp_activation_func           relu

to_frame gives you more informartion in addition to just names and values.


In [18]:
model.params.to_frame()[['Name', 'Description', 'Value']]


Out[18]:
Name Description Value
0 model_class Model class. Used internally for save/load. Ch... <class 'matchzoo.models.dense_baseline.DenseBa...
1 input_shapes Dependent on the model and data. Should be set... None
2 task Decides model output shape, loss, and metrics. None
3 optimizer None adam
4 with_multi_layer_perceptron A flag of whether a multiple layer perceptron ... True
5 mlp_num_units Number of units in first `mlp_num_layers` layers. 256
6 mlp_num_layers Number of layers of the multiple layer percetron. 3
7 mlp_num_fan_out Number of units of the layer that connects the... 64
8 mlp_activation_func Activation function used in the multiple layer... relu

To set a hyper-parameter:


In [19]:
model.params['task'] = task
model.params['mlp_num_units'] = 3
print(model.params)


model_class                   <class 'matchzoo.models.dense_baseline.DenseBaseline'>
input_shapes                  None
task                          Ranking Task
optimizer                     adam
with_multi_layer_perceptron   True
mlp_num_units                 3
mlp_num_layers                3
mlp_num_fan_out               64
mlp_activation_func           relu

Notice that we are still missing input_shapes, and that information is store in the preprocessor.


In [20]:
print(preprocessor.context['input_shapes'])


[(30,), (30,)]

We may use update to load a preprocessor's context into a model's hyper-parameter table.


In [21]:
model.params.update(preprocessor.context)

Now we have a completed hyper-parameter table.


In [22]:
model.params.completed()


Out[22]:
True

With all parameters filled in, we can now build and compile the model.


In [23]:
model.build()
model.compile()

MatchZoo models are wrapped over keras models, and the backend property of a model gives you the actual keras model built.


In [24]:
model.backend.summary()


__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
text_left (InputLayer)          (None, 30)           0                                            
__________________________________________________________________________________________________
text_right (InputLayer)         (None, 30)           0                                            
__________________________________________________________________________________________________
concatenate_1 (Concatenate)     (None, 60)           0           text_left[0][0]                  
                                                                 text_right[0][0]                 
__________________________________________________________________________________________________
dense_1 (Dense)                 (None, 3)            183         concatenate_1[0][0]              
__________________________________________________________________________________________________
dense_2 (Dense)                 (None, 3)            12          dense_1[0][0]                    
__________________________________________________________________________________________________
dense_3 (Dense)                 (None, 3)            12          dense_2[0][0]                    
__________________________________________________________________________________________________
dense_4 (Dense)                 (None, 64)           256         dense_3[0][0]                    
__________________________________________________________________________________________________
dense_5 (Dense)                 (None, 1)            65          dense_4[0][0]                    
==================================================================================================
Total params: 528
Trainable params: 528
Non-trainable params: 0
__________________________________________________________________________________________________

For more details about models, consult matchzoo/tutorials/models.ipynb.

Train, Evaluate, Predict

A DataPack can unpack itself into data that can be directly used to train a MatchZoo model.


In [25]:
x, y = train_processed.unpack()
test_x, test_y = test_processed.unpack()

In [26]:
model.fit(x, y, batch_size=32, epochs=5)


Epoch 1/5
100/100 [==============================] - 0s 3ms/step - loss: 2.7378
Epoch 2/5
100/100 [==============================] - 0s 63us/step - loss: 0.3293
Epoch 3/5
100/100 [==============================] - 0s 58us/step - loss: 0.3964
Epoch 4/5
100/100 [==============================] - 0s 60us/step - loss: 0.2228
Epoch 5/5
100/100 [==============================] - 0s 69us/step - loss: 0.1432
Out[26]:
<keras.callbacks.History at 0x133584b38>

An alternative to train a model is to use a DataGenerator. This is useful for delaying expensive preprocessing steps or doing real-time data augmentation. For some models that needs dynamic batch-wise information, using a DataGenerator is required. For more details about DataGenerator, consult matchzoo/tutorials/data_handling.ipynb.


In [27]:
data_generator = mz.DataGenerator(train_processed, batch_size=32)

In [28]:
model.fit_generator(data_generator, epochs=5, use_multiprocessing=True, workers=4)


Epoch 1/5
4/4 [==============================] - 0s 14ms/step - loss: 0.1749
Epoch 2/5
4/4 [==============================] - 0s 32ms/step - loss: 0.1149
Epoch 3/5
4/4 [==============================] - 0s 30ms/step - loss: 0.0773
Epoch 4/5
4/4 [==============================] - 0s 31ms/step - loss: 0.0625
Epoch 5/5
4/4 [==============================] - 0s 30ms/step - loss: 0.0984
Out[28]:
<keras.callbacks.History at 0x133bd15f8>

In [29]:
model.evaluate(test_x, test_y)


Out[29]:
{mean_average_precision(0.0): 0.16666666666666666}

In [30]:
model.predict(test_x)


Out[30]:
array([[-0.00927439],
       [ 0.06645402],
       [ 0.00299546],
       [ 0.06593451],
       [ 0.19827756],
       [-0.00519839],
       [-0.04881426],
       [-0.07771388],
       [-0.04881426],
       [-0.04881426],
       [-0.04881426],
       [-0.04881426],
       [-0.04881426],
       [-0.07235113],
       [-0.04881426],
       [-0.04881426],
       [-0.04881426],
       [-0.04881426],
       [-0.04881426],
       [-0.08091632]], dtype=float32)

A Shortcut to Preprocessing and Model Building

Since data preprocessing and model building are laborious and special setups of some models makes this even worse, MatchZoo provides prepare, a unified interface that handles interaction among data, model, and preprocessor automatically.

More specifically, prepare does these following things:

  • create a default preprocessor of the model class (if not given one)
  • fit the preprocessor using the raw data
  • create an embedding matrix
  • instantiate a model and fill in hype-parameters
  • build the model
  • instantiate a DataGeneratorBuilder that will build a correctly formed DataGenerator given a DataPack

It also does many special handling for specific models, but we will not go into the details of that here.


In [31]:
for model_class in mz.models.list_available():
    print(model_class)
    model, preprocessor, data_generator_builder, embedding_matrix = mz.auto.prepare(
        task=task,
        model_class=model_class,
        data_pack=train_raw,
    )
    train_processed = preprocessor.transform(train_raw, verbose=0)
    test_processed = preprocessor.transform(test_raw, verbose=0)
    train_gen = data_generator_builder.build(train_processed)
    test_gen = data_generator_builder.build(test_processed)
    model.fit_generator(train_gen, epochs=1)
    model.evaluate_generator(test_gen)
    print()


<class 'matchzoo.models.naive.Naive'>
Epoch 1/1
1/1 [==============================] - 0s 139ms/step - loss: 62835.0703

<class 'matchzoo.models.dssm.DSSM'>
Epoch 1/1
1/1 [==============================] - 1s 556ms/step - loss: 0.0507

<class 'matchzoo.models.cdssm.CDSSM'>
Epoch 1/1
1/1 [==============================] - 2s 2s/step - loss: 0.1703

<class 'matchzoo.models.dense_baseline.DenseBaseline'>
Epoch 1/1
1/1 [==============================] - 0s 458ms/step - loss: 259.2346

<class 'matchzoo.models.arci.ArcI'>
Epoch 1/1
1/1 [==============================] - 1s 617ms/step - loss: 0.0480

<class 'matchzoo.models.arcii.ArcII'>
Epoch 1/1
1/1 [==============================] - 1s 717ms/step - loss: 0.0546

<class 'matchzoo.models.match_pyramid.MatchPyramid'>
Epoch 1/1
1/1 [==============================] - 1s 559ms/step - loss: 0.0518

<class 'matchzoo.models.knrm.KNRM'>
Epoch 1/1
1/1 [==============================] - 1s 1s/step - loss: 9780.9414

<class 'matchzoo.models.duet.DUET'>
Epoch 1/1
1/1 [==============================] - 1s 1s/step - loss: 1.4414

<class 'matchzoo.models.drmmtks.DRMMTKS'>
Epoch 1/1
1/1 [==============================] - 1s 1s/step - loss: 0.0691

<class 'matchzoo.models.drmm.DRMM'>
Epoch 1/1
1/1 [==============================] - 1s 1s/step - loss: 0.0770

<class 'matchzoo.models.anmm.ANMM'>
Epoch 1/1
1/1 [==============================] - 1s 953ms/step - loss: 0.0463

<class 'matchzoo.models.mvlstm.MVLSTM'>
Epoch 1/1
1/1 [==============================] - 3s 3s/step - loss: 0.0487

<class 'matchzoo.contrib.models.match_lstm.MatchLSTM'>
Epoch 1/1
1/1 [==============================] - 5s 5s/step - loss: 0.0475

<class 'matchzoo.models.conv_knrm.ConvKNRM'>
Epoch 1/1
1/1 [==============================] - 7s 7s/step - loss: 540.5453

Save and Load the Model


In [32]:
model.save('my-model')
loaded_model = mz.load_model('my-model')