In [1]:
import matchzoo as mz
import pandas as pd
print(mz.__version__)


Using TensorFlow backend.
2.1.0

DataPack

Structure

matchzoo.DataPack is a MatchZoo native data structure that most MatchZoo data handling processes build upon. A matchzoo.DataPack consists of three parts: left, right and relation, each one of is a pandas.DataFrame.


In [2]:
data_pack = mz.datasets.toy.load_data()

In [3]:
data_pack.left.head()


Out[3]:
text_left
id_left
Q1 how are glacier caves formed?
Q2 How are the directions of the velocity and for...
Q5 how did apollo creed die
Q6 how long is the term for federal judges
Q7 how a beretta model 21 pistols magazines works

In [4]:
data_pack.right.head()


Out[4]:
text_right
id_right
D1-0 A partly submerged glacier cave on Perito More...
D1-1 The ice facade is approximately 60 m high
D1-2 Ice formations in the Titlis glacier cave
D1-3 A glacier cave is a cave formed within the ice...
D1-4 Glacier caves are often called ice caves , but...

In [5]:
data_pack.relation.head()


Out[5]:
id_left id_right label
0 Q1 D1-0 0.0
1 Q1 D1-1 0.0
2 Q1 D1-2 0.0
3 Q1 D1-3 1.0
4 Q1 D1-4 0.0

The main reason for using a matchzoo.DataPack instead of pandas.DataFrame is efficiency: we save space from storing duplicate texts and save time from processing duplicate texts.

DataPack.FrameView

However, since a big table is easier to understand and manage, we provide the frame that merges three parts into a single pandas.DataFrame when called.


In [6]:
data_pack.frame().head()


Out[6]:
id_left text_left id_right text_right label
0 Q1 how are glacier caves formed? D1-0 A partly submerged glacier cave on Perito More... 0.0
1 Q1 how are glacier caves formed? D1-1 The ice facade is approximately 60 m high 0.0
2 Q1 how are glacier caves formed? D1-2 Ice formations in the Titlis glacier cave 0.0
3 Q1 how are glacier caves formed? D1-3 A glacier cave is a cave formed within the ice... 1.0
4 Q1 how are glacier caves formed? D1-4 Glacier caves are often called ice caves , but... 0.0

Notice that frame is not a method, but a property that returns a matchzoo.DataPack.FrameView object.


In [7]:
type(data_pack.frame)


Out[7]:
matchzoo.data_pack.data_pack.DataPack.FrameView

This view reflects changes in the data pack, and can be called to create a pandas.DataFrame at any time.


In [8]:
frame = data_pack.frame
data_pack.relation['label'] = data_pack.relation['label'] + 1

In [9]:
frame().head()


Out[9]:
id_left text_left id_right text_right label
0 Q1 how are glacier caves formed? D1-0 A partly submerged glacier cave on Perito More... 1.0
1 Q1 how are glacier caves formed? D1-1 The ice facade is approximately 60 m high 1.0
2 Q1 how are glacier caves formed? D1-2 Ice formations in the Titlis glacier cave 1.0
3 Q1 how are glacier caves formed? D1-3 A glacier cave is a cave formed within the ice... 2.0
4 Q1 how are glacier caves formed? D1-4 Glacier caves are often called ice caves , but... 1.0

Slicing a DataPack

You may use [] to slice a matchzoo.DataPack similar to slicing a list. This also returns a shallow copy of the sliced data like slicing a list.


In [10]:
data_slice = data_pack[5:10]

A sliced data pack's relation will directly reflect the slicing.


In [11]:
data_slice.relation


Out[11]:
id_left id_right label
0 Q2 D2-0 1.0
1 Q2 D2-1 1.0
2 Q2 D2-2 1.0
3 Q2 D2-3 1.0
4 Q2 D2-4 1.0

In addition, left and right will be processed so only relevant information are kept.


In [12]:
data_slice.left


Out[12]:
text_left
id_left
Q2 How are the directions of the velocity and for...

In [13]:
data_slice.right


Out[13]:
text_right
id_right
D2-0 In physics , circular motion is a movement of ...
D2-1 It can be uniform, with constant angular rate ...
D2-2 The rotation around a fixed axis of a three-di...
D2-3 The equations of motion describe the movement ...
D2-4 Examples of circular motion include: an artifi...

It is also possible to slice a frame view object.


In [14]:
data_pack.frame[5:10]


Out[14]:
id_left text_left id_right text_right label
0 Q2 How are the directions of the velocity and for... D2-0 In physics , circular motion is a movement of ... 1.0
1 Q2 How are the directions of the velocity and for... D2-1 It can be uniform, with constant angular rate ... 1.0
2 Q2 How are the directions of the velocity and for... D2-2 The rotation around a fixed axis of a three-di... 1.0
3 Q2 How are the directions of the velocity and for... D2-3 The equations of motion describe the movement ... 1.0
4 Q2 How are the directions of the velocity and for... D2-4 Examples of circular motion include: an artifi... 1.0

And this is equivalent to slicing the data pack first, then the frame, since both of them are based on the relation column.


In [15]:
data_slice.frame() == data_pack.frame[5:10]


Out[15]:
id_left text_left id_right text_right label
0 True True True True True
1 True True True True True
2 True True True True True
3 True True True True True
4 True True True True True

Slicing is extremely useful for partitioning data for training vs testing.


In [16]:
num_train = int(len(data_pack) * 0.8)
data_pack.shuffle(inplace=True)
train_slice = data_pack[:num_train]
test_slice = data_pack[num_train:]

Transforming Texts

Use apply_on_text to transform texts in a matchzoo.DataPack. Check the documentation for more information.


In [17]:
data_slice.apply_on_text(len).frame()


Processing text_left with len: 100%|██████████| 1/1 [00:00<00:00, 2347.12it/s]
Processing text_right with len: 100%|██████████| 5/5 [00:00<00:00, 10270.09it/s]
Out[17]:
id_left text_left id_right text_right label
0 Q2 85 D2-0 126 1.0
1 Q2 85 D2-1 128 1.0
2 Q2 85 D2-2 99 1.0
3 Q2 85 D2-3 78 1.0
4 Q2 85 D2-4 312 1.0

In [18]:
data_slice.apply_on_text(len, rename=('left_length', 'right_length')).frame()


Processing left_length with len: 100%|██████████| 1/1 [00:00<00:00, 1883.39it/s]
Processing right_length with len: 100%|██████████| 5/5 [00:00<00:00, 14276.05it/s]
Out[18]:
id_left text_left left_length id_right text_right right_length label
0 Q2 How are the directions of the velocity and for... 85 D2-0 In physics , circular motion is a movement of ... 126 1.0
1 Q2 How are the directions of the velocity and for... 85 D2-1 It can be uniform, with constant angular rate ... 128 1.0
2 Q2 How are the directions of the velocity and for... 85 D2-2 The rotation around a fixed axis of a three-di... 99 1.0
3 Q2 How are the directions of the velocity and for... 85 D2-3 The equations of motion describe the movement ... 78 1.0
4 Q2 How are the directions of the velocity and for... 85 D2-4 Examples of circular motion include: an artifi... 312 1.0

Since adding a column indicating text length is a quite common usage, you may simply do:


In [19]:
data_slice.append_text_length().frame()


Processing length_left with len: 100%|██████████| 1/1 [00:00<00:00, 2288.22it/s]
Processing length_right with len: 100%|██████████| 5/5 [00:00<00:00, 7361.01it/s]
Out[19]:
id_left text_left length_left id_right text_right length_right label
0 Q2 How are the directions of the velocity and for... 85 D2-0 In physics , circular motion is a movement of ... 126 1.0
1 Q2 How are the directions of the velocity and for... 85 D2-1 It can be uniform, with constant angular rate ... 128 1.0
2 Q2 How are the directions of the velocity and for... 85 D2-2 The rotation around a fixed axis of a three-di... 99 1.0
3 Q2 How are the directions of the velocity and for... 85 D2-3 The equations of motion describe the movement ... 78 1.0
4 Q2 How are the directions of the velocity and for... 85 D2-4 Examples of circular motion include: an artifi... 312 1.0

To one-hot encode the labels:


In [20]:
data_pack.relation['label'] = data_pack.relation['label'].astype(int)
data_pack.one_hot_encode_label(num_classes=3).frame().head()


Out[20]:
id_left text_left id_right text_right label
0 Q17 how much are the harry potter movies worth D17-8 According to Rowling, the main theme is death. [0, 1, 0]
1 Q2 How are the directions of the velocity and for... D2-1 It can be uniform, with constant angular rate ... [0, 1, 0]
2 Q5 how did apollo creed die D5-1 He was played by Carl Weathers . [0, 1, 0]
3 Q9 how a vul works D9-1 In a VUL, the cash value can be invested in a ... [0, 1, 0]
4 Q5 how did apollo creed die D5-6 Rocky Balboa is often wrongly credited with po... [0, 1, 0]

Building Your own DataPack

Use matchzoo.pack to build your own data pack. Check documentation for more information.


In [21]:
data = pd.DataFrame({
    'text_left': list('ARSAARSA'),
    'text_right': list('arstenus')
})
my_pack = mz.pack(data)
my_pack.frame()


Out[21]:
id_left text_left id_right text_right
0 L-0 A R-0 a
1 L-1 R R-1 r
2 L-2 S R-2 s
3 L-0 A R-3 t
4 L-0 A R-4 e
5 L-1 R R-5 n
6 L-2 S R-6 u
7 L-0 A R-2 s

Unpack

Format data in a way so that MatchZoo models can directly fit it. For more details, consult matchzoo/tutorials/models.ipynb.


In [22]:
x, y = data_pack[:3].unpack()

In [23]:
x


Out[23]:
{'id_left': array(['Q17', 'Q2', 'Q5'], dtype='<U3'),
 'text_left': array(['how much are the harry potter movies worth',
        'How are the directions of the velocity and force vectors related in a circular motion',
        'how did apollo creed die'], dtype='<U85'),
 'id_right': array(['D17-8', 'D2-1', 'D5-1'], dtype='<U5'),
 'text_right': array(['According to Rowling, the main theme is death.',
        'It can be uniform, with constant angular rate of rotation (and constant speed), or non-uniform with a changing rate of rotation.',
        'He was played by Carl Weathers .'], dtype='<U128')}

In [24]:
y


Out[24]:
array([[1],
       [1],
       [1]])

Data Sets

MatchZoo incorporates various datasets that can be loaded as MatchZoo native data structures.


In [25]:
mz.datasets.list_available()


Out[25]:
['toy', 'wiki_qa', 'embeddings', 'quora_qp', 'snli']

The toy dataset doesn't need to be downloaded and can be directly used. It's the best choice to get things rolling.


In [26]:
toy_train_rank = mz.datasets.toy.load_data()
toy_train_rank.frame().head()


Out[26]:
id_left text_left id_right text_right label
0 Q1 how are glacier caves formed? D1-0 A partly submerged glacier cave on Perito More... 0.0
1 Q1 how are glacier caves formed? D1-1 The ice facade is approximately 60 m high 0.0
2 Q1 how are glacier caves formed? D1-2 Ice formations in the Titlis glacier cave 0.0
3 Q1 how are glacier caves formed? D1-3 A glacier cave is a cave formed within the ice... 1.0
4 Q1 how are glacier caves formed? D1-4 Glacier caves are often called ice caves , but... 0.0

In [27]:
toy_dev_classification, classes = mz.datasets.toy.load_data(
    stage='train', task='classification', return_classes=True)
toy_dev_classification.frame().head()


Out[27]:
id_left text_left id_right text_right label
0 Q1 how are glacier caves formed? D1-0 A partly submerged glacier cave on Perito More... [1, 0]
1 Q1 how are glacier caves formed? D1-1 The ice facade is approximately 60 m high [1, 0]
2 Q1 how are glacier caves formed? D1-2 Ice formations in the Titlis glacier cave [1, 0]
3 Q1 how are glacier caves formed? D1-3 A glacier cave is a cave formed within the ice... [0, 1]
4 Q1 how are glacier caves formed? D1-4 Glacier caves are often called ice caves , but... [1, 0]

In [28]:
classes


Out[28]:
[False, True]

Other larger datasets will be automatically downloaded the first time you use it. Run the following lines to trigger downloading.


In [29]:
wiki_dev_entailment_rank = mz.datasets.wiki_qa.load_data(stage='dev')
wiki_dev_entailment_rank.frame().head()


Out[29]:
id_left text_left id_right text_right label
0 Q8 How are epithelial tissues joined together? D8-0 Cross section of sclerenchyma fibers in plant ... 0
1 Q8 How are epithelial tissues joined together? D8-1 Microscopic view of a histologic specimen of h... 0
2 Q8 How are epithelial tissues joined together? D8-2 In Biology , Tissue is a cellular organization... 0
3 Q8 How are epithelial tissues joined together? D8-3 A tissue is an ensemble of similar cells from ... 0
4 Q8 How are epithelial tissues joined together? D8-4 Organs are then formed by the functional group... 0

In [30]:
snli_test_classification, classes = mz.datasets.snli.load_data(
    stage='test', task='classification', return_classes=True)
snli_test_classification.frame().head()


Out[30]:
id_left text_left id_right text_right label
0 L-0 This church choir sings to the masses as they ... R-0 The church has cracks in the ceiling. [0, 0, 1, 0]
1 L-0 This church choir sings to the masses as they ... R-1 The church is filled with song. [1, 0, 0, 0]
2 L-0 This church choir sings to the masses as they ... R-2 A choir singing at a baseball game. [0, 1, 0, 0]
3 L-1 A woman with a green headscarf, blue shirt and... R-3 The woman is young. [0, 0, 1, 0]
4 L-1 A woman with a green headscarf, blue shirt and... R-4 The woman is very happy. [1, 0, 0, 0]

In [31]:
classes


Out[31]:
['entailment', 'contradiction', 'neutral', '-']

Preprocessing

Preprocessors

matchzoo.preprocessors are responsible for transforming data into correct forms that matchzoo.models. BasicPreprocessor is used for models with common forms, and some other models have customized preprocessors made just for them.


In [32]:
mz.preprocessors.list_available()


Out[32]:
[matchzoo.preprocessors.dssm_preprocessor.DSSMPreprocessor,
 matchzoo.preprocessors.naive_preprocessor.NaivePreprocessor,
 matchzoo.preprocessors.basic_preprocessor.BasicPreprocessor,
 matchzoo.preprocessors.cdssm_preprocessor.CDSSMPreprocessor]

When in doubt, use the default preprocessor a model class provides.


In [33]:
preprocessor = mz.models.Naive.get_default_preprocessor()

A preprocessor should be used in two steps. First, fit, then, transform. fit collects information into context, which includes everything the preprocessor needs to transform together with other useful information for later use. fit will only change the preprocessor's inner state but not the input data. In contrast, transform returns a modified copy of the input data without changing the preprocessor's inner state.


In [34]:
train_raw = mz.datasets.toy.load_data('train', 'ranking')
test_raw = mz.datasets.toy.load_data('test', 'ranking')
preprocessor.fit(train_raw)
preprocessor.context


Processing text_left with chain_transform of Tokenize => Lowercase => PuncRemoval: 100%|██████████| 13/13 [00:00<00:00, 1283.21it/s]
Processing text_right with chain_transform of Tokenize => Lowercase => PuncRemoval: 100%|██████████| 100/100 [00:00<00:00, 4023.97it/s]
Processing text_right with append: 100%|██████████| 100/100 [00:00<00:00, 98527.23it/s]
Building FrequencyFilter from a datapack.: 100%|██████████| 100/100 [00:00<00:00, 33645.95it/s]
Processing text_right with transform: 100%|██████████| 100/100 [00:00<00:00, 105596.78it/s]
Processing text_left with extend: 100%|██████████| 13/13 [00:00<00:00, 27962.03it/s]
Processing text_right with extend: 100%|██████████| 100/100 [00:00<00:00, 106725.29it/s]
Building Vocabulary from a datapack.: 100%|██████████| 1665/1665 [00:00<00:00, 1289661.34it/s]
Out[34]:
{'filter_unit': <matchzoo.preprocessors.units.frequency_filter.FrequencyFilter at 0x1331c92e8>,
 'vocab_unit': <matchzoo.preprocessors.units.vocabulary.Vocabulary at 0x133f96e80>,
 'vocab_size': 285,
 'embedding_input_dim': 285,
 'input_shapes': [(30,), (30,)]}

In [35]:
train_preprocessed = preprocessor.transform(train_raw)
test_preprocessed = preprocessor.transform(test_raw)


Processing text_left with chain_transform of Tokenize => Lowercase => PuncRemoval: 100%|██████████| 13/13 [00:00<00:00, 7532.25it/s]
Processing text_right with chain_transform of Tokenize => Lowercase => PuncRemoval: 100%|██████████| 100/100 [00:00<00:00, 4073.29it/s]
Processing text_right with transform: 100%|██████████| 100/100 [00:00<00:00, 79182.63it/s]
Processing text_left with transform: 100%|██████████| 13/13 [00:00<00:00, 22174.03it/s]
Processing text_right with transform: 100%|██████████| 100/100 [00:00<00:00, 55180.95it/s]
Processing length_left with len: 100%|██████████| 13/13 [00:00<00:00, 22776.09it/s]
Processing length_right with len: 100%|██████████| 100/100 [00:00<00:00, 171546.18it/s]
Processing text_left with transform: 100%|██████████| 13/13 [00:00<00:00, 15552.18it/s]
Processing text_right with transform: 100%|██████████| 100/100 [00:00<00:00, 73856.38it/s]
Processing text_left with chain_transform of Tokenize => Lowercase => PuncRemoval: 100%|██████████| 3/3 [00:00<00:00, 2962.08it/s]
Processing text_right with chain_transform of Tokenize => Lowercase => PuncRemoval: 100%|██████████| 20/20 [00:00<00:00, 4877.66it/s]
Processing text_right with transform: 100%|██████████| 20/20 [00:00<00:00, 32896.50it/s]
Processing text_left with transform: 100%|██████████| 3/3 [00:00<00:00, 4703.89it/s]
Processing text_right with transform: 100%|██████████| 20/20 [00:00<00:00, 37600.22it/s]
Processing length_left with len: 100%|██████████| 3/3 [00:00<00:00, 9612.61it/s]
Processing length_right with len: 100%|██████████| 20/20 [00:00<00:00, 22221.48it/s]
Processing text_left with transform: 100%|██████████| 3/3 [00:00<00:00, 4048.56it/s]
Processing text_right with transform: 100%|██████████| 20/20 [00:00<00:00, 20198.91it/s]

In [36]:
model = mz.models.Naive()
model.guess_and_fill_missing_params()
model.build()
model.compile()
x_train, y_train = train_preprocessed.unpack()
model.fit(x_train, y_train)
x_test, y_test = test_preprocessed.unpack()
model.evaluate(x_test, y_test)


Parameter "task" set to Ranking Task.
Parameter "input_shapes" set to [(30,), (30,)].
Epoch 1/1
100/100 [==============================] - 0s 970us/step - loss: 31720.4902
Out[36]:
{mean_average_precision(0.0): 0.08333333333333333}

Processor Units

Preprocessors utilize mz.processor_units to transform data. Processor units correspond to specific transformations and you may use them independently to preprocess a data pack.


In [37]:
data_pack = mz.datasets.toy.load_data()
data_pack.frame().head()


Out[37]:
id_left text_left id_right text_right label
0 Q1 how are glacier caves formed? D1-0 A partly submerged glacier cave on Perito More... 0.0
1 Q1 how are glacier caves formed? D1-1 The ice facade is approximately 60 m high 0.0
2 Q1 how are glacier caves formed? D1-2 Ice formations in the Titlis glacier cave 0.0
3 Q1 how are glacier caves formed? D1-3 A glacier cave is a cave formed within the ice... 1.0
4 Q1 how are glacier caves formed? D1-4 Glacier caves are often called ice caves , but... 0.0

In [38]:
tokenizer = mz.preprocessors.units.Tokenize()
data_pack.apply_on_text(tokenizer.transform, inplace=True)
data_pack.frame[:5]


Processing text_left with transform: 100%|██████████| 13/13 [00:00<00:00, 7794.99it/s]
Processing text_right with transform: 100%|██████████| 100/100 [00:00<00:00, 5312.00it/s]
Out[38]:
id_left text_left id_right text_right label
0 Q1 [how, are, glacier, caves, formed, ?] D1-0 [A, partly, submerged, glacier, cave, on, Peri... 0.0
1 Q1 [how, are, glacier, caves, formed, ?] D1-1 [The, ice, facade, is, approximately, 60, m, h... 0.0
2 Q1 [how, are, glacier, caves, formed, ?] D1-2 [Ice, formations, in, the, Titlis, glacier, cave] 0.0
3 Q1 [how, are, glacier, caves, formed, ?] D1-3 [A, glacier, cave, is, a, cave, formed, within... 1.0
4 Q1 [how, are, glacier, caves, formed, ?] D1-4 [Glacier, caves, are, often, called, ice, cave... 0.0

In [39]:
lower_caser = mz.preprocessors.units.Lowercase()
data_pack.apply_on_text(lower_caser.transform, inplace=True)
data_pack.frame[:5]


Processing text_left with transform: 100%|██████████| 13/13 [00:00<00:00, 17737.79it/s]
Processing text_right with transform: 100%|██████████| 100/100 [00:00<00:00, 63811.11it/s]
Out[39]:
id_left text_left id_right text_right label
0 Q1 [how, are, glacier, caves, formed, ?] D1-0 [a, partly, submerged, glacier, cave, on, peri... 0.0
1 Q1 [how, are, glacier, caves, formed, ?] D1-1 [the, ice, facade, is, approximately, 60, m, h... 0.0
2 Q1 [how, are, glacier, caves, formed, ?] D1-2 [ice, formations, in, the, titlis, glacier, cave] 0.0
3 Q1 [how, are, glacier, caves, formed, ?] D1-3 [a, glacier, cave, is, a, cave, formed, within... 1.0
4 Q1 [how, are, glacier, caves, formed, ?] D1-4 [glacier, caves, are, often, called, ice, cave... 0.0

Or use chain_transform to apply multiple processor units at one time


In [40]:
data_pack = mz.datasets.toy.load_data()
chain = mz.chain_transform([mz.preprocessors.units.Tokenize(),
                           mz.preprocessors.units.Lowercase()])
data_pack.apply_on_text(chain, inplace=True)
data_pack.frame[:5]


Processing text_left with chain_transform of Tokenize => Lowercase: 100%|██████████| 13/13 [00:00<00:00, 6659.25it/s]
Processing text_right with chain_transform of Tokenize => Lowercase: 100%|██████████| 100/100 [00:00<00:00, 4481.48it/s]
Out[40]:
id_left text_left id_right text_right label
0 Q1 [how, are, glacier, caves, formed, ?] D1-0 [a, partly, submerged, glacier, cave, on, peri... 0.0
1 Q1 [how, are, glacier, caves, formed, ?] D1-1 [the, ice, facade, is, approximately, 60, m, h... 0.0
2 Q1 [how, are, glacier, caves, formed, ?] D1-2 [ice, formations, in, the, titlis, glacier, cave] 0.0
3 Q1 [how, are, glacier, caves, formed, ?] D1-3 [a, glacier, cave, is, a, cave, formed, within... 1.0
4 Q1 [how, are, glacier, caves, formed, ?] D1-4 [glacier, caves, are, often, called, ice, cave... 0.0

Notice that some processor units are stateful so we have to fit them before using their transform.


In [41]:
mz.preprocessors.units.Vocabulary.__base__


Out[41]:
matchzoo.preprocessors.units.stateful_unit.StatefulUnit

In [42]:
vocab_unit = mz.preprocessors.units.Vocabulary()
texts = data_pack.frame()[['text_left', 'text_right']]
all_tokens = texts.sum().sum()
vocab_unit.fit(all_tokens)

Such StatefulProcessorUnit will save information in its state when fit, similar to the context of a preprocessor. In our case here, the vocabulary unit will save a term to index mapping, and a index to term mapping, called term_index and index_term respectively. Then we can proceed transforming a data pack.


In [43]:
for vocab in 'how', 'are', 'glacier':
    print(vocab, vocab_unit.state['term_index'][vocab])


how 153
are 604
glacier 55

In [44]:
data_pack.apply_on_text(vocab_unit.transform, inplace=True)
data_pack.frame()[:5]


Processing text_left with transform: 100%|██████████| 13/13 [00:00<00:00, 18211.74it/s]
Processing text_right with transform: 100%|██████████| 100/100 [00:00<00:00, 58408.36it/s]
Out[44]:
id_left text_left id_right text_right label
0 Q1 [153, 604, 55, 448, 752, 629] D1-0 [688, 278, 896, 55, 165, 25, 493, 851, 55, 509] 0.0
1 Q1 [153, 604, 55, 448, 752, 629] D1-1 [371, 800, 827, 185, 87, 76, 901, 639] 0.0
2 Q1 [153, 604, 55, 448, 752, 629] D1-2 [800, 378, 394, 371, 213, 55, 165] 0.0
3 Q1 [153, 604, 55, 448, 752, 629] D1-3 [688, 55, 165, 185, 688, 165, 752, 712, 371, 8... 1.0
4 Q1 [153, 604, 55, 448, 752, 629] D1-4 [55, 448, 604, 856, 389, 800, 448, 808, 72, 33... 0.0

Since this usage is quite common, we wrapped a function to do the same thing. For other stateful units, consult their documentations and try mz.build_unit_from_data_pack.


In [45]:
data_pack = mz.datasets.toy.load_data()
vocab_unit = mz.build_vocab_unit(data_pack)
data_pack.apply_on_text(vocab_unit.transform).frame[:5]


Processing text_left with extend: 100%|██████████| 13/13 [00:00<00:00, 18205.66it/s]
Processing text_right with extend: 100%|██████████| 100/100 [00:00<00:00, 146398.05it/s]
Building Vocabulary from a datapack.: 100%|██████████| 13893/13893 [00:00<00:00, 3564222.00it/s]
Processing text_left with transform: 100%|██████████| 13/13 [00:00<00:00, 12546.24it/s]
Processing text_right with transform: 100%|██████████| 100/100 [00:00<00:00, 23825.86it/s]
Out[45]:
id_left text_left id_right text_right label
0 Q1 [4, 1, 55, 29, 24, 72, 32, 29, 11, 33, 24, 49,... D1-0 [12, 29, 66, 24, 72, 54, 33, 51, 29, 21, 25, 1... 0.0
1 Q1 [4, 1, 55, 29, 24, 72, 32, 29, 11, 33, 24, 49,... D1-1 [37, 4, 32, 29, 7, 49, 32, 29, 68, 24, 49, 24,... 0.0
2 Q1 [4, 1, 55, 29, 24, 72, 32, 29, 11, 33, 24, 49,... D1-2 [8, 49, 32, 29, 68, 1, 72, 70, 24, 54, 7, 1, 2... 0.0
3 Q1 [4, 1, 55, 29, 24, 72, 32, 29, 11, 33, 24, 49,... D1-3 [12, 29, 11, 33, 24, 49, 7, 32, 72, 29, 49, 24... 1.0
4 Q1 [4, 1, 55, 29, 24, 72, 32, 29, 11, 33, 24, 49,... D1-4 [6, 33, 24, 49, 7, 32, 72, 29, 49, 24, 14, 32,... 0.0

DataGenerator

Some MatchZoo models (e.g. DRMM, MatchPyramid) require batch-wise information for training so using fit_generator instead of using fit is necessary. In addition, sometimes your memory just can't hold all transformed data so to delay a part of the preprocessing process is necessary.

MatchZoo provides DataGenerator as an alternative. Instead of fit, you may do a fit_generator that takes a data generator that unpack data on the fly.


In [46]:
x_train, y_train = train_preprocessed.unpack()
model.fit(x_train, y_train)


Epoch 1/1
100/100 [==============================] - 0s 18us/step - loss: 30618.9355
Out[46]:
<keras.callbacks.History at 0x1085f8438>

In [47]:
data_gen = mz.DataGenerator(train_preprocessed)
model.fit_generator(data_gen)


Epoch 1/1
1/1 [==============================] - 0s 18ms/step - loss: 29543.1270
Out[47]:
<keras.callbacks.History at 0x10860dba8>

The data preprocessing of DSSM eats a lot of memory, but we can workaround that using the callback hook of DataGenerator.


In [48]:
preprocessor = mz.preprocessors.DSSMPreprocessor(with_word_hashing=False)
data = preprocessor.fit_transform(train_raw, verbose=0)

In [49]:
dssm = mz.models.DSSM()
dssm.params['task'] = mz.tasks.Ranking()
dssm.params.update(preprocessor.context)
dssm.build()
dssm.compile()

In [50]:
term_index = preprocessor.context['vocab_unit'].state['term_index']
hashing_unit = mz.preprocessors.units.WordHashing(term_index)
data_generator = mz.DataGenerator(
    data,
    batch_size=4,
    callbacks=[
        mz.data_generator.callbacks.LambdaCallback(
            on_batch_data_pack=lambda dp: dp.apply_on_text(
                hashing_unit.transform, inplace=True, verbose=0)
        )
    ]
)

In [51]:
dssm.fit_generator(data_generator)


Epoch 1/1
25/25 [==============================] - 1s 33ms/step - loss: 0.0471
Out[51]:
<keras.callbacks.History at 0x1337f78d0>

In addition, losses like RankHingeLoss and RankCrossEntropyLoss have to be used with DataGenerator with mode='pair', since batch-wise information are needed and computed on the fly.


In [52]:
num_neg = 4
task = mz.tasks.Ranking(loss=mz.losses.RankHingeLoss(num_neg=num_neg))
preprocessor = model.get_default_preprocessor()
train_processed = preprocessor.fit_transform(train_raw)


Processing text_left with chain_transform of Tokenize => Lowercase => PuncRemoval: 100%|██████████| 13/13 [00:00<00:00, 3234.62it/s]
Processing text_right with chain_transform of Tokenize => Lowercase => PuncRemoval: 100%|██████████| 100/100 [00:00<00:00, 4000.52it/s]
Processing text_right with append: 100%|██████████| 100/100 [00:00<00:00, 194993.21it/s]
Building FrequencyFilter from a datapack.: 100%|██████████| 100/100 [00:00<00:00, 98898.94it/s]
Processing text_right with transform: 100%|██████████| 100/100 [00:00<00:00, 89698.55it/s]
Processing text_left with extend: 100%|██████████| 13/13 [00:00<00:00, 28622.55it/s]
Processing text_right with extend: 100%|██████████| 100/100 [00:00<00:00, 195995.51it/s]
Building Vocabulary from a datapack.: 100%|██████████| 1665/1665 [00:00<00:00, 2223341.66it/s]
Processing text_left with chain_transform of Tokenize => Lowercase => PuncRemoval: 100%|██████████| 13/13 [00:00<00:00, 6020.98it/s]
Processing text_right with chain_transform of Tokenize => Lowercase => PuncRemoval: 100%|██████████| 100/100 [00:00<00:00, 5079.76it/s]
Processing text_right with transform: 100%|██████████| 100/100 [00:00<00:00, 106373.42it/s]
Processing text_left with transform: 100%|██████████| 13/13 [00:00<00:00, 21274.27it/s]
Processing text_right with transform: 100%|██████████| 100/100 [00:00<00:00, 56397.79it/s]
Processing length_left with len: 100%|██████████| 13/13 [00:00<00:00, 24863.64it/s]
Processing length_right with len: 100%|██████████| 100/100 [00:00<00:00, 142228.01it/s]
Processing text_left with transform: 100%|██████████| 13/13 [00:00<00:00, 21526.23it/s]
Processing text_right with transform: 100%|██████████| 100/100 [00:00<00:00, 84767.66it/s]

In [53]:
model = mz.models.Naive()
model.params['task'] = task
model.params.update(preprocessor.context)
model.build()
model.compile()

In [54]:
data_gen = mz.DataGenerator(
    train_processed,
    mode='pair',
    num_neg=num_neg,
    num_dup=2,
    batch_size=32
)

In [55]:
model.fit_generator(data_gen)


Epoch 1/1
1/1 [==============================] - 0s 222ms/step - loss: 28.1434
Out[55]:
<keras.callbacks.History at 0x1332f9ba8>