In [1]:

    
import matchzoo as mz
import pandas as pd
print(mz.__version__)









    



Using TensorFlow backend.






    



2.1.0

DataPack

Structure

matchzoo.DataPack is a MatchZoo native data structure that most MatchZoo data handling processes build upon. A matchzoo.DataPack consists of three parts: left, right and relation, each one of is a pandas.DataFrame.



In [2]:

    
data_pack = mz.datasets.toy.load_data()



In [3]:

    
data_pack.left.head()









    Out[3]:







  
    
      
      text_left
    
    
      id_left
      
    
  
  
    
      Q1
      how are glacier caves formed?
    
    
      Q2
      How are the directions of the velocity and for...
    
    
      Q5
      how did apollo creed die
    
    
      Q6
      how long is the term for federal judges
    
    
      Q7
      how a beretta model 21 pistols magazines works



In [4]:

    
data_pack.right.head()









    Out[4]:







  
    
      
      text_right
    
    
      id_right
      
    
  
  
    
      D1-0
      A partly submerged glacier cave on Perito More...
    
    
      D1-1
      The ice facade is approximately 60 m high
    
    
      D1-2
      Ice formations in the Titlis glacier cave
    
    
      D1-3
      A glacier cave is a cave formed within the ice...
    
    
      D1-4
      Glacier caves are often called ice caves , but...



In [5]:

    
data_pack.relation.head()

The main reason for using a matchzoo.DataPack instead of pandas.DataFrame is efficiency: we save space from storing duplicate texts and save time from processing duplicate texts.

DataPack.FrameView

However, since a big table is easier to understand and manage, we provide the frame that merges three parts into a single pandas.DataFrame when called.



In [6]:

    
data_pack.frame().head()









    Out[6]:







  
    
      
      id_left
      text_left
      id_right
      text_right
      label
    
  
  
    
      0
      Q1
      how are glacier caves formed?
      D1-0
      A partly submerged glacier cave on Perito More...
      0.0
    
    
      1
      Q1
      how are glacier caves formed?
      D1-1
      The ice facade is approximately 60 m high
      0.0
    
    
      2
      Q1
      how are glacier caves formed?
      D1-2
      Ice formations in the Titlis glacier cave
      0.0
    
    
      3
      Q1
      how are glacier caves formed?
      D1-3
      A glacier cave is a cave formed within the ice...
      1.0
    
    
      4
      Q1
      how are glacier caves formed?
      D1-4
      Glacier caves are often called ice caves , but...
      0.0

Notice that frame is not a method, but a property that returns a matchzoo.DataPack.FrameView object.



In [7]:

    
type(data_pack.frame)









    Out[7]:





matchzoo.data_pack.data_pack.DataPack.FrameView

This view reflects changes in the data pack, and can be called to create a pandas.DataFrame at any time.



In [8]:

    
frame = data_pack.frame
data_pack.relation['label'] = data_pack.relation['label'] + 1



In [9]:

    
frame().head()









    Out[9]:







  
    
      
      id_left
      text_left
      id_right
      text_right
      label
    
  
  
    
      0
      Q1
      how are glacier caves formed?
      D1-0
      A partly submerged glacier cave on Perito More...
      1.0
    
    
      1
      Q1
      how are glacier caves formed?
      D1-1
      The ice facade is approximately 60 m high
      1.0
    
    
      2
      Q1
      how are glacier caves formed?
      D1-2
      Ice formations in the Titlis glacier cave
      1.0
    
    
      3
      Q1
      how are glacier caves formed?
      D1-3
      A glacier cave is a cave formed within the ice...
      2.0
    
    
      4
      Q1
      how are glacier caves formed?
      D1-4
      Glacier caves are often called ice caves , but...
      1.0

Slicing a DataPack

You may use [] to slice a matchzoo.DataPack similar to slicing a list. This also returns a shallow copy of the sliced data like slicing a list.



In [10]:

    
data_slice = data_pack[5:10]

A sliced data pack's relation will directly reflect the slicing.



In [11]:

    
data_slice.relation

In addition, left and right will be processed so only relevant information are kept.



In [12]:

    
data_slice.left









    Out[12]:







  
    
      
      text_left
    
    
      id_left
      
    
  
  
    
      Q2
      How are the directions of the velocity and for...



In [13]:

    
data_slice.right









    Out[13]:







  
    
      
      text_right
    
    
      id_right
      
    
  
  
    
      D2-0
      In physics , circular motion is a movement of ...
    
    
      D2-1
      It can be uniform, with constant angular rate ...
    
    
      D2-2
      The rotation around a fixed axis of a three-di...
    
    
      D2-3
      The equations of motion describe the movement ...
    
    
      D2-4
      Examples of circular motion include: an artifi...

It is also possible to slice a frame view object.



In [14]:

    
data_pack.frame[5:10]









    Out[14]:







  
    
      
      id_left
      text_left
      id_right
      text_right
      label
    
  
  
    
      0
      Q2
      How are the directions of the velocity and for...
      D2-0
      In physics , circular motion is a movement of ...
      1.0
    
    
      1
      Q2
      How are the directions of the velocity and for...
      D2-1
      It can be uniform, with constant angular rate ...
      1.0
    
    
      2
      Q2
      How are the directions of the velocity and for...
      D2-2
      The rotation around a fixed axis of a three-di...
      1.0
    
    
      3
      Q2
      How are the directions of the velocity and for...
      D2-3
      The equations of motion describe the movement ...
      1.0
    
    
      4
      Q2
      How are the directions of the velocity and for...
      D2-4
      Examples of circular motion include: an artifi...
      1.0

And this is equivalent to slicing the data pack first, then the frame, since both of them are based on the relation column.



In [15]:

    
data_slice.frame() == data_pack.frame[5:10]









    Out[15]:







  
    
      
      id_left
      text_left
      id_right
      text_right
      label
    
  
  
    
      0
      True
      True
      True
      True
      True
    
    
      1
      True
      True
      True
      True
      True
    
    
      2
      True
      True
      True
      True
      True
    
    
      3
      True
      True
      True
      True
      True
    
    
      4
      True
      True
      True
      True
      True

Slicing is extremely useful for partitioning data for training vs testing.



In [16]:

    
num_train = int(len(data_pack) * 0.8)
data_pack.shuffle(inplace=True)
train_slice = data_pack[:num_train]
test_slice = data_pack[num_train:]

Transforming Texts

Use apply_on_text to transform texts in a matchzoo.DataPack. Check the documentation for more information.



In [17]:

    
data_slice.apply_on_text(len).frame()









    



Processing text_left with len: 100%|██████████| 1/1 [00:00<00:00, 2347.12it/s]
Processing text_right with len: 100%|██████████| 5/5 [00:00<00:00, 10270.09it/s]






    Out[17]:







  
    
      
      id_left
      text_left
      id_right
      text_right
      label
    
  
  
    
      0
      Q2
      85
      D2-0
      126
      1.0
    
    
      1
      Q2
      85
      D2-1
      128
      1.0
    
    
      2
      Q2
      85
      D2-2
      99
      1.0
    
    
      3
      Q2
      85
      D2-3
      78
      1.0
    
    
      4
      Q2
      85
      D2-4
      312
      1.0



In [18]:

    
data_slice.apply_on_text(len, rename=('left_length', 'right_length')).frame()









    



Processing left_length with len: 100%|██████████| 1/1 [00:00<00:00, 1883.39it/s]
Processing right_length with len: 100%|██████████| 5/5 [00:00<00:00, 14276.05it/s]






    Out[18]:







  
    
      
      id_left
      text_left
      left_length
      id_right
      text_right
      right_length
      label
    
  
  
    
      0
      Q2
      How are the directions of the velocity and for...
      85
      D2-0
      In physics , circular motion is a movement of ...
      126
      1.0
    
    
      1
      Q2
      How are the directions of the velocity and for...
      85
      D2-1
      It can be uniform, with constant angular rate ...
      128
      1.0
    
    
      2
      Q2
      How are the directions of the velocity and for...
      85
      D2-2
      The rotation around a fixed axis of a three-di...
      99
      1.0
    
    
      3
      Q2
      How are the directions of the velocity and for...
      85
      D2-3
      The equations of motion describe the movement ...
      78
      1.0
    
    
      4
      Q2
      How are the directions of the velocity and for...
      85
      D2-4
      Examples of circular motion include: an artifi...
      312
      1.0

Since adding a column indicating text length is a quite common usage, you may simply do:



In [19]:

    
data_slice.append_text_length().frame()









    



Processing length_left with len: 100%|██████████| 1/1 [00:00<00:00, 2288.22it/s]
Processing length_right with len: 100%|██████████| 5/5 [00:00<00:00, 7361.01it/s]






    Out[19]:







  
    
      
      id_left
      text_left
      length_left
      id_right
      text_right
      length_right
      label
    
  
  
    
      0
      Q2
      How are the directions of the velocity and for...
      85
      D2-0
      In physics , circular motion is a movement of ...
      126
      1.0
    
    
      1
      Q2
      How are the directions of the velocity and for...
      85
      D2-1
      It can be uniform, with constant angular rate ...
      128
      1.0
    
    
      2
      Q2
      How are the directions of the velocity and for...
      85
      D2-2
      The rotation around a fixed axis of a three-di...
      99
      1.0
    
    
      3
      Q2
      How are the directions of the velocity and for...
      85
      D2-3
      The equations of motion describe the movement ...
      78
      1.0
    
    
      4
      Q2
      How are the directions of the velocity and for...
      85
      D2-4
      Examples of circular motion include: an artifi...
      312
      1.0

To one-hot encode the labels:



In [20]:

    
data_pack.relation['label'] = data_pack.relation['label'].astype(int)
data_pack.one_hot_encode_label(num_classes=3).frame().head()









    Out[20]:







  
    
      
      id_left
      text_left
      id_right
      text_right
      label
    
  
  
    
      0
      Q17
      how much are the harry potter movies worth
      D17-8
      According to Rowling, the main theme is death.
      [0, 1, 0]
    
    
      1
      Q2
      How are the directions of the velocity and for...
      D2-1
      It can be uniform, with constant angular rate ...
      [0, 1, 0]
    
    
      2
      Q5
      how did apollo creed die
      D5-1
      He was played by Carl Weathers .
      [0, 1, 0]
    
    
      3
      Q9
      how a vul works
      D9-1
      In a VUL, the cash value can be invested in a ...
      [0, 1, 0]
    
    
      4
      Q5
      how did apollo creed die
      D5-6
      Rocky Balboa is often wrongly credited with po...
      [0, 1, 0]

Building Your own DataPack

Use matchzoo.pack to build your own data pack. Check documentation for more information.



In [21]:

    
data = pd.DataFrame({
    'text_left': list('ARSAARSA'),
    'text_right': list('arstenus')
})
my_pack = mz.pack(data)
my_pack.frame()

Unpack

Format data in a way so that MatchZoo models can directly fit it. For more details, consult matchzoo/tutorials/models.ipynb.



In [22]:

    
x, y = data_pack[:3].unpack()



In [23]:

    
x









    Out[23]:





{'id_left': array(['Q17', 'Q2', 'Q5'], dtype='<U3'),
 'text_left': array(['how much are the harry potter movies worth',
        'How are the directions of the velocity and force vectors related in a circular motion',
        'how did apollo creed die'], dtype='<U85'),
 'id_right': array(['D17-8', 'D2-1', 'D5-1'], dtype='<U5'),
 'text_right': array(['According to Rowling, the main theme is death.',
        'It can be uniform, with constant angular rate of rotation (and constant speed), or non-uniform with a changing rate of rotation.',
        'He was played by Carl Weathers .'], dtype='<U128')}



In [24]:

    
y









    Out[24]:





array([[1],
       [1],
       [1]])

Data Sets

MatchZoo incorporates various datasets that can be loaded as MatchZoo native data structures.



In [25]:

    
mz.datasets.list_available()









    Out[25]:





['toy', 'wiki_qa', 'embeddings', 'quora_qp', 'snli']

The toy dataset doesn't need to be downloaded and can be directly used. It's the best choice to get things rolling.



In [26]:

    
toy_train_rank = mz.datasets.toy.load_data()
toy_train_rank.frame().head()









    Out[26]:







  
    
      
      id_left
      text_left
      id_right
      text_right
      label
    
  
  
    
      0
      Q1
      how are glacier caves formed?
      D1-0
      A partly submerged glacier cave on Perito More...
      0.0
    
    
      1
      Q1
      how are glacier caves formed?
      D1-1
      The ice facade is approximately 60 m high
      0.0
    
    
      2
      Q1
      how are glacier caves formed?
      D1-2
      Ice formations in the Titlis glacier cave
      0.0
    
    
      3
      Q1
      how are glacier caves formed?
      D1-3
      A glacier cave is a cave formed within the ice...
      1.0
    
    
      4
      Q1
      how are glacier caves formed?
      D1-4
      Glacier caves are often called ice caves , but...
      0.0



In [27]:

    
toy_dev_classification, classes = mz.datasets.toy.load_data(
    stage='train', task='classification', return_classes=True)
toy_dev_classification.frame().head()









    Out[27]:







  
    
      
      id_left
      text_left
      id_right
      text_right
      label
    
  
  
    
      0
      Q1
      how are glacier caves formed?
      D1-0
      A partly submerged glacier cave on Perito More...
      [1, 0]
    
    
      1
      Q1
      how are glacier caves formed?
      D1-1
      The ice facade is approximately 60 m high
      [1, 0]
    
    
      2
      Q1
      how are glacier caves formed?
      D1-2
      Ice formations in the Titlis glacier cave
      [1, 0]
    
    
      3
      Q1
      how are glacier caves formed?
      D1-3
      A glacier cave is a cave formed within the ice...
      [0, 1]
    
    
      4
      Q1
      how are glacier caves formed?
      D1-4
      Glacier caves are often called ice caves , but...
      [1, 0]



In [28]:

    
classes









    Out[28]:





[False, True]

Other larger datasets will be automatically downloaded the first time you use it. Run the following lines to trigger downloading.



In [29]:

    
wiki_dev_entailment_rank = mz.datasets.wiki_qa.load_data(stage='dev')
wiki_dev_entailment_rank.frame().head()









    Out[29]:







  
    
      
      id_left
      text_left
      id_right
      text_right
      label
    
  
  
    
      0
      Q8
      How are epithelial tissues joined together?
      D8-0
      Cross section of sclerenchyma fibers in plant ...
      0
    
    
      1
      Q8
      How are epithelial tissues joined together?
      D8-1
      Microscopic view of a histologic specimen of h...
      0
    
    
      2
      Q8
      How are epithelial tissues joined together?
      D8-2
      In Biology , Tissue is a cellular organization...
      0
    
    
      3
      Q8
      How are epithelial tissues joined together?
      D8-3
      A tissue is an ensemble of similar cells from ...
      0
    
    
      4
      Q8
      How are epithelial tissues joined together?
      D8-4
      Organs are then formed by the functional group...
      0



In [30]:

    
snli_test_classification, classes = mz.datasets.snli.load_data(
    stage='test', task='classification', return_classes=True)
snli_test_classification.frame().head()









    Out[30]:







  
    
      
      id_left
      text_left
      id_right
      text_right
      label
    
  
  
    
      0
      L-0
      This church choir sings to the masses as they ...
      R-0
      The church has cracks in the ceiling.
      [0, 0, 1, 0]
    
    
      1
      L-0
      This church choir sings to the masses as they ...
      R-1
      The church is filled with song.
      [1, 0, 0, 0]
    
    
      2
      L-0
      This church choir sings to the masses as they ...
      R-2
      A choir singing at a baseball game.
      [0, 1, 0, 0]
    
    
      3
      L-1
      A woman with a green headscarf, blue shirt and...
      R-3
      The woman is young.
      [0, 0, 1, 0]
    
    
      4
      L-1
      A woman with a green headscarf, blue shirt and...
      R-4
      The woman is very happy.
      [1, 0, 0, 0]



In [31]:

    
classes









    Out[31]:





['entailment', 'contradiction', 'neutral', '-']

Preprocessing

Preprocessors

matchzoo.preprocessors are responsible for transforming data into correct forms that matchzoo.models. BasicPreprocessor is used for models with common forms, and some other models have customized preprocessors made just for them.



In [32]:

    
mz.preprocessors.list_available()









    Out[32]:





[matchzoo.preprocessors.dssm_preprocessor.DSSMPreprocessor,
 matchzoo.preprocessors.naive_preprocessor.NaivePreprocessor,
 matchzoo.preprocessors.basic_preprocessor.BasicPreprocessor,
 matchzoo.preprocessors.cdssm_preprocessor.CDSSMPreprocessor]

When in doubt, use the default preprocessor a model class provides.



In [33]:

    
preprocessor = mz.models.Naive.get_default_preprocessor()

A preprocessor should be used in two steps. First, fit, then, transform. fit collects information into context, which includes everything the preprocessor needs to transform together with other useful information for later use. fit will only change the preprocessor's inner state but not the input data. In contrast, transform returns a modified copy of the input data without changing the preprocessor's inner state.



In [34]:

    
train_raw = mz.datasets.toy.load_data('train', 'ranking')
test_raw = mz.datasets.toy.load_data('test', 'ranking')
preprocessor.fit(train_raw)
preprocessor.context









    



Processing text_left with chain_transform of Tokenize => Lowercase => PuncRemoval: 100%|██████████| 13/13 [00:00<00:00, 1283.21it/s]
Processing text_right with chain_transform of Tokenize => Lowercase => PuncRemoval: 100%|██████████| 100/100 [00:00<00:00, 4023.97it/s]
Processing text_right with append: 100%|██████████| 100/100 [00:00<00:00, 98527.23it/s]
Building FrequencyFilter from a datapack.: 100%|██████████| 100/100 [00:00<00:00, 33645.95it/s]
Processing text_right with transform: 100%|██████████| 100/100 [00:00<00:00, 105596.78it/s]
Processing text_left with extend: 100%|██████████| 13/13 [00:00<00:00, 27962.03it/s]
Processing text_right with extend: 100%|██████████| 100/100 [00:00<00:00, 106725.29it/s]
Building Vocabulary from a datapack.: 100%|██████████| 1665/1665 [00:00<00:00, 1289661.34it/s]






    Out[34]:





{'filter_unit': <matchzoo.preprocessors.units.frequency_filter.FrequencyFilter at 0x1331c92e8>,
 'vocab_unit': <matchzoo.preprocessors.units.vocabulary.Vocabulary at 0x133f96e80>,
 'vocab_size': 285,
 'embedding_input_dim': 285,
 'input_shapes': [(30,), (30,)]}



In [35]:

    
train_preprocessed = preprocessor.transform(train_raw)
test_preprocessed = preprocessor.transform(test_raw)









    



Processing text_left with chain_transform of Tokenize => Lowercase => PuncRemoval: 100%|██████████| 13/13 [00:00<00:00, 7532.25it/s]
Processing text_right with chain_transform of Tokenize => Lowercase => PuncRemoval: 100%|██████████| 100/100 [00:00<00:00, 4073.29it/s]
Processing text_right with transform: 100%|██████████| 100/100 [00:00<00:00, 79182.63it/s]
Processing text_left with transform: 100%|██████████| 13/13 [00:00<00:00, 22174.03it/s]
Processing text_right with transform: 100%|██████████| 100/100 [00:00<00:00, 55180.95it/s]
Processing length_left with len: 100%|██████████| 13/13 [00:00<00:00, 22776.09it/s]
Processing length_right with len: 100%|██████████| 100/100 [00:00<00:00, 171546.18it/s]
Processing text_left with transform: 100%|██████████| 13/13 [00:00<00:00, 15552.18it/s]
Processing text_right with transform: 100%|██████████| 100/100 [00:00<00:00, 73856.38it/s]
Processing text_left with chain_transform of Tokenize => Lowercase => PuncRemoval: 100%|██████████| 3/3 [00:00<00:00, 2962.08it/s]
Processing text_right with chain_transform of Tokenize => Lowercase => PuncRemoval: 100%|██████████| 20/20 [00:00<00:00, 4877.66it/s]
Processing text_right with transform: 100%|██████████| 20/20 [00:00<00:00, 32896.50it/s]
Processing text_left with transform: 100%|██████████| 3/3 [00:00<00:00, 4703.89it/s]
Processing text_right with transform: 100%|██████████| 20/20 [00:00<00:00, 37600.22it/s]
Processing length_left with len: 100%|██████████| 3/3 [00:00<00:00, 9612.61it/s]
Processing length_right with len: 100%|██████████| 20/20 [00:00<00:00, 22221.48it/s]
Processing text_left with transform: 100%|██████████| 3/3 [00:00<00:00, 4048.56it/s]
Processing text_right with transform: 100%|██████████| 20/20 [00:00<00:00, 20198.91it/s]



In [36]:

    
model = mz.models.Naive()
model.guess_and_fill_missing_params()
model.build()
model.compile()
x_train, y_train = train_preprocessed.unpack()
model.fit(x_train, y_train)
x_test, y_test = test_preprocessed.unpack()
model.evaluate(x_test, y_test)









    



Parameter "task" set to Ranking Task.
Parameter "input_shapes" set to [(30,), (30,)].
Epoch 1/1
100/100 [==============================] - 0s 970us/step - loss: 31720.4902






    Out[36]:





{mean_average_precision(0.0): 0.08333333333333333}

Processor Units

Preprocessors utilize mz.processor_units to transform data. Processor units correspond to specific transformations and you may use them independently to preprocess a data pack.



In [37]:

    
data_pack = mz.datasets.toy.load_data()
data_pack.frame().head()









    Out[37]:







  
    
      
      id_left
      text_left
      id_right
      text_right
      label
    
  
  
    
      0
      Q1
      how are glacier caves formed?
      D1-0
      A partly submerged glacier cave on Perito More...
      0.0
    
    
      1
      Q1
      how are glacier caves formed?
      D1-1
      The ice facade is approximately 60 m high
      0.0
    
    
      2
      Q1
      how are glacier caves formed?
      D1-2
      Ice formations in the Titlis glacier cave
      0.0
    
    
      3
      Q1
      how are glacier caves formed?
      D1-3
      A glacier cave is a cave formed within the ice...
      1.0
    
    
      4
      Q1
      how are glacier caves formed?
      D1-4
      Glacier caves are often called ice caves , but...
      0.0



In [38]:

    
tokenizer = mz.preprocessors.units.Tokenize()
data_pack.apply_on_text(tokenizer.transform, inplace=True)
data_pack.frame[:5]









    



Processing text_left with transform: 100%|██████████| 13/13 [00:00<00:00, 7794.99it/s]
Processing text_right with transform: 100%|██████████| 100/100 [00:00<00:00, 5312.00it/s]






    Out[38]:







  
    
      
      id_left
      text_left
      id_right
      text_right
      label
    
  
  
    
      0
      Q1
      [how, are, glacier, caves, formed, ?]
      D1-0
      [A, partly, submerged, glacier, cave, on, Peri...
      0.0
    
    
      1
      Q1
      [how, are, glacier, caves, formed, ?]
      D1-1
      [The, ice, facade, is, approximately, 60, m, h...
      0.0
    
    
      2
      Q1
      [how, are, glacier, caves, formed, ?]
      D1-2
      [Ice, formations, in, the, Titlis, glacier, cave]
      0.0
    
    
      3
      Q1
      [how, are, glacier, caves, formed, ?]
      D1-3
      [A, glacier, cave, is, a, cave, formed, within...
      1.0
    
    
      4
      Q1
      [how, are, glacier, caves, formed, ?]
      D1-4
      [Glacier, caves, are, often, called, ice, cave...
      0.0



In [39]:

    
lower_caser = mz.preprocessors.units.Lowercase()
data_pack.apply_on_text(lower_caser.transform, inplace=True)
data_pack.frame[:5]









    



Processing text_left with transform: 100%|██████████| 13/13 [00:00<00:00, 17737.79it/s]
Processing text_right with transform: 100%|██████████| 100/100 [00:00<00:00, 63811.11it/s]






    Out[39]:







  
    
      
      id_left
      text_left
      id_right
      text_right
      label
    
  
  
    
      0
      Q1
      [how, are, glacier, caves, formed, ?]
      D1-0
      [a, partly, submerged, glacier, cave, on, peri...
      0.0
    
    
      1
      Q1
      [how, are, glacier, caves, formed, ?]
      D1-1
      [the, ice, facade, is, approximately, 60, m, h...
      0.0
    
    
      2
      Q1
      [how, are, glacier, caves, formed, ?]
      D1-2
      [ice, formations, in, the, titlis, glacier, cave]
      0.0
    
    
      3
      Q1
      [how, are, glacier, caves, formed, ?]
      D1-3
      [a, glacier, cave, is, a, cave, formed, within...
      1.0
    
    
      4
      Q1
      [how, are, glacier, caves, formed, ?]
      D1-4
      [glacier, caves, are, often, called, ice, cave...
      0.0

Or use chain_transform to apply multiple processor units at one time



In [40]:

    
data_pack = mz.datasets.toy.load_data()
chain = mz.chain_transform([mz.preprocessors.units.Tokenize(),
                           mz.preprocessors.units.Lowercase()])
data_pack.apply_on_text(chain, inplace=True)
data_pack.frame[:5]









    



Processing text_left with chain_transform of Tokenize => Lowercase: 100%|██████████| 13/13 [00:00<00:00, 6659.25it/s]
Processing text_right with chain_transform of Tokenize => Lowercase: 100%|██████████| 100/100 [00:00<00:00, 4481.48it/s]






    Out[40]:







  
    
      
      id_left
      text_left
      id_right
      text_right
      label
    
  
  
    
      0
      Q1
      [how, are, glacier, caves, formed, ?]
      D1-0
      [a, partly, submerged, glacier, cave, on, peri...
      0.0
    
    
      1
      Q1
      [how, are, glacier, caves, formed, ?]
      D1-1
      [the, ice, facade, is, approximately, 60, m, h...
      0.0
    
    
      2
      Q1
      [how, are, glacier, caves, formed, ?]
      D1-2
      [ice, formations, in, the, titlis, glacier, cave]
      0.0
    
    
      3
      Q1
      [how, are, glacier, caves, formed, ?]
      D1-3
      [a, glacier, cave, is, a, cave, formed, within...
      1.0
    
    
      4
      Q1
      [how, are, glacier, caves, formed, ?]
      D1-4
      [glacier, caves, are, often, called, ice, cave...
      0.0

Notice that some processor units are stateful so we have to fit them before using their transform.



In [41]:

    
mz.preprocessors.units.Vocabulary.__base__









    Out[41]:





matchzoo.preprocessors.units.stateful_unit.StatefulUnit



In [42]:

    
vocab_unit = mz.preprocessors.units.Vocabulary()
texts = data_pack.frame()[['text_left', 'text_right']]
all_tokens = texts.sum().sum()
vocab_unit.fit(all_tokens)

Such StatefulProcessorUnit will save information in its state when fit, similar to the context of a preprocessor. In our case here, the vocabulary unit will save a term to index mapping, and a index to term mapping, called term_index and index_term respectively. Then we can proceed transforming a data pack.



In [43]:

    
for vocab in 'how', 'are', 'glacier':
    print(vocab, vocab_unit.state['term_index'][vocab])









    



how 153
are 604
glacier 55



In [44]:

    
data_pack.apply_on_text(vocab_unit.transform, inplace=True)
data_pack.frame()[:5]









    



Processing text_left with transform: 100%|██████████| 13/13 [00:00<00:00, 18211.74it/s]
Processing text_right with transform: 100%|██████████| 100/100 [00:00<00:00, 58408.36it/s]






    Out[44]:







  
    
      
      id_left
      text_left
      id_right
      text_right
      label
    
  
  
    
      0
      Q1
      [153, 604, 55, 448, 752, 629]
      D1-0
      [688, 278, 896, 55, 165, 25, 493, 851, 55, 509]
      0.0
    
    
      1
      Q1
      [153, 604, 55, 448, 752, 629]
      D1-1
      [371, 800, 827, 185, 87, 76, 901, 639]
      0.0
    
    
      2
      Q1
      [153, 604, 55, 448, 752, 629]
      D1-2
      [800, 378, 394, 371, 213, 55, 165]
      0.0
    
    
      3
      Q1
      [153, 604, 55, 448, 752, 629]
      D1-3
      [688, 55, 165, 185, 688, 165, 752, 712, 371, 8...
      1.0
    
    
      4
      Q1
      [153, 604, 55, 448, 752, 629]
      D1-4
      [55, 448, 604, 856, 389, 800, 448, 808, 72, 33...
      0.0

Since this usage is quite common, we wrapped a function to do the same thing. For other stateful units, consult their documentations and try mz.build_unit_from_data_pack.



In [45]:

    
data_pack = mz.datasets.toy.load_data()
vocab_unit = mz.build_vocab_unit(data_pack)
data_pack.apply_on_text(vocab_unit.transform).frame[:5]









    



Processing text_left with extend: 100%|██████████| 13/13 [00:00<00:00, 18205.66it/s]
Processing text_right with extend: 100%|██████████| 100/100 [00:00<00:00, 146398.05it/s]
Building Vocabulary from a datapack.: 100%|██████████| 13893/13893 [00:00<00:00, 3564222.00it/s]
Processing text_left with transform: 100%|██████████| 13/13 [00:00<00:00, 12546.24it/s]
Processing text_right with transform: 100%|██████████| 100/100 [00:00<00:00, 23825.86it/s]






    Out[45]:







  
    
      
      id_left
      text_left
      id_right
      text_right
      label
    
  
  
    
      0
      Q1
      [4, 1, 55, 29, 24, 72, 32, 29, 11, 33, 24, 49,...
      D1-0
      [12, 29, 66, 24, 72, 54, 33, 51, 29, 21, 25, 1...
      0.0
    
    
      1
      Q1
      [4, 1, 55, 29, 24, 72, 32, 29, 11, 33, 24, 49,...
      D1-1
      [37, 4, 32, 29, 7, 49, 32, 29, 68, 24, 49, 24,...
      0.0
    
    
      2
      Q1
      [4, 1, 55, 29, 24, 72, 32, 29, 11, 33, 24, 49,...
      D1-2
      [8, 49, 32, 29, 68, 1, 72, 70, 24, 54, 7, 1, 2...
      0.0
    
    
      3
      Q1
      [4, 1, 55, 29, 24, 72, 32, 29, 11, 33, 24, 49,...
      D1-3
      [12, 29, 11, 33, 24, 49, 7, 32, 72, 29, 49, 24...
      1.0
    
    
      4
      Q1
      [4, 1, 55, 29, 24, 72, 32, 29, 11, 33, 24, 49,...
      D1-4
      [6, 33, 24, 49, 7, 32, 72, 29, 49, 24, 14, 32,...
      0.0

DataGenerator

Some MatchZoo models (e.g. DRMM, MatchPyramid) require batch-wise information for training so using fit_generator instead of using fit is necessary. In addition, sometimes your memory just can't hold all transformed data so to delay a part of the preprocessing process is necessary.

MatchZoo provides DataGenerator as an alternative. Instead of fit, you may do a fit_generator that takes a data generator that unpack data on the fly.



In [46]:

    
x_train, y_train = train_preprocessed.unpack()
model.fit(x_train, y_train)









    



Epoch 1/1
100/100 [==============================] - 0s 18us/step - loss: 30618.9355






    Out[46]:





<keras.callbacks.History at 0x1085f8438>



In [47]:

    
data_gen = mz.DataGenerator(train_preprocessed)
model.fit_generator(data_gen)









    



Epoch 1/1
1/1 [==============================] - 0s 18ms/step - loss: 29543.1270






    Out[47]:





<keras.callbacks.History at 0x10860dba8>

The data preprocessing of DSSM eats a lot of memory, but we can workaround that using the callback hook of DataGenerator.



In [48]:

    
preprocessor = mz.preprocessors.DSSMPreprocessor(with_word_hashing=False)
data = preprocessor.fit_transform(train_raw, verbose=0)



In [49]:

    
dssm = mz.models.DSSM()
dssm.params['task'] = mz.tasks.Ranking()
dssm.params.update(preprocessor.context)
dssm.build()
dssm.compile()



In [50]:

    
term_index = preprocessor.context['vocab_unit'].state['term_index']
hashing_unit = mz.preprocessors.units.WordHashing(term_index)
data_generator = mz.DataGenerator(
    data,
    batch_size=4,
    callbacks=[
        mz.data_generator.callbacks.LambdaCallback(
            on_batch_data_pack=lambda dp: dp.apply_on_text(
                hashing_unit.transform, inplace=True, verbose=0)
        )
    ]
)



In [51]:

    
dssm.fit_generator(data_generator)









    



Epoch 1/1
25/25 [==============================] - 1s 33ms/step - loss: 0.0471






    Out[51]:





<keras.callbacks.History at 0x1337f78d0>

In addition, losses like RankHingeLoss and RankCrossEntropyLoss have to be used with DataGenerator with mode='pair', since batch-wise information are needed and computed on the fly.



In [52]:

    
num_neg = 4
task = mz.tasks.Ranking(loss=mz.losses.RankHingeLoss(num_neg=num_neg))
preprocessor = model.get_default_preprocessor()
train_processed = preprocessor.fit_transform(train_raw)









    



Processing text_left with chain_transform of Tokenize => Lowercase => PuncRemoval: 100%|██████████| 13/13 [00:00<00:00, 3234.62it/s]
Processing text_right with chain_transform of Tokenize => Lowercase => PuncRemoval: 100%|██████████| 100/100 [00:00<00:00, 4000.52it/s]
Processing text_right with append: 100%|██████████| 100/100 [00:00<00:00, 194993.21it/s]
Building FrequencyFilter from a datapack.: 100%|██████████| 100/100 [00:00<00:00, 98898.94it/s]
Processing text_right with transform: 100%|██████████| 100/100 [00:00<00:00, 89698.55it/s]
Processing text_left with extend: 100%|██████████| 13/13 [00:00<00:00, 28622.55it/s]
Processing text_right with extend: 100%|██████████| 100/100 [00:00<00:00, 195995.51it/s]
Building Vocabulary from a datapack.: 100%|██████████| 1665/1665 [00:00<00:00, 2223341.66it/s]
Processing text_left with chain_transform of Tokenize => Lowercase => PuncRemoval: 100%|██████████| 13/13 [00:00<00:00, 6020.98it/s]
Processing text_right with chain_transform of Tokenize => Lowercase => PuncRemoval: 100%|██████████| 100/100 [00:00<00:00, 5079.76it/s]
Processing text_right with transform: 100%|██████████| 100/100 [00:00<00:00, 106373.42it/s]
Processing text_left with transform: 100%|██████████| 13/13 [00:00<00:00, 21274.27it/s]
Processing text_right with transform: 100%|██████████| 100/100 [00:00<00:00, 56397.79it/s]
Processing length_left with len: 100%|██████████| 13/13 [00:00<00:00, 24863.64it/s]
Processing length_right with len: 100%|██████████| 100/100 [00:00<00:00, 142228.01it/s]
Processing text_left with transform: 100%|██████████| 13/13 [00:00<00:00, 21526.23it/s]
Processing text_right with transform: 100%|██████████| 100/100 [00:00<00:00, 84767.66it/s]



In [53]:

    
model = mz.models.Naive()
model.params['task'] = task
model.params.update(preprocessor.context)
model.build()
model.compile()



In [54]:

    
data_gen = mz.DataGenerator(
    train_processed,
    mode='pair',
    num_neg=num_neg,
    num_dup=2,
    batch_size=32
)



In [55]:

    
model.fit_generator(data_gen)









    



Epoch 1/1
1/1 [==============================] - 0s 222ms/step - loss: 28.1434






    Out[55]:





<keras.callbacks.History at 0x1332f9ba8>

	text_left
id_left
Q1	how are glacier caves formed?
Q2	How are the directions of the velocity and for...
Q5	how did apollo creed die
Q6	how long is the term for federal judges
Q7	how a beretta model 21 pistols magazines works

	text_right
id_right
D1-0	A partly submerged glacier cave on Perito More...
D1-1	The ice facade is approximately 60 m high
D1-2	Ice formations in the Titlis glacier cave
D1-3	A glacier cave is a cave formed within the ice...
D1-4	Glacier caves are often called ice caves , but...

	text_right
id_right
D2-0	In physics , circular motion is a movement of ...
D2-1	It can be uniform, with constant angular rate ...
D2-2	The rotation around a fixed axis of a three-di...
D2-3	The equations of motion describe the movement ...
D2-4	Examples of circular motion include: an artifi...

	id_left	text_left	id_right	text_right	label
0	True	True	True	True	True
1	True	True	True	True	True
2	True	True	True	True	True
3	True	True	True	True	True
4	True	True	True	True	True

	id_left	text_left	id_right	text_right	label
0	Q17	how much are the harry potter movies worth	D17-8	According to Rowling, the main theme is death.	[0, 1, 0]
1	Q2	How are the directions of the velocity and for...	D2-1	It can be uniform, with constant angular rate ...	[0, 1, 0]
2	Q5	how did apollo creed die	D5-1	He was played by Carl Weathers .	[0, 1, 0]
3	Q9	how a vul works	D9-1	In a VUL, the cash value can be invested in a ...	[0, 1, 0]
4	Q5	how did apollo creed die	D5-6	Rocky Balboa is often wrongly credited with po...	[0, 1, 0]

	id_left	text_left	id_right	text_right
0	L-0	A	R-0	a
1	L-1	R	R-1	r
2	L-2	S	R-2	s
3	L-0	A	R-3	t
4	L-0	A	R-4	e
5	L-1	R	R-5	n
6	L-2	S	R-6	u
7	L-0	A	R-2	s

	id_left	text_left	id_right	text_right
0	Q8	How are epithelial tissues joined together?	D8-0	Cross section of sclerenchyma fibers in plant ...
1	Q8	How are epithelial tissues joined together?	D8-1	Microscopic view of a histologic specimen of h...
2	Q8	How are epithelial tissues joined together?	D8-2	In Biology , Tissue is a cellular organization...
3	Q8	How are epithelial tissues joined together?	D8-3	A tissue is an ensemble of similar cells from ...
4	Q8	How are epithelial tissues joined together?	D8-4	Organs are then formed by the functional group...

	id_left	text_left	id_right	text_right	label
0	L-0	This church choir sings to the masses as they ...	R-0	The church has cracks in the ceiling.	[0, 0, 1, 0]
1	L-0	This church choir sings to the masses as they ...	R-1	The church is filled with song.	[1, 0, 0, 0]
2	L-0	This church choir sings to the masses as they ...	R-2	A choir singing at a baseball game.	[0, 1, 0, 0]
3	L-1	A woman with a green headscarf, blue shirt and...	R-3	The woman is young.	[0, 0, 1, 0]
4	L-1	A woman with a green headscarf, blue shirt and...	R-4	The woman is very happy.	[1, 0, 0, 0]

	id_left	text_left	id_right	text_right	label
0	Q1	[how, are, glacier, caves, formed, ?]	D1-0	[A, partly, submerged, glacier, cave, on, Peri...	0.0
1	Q1	[how, are, glacier, caves, formed, ?]	D1-1	[The, ice, facade, is, approximately, 60, m, h...	0.0
2	Q1	[how, are, glacier, caves, formed, ?]	D1-2	[Ice, formations, in, the, Titlis, glacier, cave]	0.0
3	Q1	[how, are, glacier, caves, formed, ?]	D1-3	[A, glacier, cave, is, a, cave, formed, within...	1.0
4	Q1	[how, are, glacier, caves, formed, ?]	D1-4	[Glacier, caves, are, often, called, ice, cave...	0.0

	id_left	text_left	id_right	text_right	label
0	Q1	[153, 604, 55, 448, 752, 629]	D1-0	[688, 278, 896, 55, 165, 25, 493, 851, 55, 509]	0.0
1	Q1	[153, 604, 55, 448, 752, 629]	D1-1	[371, 800, 827, 185, 87, 76, 901, 639]	0.0
2	Q1	[153, 604, 55, 448, 752, 629]	D1-2	[800, 378, 394, 371, 213, 55, 165]	0.0
3	Q1	[153, 604, 55, 448, 752, 629]	D1-3	[688, 55, 165, 185, 688, 165, 752, 712, 371, 8...	1.0
4	Q1	[153, 604, 55, 448, 752, 629]	D1-4	[55, 448, 604, 856, 389, 800, 448, 808, 72, 33...	0.0

	id_left	text_left	id_right	text_right
0	L-0	A	R-0	a
1	L-1	R	R-1	r
2	L-2	S	R-2	s
3	L-0	A	R-3	t
4	L-0	A	R-4	e
5	L-1	R	R-5	n
6	L-2	S	R-6	u
7	L-0	A	R-2	s

	id_left	text_left	id_right	text_right	label
0	Q1	[4, 1, 55, 29, 24, 72, 32, 29, 11, 33, 24, 49,...	D1-0	[12, 29, 66, 24, 72, 54, 33, 51, 29, 21, 25, 1...	0.0
1	Q1	[4, 1, 55, 29, 24, 72, 32, 29, 11, 33, 24, 49,...	D1-1	[37, 4, 32, 29, 7, 49, 32, 29, 68, 24, 49, 24,...	0.0
2	Q1	[4, 1, 55, 29, 24, 72, 32, 29, 11, 33, 24, 49,...	D1-2	[8, 49, 32, 29, 68, 1, 72, 70, 24, 54, 7, 1, 2...	0.0
3	Q1	[4, 1, 55, 29, 24, 72, 32, 29, 11, 33, 24, 49,...	D1-3	[12, 29, 11, 33, 24, 49, 7, 32, 72, 29, 49, 24...	1.0
4	Q1	[4, 1, 55, 29, 24, 72, 32, 29, 11, 33, 24, 49,...	D1-4	[6, 33, 24, 49, 7, 32, 72, 29, 49, 24, 14, 32,...	0.0

	id_left	text_left	id_right	text_right
0	L-0	A	R-0	a
1	L-1	R	R-1	r
2	L-2	S	R-2	s
3	L-0	A	R-3	t
4	L-0	A	R-4	e
5	L-1	R	R-5	n
6	L-2	S	R-6	u
7	L-0	A	R-2	s