In [1]:
import matchzoo as mz
import pandas as pd
print(mz.__version__)
matchzoo.DataPack is a MatchZoo native data structure that most MatchZoo data handling processes build upon. A matchzoo.DataPack consists of three parts: left, right and relation, each one of is a pandas.DataFrame.
In [2]:
data_pack = mz.datasets.toy.load_data()
In [3]:
data_pack.left.head()
Out[3]:
In [4]:
data_pack.right.head()
Out[4]:
In [5]:
data_pack.relation.head()
Out[5]:
The main reason for using a matchzoo.DataPack instead of pandas.DataFrame is efficiency: we save space from storing duplicate texts and save time from processing duplicate texts.
However, since a big table is easier to understand and manage, we provide the frame that merges three parts into a single pandas.DataFrame when called.
In [6]:
data_pack.frame().head()
Out[6]:
Notice that frame is not a method, but a property that returns a matchzoo.DataPack.FrameView object.
In [7]:
type(data_pack.frame)
Out[7]:
This view reflects changes in the data pack, and can be called to create a pandas.DataFrame at any time.
In [8]:
frame = data_pack.frame
data_pack.relation['label'] = data_pack.relation['label'] + 1
In [9]:
frame().head()
Out[9]:
You may use [] to slice a matchzoo.DataPack similar to slicing a list. This also returns a shallow copy of the sliced data like slicing a list.
In [10]:
data_slice = data_pack[5:10]
A sliced data pack's relation will directly reflect the slicing.
In [11]:
data_slice.relation
Out[11]:
In addition, left and right will be processed so only relevant information are kept.
In [12]:
data_slice.left
Out[12]:
In [13]:
data_slice.right
Out[13]:
It is also possible to slice a frame view object.
In [14]:
data_pack.frame[5:10]
Out[14]:
And this is equivalent to slicing the data pack first, then the frame, since both of them are based on the relation column.
In [15]:
data_slice.frame() == data_pack.frame[5:10]
Out[15]:
Slicing is extremely useful for partitioning data for training vs testing.
In [16]:
num_train = int(len(data_pack) * 0.8)
data_pack.shuffle(inplace=True)
train_slice = data_pack[:num_train]
test_slice = data_pack[num_train:]
Use apply_on_text to transform texts in a matchzoo.DataPack. Check the documentation for more information.
In [17]:
data_slice.apply_on_text(len).frame()
Out[17]:
In [18]:
data_slice.apply_on_text(len, rename=('left_length', 'right_length')).frame()
Out[18]:
Since adding a column indicating text length is a quite common usage, you may simply do:
In [19]:
data_slice.append_text_length().frame()
Out[19]:
To one-hot encode the labels:
In [20]:
data_pack.relation['label'] = data_pack.relation['label'].astype(int)
data_pack.one_hot_encode_label(num_classes=3).frame().head()
Out[20]:
Use matchzoo.pack to build your own data pack. Check documentation for more information.
In [21]:
data = pd.DataFrame({
'text_left': list('ARSAARSA'),
'text_right': list('arstenus')
})
my_pack = mz.pack(data)
my_pack.frame()
Out[21]:
Format data in a way so that MatchZoo models can directly fit it. For more details, consult matchzoo/tutorials/models.ipynb.
In [22]:
x, y = data_pack[:3].unpack()
In [23]:
x
Out[23]:
In [24]:
y
Out[24]:
MatchZoo incorporates various datasets that can be loaded as MatchZoo native data structures.
In [25]:
mz.datasets.list_available()
Out[25]:
The toy dataset doesn't need to be downloaded and can be directly used. It's the best choice to get things rolling.
In [26]:
toy_train_rank = mz.datasets.toy.load_data()
toy_train_rank.frame().head()
Out[26]:
In [27]:
toy_dev_classification, classes = mz.datasets.toy.load_data(
stage='train', task='classification', return_classes=True)
toy_dev_classification.frame().head()
Out[27]:
In [28]:
classes
Out[28]:
Other larger datasets will be automatically downloaded the first time you use it. Run the following lines to trigger downloading.
In [29]:
wiki_dev_entailment_rank = mz.datasets.wiki_qa.load_data(stage='dev')
wiki_dev_entailment_rank.frame().head()
Out[29]:
In [30]:
snli_test_classification, classes = mz.datasets.snli.load_data(
stage='test', task='classification', return_classes=True)
snli_test_classification.frame().head()
Out[30]:
In [31]:
classes
Out[31]:
matchzoo.preprocessors are responsible for transforming data into correct forms that matchzoo.models. BasicPreprocessor is used for models with common forms, and some other models have customized preprocessors made just for them.
In [32]:
mz.preprocessors.list_available()
Out[32]:
When in doubt, use the default preprocessor a model class provides.
In [33]:
preprocessor = mz.models.Naive.get_default_preprocessor()
A preprocessor should be used in two steps. First, fit, then, transform. fit collects information into context, which includes everything the preprocessor needs to transform together with other useful information for later use. fit will only change the preprocessor's inner state but not the input data. In contrast, transform returns a modified copy of the input data without changing the preprocessor's inner state.
In [34]:
train_raw = mz.datasets.toy.load_data('train', 'ranking')
test_raw = mz.datasets.toy.load_data('test', 'ranking')
preprocessor.fit(train_raw)
preprocessor.context
Out[34]:
In [35]:
train_preprocessed = preprocessor.transform(train_raw)
test_preprocessed = preprocessor.transform(test_raw)
In [36]:
model = mz.models.Naive()
model.guess_and_fill_missing_params()
model.build()
model.compile()
x_train, y_train = train_preprocessed.unpack()
model.fit(x_train, y_train)
x_test, y_test = test_preprocessed.unpack()
model.evaluate(x_test, y_test)
Out[36]:
Preprocessors utilize mz.processor_units to transform data. Processor units correspond to specific transformations and you may use them independently to preprocess a data pack.
In [37]:
data_pack = mz.datasets.toy.load_data()
data_pack.frame().head()
Out[37]:
In [38]:
tokenizer = mz.preprocessors.units.Tokenize()
data_pack.apply_on_text(tokenizer.transform, inplace=True)
data_pack.frame[:5]
Out[38]:
In [39]:
lower_caser = mz.preprocessors.units.Lowercase()
data_pack.apply_on_text(lower_caser.transform, inplace=True)
data_pack.frame[:5]
Out[39]:
Or use chain_transform to apply multiple processor units at one time
In [40]:
data_pack = mz.datasets.toy.load_data()
chain = mz.chain_transform([mz.preprocessors.units.Tokenize(),
mz.preprocessors.units.Lowercase()])
data_pack.apply_on_text(chain, inplace=True)
data_pack.frame[:5]
Out[40]:
Notice that some processor units are stateful so we have to fit them before using their transform.
In [41]:
mz.preprocessors.units.Vocabulary.__base__
Out[41]:
In [42]:
vocab_unit = mz.preprocessors.units.Vocabulary()
texts = data_pack.frame()[['text_left', 'text_right']]
all_tokens = texts.sum().sum()
vocab_unit.fit(all_tokens)
Such StatefulProcessorUnit will save information in its state when fit, similar to the context of a preprocessor. In our case here, the vocabulary unit will save a term to index mapping, and a index to term mapping, called term_index and index_term respectively. Then we can proceed transforming a data pack.
In [43]:
for vocab in 'how', 'are', 'glacier':
print(vocab, vocab_unit.state['term_index'][vocab])
In [44]:
data_pack.apply_on_text(vocab_unit.transform, inplace=True)
data_pack.frame()[:5]
Out[44]:
Since this usage is quite common, we wrapped a function to do the same thing. For other stateful units, consult their documentations and try mz.build_unit_from_data_pack.
In [45]:
data_pack = mz.datasets.toy.load_data()
vocab_unit = mz.build_vocab_unit(data_pack)
data_pack.apply_on_text(vocab_unit.transform).frame[:5]
Out[45]:
Some MatchZoo models (e.g. DRMM, MatchPyramid) require batch-wise information for training so using fit_generator instead of using fit is necessary. In addition, sometimes your memory just can't hold all transformed data so to delay a part of the preprocessing process is necessary.
MatchZoo provides DataGenerator as an alternative. Instead of fit, you may do a fit_generator that takes a data generator that unpack data on the fly.
In [46]:
x_train, y_train = train_preprocessed.unpack()
model.fit(x_train, y_train)
Out[46]:
In [47]:
data_gen = mz.DataGenerator(train_preprocessed)
model.fit_generator(data_gen)
Out[47]:
The data preprocessing of DSSM eats a lot of memory, but we can workaround that using the callback hook of DataGenerator.
In [48]:
preprocessor = mz.preprocessors.DSSMPreprocessor(with_word_hashing=False)
data = preprocessor.fit_transform(train_raw, verbose=0)
In [49]:
dssm = mz.models.DSSM()
dssm.params['task'] = mz.tasks.Ranking()
dssm.params.update(preprocessor.context)
dssm.build()
dssm.compile()
In [50]:
term_index = preprocessor.context['vocab_unit'].state['term_index']
hashing_unit = mz.preprocessors.units.WordHashing(term_index)
data_generator = mz.DataGenerator(
data,
batch_size=4,
callbacks=[
mz.data_generator.callbacks.LambdaCallback(
on_batch_data_pack=lambda dp: dp.apply_on_text(
hashing_unit.transform, inplace=True, verbose=0)
)
]
)
In [51]:
dssm.fit_generator(data_generator)
Out[51]:
In addition, losses like RankHingeLoss and RankCrossEntropyLoss have to be used with DataGenerator with mode='pair', since batch-wise information are needed and computed on the fly.
In [52]:
num_neg = 4
task = mz.tasks.Ranking(loss=mz.losses.RankHingeLoss(num_neg=num_neg))
preprocessor = model.get_default_preprocessor()
train_processed = preprocessor.fit_transform(train_raw)
In [53]:
model = mz.models.Naive()
model.params['task'] = task
model.params.update(preprocessor.context)
model.build()
model.compile()
In [54]:
data_gen = mz.DataGenerator(
train_processed,
mode='pair',
num_neg=num_neg,
num_dup=2,
batch_size=32
)
In [55]:
model.fit_generator(data_gen)
Out[55]: