In [1]:
import matchzoo as mz
import pandas as pd
print(mz.__version__)
matchzoo.DataPack
is a MatchZoo native data structure that most MatchZoo data handling processes build upon. A matchzoo.DataPack
consists of three parts: left
, right
and relation
, each one of is a pandas.DataFrame
.
In [2]:
data_pack = mz.datasets.toy.load_data()
In [3]:
data_pack.left.head()
Out[3]:
In [4]:
data_pack.right.head()
Out[4]:
In [5]:
data_pack.relation.head()
Out[5]:
The main reason for using a matchzoo.DataPack
instead of pandas.DataFrame
is efficiency: we save space from storing duplicate texts and save time from processing duplicate texts.
However, since a big table is easier to understand and manage, we provide the frame
that merges three parts into a single pandas.DataFrame
when called.
In [6]:
data_pack.frame().head()
Out[6]:
Notice that frame
is not a method, but a property that returns a matchzoo.DataPack.FrameView
object.
In [7]:
type(data_pack.frame)
Out[7]:
This view reflects changes in the data pack, and can be called to create a pandas.DataFrame
at any time.
In [8]:
frame = data_pack.frame
data_pack.relation['label'] = data_pack.relation['label'] + 1
In [9]:
frame().head()
Out[9]:
You may use []
to slice a matchzoo.DataPack
similar to slicing a list
. This also returns a shallow copy of the sliced data like slicing a list
.
In [10]:
data_slice = data_pack[5:10]
A sliced data pack's relation
will directly reflect the slicing.
In [11]:
data_slice.relation
Out[11]:
In addition, left
and right
will be processed so only relevant information are kept.
In [12]:
data_slice.left
Out[12]:
In [13]:
data_slice.right
Out[13]:
It is also possible to slice a frame view object.
In [14]:
data_pack.frame[5:10]
Out[14]:
And this is equivalent to slicing the data pack first, then the frame, since both of them are based on the relation
column.
In [15]:
data_slice.frame() == data_pack.frame[5:10]
Out[15]:
Slicing is extremely useful for partitioning data for training vs testing.
In [16]:
num_train = int(len(data_pack) * 0.8)
data_pack.shuffle(inplace=True)
train_slice = data_pack[:num_train]
test_slice = data_pack[num_train:]
Use apply_on_text
to transform texts in a matchzoo.DataPack
. Check the documentation for more information.
In [17]:
data_slice.apply_on_text(len).frame()
Out[17]:
In [18]:
data_slice.apply_on_text(len, rename=('left_length', 'right_length')).frame()
Out[18]:
Since adding a column indicating text length is a quite common usage, you may simply do:
In [19]:
data_slice.append_text_length().frame()
Out[19]:
To one-hot encode the labels:
In [20]:
data_pack.relation['label'] = data_pack.relation['label'].astype(int)
data_pack.one_hot_encode_label(num_classes=3).frame().head()
Out[20]:
Use matchzoo.pack
to build your own data pack. Check documentation for more information.
In [21]:
data = pd.DataFrame({
'text_left': list('ARSAARSA'),
'text_right': list('arstenus')
})
my_pack = mz.pack(data)
my_pack.frame()
Out[21]:
Format data in a way so that MatchZoo models can directly fit it. For more details, consult matchzoo/tutorials/models.ipynb
.
In [22]:
x, y = data_pack[:3].unpack()
In [23]:
x
Out[23]:
In [24]:
y
Out[24]:
MatchZoo incorporates various datasets that can be loaded as MatchZoo native data structures.
In [25]:
mz.datasets.list_available()
Out[25]:
The toy dataset doesn't need to be downloaded and can be directly used. It's the best choice to get things rolling.
In [26]:
toy_train_rank = mz.datasets.toy.load_data()
toy_train_rank.frame().head()
Out[26]:
In [27]:
toy_dev_classification, classes = mz.datasets.toy.load_data(
stage='train', task='classification', return_classes=True)
toy_dev_classification.frame().head()
Out[27]:
In [28]:
classes
Out[28]:
Other larger datasets will be automatically downloaded the first time you use it. Run the following lines to trigger downloading.
In [29]:
wiki_dev_entailment_rank = mz.datasets.wiki_qa.load_data(stage='dev')
wiki_dev_entailment_rank.frame().head()
Out[29]:
In [30]:
snli_test_classification, classes = mz.datasets.snli.load_data(
stage='test', task='classification', return_classes=True)
snli_test_classification.frame().head()
Out[30]:
In [31]:
classes
Out[31]:
matchzoo.preprocessors
are responsible for transforming data into correct forms that matchzoo.models
. BasicPreprocessor
is used for models with common forms, and some other models have customized preprocessors made just for them.
In [32]:
mz.preprocessors.list_available()
Out[32]:
When in doubt, use the default preprocessor a model class provides.
In [33]:
preprocessor = mz.models.Naive.get_default_preprocessor()
A preprocessor should be used in two steps. First, fit
, then, transform
. fit
collects information into context
, which includes everything the preprocessor needs to transform
together with other useful information for later use. fit
will only change the preprocessor's inner state but not the input data. In contrast, transform
returns a modified copy of the input data without changing the preprocessor's inner state.
In [34]:
train_raw = mz.datasets.toy.load_data('train', 'ranking')
test_raw = mz.datasets.toy.load_data('test', 'ranking')
preprocessor.fit(train_raw)
preprocessor.context
Out[34]:
In [35]:
train_preprocessed = preprocessor.transform(train_raw)
test_preprocessed = preprocessor.transform(test_raw)
In [36]:
model = mz.models.Naive()
model.guess_and_fill_missing_params()
model.build()
model.compile()
x_train, y_train = train_preprocessed.unpack()
model.fit(x_train, y_train)
x_test, y_test = test_preprocessed.unpack()
model.evaluate(x_test, y_test)
Out[36]:
Preprocessors utilize mz.processor_units
to transform data. Processor units correspond to specific transformations and you may use them independently to preprocess a data pack.
In [37]:
data_pack = mz.datasets.toy.load_data()
data_pack.frame().head()
Out[37]:
In [38]:
tokenizer = mz.preprocessors.units.Tokenize()
data_pack.apply_on_text(tokenizer.transform, inplace=True)
data_pack.frame[:5]
Out[38]:
In [39]:
lower_caser = mz.preprocessors.units.Lowercase()
data_pack.apply_on_text(lower_caser.transform, inplace=True)
data_pack.frame[:5]
Out[39]:
Or use chain_transform
to apply multiple processor units at one time
In [40]:
data_pack = mz.datasets.toy.load_data()
chain = mz.chain_transform([mz.preprocessors.units.Tokenize(),
mz.preprocessors.units.Lowercase()])
data_pack.apply_on_text(chain, inplace=True)
data_pack.frame[:5]
Out[40]:
Notice that some processor units are stateful so we have to fit
them before using their transform
.
In [41]:
mz.preprocessors.units.Vocabulary.__base__
Out[41]:
In [42]:
vocab_unit = mz.preprocessors.units.Vocabulary()
texts = data_pack.frame()[['text_left', 'text_right']]
all_tokens = texts.sum().sum()
vocab_unit.fit(all_tokens)
Such StatefulProcessorUnit
will save information in its state
when fit
, similar to the context
of a preprocessor. In our case here, the vocabulary unit will save a term to index mapping, and a index to term mapping, called term_index
and index_term
respectively. Then we can proceed transforming a data pack.
In [43]:
for vocab in 'how', 'are', 'glacier':
print(vocab, vocab_unit.state['term_index'][vocab])
In [44]:
data_pack.apply_on_text(vocab_unit.transform, inplace=True)
data_pack.frame()[:5]
Out[44]:
Since this usage is quite common, we wrapped a function to do the same thing. For other stateful units, consult their documentations and try mz.build_unit_from_data_pack
.
In [45]:
data_pack = mz.datasets.toy.load_data()
vocab_unit = mz.build_vocab_unit(data_pack)
data_pack.apply_on_text(vocab_unit.transform).frame[:5]
Out[45]:
Some MatchZoo models (e.g. DRMM, MatchPyramid) require batch-wise information for training so using fit_generator
instead of using fit
is necessary. In addition, sometimes your memory just can't hold all transformed data so to delay a part of the preprocessing process is necessary.
MatchZoo provides DataGenerator
as an alternative. Instead of fit
, you may do a fit_generator
that takes a data generator that unpack
data on the fly.
In [46]:
x_train, y_train = train_preprocessed.unpack()
model.fit(x_train, y_train)
Out[46]:
In [47]:
data_gen = mz.DataGenerator(train_preprocessed)
model.fit_generator(data_gen)
Out[47]:
The data preprocessing of DSSM
eats a lot of memory, but we can workaround that using the callback hook of DataGenerator
.
In [48]:
preprocessor = mz.preprocessors.DSSMPreprocessor(with_word_hashing=False)
data = preprocessor.fit_transform(train_raw, verbose=0)
In [49]:
dssm = mz.models.DSSM()
dssm.params['task'] = mz.tasks.Ranking()
dssm.params.update(preprocessor.context)
dssm.build()
dssm.compile()
In [50]:
term_index = preprocessor.context['vocab_unit'].state['term_index']
hashing_unit = mz.preprocessors.units.WordHashing(term_index)
data_generator = mz.DataGenerator(
data,
batch_size=4,
callbacks=[
mz.data_generator.callbacks.LambdaCallback(
on_batch_data_pack=lambda dp: dp.apply_on_text(
hashing_unit.transform, inplace=True, verbose=0)
)
]
)
In [51]:
dssm.fit_generator(data_generator)
Out[51]:
In addition, losses like RankHingeLoss
and RankCrossEntropyLoss
have to be used with DataGenerator
with mode='pair'
, since batch-wise information are needed and computed on the fly.
In [52]:
num_neg = 4
task = mz.tasks.Ranking(loss=mz.losses.RankHingeLoss(num_neg=num_neg))
preprocessor = model.get_default_preprocessor()
train_processed = preprocessor.fit_transform(train_raw)
In [53]:
model = mz.models.Naive()
model.params['task'] = task
model.params.update(preprocessor.context)
model.build()
model.compile()
In [54]:
data_gen = mz.DataGenerator(
train_processed,
mode='pair',
num_neg=num_neg,
num_dup=2,
batch_size=32
)
In [55]:
model.fit_generator(data_gen)
Out[55]: