In [1]:
import matchzoo as mz
print(mz.__version__)
There are two types of tasks available in MatchZoo. mz.tasks.Ranking
and mz.tasks.Classification
. We will use a ranking task for this demo.
In [2]:
task = mz.tasks.Ranking()
print(task)
In [3]:
train_raw = mz.datasets.toy.load_data(stage='train', task=task)
test_raw = mz.datasets.toy.load_data(stage='test', task=task)
In [4]:
type(train_raw)
Out[4]:
DataPack
is a MatchZoo native data structure that most MatchZoo data handling processes build upon. A DataPack
is consists of three pandas.DataFrame
:
In [5]:
train_raw.left.head()
Out[5]:
In [6]:
train_raw.right.head()
Out[6]:
In [7]:
train_raw.relation.head()
Out[7]:
It is also possible to convert a DataPack
into a single pandas.DataFrame
that holds all information.
In [8]:
train_raw.frame().head()
Out[8]:
However, using such pandas.DataFrame
consumes much more memory if there are many duplicates in the texts, and that is the exact reason why we use DataPack
. For more details about data handling, consult matchzoo/tutorials/data_handling.ipynb
.
MatchZoo preprocessors are used to convert a raw DataPack
into a DataPack
that ready to be fed into a model.
In [9]:
preprocessor = mz.preprocessors.BasicPreprocessor()
There are two steps to use a preprocessor. First, fit
. Then, transform
. fit
will only changes the preprocessor's inner state but not the input DataPack
.
In [10]:
preprocessor.fit(train_raw)
Out[10]:
fit
will gather useful information into its context
, which will be used later in a transform
or used to set hyper-parameters of a model.
In [11]:
preprocessor.context
Out[11]:
Once fit
, the preprocessor has enough information to transform
. transform
will not change the preprocessor's inner state and the input DataPack
, but return a transformed DataPack
.
In [12]:
train_processed = preprocessor.transform(train_raw)
test_processed = preprocessor.transform(test_raw)
In [13]:
train_processed.left.head()
Out[13]:
As we can see, text_left
is already in sequence form that nerual networks love.
Just to make sure we have the correct sequence:
In [14]:
vocab_unit = preprocessor.context['vocab_unit']
print('Orig Text:', train_processed.left.loc['Q1']['text_left'])
sequence = train_processed.left.loc['Q1']['text_left']
print('Transformed Indices:', sequence)
print('Transformed Indices Meaning:',
'_'.join([vocab_unit.state['index_term'][i] for i in sequence]))
For more details about preprocessing, consult matchzoo/tutorials/data_handling.ipynb
.
MatchZoo provides many built-in text matching models.
In [15]:
mz.models.list_available()
Out[15]:
Let's use mz.models.DenseBaseline
for our demo.
In [16]:
model = mz.models.DenseBaseline()
The model is initialized with a hyper parameter table, in which values are partially filled. To view parameters and their values, use print
.
In [17]:
print(model.params)
to_frame
gives you more informartion in addition to just names and values.
In [18]:
model.params.to_frame()[['Name', 'Description', 'Value']]
Out[18]:
To set a hyper-parameter:
In [19]:
model.params['task'] = task
model.params['mlp_num_units'] = 3
print(model.params)
Notice that we are still missing input_shapes
, and that information is store in the preprocessor.
In [20]:
print(preprocessor.context['input_shapes'])
We may use update
to load a preprocessor's context into a model's hyper-parameter table.
In [21]:
model.params.update(preprocessor.context)
Now we have a completed hyper-parameter table.
In [22]:
model.params.completed()
Out[22]:
With all parameters filled in, we can now build and compile the model.
In [23]:
model.build()
model.compile()
MatchZoo models are wrapped over keras models, and the backend
property of a model gives you the actual keras model built.
In [24]:
model.backend.summary()
For more details about models, consult matchzoo/tutorials/models.ipynb
.
A DataPack
can unpack
itself into data that can be directly used to train a MatchZoo model.
In [25]:
x, y = train_processed.unpack()
test_x, test_y = test_processed.unpack()
In [26]:
model.fit(x, y, batch_size=32, epochs=5)
Out[26]:
An alternative to train a model is to use a DataGenerator
. This is useful for delaying expensive preprocessing steps or doing real-time data augmentation. For some models that needs dynamic batch-wise information, using a DataGenerator
is required. For more details about DataGenerator
, consult matchzoo/tutorials/data_handling.ipynb
.
In [27]:
data_generator = mz.DataGenerator(train_processed, batch_size=32)
In [28]:
model.fit_generator(data_generator, epochs=5, use_multiprocessing=True, workers=4)
Out[28]:
In [29]:
model.evaluate(test_x, test_y)
Out[29]:
In [30]:
model.predict(test_x)
Out[30]:
Since data preprocessing and model building are laborious and special setups of some models makes this even worse, MatchZoo provides prepare
, a unified interface that handles interaction among data, model, and preprocessor automatically.
More specifically, prepare
does these following things:
DataGeneratorBuilder
that will build a correctly formed DataGenerator
given a DataPack
It also does many special handling for specific models, but we will not go into the details of that here.
In [31]:
for model_class in mz.models.list_available():
print(model_class)
model, preprocessor, data_generator_builder, embedding_matrix = mz.auto.prepare(
task=task,
model_class=model_class,
data_pack=train_raw,
)
train_processed = preprocessor.transform(train_raw, verbose=0)
test_processed = preprocessor.transform(test_raw, verbose=0)
train_gen = data_generator_builder.build(train_processed)
test_gen = data_generator_builder.build(test_processed)
model.fit_generator(train_gen, epochs=1)
model.evaluate_generator(test_gen)
print()
In [32]:
model.save('my-model')
loaded_model = mz.load_model('my-model')