In [1]:
import matchzoo as mz
print(mz.__version__)
There are two types of tasks available in MatchZoo. mz.tasks.Ranking and mz.tasks.Classification. We will use a ranking task for this demo.
In [2]:
task = mz.tasks.Ranking()
print(task)
In [3]:
train_raw = mz.datasets.toy.load_data(stage='train', task=task)
test_raw = mz.datasets.toy.load_data(stage='test', task=task)
In [4]:
type(train_raw)
Out[4]:
DataPack is a MatchZoo native data structure that most MatchZoo data handling processes build upon. A DataPack is consists of three pandas.DataFrame:
In [5]:
train_raw.left.head()
Out[5]:
In [6]:
train_raw.right.head()
Out[6]:
In [7]:
train_raw.relation.head()
Out[7]:
It is also possible to convert a DataPack into a single pandas.DataFrame that holds all information.
In [8]:
train_raw.frame().head()
Out[8]:
However, using such pandas.DataFrame consumes much more memory if there are many duplicates in the texts, and that is the exact reason why we use DataPack. For more details about data handling, consult matchzoo/tutorials/data_handling.ipynb.
MatchZoo preprocessors are used to convert a raw DataPack into a DataPack that ready to be fed into a model.
In [9]:
preprocessor = mz.preprocessors.BasicPreprocessor()
There are two steps to use a preprocessor. First, fit. Then, transform. fit will only changes the preprocessor's inner state but not the input DataPack.
In [10]:
preprocessor.fit(train_raw)
Out[10]:
fit will gather useful information into its context, which will be used later in a transform or used to set hyper-parameters of a model.
In [11]:
preprocessor.context
Out[11]:
Once fit, the preprocessor has enough information to transform. transform will not change the preprocessor's inner state and the input DataPack, but return a transformed DataPack.
In [12]:
train_processed = preprocessor.transform(train_raw)
test_processed = preprocessor.transform(test_raw)
In [13]:
train_processed.left.head()
Out[13]:
As we can see, text_left is already in sequence form that nerual networks love.
Just to make sure we have the correct sequence:
In [14]:
vocab_unit = preprocessor.context['vocab_unit']
print('Orig Text:', train_processed.left.loc['Q1']['text_left'])
sequence = train_processed.left.loc['Q1']['text_left']
print('Transformed Indices:', sequence)
print('Transformed Indices Meaning:',
'_'.join([vocab_unit.state['index_term'][i] for i in sequence]))
For more details about preprocessing, consult matchzoo/tutorials/data_handling.ipynb.
MatchZoo provides many built-in text matching models.
In [15]:
mz.models.list_available()
Out[15]:
Let's use mz.models.DenseBaseline for our demo.
In [16]:
model = mz.models.DenseBaseline()
The model is initialized with a hyper parameter table, in which values are partially filled. To view parameters and their values, use print.
In [17]:
print(model.params)
to_frame gives you more informartion in addition to just names and values.
In [18]:
model.params.to_frame()[['Name', 'Description', 'Value']]
Out[18]:
To set a hyper-parameter:
In [19]:
model.params['task'] = task
model.params['mlp_num_units'] = 3
print(model.params)
Notice that we are still missing input_shapes, and that information is store in the preprocessor.
In [20]:
print(preprocessor.context['input_shapes'])
We may use update to load a preprocessor's context into a model's hyper-parameter table.
In [21]:
model.params.update(preprocessor.context)
Now we have a completed hyper-parameter table.
In [22]:
model.params.completed()
Out[22]:
With all parameters filled in, we can now build and compile the model.
In [23]:
model.build()
model.compile()
MatchZoo models are wrapped over keras models, and the backend property of a model gives you the actual keras model built.
In [24]:
model.backend.summary()
For more details about models, consult matchzoo/tutorials/models.ipynb.
A DataPack can unpack itself into data that can be directly used to train a MatchZoo model.
In [25]:
x, y = train_processed.unpack()
test_x, test_y = test_processed.unpack()
In [26]:
model.fit(x, y, batch_size=32, epochs=5)
Out[26]:
An alternative to train a model is to use a DataGenerator. This is useful for delaying expensive preprocessing steps or doing real-time data augmentation. For some models that needs dynamic batch-wise information, using a DataGenerator is required. For more details about DataGenerator, consult matchzoo/tutorials/data_handling.ipynb.
In [27]:
data_generator = mz.DataGenerator(train_processed, batch_size=32)
In [28]:
model.fit_generator(data_generator, epochs=5, use_multiprocessing=True, workers=4)
Out[28]:
In [29]:
model.evaluate(test_x, test_y)
Out[29]:
In [30]:
model.predict(test_x)
Out[30]:
Since data preprocessing and model building are laborious and special setups of some models makes this even worse, MatchZoo provides prepare, a unified interface that handles interaction among data, model, and preprocessor automatically.
More specifically, prepare does these following things:
DataGeneratorBuilder that will build a correctly formed DataGenerator given a DataPackIt also does many special handling for specific models, but we will not go into the details of that here.
In [31]:
for model_class in mz.models.list_available():
print(model_class)
model, preprocessor, data_generator_builder, embedding_matrix = mz.auto.prepare(
task=task,
model_class=model_class,
data_pack=train_raw,
)
train_processed = preprocessor.transform(train_raw, verbose=0)
test_processed = preprocessor.transform(test_raw, verbose=0)
train_gen = data_generator_builder.build(train_processed)
test_gen = data_generator_builder.build(test_processed)
model.fit_generator(train_gen, epochs=1)
model.evaluate_generator(test_gen)
print()
In [32]:
model.save('my-model')
loaded_model = mz.load_model('my-model')