This example uses a dataset from stackoverflow, with the input being posts and predictions one of the possible classes. It's an adapted version from the Keras example available at https://towardsdatascience.com/multi-class-text-classification-model-comparison-and-selection-5eb066197568.
It shows how to use Keras for training a tokenizer and a model. It also shows how to build a custom model for deployment of this hybrid model.
This notebook has been tested with the following packages:
(you may need to change pip
to pip3
, depending on your own Python environment)
In [1]:
# Python 3.6
!pip install verta
!pip install wget
!pip install pandas
!pip install tensorflow==1.14.0
!pip install scikit-learn
!pip install lxml
!pip install beautifulsoup4
In [2]:
HOST = 'app.verta.ai'
PROJECT_NAME = 'Text Classification'
EXPERIMENT_NAME = 'basic-clf'
In [3]:
# import os
# os.environ['VERTA_EMAIL'] =
# os.environ['VERTA_DEV_KEY'] =
In [4]:
from verta import Client
from verta.utils import ModelAPI
client = Client(HOST, use_git=False)
proj = client.set_project(PROJECT_NAME)
expt = client.set_experiment(EXPERIMENT_NAME)
run = client.set_experiment_run()
In [5]:
from __future__ import absolute_import, division, print_function, unicode_literals
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
import os
import re
import wget
from tensorflow import keras
from sklearn.preprocessing import LabelBinarizer, LabelEncoder
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
Download the dataset, load and reduce the number of examples so that the example runs faster.
In [6]:
if not os.path.exists('stack-overflow-data.csv'):
wget.download('https://storage.googleapis.com/tensorflow-workshop-examples/stack-overflow-data.csv')
In [7]:
df = pd.read_csv('stack-overflow-data.csv')
df = df[pd.notnull(df['tags'])]
df = df[:5000]
print(df.head(10))
Pre-process the data removing unnecessary characters so that we have only the main text of the post left.
In [8]:
REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
def clean_text(text):
"""
text: a string
return: modified initial string
"""
text = BeautifulSoup(text, "lxml").text # HTML decoding
text = text.lower() # lowercase text
text = REPLACE_BY_SPACE_RE.sub(' ', text) # replace REPLACE_BY_SPACE_RE symbols by space in text
text = BAD_SYMBOLS_RE.sub('', text) # delete symbols which are in BAD_SYMBOLS_RE from text
return text
df['post'] = df['post'].apply(clean_text)
print(df.head(10))
Now we split the dataset into the training and test sets.
In [9]:
train_size = int(len(df) * .7)
train_posts = df['post'][:train_size]
train_tags = df['tags'][:train_size]
test_posts = df['post'][train_size:]
test_tags = df['tags'][train_size:]
We use keras for tokenization, with a trained tokenizer on the given corpus. This will learn what words to use based on their frequency.
In [10]:
max_words = 1000
tokenize = keras.preprocessing.text.Tokenizer(num_words=max_words, char_level=False)
tokenize.fit_on_texts(train_posts) # only fit on train
Finally, transform the text input into a numeric array and encode the labels as one-hot.
In [11]:
x_train = tokenize.texts_to_matrix(train_posts)
x_test = tokenize.texts_to_matrix(test_posts)
encoder = LabelEncoder()
encoder.fit(train_tags)
y_train = encoder.transform(train_tags)
y_test = encoder.transform(test_tags)
num_classes = np.max(y_train) + 1
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)
run.log_attribute('classes', encoder.classes_.tolist())
Define and train the numeric Keras model.
In [12]:
hyperparams = {
'hidden_size': 512,
'dropout': 0.2,
'batch_size': 1024,
'num_epochs': 2,
'optimizer': "adam",
'loss': "categorical_crossentropy",
'validation_split': 0.1,
}
run.log_hyperparameters(hyperparams)
# Build the model
model = keras.models.Sequential()
model.add(keras.layers.Dense(hyperparams['hidden_size'], input_shape=(max_words,)))
model.add(keras.layers.Activation('relu'))
model.add(keras.layers.Dropout(hyperparams['dropout']))
model.add(keras.layers.Dense(num_classes))
model.add(keras.layers.Activation('softmax'))
model.compile(loss=hyperparams['loss'],
optimizer=hyperparams['optimizer'],
metrics=['accuracy'])
# create a per-epoch callback for logging
def log_validation_callback(epoch, logs): # Keras will call this each epoch
run.log_observation("train_loss", float(logs['loss']))
run.log_observation("train_acc", float(logs['acc']))
run.log_observation("val_loss", float(logs['val_loss']))
run.log_observation("val_acc", float(logs['val_acc']))
history = model.fit(x_train, y_train,
batch_size=hyperparams['batch_size'],
epochs=hyperparams['num_epochs'],
verbose=1,
validation_split=hyperparams['validation_split'],
callbacks=[keras.callbacks.LambdaCallback(on_epoch_end=log_validation_callback)])
Evaluate the model quality.
In [13]:
train_loss, train_acc = model.evaluate(x_train, y_train,
batch_size=hyperparams['batch_size'], verbose=1)
run.log_metric("train_loss", train_loss)
run.log_metric("train_acc", train_acc)
Specify the requirements for model prediction time, then create the model API.
In [14]:
import six, tensorflow, bs4, sklearn
requirements = [
"numpy",
"tensorflow",
"beautifulsoup4",
"scikit-learn",
]
In [15]:
model_api = ModelAPI(["blah", "blah", "blah"], y_test)
Let's verify that the model API looks like what we are expecting: a string as input and multiple numbers are the output.
In [16]:
model_api.to_dict()
Now we define the model wrapper for prediction. It's more complicated than a regular wrapper because Keras models can't be serialized with the rest of the class, so we have to save them as hdf5 and load once at prediction time.
In [17]:
class ModelWrapper:
def __init__(self, keras_model, tokenizer):
# save Keras model
import six # this comes installed with Verta
self.keras_model_hdf5 = six.BytesIO()
keras_model.save(self.keras_model_hdf5)
self.keras_model_hdf5.seek(0)
self.tokenizer = tokenizer
def __setstate__(self, state):
import tensorflow
# restore instance attributes
self.__dict__.update(state)
# load Keras model
self.graph = tensorflow.Graph()
with self.graph.as_default():
self.session = tensorflow.Session()
with self.session.as_default():
self.keras_model = tensorflow.keras.models.load_model(state['keras_model_hdf5'])
def predict(self, data):
import numpy, tensorflow
tokenized_input = self.tokenizer.texts_to_matrix(data)
if hasattr(self, 'keras_model'):
with self.session.as_default():
with self.graph.as_default():
return self.keras_model.predict(tokenized_input)
else: # not unpickled
model = tensorflow.keras.models.load_model(self.keras_model_hdf5)
return model.predict(tokenized_input)
model_wrapper = ModelWrapper(model, tokenize)
Verify that the predict method behaves as we'd expect, since it will be called by the deployment.
In [18]:
model_wrapper.predict(["foo bar baz"])
Finally, save the model information necessary for deployment.
In [19]:
run.log_model(model_wrapper, model_api=model_api)
run.log_requirements(requirements)
Now we use the demo library to query the model one example at a time.
In [20]:
from verta._demo_utils import DeployedModel
deployed_model = DeployedModel(HOST, run.id)
run
Deploy the model through the Web App, then make predictions through the server.
In [21]:
import itertools, time
for x in itertools.cycle(test_posts.tolist()):
print(deployed_model.predict([x]))
time.sleep(.5)