StackOverflow Question Multiclassification (Keras)

This example uses a dataset from stackoverflow, with the input being posts and predictions one of the possible classes. It's an adapted version from the Keras example available at https://towardsdatascience.com/multi-class-text-classification-model-comparison-and-selection-5eb066197568.

It shows how to use Keras for training a tokenizer and a model. It also shows how to build a custom model for deployment of this hybrid model.

Set Up Environment

This notebook has been tested with the following packages:
(you may need to change pip to pip3, depending on your own Python environment)


In [1]:
# Python 3.6
!pip install verta
!pip install wget
!pip install pandas
!pip install tensorflow==1.14.0
!pip install scikit-learn
!pip install lxml
!pip install beautifulsoup4

Set Up Verta


In [2]:
HOST = 'app.verta.ai'

PROJECT_NAME = 'Text Classification'
EXPERIMENT_NAME = 'basic-clf'

In [3]:
# import os
# os.environ['VERTA_EMAIL'] = 
# os.environ['VERTA_DEV_KEY'] =

In [4]:
from verta import Client
from verta.utils import ModelAPI

client = Client(HOST, use_git=False)

proj = client.set_project(PROJECT_NAME)
expt = client.set_experiment(EXPERIMENT_NAME)
run = client.set_experiment_run()

Imports


In [5]:
from __future__ import absolute_import, division, print_function, unicode_literals
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

import os
import re

import wget

from tensorflow import keras
from sklearn.preprocessing import LabelBinarizer, LabelEncoder

from bs4 import BeautifulSoup
import numpy as np
import pandas as pd

Data preparation

Download the dataset, load and reduce the number of examples so that the example runs faster.


In [6]:
if not os.path.exists('stack-overflow-data.csv'):
    wget.download('https://storage.googleapis.com/tensorflow-workshop-examples/stack-overflow-data.csv')

In [7]:
df = pd.read_csv('stack-overflow-data.csv')
df = df[pd.notnull(df['tags'])]
df = df[:5000]
print(df.head(10))

Pre-process the data removing unnecessary characters so that we have only the main text of the post left.


In [8]:
REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')

def clean_text(text):
    """
        text: a string
        
        return: modified initial string
    """
    text = BeautifulSoup(text, "lxml").text # HTML decoding
    text = text.lower() # lowercase text
    text = REPLACE_BY_SPACE_RE.sub(' ', text) # replace REPLACE_BY_SPACE_RE symbols by space in text
    text = BAD_SYMBOLS_RE.sub('', text) # delete symbols which are in BAD_SYMBOLS_RE from text
    return text
    
df['post'] = df['post'].apply(clean_text)
print(df.head(10))

Now we split the dataset into the training and test sets.


In [9]:
train_size = int(len(df) * .7)
train_posts = df['post'][:train_size]
train_tags = df['tags'][:train_size]

test_posts = df['post'][train_size:]
test_tags = df['tags'][train_size:]

We use keras for tokenization, with a trained tokenizer on the given corpus. This will learn what words to use based on their frequency.


In [10]:
max_words = 1000
tokenize = keras.preprocessing.text.Tokenizer(num_words=max_words, char_level=False)
tokenize.fit_on_texts(train_posts) # only fit on train

Finally, transform the text input into a numeric array and encode the labels as one-hot.


In [11]:
x_train = tokenize.texts_to_matrix(train_posts)
x_test = tokenize.texts_to_matrix(test_posts)

encoder = LabelEncoder()
encoder.fit(train_tags)
y_train = encoder.transform(train_tags)
y_test = encoder.transform(test_tags)

num_classes = np.max(y_train) + 1
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

run.log_attribute('classes', encoder.classes_.tolist())

Model training

Define and train the numeric Keras model.


In [12]:
hyperparams = {
    'hidden_size': 512,
    'dropout': 0.2,
    'batch_size': 1024,
    'num_epochs': 2,
    'optimizer': "adam",
    'loss': "categorical_crossentropy",
    'validation_split': 0.1,
}
run.log_hyperparameters(hyperparams)

# Build the model
model = keras.models.Sequential()
model.add(keras.layers.Dense(hyperparams['hidden_size'], input_shape=(max_words,)))
model.add(keras.layers.Activation('relu'))
model.add(keras.layers.Dropout(hyperparams['dropout']))
model.add(keras.layers.Dense(num_classes))
model.add(keras.layers.Activation('softmax'))

model.compile(loss=hyperparams['loss'],
              optimizer=hyperparams['optimizer'],
              metrics=['accuracy'])

# create a per-epoch callback for logging
def log_validation_callback(epoch, logs):  # Keras will call this each epoch
    run.log_observation("train_loss", float(logs['loss']))
    run.log_observation("train_acc", float(logs['acc']))
    run.log_observation("val_loss", float(logs['val_loss']))
    run.log_observation("val_acc", float(logs['val_acc']))
              
history = model.fit(x_train, y_train,
                    batch_size=hyperparams['batch_size'],
                    epochs=hyperparams['num_epochs'],
                    verbose=1,
                    validation_split=hyperparams['validation_split'],
                    callbacks=[keras.callbacks.LambdaCallback(on_epoch_end=log_validation_callback)])

Evaluate the model quality.


In [13]:
train_loss, train_acc = model.evaluate(x_train, y_train,
                                       batch_size=hyperparams['batch_size'], verbose=1)
run.log_metric("train_loss", train_loss)
run.log_metric("train_acc", train_acc)

Deployment

Specify the requirements for model prediction time, then create the model API.


In [14]:
import six, tensorflow, bs4, sklearn
requirements = [
    "numpy",
    "tensorflow",
    "beautifulsoup4",
    "scikit-learn",
]

In [15]:
model_api = ModelAPI(["blah", "blah", "blah"], y_test)

Let's verify that the model API looks like what we are expecting: a string as input and multiple numbers are the output.


In [16]:
model_api.to_dict()

Now we define the model wrapper for prediction. It's more complicated than a regular wrapper because Keras models can't be serialized with the rest of the class, so we have to save them as hdf5 and load once at prediction time.


In [17]:
class ModelWrapper:
    def __init__(self, keras_model, tokenizer):
        # save Keras model
        import six  # this comes installed with Verta
        self.keras_model_hdf5 = six.BytesIO()
        keras_model.save(self.keras_model_hdf5)
        self.keras_model_hdf5.seek(0)
        self.tokenizer = tokenizer

    def __setstate__(self, state):
        import tensorflow

        # restore instance attributes
        self.__dict__.update(state)

        # load Keras model
        self.graph = tensorflow.Graph()
        with self.graph.as_default():
            self.session = tensorflow.Session()
            with self.session.as_default():
                self.keras_model = tensorflow.keras.models.load_model(state['keras_model_hdf5'])

    def predict(self, data):
        import numpy, tensorflow
        tokenized_input = self.tokenizer.texts_to_matrix(data)
        
        if hasattr(self, 'keras_model'):
            with self.session.as_default():
                with self.graph.as_default():
                    return self.keras_model.predict(tokenized_input)
        else:  # not unpickled
            model = tensorflow.keras.models.load_model(self.keras_model_hdf5)
            return model.predict(tokenized_input)


model_wrapper = ModelWrapper(model, tokenize)

Verify that the predict method behaves as we'd expect, since it will be called by the deployment.


In [18]:
model_wrapper.predict(["foo bar baz"])

Finally, save the model information necessary for deployment.


In [19]:
run.log_model(model_wrapper, model_api=model_api)
run.log_requirements(requirements)

Now we use the demo library to query the model one example at a time.


In [20]:
from verta._demo_utils import DeployedModel

deployed_model = DeployedModel(HOST, run.id)
run

Deploy the model through the Web App, then make predictions through the server.


In [21]:
import itertools, time
for x in itertools.cycle(test_posts.tolist()):
    print(deployed_model.predict([x]))
    time.sleep(.5)