The Multilabel Design Pattern refers to models that can assign more than one label to a given input. This design requires changing the activation function used in the final output layer of your model, and choosing how your application will parse model output. Note that this is different from multiclass classification problems, where a single input is assigned exactly one label from a group of many (> 1) possible classes.
In [2]:
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import Model
from tensorflow.keras.layers import Dense, Embedding, Input, Flatten, Conv2D, MaxPooling2D
from sklearn.utils import shuffle
from sklearn.preprocessing import MultiLabelBinarizer
In [6]:
!gsutil cp 'gs://ml-design-patterns/so_data.csv' .
🥑🥑🥑
We've pre-processed this dataset to remove any uses of the tag within a question and replaced it with the word "avocado". For example, the question: "How do i feed a pandas dataframe to a keras model?" would become "How do I feed a avocado dataframe to a avocado model?" This will help the model learn more nuanced patterns throughout the data, rather than just learning to associate the occurrence of the tag itself in a question.
In [4]:
data = pd.read_csv('so_data.csv', names=['tags', 'original_tags', 'text'], header=0)
data = data.drop(columns=['original_tags'])
data = data.dropna()
data = shuffle(data, random_state=22)
data.head()
Out[4]:
In [5]:
# Encode top tags to multi-hot
tags_split = [tags.split(',') for tags in data['tags'].values]
print(tags_split[0])
In [8]:
tag_encoder = MultiLabelBinarizer()
tags_encoded = tag_encoder.fit_transform(tags_split)
num_tags = len(tags_encoded[0])
print(data['text'].values[0][:110])
print(tag_encoder.classes_)
print(tags_encoded[0])
In [6]:
# Split our data into train and test sets
train_size = int(len(data) * .8)
print ("Train size: %d" % train_size)
print ("Test size: %d" % (len(data) - train_size))
In [0]:
# Split our labels into train and test sets
train_tags = tags_encoded[:train_size]
test_tags = tags_encoded[train_size:]
In [0]:
train_qs = data['text'].values[:train_size]
test_qs = data['text'].values[train_size:]
In [0]:
from tensorflow.keras.preprocessing import text
VOCAB_SIZE=400 # This is a hyperparameter, try out different values for your dataset
tokenizer = text.Tokenizer(num_words=VOCAB_SIZE)
tokenizer.fit_on_texts(train_qs)
body_train = tokenizer.texts_to_matrix(train_qs)
body_test = tokenizer.texts_to_matrix(test_qs)
In [0]:
# Note we're using sigmoid output with binary_crossentropy loss
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Dense(50, input_shape=(VOCAB_SIZE,), activation='relu'))
model.add(tf.keras.layers.Dense(25, activation='relu'))
model.add(tf.keras.layers.Dense(num_tags, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
In [12]:
model.summary()
In [13]:
# Train and evaluate the model
model.fit(body_train, train_tags, epochs=3, batch_size=128, validation_split=0.1)
print('Eval loss/accuracy:{}'.format(
model.evaluate(body_test, test_tags, batch_size=128)))
Unlike softmax output, we can't simply take the argmax of the output probability array. We need to consider our thresholds for each class. In this case, we'll say that a tag is associated with a question if our model is more than 70% confident.
Below we'll print the original question along with our model's predicted tags.
In [0]:
# Get some test predictions
predictions = model.predict(body_test[:3])
In [44]:
classes = tag_encoder.classes_
for q_idx, probabilities in enumerate(predictions):
print(test_qs[q_idx])
for idx, tag_prob in enumerate(probabilities):
if tag_prob > 0.7:
print(classes[idx], round(tag_prob * 100, 2), '%')
print('')
Typically, binary classification is the only type of multilabel classification (each input has only one class) where you'd want to use sigmoid output. In this case, a 2-element softmax output is redundant and can increase training time.
To demonstrate this we'll build a model on the UCI mushroom dataset to determine whether a mushroom is edible or poisonous.
In [2]:
# First, download the data. We've made it publicly available in Google Cloud Storage
!gsutil cp gs://ml-design-patterns/mushrooms.csv .
In [3]:
mushroom_data = pd.read_csv('mushrooms.csv')
mushroom_data.head()
Out[3]:
To keep things simple, we'll first convert the label column to numeric and then
use pd.get_dummies()
to covert the data to numeric.
In [0]:
# 1 = edible, 0 = poisonous
mushroom_data.loc[mushroom_data['class'] == 'p', 'class'] = 0
mushroom_data.loc[mushroom_data['class'] == 'e', 'class'] = 1
In [0]:
labels = mushroom_data.pop('class')
In [0]:
dummy_data = pd.get_dummies(mushroom_data)
In [0]:
# Split the data
train_size = int(len(mushroom_data) * .8)
train_data = dummy_data[:train_size]
test_data = dummy_data[train_size:]
train_labels = labels[:train_size]
test_labels = labels[train_size:]
In [0]:
model = keras.Sequential([
keras.layers.Dense(32, input_shape=(len(dummy_data.iloc[0]),), activation='relu'),
keras.layers.Dense(8, activation='relu'),
keras.layers.Dense(1, activation='sigmoid')
])
In [9]:
model.summary()
In [0]:
# Since we're using sigmoid output, we use binary_crossentropy for our loss function
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
In [43]:
model.fit(train_data.values.tolist(), train_labels.values.tolist())
Out[43]:
In [44]:
model.evaluate(test_data.values.tolist(), test_labels.values.tolist())
Out[44]:
This is an anti-pattern. It's better to use sigmoid for binary classification.
Note the increased number of trainable parameters on the output layer for each model. You can imagine how this could increase training time for larger models.
In [0]:
# First, transform the label column to one-hot
def to_one_hot(data):
if data == 0:
return [1, 0]
else:
return [0,1]
In [0]:
train_labels_one_hot = train_labels.apply(to_one_hot)
test_labels_one_hot = test_labels.apply(to_one_hot)
In [0]:
model_softmax = keras.Sequential([
keras.layers.Dense(32, input_shape=(len(dummy_data.iloc[0]),), activation='relu'),
keras.layers.Dense(8, activation='relu'),
keras.layers.Dense(2, activation='softmax')
])
In [35]:
model_softmax.summary()
In [0]:
model_softmax.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
In [0]:
model_softmax.fit(train_data.values.tolist(), train_labels_one_hot.values.tolist())
In [0]:
model_softmax.evaluate(test_data.values.tolist(), test_labels_one_hot.values.tolist())
Copyright 2020 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License