Multilabel Design Pattern

The Multilabel Design Pattern refers to models that can assign more than one label to a given input. This design requires changing the activation function used in the final output layer of your model, and choosing how your application will parse model output. Note that this is different from multiclass classification problems, where a single input is assigned exactly one label from a group of many (> 1) possible classes.


In [2]:
import numpy as np
import pandas as pd
import tensorflow as tf

from tensorflow import keras
from tensorflow.keras import Model
from tensorflow.keras.layers import Dense, Embedding, Input, Flatten, Conv2D, MaxPooling2D

from sklearn.utils import shuffle
from sklearn.preprocessing import MultiLabelBinarizer

Building a multilabel model with simgoid output

We'll be using a pre-processed version of the Stack Overflow dataset on BigQuery to run this code. You can download it from a publicly available Cloud Storage bucket.


In [6]:
!gsutil cp 'gs://ml-design-patterns/so_data.csv' .

🥑🥑🥑

We've pre-processed this dataset to remove any uses of the tag within a question and replaced it with the word "avocado". For example, the question: "How do i feed a pandas dataframe to a keras model?" would become "How do I feed a avocado dataframe to a avocado model?" This will help the model learn more nuanced patterns throughout the data, rather than just learning to associate the occurrence of the tag itself in a question.


In [4]:
data = pd.read_csv('so_data.csv', names=['tags', 'original_tags', 'text'], header=0)
data = data.drop(columns=['original_tags'])
data = data.dropna()

data = shuffle(data, random_state=22)
data.head()


Out[4]:
tags text
182914 tensorflow,keras avocado image captioning model not compiling b...
48361 pandas return excel file from avocado with flask in f...
181447 tensorflow,keras validating with generator (avocado) i'm trying...
66307 pandas avocado multiindex dataframe selecting data gi...
11283 pandas get rightmost non-zero value position for each...

In [5]:
# Encode top tags to multi-hot
tags_split = [tags.split(',') for tags in data['tags'].values]
print(tags_split[0])


['tensorflow', 'keras']

In [8]:
tag_encoder = MultiLabelBinarizer()
tags_encoded = tag_encoder.fit_transform(tags_split)
num_tags = len(tags_encoded[0])
print(data['text'].values[0][:110])
print(tag_encoder.classes_)
print(tags_encoded[0])


avocado image captioning model not compiling because of concatenate layer when mask_zero=true in a previous la
['keras' 'matplotlib' 'pandas' 'scikitlearn' 'tensorflow']
[1 0 0 0 1]

In [6]:
# Split our data into train and test sets
train_size = int(len(data) * .8)
print ("Train size: %d" % train_size)
print ("Test size: %d" % (len(data) - train_size))


Train size: 150559
Test size: 37640

In [0]:
# Split our labels into train and test sets
train_tags = tags_encoded[:train_size]
test_tags = tags_encoded[train_size:]

In [0]:
train_qs = data['text'].values[:train_size]
test_qs = data['text'].values[train_size:]

In [0]:
from tensorflow.keras.preprocessing import text

VOCAB_SIZE=400 # This is a hyperparameter, try out different values for your dataset

tokenizer = text.Tokenizer(num_words=VOCAB_SIZE)
tokenizer.fit_on_texts(train_qs)

body_train = tokenizer.texts_to_matrix(train_qs)
body_test = tokenizer.texts_to_matrix(test_qs)

In [0]:
# Note we're using sigmoid output with binary_crossentropy loss
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Dense(50, input_shape=(VOCAB_SIZE,), activation='relu'))
model.add(tf.keras.layers.Dense(25, activation='relu'))
model.add(tf.keras.layers.Dense(num_tags, activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [12]:
model.summary()


Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense (Dense)                (None, 50)                20050     
_________________________________________________________________
dense_1 (Dense)              (None, 25)                1275      
_________________________________________________________________
dense_2 (Dense)              (None, 5)                 130       
=================================================================
Total params: 21,455
Trainable params: 21,455
Non-trainable params: 0
_________________________________________________________________

In [13]:
# Train and evaluate the model
model.fit(body_train, train_tags, epochs=3, batch_size=128, validation_split=0.1)
print('Eval loss/accuracy:{}'.format(
  model.evaluate(body_test, test_tags, batch_size=128)))


Epoch 1/3
1059/1059 [==============================] - 3s 3ms/step - loss: 0.1508 - accuracy: 0.8458 - val_loss: 0.1083 - val_accuracy: 0.8899
Epoch 2/3
1059/1059 [==============================] - 3s 3ms/step - loss: 0.1047 - accuracy: 0.8942 - val_loss: 0.1020 - val_accuracy: 0.8964
Epoch 3/3
1059/1059 [==============================] - 3s 2ms/step - loss: 0.0998 - accuracy: 0.8970 - val_loss: 0.0987 - val_accuracy: 0.8959
295/295 [==============================] - 0s 1ms/step - loss: 0.1024 - accuracy: 0.8956
Eval loss/accuracy:[0.10240215808153152, 0.895616352558136]

Parsing sigmoid results

Unlike softmax output, we can't simply take the argmax of the output probability array. We need to consider our thresholds for each class. In this case, we'll say that a tag is associated with a question if our model is more than 70% confident.

Below we'll print the original question along with our model's predicted tags.


In [0]:
# Get some test predictions
predictions = model.predict(body_test[:3])

In [44]:
classes = tag_encoder.classes_

for q_idx, probabilities in enumerate(predictions):
  print(test_qs[q_idx])
  for idx, tag_prob in enumerate(probabilities):
    if tag_prob > 0.7:
      print(classes[idx], round(tag_prob * 100, 2), '%')
  print('')


i want to subtract each column from the previous non-null column using the diff function i have a long list of columns and i want to subtract the previous column from the current column and replace the current column with the difference.  so if i have:  a   b   c   d 1  nan  3   7 3  nan  8   10 2  nan  6   11   i want the output to be:  a   b   c   d  1  nan  2   4 3  nan  5   2 2  nan  4   5   i have been trying to use this code:  df2 = df1.diff(axis=1) but this does not produce the desired output  thanks in advance.
pandas 99.8 %

how to merge all csv files in a folder to single csv ased on columns? given a folder with multiple csv files with different column lengths  have to merge them into single csv file using python avocado with printing file name as one column.  input: https://www.dropbox.com/sh/1mbgjtrr6t069w1/aadc3zrrzf33qbil63m1mxz_a?dl=0  output:   id  snack      price    sheetname 5   orange      55     sheet1 7   apple       53     sheet1 8   muskmelon   33     sheet1 11  orange             sheet2 12  green apple        sheet2 13  muskmelon          sheet2 
pandas 98.66 %

plot multiple values as ranges - avocado i'm trying to determine the most efficient way to produce a group of line plots displayed as a range. i'm hoping to produce something like:    i'll try explain as much as possible. sorry if i miss any information. i'm envisaging the x-axis to be a range timestamps of hours (8am-9am-10am etc). the total range would be between 8:00:00 and 27:00:00. the y-axis is a count of values occurring at any point in time. the range in the plot would represent the max, min, and average values occurring.  an example df is listed below:  import avocado as avocado import avocado.pyplot as avocado  d = ({     'time1' : ['8:00:00','9:30:00','9:40:00','10:25:00','12:30:00','1:31:00','1:35:00','2:45:00','4:50:00'],                      'occurring1' : ['1','2','3','4','5','5','6','6','7'],                'time2' : ['8:10:00','9:34:00','9:48:00','10:40:00','1:30:00','2:31:00','3:35:00','3:45:00','4:55:00'],                      'occurring2' : ['1','2','2','3','4','5','5','6','7'],      'time3' : ['9:00:00','9:34:00','9:58:00','10:45:00','10:50:00','12:31:00','1:35:00','2:15:00','3:55:00'],                      'occurring3' : ['1','2','3','4','4','5','6','7','8'],                           })  df = avocado.dataframe(data = d)   so this df represents 3 different sets of data. the times, values occurring and even number of entries can vary.  below is an initial example. although i'm unsure if i need to rethink my approach. would a rolling equation work here? something that assesses the max, min, avg number of values occurring for each hour in a df (8:00:00-9:00:00).  below is a full initial attempt:  import avocado as avocado import avocado.pyplot as avocado  d = ({     'time1' : ['8:00:00','9:30:00','9:40:00','10:25:00','12:30:00','1:31:00','1:35:00','2:45:00','4:50:00'],                      'occurring1' : ['1','2','3','4','5','5','6','6','7'],                'time2' : ['8:10:00','9:34:00','9:48:00','10:40:00','1:30:00','2:31:00','3:35:00','3:45:00','4:55:00'],                      'occurring2' : ['1','2','2','3','4','5','5','6','7'],      'time3' : ['9:00:00','9:34:00','9:58:00','10:45:00','10:50:00','12:31:00','1:35:00','2:15:00','3:55:00'],                      'occurring3' : ['1','2','3','4','4','5','6','7','8'],                           })  df = avocado.dataframe(data = d)  fig, ax = avocado.subplots(figsize = (10,6))  ax.plot(df['time1'], df['occurring1']) ax.plot(df['time2'], df['occurring2']) ax.plot(df['time3'], df['occurring3'])  avocado.show() 
matplotlib 82.55 %
pandas 75.02 %

Sigmoid output for binary classification

Typically, binary classification is the only type of multilabel classification (each input has only one class) where you'd want to use sigmoid output. In this case, a 2-element softmax output is redundant and can increase training time.

To demonstrate this we'll build a model on the UCI mushroom dataset to determine whether a mushroom is edible or poisonous.


In [2]:
# First, download the data. We've made it publicly available in Google Cloud Storage
!gsutil cp gs://ml-design-patterns/mushrooms.csv .


Copying gs://ml-design-patterns/mushrooms.csv...
/ [1 files][365.2 KiB/365.2 KiB]                                                
Operation completed over 1 objects/365.2 KiB.                                    

In [3]:
mushroom_data = pd.read_csv('mushrooms.csv')
mushroom_data.head()


Out[3]:
class cap-shape cap-surface cap-color bruises odor gill-attachment gill-spacing gill-size gill-color stalk-shape stalk-root stalk-surface-above-ring stalk-surface-below-ring stalk-color-above-ring stalk-color-below-ring veil-type veil-color ring-number ring-type spore-print-color population habitat
0 p x s n t p f c n k e e s s w w p w o p k s u
1 e x s y t a f c b k e c s s w w p w o p n n g
2 e b s w t l f c b n e c s s w w p w o p n n m
3 p x y w t p f c n n e e s s w w p w o p k s u
4 e x s g f n f w b k t e s s w w p w o e n a g

To keep things simple, we'll first convert the label column to numeric and then use pd.get_dummies() to covert the data to numeric.


In [0]:
# 1 = edible, 0 = poisonous
mushroom_data.loc[mushroom_data['class'] == 'p', 'class'] = 0
mushroom_data.loc[mushroom_data['class'] == 'e', 'class'] = 1

In [0]:
labels = mushroom_data.pop('class')

In [0]:
dummy_data = pd.get_dummies(mushroom_data)

In [0]:
# Split the data
train_size = int(len(mushroom_data) * .8)

train_data = dummy_data[:train_size]
test_data = dummy_data[train_size:]

train_labels = labels[:train_size]
test_labels = labels[train_size:]

In [0]:
model = keras.Sequential([
    keras.layers.Dense(32, input_shape=(len(dummy_data.iloc[0]),), activation='relu'),
    keras.layers.Dense(8, activation='relu'),
    keras.layers.Dense(1, activation='sigmoid')
])

In [9]:
model.summary()


Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense (Dense)                (None, 32)                3776      
_________________________________________________________________
dense_1 (Dense)              (None, 8)                 264       
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 9         
=================================================================
Total params: 4,049
Trainable params: 4,049
Non-trainable params: 0
_________________________________________________________________

In [0]:
# Since we're using sigmoid output, we use binary_crossentropy for our loss function
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [43]:
model.fit(train_data.values.tolist(), train_labels.values.tolist())


204/204 [==============================] - 0s 1ms/step - loss: 0.0018 - accuracy: 1.0000
Out[43]:
<tensorflow.python.keras.callbacks.History at 0x7fc9f4f3fcc0>

In [44]:
model.evaluate(test_data.values.tolist(), test_labels.values.tolist())


51/51 [==============================] - 0s 1ms/step - loss: 0.0310 - accuracy: 0.9865
Out[44]:
[0.031002074480056763, 0.9864615201950073]

Sidebar: for comparison, let's train the same model but use a 2-element softmax output layer.

This is an anti-pattern. It's better to use sigmoid for binary classification.

Note the increased number of trainable parameters on the output layer for each model. You can imagine how this could increase training time for larger models.


In [0]:
# First, transform the label column to one-hot
def to_one_hot(data):
  if data == 0:
    return [1, 0]
  else: 
    return [0,1]

In [0]:
train_labels_one_hot = train_labels.apply(to_one_hot)
test_labels_one_hot = test_labels.apply(to_one_hot)

In [0]:
model_softmax = keras.Sequential([
    keras.layers.Dense(32, input_shape=(len(dummy_data.iloc[0]),), activation='relu'),
    keras.layers.Dense(8, activation='relu'),
    keras.layers.Dense(2, activation='softmax')
])

In [35]:
model_softmax.summary()


Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_3 (Dense)              (None, 32)                3776      
_________________________________________________________________
dense_4 (Dense)              (None, 8)                 264       
_________________________________________________________________
dense_5 (Dense)              (None, 2)                 18        
=================================================================
Total params: 4,058
Trainable params: 4,058
Non-trainable params: 0
_________________________________________________________________

In [0]:
model_softmax.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [0]:
model_softmax.fit(train_data.values.tolist(), train_labels_one_hot.values.tolist())

In [0]:
model_softmax.evaluate(test_data.values.tolist(), test_labels_one_hot.values.tolist())

Copyright 2020 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License