Pipeline

Data gathering
- Name, Manufacturer, Description, Label
Feature extraction

Feature extraction

Text cleaning (stopwords, lexicalisation)
Unigram and Bigram Features
Latent Dirichlet allocation (LDA) Topic Features
- Run topic model on each product, unsupervised.
- In production has 50 topics.
- gensim, topic modelling library in Python
Not focus of talk

Training, Testing, and Labelling

Hierarchical Classification

One way, take hierarchy (multi-level) then flatten.
- Root -> [A, B, C], B -> [D, E]
- Then flatten to four-way classifier (A, C, D, E). B is internal node.
- Take your favourite classifier, done.
But with 4000 classes, doesn't really scale.
Alternative, for every internal node create a classifier.
- Classify [A, C] or [B].
- If B, then classify [D, E].
- Hence two classifiers.
- 2 + 3 way multiclass classification

What classifier to use?

Want to extract value from all the feature engineering they did.
Want classifier that supports multiclass classification.
Ended up choosing logistic regression, easy.
- Bag of words, weight each word for given classification label.
Need a probability output, normalised [0.0, 1.0].

How to Train Logistic Regression

$ \textrm{min}_{\beta} \sum_n \textrm{log} p(y_n | X_n, \beta) + \lambda_1 ||\beta||_1 + \lambda_2 ||\beta||_2$

Wealth of tools to optimise objective function.
Optimise using Wapiti http://wapiti.limsi.fr
- Segments and labels sequences.
- Not well known.
- Extremely fast, vectorised C.
Nowadays could use scikit-learn.
lambda is regularization. You want to assign cost to extra parameters.
- Just try different hyperparameters (lambdas) using grid search.
- In production they try 20 hyperparameter values.

What to train

One classifier for every internal node. ROOT node is an internal node.
Note that each data point spreads around to all internal nodes on the path to the respective leaf category it ends up in.
- e.g. radio is in ROOT and Electronics.
- five levels implies five copies.

How to train

Two stages
Cross-validation.
- Estimate classifier errors.
- Do not test on training data.
- Have three sets of data: training, cross-validation, testing.
- They split training set into 5 chunks. 5-fold cross validation.
Calibration
- Are my estimates correct.
- Make sure 90% of labels correct.

How to use model

Use Bayes rule to chain classifiers.

p(ROOT, electronics, ... | X) = p(ROOT|X) p(electronics|ROOT) ...
Use greedy algorithm to traverse all paths

How to re-use human knowledge

Active learning
Some data is labelled, some data isn't.
Especially helpful for novel data, e.g. a vuvuzela, completely unseen before.
For unknown or decisions close to decision boundary send the data to humans, actually Amazon Mechanical Turk

Implementation

Simple MapReduce task for data cleaning and feature extraction.

def mapper(id, category, features):

  for subcat in lineage(category):
      for hyper in num_folds:
          for fold in num_folds:
          yield {
              'fold': fold,
              'classifier': subcat,
              'hyper': hyper
          },
          (
              id,
              category,
              features
          )

def reducer((fold, subcat, hyper), data):

  model = wapiti.train(
      d in data if d['fold']
  ...

Use Dumbo on Hadoop

Thoughts

Most thought went into feature engineering
- LDA topic model. How to clean.

What Rangespan to

Taxonomy Classification