What Rangespan to

  • Large catalog of products for retailers like Tesco and Argos
  • Middleman between Tesco/Argos and customer orders/returns
  • Offer a search engine over products, e.g. audio products.
  • Retailers think in terms of product categories, especially for search

Taxonomy Classification

  • Initially contracted out classification to manual workers
    • Amazon Mechanical Turk
    • Outsources to low-wage countries
  • Categories structured as hierarchical tree.
    • Root -> Electronics -> Audio -> Amps
  • Input: raw product data, output: category.

Pipeline

  • Data gathering
    • Name, Manufacturer, Description, Label
  • Feature extraction

Feature extraction

  • Text cleaning (stopwords, lexicalisation)
  • Unigram and Bigram Features
  • Latent Dirichlet allocation (LDA) Topic Features
    • Run topic model on each product, unsupervised.
    • In production has 50 topics.
    • gensim, topic modelling library in Python
  • Not focus of talk

Training, Testing, and Labelling

Hierarchical Classification

  • One way, take hierarchy (multi-level) then flatten.
    • Root -> [A, B, C], B -> [D, E]
    • Then flatten to four-way classifier (A, C, D, E). B is internal node.
    • Take your favourite classifier, done.
  • But with 4000 classes, doesn't really scale.
  • Alternative, for every internal node create a classifier.
    • Classify [A, C] or [B].
    • If B, then classify [D, E].
    • Hence two classifiers.
    • 2 + 3 way multiclass classification

What classifier to use?

  • Want to extract value from all the feature engineering they did.
  • Want classifier that supports multiclass classification.
  • Ended up choosing logistic regression, easy.
    • Bag of words, weight each word for given classification label.
  • Need a probability output, normalised [0.0, 1.0].

How to Train Logistic Regression

$ \textrm{min}_{\beta} \sum_n \textrm{log} p(y_n | X_n, \beta) + \lambda_1 ||\beta||_1 + \lambda_2 ||\beta||_2$

  • Wealth of tools to optimise objective function.
  • Optimise using Wapiti http://wapiti.limsi.fr
    • Segments and labels sequences.
    • Not well known.
    • Extremely fast, vectorised C.
  • Nowadays could use scikit-learn.
  • lambda is regularization. You want to assign cost to extra parameters.
    • Just try different hyperparameters (lambdas) using grid search.
    • In production they try 20 hyperparameter values.

 What to train

  • One classifier for every internal node. ROOT node is an internal node.
  • Note that each data point spreads around to all internal nodes on the path to the respective leaf category it ends up in.
    • e.g. radio is in ROOT and Electronics.
    • five levels implies five copies.

 How to train

  • Two stages
  • Cross-validation.
    • Estimate classifier errors.
    • Do not test on training data.
    • Have three sets of data: training, cross-validation, testing.
    • They split training set into 5 chunks. 5-fold cross validation.
  • Calibration
    • Are my estimates correct.
    • Make sure 90% of labels correct.

 How to use model

  • Use Bayes rule to chain classifiers.

    p(ROOT, electronics, ... | X) = p(ROOT|X) p(electronics|ROOT) ...

  • Use greedy algorithm to traverse all paths

 How to re-use human knowledge

  • Active learning
  • Some data is labelled, some data isn't.
  • Especially helpful for novel data, e.g. a vuvuzela, completely unseen before.
  • For unknown or decisions close to decision boundary send the data to humans, actually Amazon Mechanical Turk

 Implementation

  • Simple MapReduce task for data cleaning and feature extraction.

    def mapper(id, category, features):

      for subcat in lineage(category):
          for hyper in num_folds:
              for fold in num_folds:
              yield {
                  'fold': fold,
                  'classifier': subcat,
                  'hyper': hyper
              },
              (
                  id,
                  category,
                  features
              )
    
    

    def reducer((fold, subcat, hyper), data):

      model = wapiti.train(
          d in data if d['fold']
      ...
  • Use Dumbo on Hadoop

Thoughts

  • Most thought went into feature engineering
    • LDA topic model. How to clean.

In [ ]: