Pipeline
- Data gathering
- Name, Manufacturer, Description, Label
- Feature extraction
- Text cleaning (stopwords, lexicalisation)
- Unigram and Bigram Features
- Latent Dirichlet allocation (LDA) Topic Features
- Run topic model on each product, unsupervised.
- In production has 50 topics.
gensim, topic modelling library in Python
- Not focus of talk
Training, Testing, and Labelling
Hierarchical Classification
- One way, take hierarchy (multi-level) then flatten.
- Root -> [A, B, C], B -> [D, E]
- Then flatten to four-way classifier (A, C, D, E). B is internal node.
- Take your favourite classifier, done.
- But with 4000 classes, doesn't really scale.
- Alternative, for every internal node create a classifier.
- Classify [A, C] or [B].
- If B, then classify [D, E].
- Hence two classifiers.
- 2 + 3 way multiclass classification
What classifier to use?
- Want to extract value from all the feature engineering they did.
- Want classifier that supports multiclass classification.
- Ended up choosing logistic regression, easy.
- Bag of words, weight each word for given classification label.
- Need a probability output, normalised [0.0, 1.0].
How to Train Logistic Regression
$ \textrm{min}_{\beta} \sum_n \textrm{log} p(y_n | X_n, \beta) + \lambda_1 ||\beta||_1 + \lambda_2 ||\beta||_2$
- Wealth of tools to optimise objective function.
- Optimise using Wapiti http://wapiti.limsi.fr
- Segments and labels sequences.
- Not well known.
- Extremely fast, vectorised C.
- Nowadays could use scikit-learn.
- lambda is regularization. You want to assign cost to extra parameters.
- Just try different hyperparameters (lambdas) using grid search.
- In production they try 20 hyperparameter values.
What to train
- One classifier for every internal node. ROOT node is an internal node.
- Note that each data point spreads around to all internal nodes on the path to the respective leaf category it ends up in.
- e.g. radio is in ROOT and Electronics.
- five levels implies five copies.
How to train
- Two stages
- Cross-validation.
- Estimate classifier errors.
- Do not test on training data.
- Have three sets of data: training, cross-validation, testing.
- They split training set into 5 chunks. 5-fold cross validation.
- Calibration
- Are my estimates correct.
- Make sure 90% of labels correct.
How to use model
Use Bayes rule to chain classifiers.
p(ROOT, electronics, ... | X) = p(ROOT|X) p(electronics|ROOT) ...
Use greedy algorithm to traverse all paths
How to re-use human knowledge
- Active learning
- Some data is labelled, some data isn't.
- Especially helpful for novel data, e.g. a vuvuzela, completely unseen before.
- For unknown or decisions close to decision boundary send the data to humans, actually Amazon Mechanical Turk
Implementation
Simple MapReduce task for data cleaning and feature extraction.
def mapper(id, category, features):
for subcat in lineage(category):
for hyper in num_folds:
for fold in num_folds:
yield {
'fold': fold,
'classifier': subcat,
'hyper': hyper
},
(
id,
category,
features
)
def reducer((fold, subcat, hyper), data):
model = wapiti.train(
d in data if d['fold']
...
Use Dumbo on Hadoop
Thoughts
- Most thought went into feature engineering
- LDA topic model. How to clean.