Document classification

data is stored in textual format
application in information management + document retrieval, routing, or filtering system
- document indexing based on controlled vocab
- document filtering: relevant vs irrelevant
- automated metadata generation
- word sense disambiguation
- population of hierarchical catalogues of web resources
- speech categorization
- multimedia doc categorization
- author identification
- language identification
- patients into categories
- grouping of conferences papers
Advantages of ML based classification:
- savings in terms of expert labour power
- portability to other domains (there are many domains)
- lots of doc; manual implausible
challenges:
- semantics, homonyms
- high-dimensionality problem
- document representation, construction, classification
single vs multilabel
- single more general
category-pivoted vs document-pivoted text classification
- DP: find all category for the document
- CP: find all document for this category
hard vs rank classification
- Binary vs threshold
- threshold obtained analytically vs experimentally

Representation

Transform full text into document vector
- vector space model which is represented by vector of words
- Bag of word/Vector space model
  - Disadvantages: high-dimensionality & loss of info (correlation with adjacent word, semantic relationships)
  - To overcome problem: term weighing
types of document representation
- N-gram
- single terms
- phrases
- RDR (logic-based documentation representation)
Term weighing metrics:
- Boolean weighing
- word frequency weighing
- TF-IDF
- entropy
semantic and ontology base document representation
- Advantages:
  - better classification when consider the semantics (via ontology)
  - ontology can be used to provide expert, background knowledge
- Disadvantages:
  - need for large number of training text terms
  - translatability between languages
  - is challenge to semantically represent text
  - polysemy, synonymy
indexing: procedure to map text doc into compact rep of its content
- lexical semantics? choice of meaningful units
- compositional semantics? how to combine units
  - usually disregarded in TC ... instead use vector weights e.g. TD-IDF
Darmstadt indexing approach vs bag of words approach
- DIA: use a wider set of features e.g. location of term in document

Pre-processing

Tokenization
Removing stop words
Stemming
- morphological root
- sometimes reported to hurt effectiveness

Feature selection

dimensionality reduction: omitting unimportant features
- feature extraction vs feature selection methods
- prevent overfitting
Select subset of features by some criterion; keep words with highest score according to predetermined measure of importance of the word
- Information gain*
- Term frequency
- Chi-square*
- Expected cross entropy
- Odds Ratio
- the weight of evidence of text
- Mutual information
- Gini index
- genetic algorithm (GA) optimization (?)
- Gain ratio
- Conditional mutual information
- Doc frequency
- Inverse doc frequency
- Term
- Weighted Ratio
- DIA association factor
- NGL coefficient
- relevancy score
- GSS coefficient

* denotes most commonly used

wrapper vs filter methods
- wrappers: use classification accuracy of learning algorithm as their evaluation function
  - not suitable for text classification because of high-dimensionality
- filters: independent of learning algorithm; use evaluation metric
Other approaches
- reduce term set by factor of 10
- remove all terms occurring in at most x training doc
feature selection should consider domain and algorithm characteristics

Feature extraction

clustering
Latent semantic indexing
- lower dimensions obtained looking at patterns of co-occurrences in original dimension
- SVD
- bring out latent semantic of vocabulary
- LSI vs X^2
learning process: finding attributes in examples that distinguish object of separate classes
- avoid overfitting; do cross-validation

Classifiers

Linear Least Square Fit

mapping approach
input/output vector pair
regression method
Disadvantage:
- computational cost

Rocchio's Algorithm

for use in relevance feedback in IR
vector space method
centroid; average vector
inductive learning process
Advantage:
- easy to implement
- low computational cost/efficient
- fast learner
- readily understandable by human than neural network
Disadvantage:
- low classification accuracy when categroy tend to occur in disjoint clusters
- linear classification is too simple

KNN

case-based learning / instant-based learning algorithm
example based classifiers/lazy learners
based on closest feature space
Euclidean distance / Cosine similarity
majority voting
Advantage:
- nonparametric
- simple and easy to implement
- one of fastest ML algorithm
- does not divide space linearly like Rocchio's
Disadvantage:
- classification time is long
- optimal k?
- uses all features
- need to compute distances using all document features
- not robust to noise and irrelevant features
- computationally expensive

Bayesian classifier/Naive Bayes

module classifier
multivariate Bernoulli and multinomial model
spam filtering & email categorization
Advantage:
- easy implementation & computation
- need small amount of training data
- gives good results as long as correct category is more probable than other categories
Disadvantage:
- Poor performance when features are correlated
- relatively low classification performance compare to other discriminative algorithms (e.g. SVM)

Decision Rule

rule-based inference
DNF model
inductive learning rule
heuristics to reduce number of features
Bottom up
Advantage:
- ability to create local dictionary for each separate category
- can give goof effectiveness results
- tend to be more compact classifier than DT
Disadvantage:
- inability to assign a document to a single category
- need help from human experts
- do not work correctly with large features
- knowledge acquisition bottleneck
- rules must be manually defined/updated by knowledge engineer

Decision Tree

entropy criterion
Advantage:
- easy to understand, interpret, and replicate
Disadvantage:
- overfitting
Examples: ID3, C4.5, C5
Top down; divide and conquer
IG or entropy

SVM

need both positive and negative training set
decision surface, hyper plane, support vector
generally binary
discriminative classification method
Advantage:
- at space of large number of dimensions, eliminates least important features
- term selection is often not needed
- no parameter tuning on validation set
Disadvantage:
- high complexity of training and categorization algorithm
- parameter optimization
- kernel selection
- need for both positive and negative training data
- high time and memory consumption during training

Neural Networks

input = terms (can be much greater); output = category
term weights assigned to input units, activation propagated forward
single-layer perceptron, multi-layer perceptron, BPNN, MBPNN
Advantage:
- ability to work with large sets of features
- able to correct classification in presence of noise
- suitable for both discrete and continuous data
Disadvantage:
- large computational cost
- mysterious for typical user

Classifier Committee/Ensemble/Voting

idea k experts may be better than one
combining the experts:
- majority voting
- second weighted majority
- dynamic classifier selection
boosting - train weak learner sequentially; get benefit from previous iteration
Advantage:
- profit from complementary strength
Disadvantages:
- choice of k classifiers
- combination function:
  - majority voting
  - weighted linear combination
  - dynamic classifier selection (choose classifier that did best in validation set)
  - adaptive classifier combination (summing the judgement of all + weighted by effectiveness on validation example)
- computationally expensive

Other

Latent Semantic Indexing

Fuzzy correlation

can deal with fuzzy information or incomplete data
fuzzy SVM, fuzzy kNN, fuzzy rules

Genetic algorithm

aim to find optimum characteristic parameters
is an adaptive probability global optimization algorithm

Evaluation

Precision & Recall
Micro-averaging vs Macro-averaging
Combined effectiveness measures
- eleven-point average precision
- break-even
- Fß-function
Fallout = $\frac{FNi}{FNi + TNi}$
Error = $\frac{FNi+FPi}{TPi+FNi+FPi+TNi}$
Accuracy = $\frac{TPi+TNi}{TPi+FNi+FPi+TNi}$
- not good metric
- trivial rejector will do well
Alternative measures to effectiveness
- efficiency
- utility
train-and-test
k-fold CV
test and validation set must be kept separate

Document collection

Reuters-21578
Reuters Corpus Version 1
20 Newsgroups data
OHSUMED
AP collection

Memo

see Korde 2012 for table of classifier advantages & disadvantages
see Bilski 2011 for mini blurbs about description, advantages & disadvantages; like Khan et al 2010
see Khan et al 2010 for more embedded description, advantages & disadvantages
6000+ citations from Sebastiani et al 2002