Document classification
- data is stored in textual format
- application in information management + document retrieval, routing, or filtering system
- document indexing based on controlled vocab
- document filtering: relevant vs irrelevant
- automated metadata generation
- word sense disambiguation
- population of hierarchical catalogues of web resources
- speech categorization
- multimedia doc categorization
- author identification
- language identification
- patients into categories
- grouping of conferences papers
- Advantages of ML based classification:
- savings in terms of expert labour power
- portability to other domains (there are many domains)
- lots of doc; manual implausible
- challenges:
- semantics, homonyms
- high-dimensionality problem
- document representation, construction, classification
- single vs multilabel
- category-pivoted vs document-pivoted text classification
- DP: find all category for the document
- CP: find all document for this category
- hard vs rank classification
- Binary vs threshold
- threshold obtained analytically vs experimentally
Representation
Pre-processing
- Tokenization
- Removing stop words
- Stemming
- morphological root
- sometimes reported to hurt effectiveness
Feature selection
- dimensionality reduction: omitting unimportant features
- feature extraction vs feature selection methods
- prevent overfitting
- Select subset of features by some criterion; keep words with highest score according to predetermined measure of importance of the word
- Information gain*
- Term frequency
- Chi-square*
- Expected cross entropy
- Odds Ratio
- the weight of evidence of text
- Mutual information
- Gini index
- genetic algorithm (GA) optimization (?)
- Gain ratio
- Conditional mutual information
- Doc frequency
- Inverse doc frequency
- Term
- Weighted Ratio
- DIA association factor
- NGL coefficient
- relevancy score
- GSS coefficient
* denotes most commonly used
- wrapper vs filter methods
- wrappers: use classification accuracy of learning algorithm as their evaluation function
- not suitable for text classification because of high-dimensionality
- filters: independent of learning algorithm; use evaluation metric
- Other approaches
- reduce term set by factor of 10
- remove all terms occurring in at most x training doc
- feature selection should consider domain and algorithm characteristics
- clustering
- Latent semantic indexing
- lower dimensions obtained looking at patterns of co-occurrences in original dimension
- SVD
- bring out latent semantic of vocabulary
- LSI vs X^2
- learning process: finding attributes in examples that distinguish object of separate classes
- avoid overfitting; do cross-validation
Classifiers
Linear Least Square Fit
- mapping approach
- input/output vector pair
- regression method
- Disadvantage:
Rocchio's Algorithm
- for use in relevance feedback in IR
- vector space method
- centroid; average vector
- inductive learning process
- Advantage:
- easy to implement
- low computational cost/efficient
- fast learner
- readily understandable by human than neural network
- Disadvantage:
- low classification accuracy when categroy tend to occur in disjoint clusters
- linear classification is too simple
KNN
- case-based learning / instant-based learning algorithm
- example based classifiers/lazy learners
- based on closest feature space
- Euclidean distance / Cosine similarity
- majority voting
- Advantage:
- nonparametric
- simple and easy to implement
- one of fastest ML algorithm
- does not divide space linearly like Rocchio's
- Disadvantage:
- classification time is long
- optimal k?
- uses all features
- need to compute distances using all document features
- not robust to noise and irrelevant features
- computationally expensive
Bayesian classifier/Naive Bayes
- module classifier
- multivariate Bernoulli and multinomial model
- spam filtering & email categorization
- Advantage:
- easy implementation & computation
- need small amount of training data
- gives good results as long as correct category is more probable than other categories
- Disadvantage:
- Poor performance when features are correlated
- relatively low classification performance compare to other discriminative algorithms (e.g. SVM)
Decision Rule
- rule-based inference
- DNF model
- inductive learning rule
- heuristics to reduce number of features
- Bottom up
- Advantage:
- ability to create local dictionary for each separate category
- can give goof effectiveness results
- tend to be more compact classifier than DT
- Disadvantage:
- inability to assign a document to a single category
- need help from human experts
- do not work correctly with large features
- knowledge acquisition bottleneck
- rules must be manually defined/updated by knowledge engineer
Decision Tree
- entropy criterion
- Advantage:
- easy to understand, interpret, and replicate
- Disadvantage:
- Examples: ID3, C4.5, C5
- Top down; divide and conquer
- IG or entropy
SVM
- need both positive and negative training set
- decision surface, hyper plane, support vector
- generally binary
- discriminative classification method
- Advantage:
- at space of large number of dimensions, eliminates least important features
- term selection is often not needed
- no parameter tuning on validation set
- Disadvantage:
- high complexity of training and categorization algorithm
- parameter optimization
- kernel selection
- need for both positive and negative training data
- high time and memory consumption during training
Neural Networks
- input = terms (can be much greater); output = category
- term weights assigned to input units, activation propagated forward
- single-layer perceptron, multi-layer perceptron, BPNN, MBPNN
- Advantage:
- ability to work with large sets of features
- able to correct classification in presence of noise
- suitable for both discrete and continuous data
- Disadvantage:
- large computational cost
- mysterious for typical user
Classifier Committee/Ensemble/Voting
- idea k experts may be better than one
- combining the experts:
- majority voting
- second weighted majority
- dynamic classifier selection
- boosting - train weak learner sequentially; get benefit from previous iteration
- Advantage:
- profit from complementary strength
- Disadvantages:
- choice of k classifiers
- combination function:
- majority voting
- weighted linear combination
- dynamic classifier selection (choose classifier that did best in validation set)
- adaptive classifier combination (summing the judgement of all + weighted by effectiveness on validation example)
- computationally expensive
Other
Latent Semantic Indexing
Fuzzy correlation
- can deal with fuzzy information or incomplete data
- fuzzy SVM, fuzzy kNN, fuzzy rules
Genetic algorithm
- aim to find optimum characteristic parameters
- is an adaptive probability global optimization algorithm
Evaluation
- Precision & Recall
Micro-averaging vs Macro-averaging
Combined effectiveness measures
- eleven-point average precision
- break-even
- Fß-function
Fallout = $\frac{FNi}{FNi + TNi}$
- Error = $\frac{FNi+FPi}{TPi+FNi+FPi+TNi}$
- Accuracy = $\frac{TPi+TNi}{TPi+FNi+FPi+TNi}$
- not good metric
- trivial rejector will do well
Alternative measures to effectiveness
train-and-test
- k-fold CV
- test and validation set must be kept separate
Document collection
- Reuters-21578
- Reuters Corpus Version 1
- 20 Newsgroups data
- OHSUMED
- AP collection
Memo
- see Korde 2012 for table of classifier advantages & disadvantages
- see Bilski 2011 for mini blurbs about description, advantages & disadvantages; like Khan et al 2010
- see Khan et al 2010 for more embedded description, advantages & disadvantages
- 6000+ citations from Sebastiani et al 2002