Document classification

  • data is stored in textual format
  • application in information management + document retrieval, routing, or filtering system
    • document indexing based on controlled vocab
    • document filtering: relevant vs irrelevant
    • automated metadata generation
    • word sense disambiguation
    • population of hierarchical catalogues of web resources
    • speech categorization
    • multimedia doc categorization
    • author identification
    • language identification
    • patients into categories
    • grouping of conferences papers
  • Advantages of ML based classification:
    • savings in terms of expert labour power
    • portability to other domains (there are many domains)
    • lots of doc; manual implausible
  • challenges:
    • semantics, homonyms
    • high-dimensionality problem
    • document representation, construction, classification
  • single vs multilabel
    • single more general
  • category-pivoted vs document-pivoted text classification
    • DP: find all category for the document
    • CP: find all document for this category
  • hard vs rank classification
    • Binary vs threshold
    • threshold obtained analytically vs experimentally

Representation

  • Transform full text into document vector
    • vector space model which is represented by vector of words
    • Bag of word/Vector space model
      • Disadvantages: high-dimensionality & loss of info (correlation with adjacent word, semantic relationships)
      • To overcome problem: term weighing
  • types of document representation
    • N-gram
    • single terms
    • phrases
    • RDR (logic-based documentation representation)
  • Term weighing metrics:

    • Boolean weighing
    • word frequency weighing
    • TF-IDF
    • entropy
  • semantic and ontology base document representation

    • Advantages:
      • better classification when consider the semantics (via ontology)
      • ontology can be used to provide expert, background knowledge
    • Disadvantages:
      • need for large number of training text terms
      • translatability between languages
      • is challenge to semantically represent text
      • polysemy, synonymy
  • indexing: procedure to map text doc into compact rep of its content

    • lexical semantics? choice of meaningful units
    • compositional semantics? how to combine units
      • usually disregarded in TC ... instead use vector weights e.g. TD-IDF
  • Darmstadt indexing approach vs bag of words approach
    • DIA: use a wider set of features e.g. location of term in document

Pre-processing

  • Tokenization
  • Removing stop words
  • Stemming
    • morphological root
    • sometimes reported to hurt effectiveness

Feature selection

  • dimensionality reduction: omitting unimportant features
    • feature extraction vs feature selection methods
    • prevent overfitting
  • Select subset of features by some criterion; keep words with highest score according to predetermined measure of importance of the word
    • Information gain*
    • Term frequency
    • Chi-square*
    • Expected cross entropy
    • Odds Ratio
    • the weight of evidence of text
    • Mutual information
    • Gini index
    • genetic algorithm (GA) optimization (?)
    • Gain ratio
    • Conditional mutual information
    • Doc frequency
    • Inverse doc frequency
    • Term
    • Weighted Ratio
    • DIA association factor
    • NGL coefficient
    • relevancy score
    • GSS coefficient

* denotes most commonly used

  • wrapper vs filter methods
    • wrappers: use classification accuracy of learning algorithm as their evaluation function
      • not suitable for text classification because of high-dimensionality
    • filters: independent of learning algorithm; use evaluation metric
  • Other approaches
    • reduce term set by factor of 10
    • remove all terms occurring in at most x training doc
  • feature selection should consider domain and algorithm characteristics

Feature extraction

  • clustering
  • Latent semantic indexing
    • lower dimensions obtained looking at patterns of co-occurrences in original dimension
    • SVD
    • bring out latent semantic of vocabulary
    • LSI vs X^2
  • learning process: finding attributes in examples that distinguish object of separate classes
    • avoid overfitting; do cross-validation

Classifiers

Linear Least Square Fit

  • mapping approach
  • input/output vector pair
  • regression method
  • Disadvantage:
    • computational cost

Rocchio's Algorithm

  • for use in relevance feedback in IR
  • vector space method
  • centroid; average vector
  • inductive learning process
  • Advantage:
    • easy to implement
    • low computational cost/efficient
    • fast learner
    • readily understandable by human than neural network
  • Disadvantage:
    • low classification accuracy when categroy tend to occur in disjoint clusters
    • linear classification is too simple

KNN

  • case-based learning / instant-based learning algorithm
  • example based classifiers/lazy learners
  • based on closest feature space
  • Euclidean distance / Cosine similarity
  • majority voting
  • Advantage:
    • nonparametric
    • simple and easy to implement
    • one of fastest ML algorithm
    • does not divide space linearly like Rocchio's
  • Disadvantage:
    • classification time is long
    • optimal k?
    • uses all features
    • need to compute distances using all document features
    • not robust to noise and irrelevant features
    • computationally expensive

Bayesian classifier/Naive Bayes

  • module classifier
  • multivariate Bernoulli and multinomial model
  • spam filtering & email categorization
  • Advantage:
    • easy implementation & computation
    • need small amount of training data
    • gives good results as long as correct category is more probable than other categories
  • Disadvantage:
    • Poor performance when features are correlated
    • relatively low classification performance compare to other discriminative algorithms (e.g. SVM)

Decision Rule

  • rule-based inference
  • DNF model
  • inductive learning rule
  • heuristics to reduce number of features
  • Bottom up
  • Advantage:
    • ability to create local dictionary for each separate category
    • can give goof effectiveness results
    • tend to be more compact classifier than DT
  • Disadvantage:
    • inability to assign a document to a single category
    • need help from human experts
    • do not work correctly with large features
    • knowledge acquisition bottleneck
    • rules must be manually defined/updated by knowledge engineer

Decision Tree

  • entropy criterion
  • Advantage:
    • easy to understand, interpret, and replicate
  • Disadvantage:
    • overfitting
  • Examples: ID3, C4.5, C5
  • Top down; divide and conquer
  • IG or entropy

SVM

  • need both positive and negative training set
  • decision surface, hyper plane, support vector
  • generally binary
  • discriminative classification method
  • Advantage:
    • at space of large number of dimensions, eliminates least important features
    • term selection is often not needed
    • no parameter tuning on validation set
  • Disadvantage:
    • high complexity of training and categorization algorithm
    • parameter optimization
    • kernel selection
    • need for both positive and negative training data
    • high time and memory consumption during training

Neural Networks

  • input = terms (can be much greater); output = category
  • term weights assigned to input units, activation propagated forward
  • single-layer perceptron, multi-layer perceptron, BPNN, MBPNN
  • Advantage:
    • ability to work with large sets of features
    • able to correct classification in presence of noise
    • suitable for both discrete and continuous data
  • Disadvantage:
    • large computational cost
    • mysterious for typical user

Classifier Committee/Ensemble/Voting

  • idea k experts may be better than one
  • combining the experts:
    • majority voting
    • second weighted majority
    • dynamic classifier selection
  • boosting - train weak learner sequentially; get benefit from previous iteration
  • Advantage:
    • profit from complementary strength
  • Disadvantages:
    • choice of k classifiers
    • combination function:
      • majority voting
      • weighted linear combination
      • dynamic classifier selection (choose classifier that did best in validation set)
      • adaptive classifier combination (summing the judgement of all + weighted by effectiveness on validation example)
    • computationally expensive

Other

Latent Semantic Indexing

Fuzzy correlation

  • can deal with fuzzy information or incomplete data
  • fuzzy SVM, fuzzy kNN, fuzzy rules

Genetic algorithm

  • aim to find optimum characteristic parameters
  • is an adaptive probability global optimization algorithm

Evaluation

  • Precision & Recall
  • Micro-averaging vs Macro-averaging

  • Combined effectiveness measures

    • eleven-point average precision
    • break-even
    • Fß-function
  • Fallout = $\frac{FNi}{FNi + TNi}$

  • Error = $\frac{FNi+FPi}{TPi+FNi+FPi+TNi}$
  • Accuracy = $\frac{TPi+TNi}{TPi+FNi+FPi+TNi}$
    • not good metric
    • trivial rejector will do well
  • Alternative measures to effectiveness

    • efficiency
    • utility
  • train-and-test

  • k-fold CV
  • test and validation set must be kept separate

Document collection

  • Reuters-21578
  • Reuters Corpus Version 1
  • 20 Newsgroups data
  • OHSUMED
  • AP collection

Memo

  • see Korde 2012 for table of classifier advantages & disadvantages
  • see Bilski 2011 for mini blurbs about description, advantages & disadvantages; like Khan et al 2010
  • see Khan et al 2010 for more embedded description, advantages & disadvantages
  • 6000+ citations from Sebastiani et al 2002