Advanced normalization
- CollateX default matching
- Why you may want to override it
- How to override it
CollateX default matching
- Exact string matching – Near matching
- Tokenize by splitting on white space
- Punctuation marks are individual tokens
- No case normalization
- No Unicode normalization
Sample normalization overrides
- Case folding
- Unicode normalization (precomposed characters)
- Strip punctuation
- Strip markup
Soundex
- English-language surnames, 1918
- Algorithm (simplified)
- Retain first letter
- Delete other vowels
- Degeminate
- Conflate other letters according to phonetic similarity (e.g., t/d = 3; m/n = 5)
- Truncate or zero-pad to four characters
- Examples
- Birnbaum B-651 (also ✓Barenboim; also ✗Brumble)
Soundex assumptions
- More nuanced than generic edit distance
- Edit distance (Levenshtein distance): deletion, insertion, substitution (Damerau-Levenshtein: transposition)
- Character differences are not all equivalent with respect to information load
- Consonants carry more information than vowels
- Information load may be sensitive to position
- Beginning of word carries more information than end
- Especially true for lexical (not morphological) searching in inflected languages
Adapting Soundex to Church Slavonic
- Neutralize variant spellings of initial vowel
- Casefold, neutralize consonantal variants
- Not always one-to-one, e.g., щ = шт
- Degeminate, delete other vowels, delete diacritics
- Keep two letters of two-letter words
- Higher information load
- Other conflations?
- Knowledge based vs machine learning
- Expand abbreviations? – б҃га, бг҃а, б҃а = бога (бг)
- Truncate
- Zero-pad
- To what length?
Two types of normalization
Collation
- Find alignment points
- Coarse adjustments
- No harm in conflating, e.g., imperfect and aorist or infinitive and supine
Evaluation
- Alignment points are already known
- Finer comparisons
- Many need to distinguish on the basis of small details
Collation after Soundex
- Greatly improved results
- Utilize forced matches
- Misses
- Gap in alignment (no forced match)
- Imperfect match
- CollateX recognizes only perfect matches
- Unable to recognize closest match (but see near matching)