notebook.community

Advanced normalization

CollateX default matching
Why you may want to override it
How to override it

CollateX default matching

Exact string matching – Near matching
Tokenize by splitting on white space
Punctuation marks are individual tokens
No case normalization
No Unicode normalization

Sample normalization overrides

Case folding
Unicode normalization (precomposed characters)
Strip punctuation
Strip markup

Soundex

English-language surnames, 1918
Algorithm (simplified)
1. Retain first letter
2. Delete other vowels
3. Degeminate
4. Conflate other letters according to phonetic similarity (e.g., t/d = 3; m/n = 5)
5. Truncate or zero-pad to four characters
Examples
- Birnbaum B-651 (also ✓Barenboim; also ✗Brumble)

Soundex assumptions

More nuanced than generic edit distance
- Edit distance (Levenshtein distance): deletion, insertion, substitution (Damerau-Levenshtein: transposition)
Character differences are not all equivalent with respect to information load
- Consonants carry more information than vowels
Information load may be sensitive to position
- Beginning of word carries more information than end
- Especially true for lexical (not morphological) searching in inflected languages

Adapting Soundex to Church Slavonic

Neutralize variant spellings of initial vowel
- оу,у,ꙋ=у
- ѡ,ꙍ,ѻ,о=о
Casefold, neutralize consonantal variants
- Not always one-to-one, e.g., щ = шт
Degeminate, delete other vowels, delete diacritics
- Keep two letters of two-letter words
- Higher information load
Other conflations?
- Knowledge based vs machine learning
Expand abbreviations? – б҃га, бг҃а, б҃а = бога (бг)
- Truncate
- Zero-pad
- To what length?

Two types of normalization

Collation

Find alignment points
Coarse adjustments
No harm in conflating, e.g., imperfect and aorist or infinitive and supine

Evaluation

Alignment points are already known
Finer comparisons
Many need to distinguish on the basis of small details

Collation after Soundex

Greatly improved results
Utilize forced matches
- A B C
- A D C
Misses
- Gap in alignment (no forced match)
- Imperfect match
  - фраки ~ фраци
- CollateX recognizes only perfect matches
- Unable to recognize closest match (but see near matching)