-
Notifications
You must be signed in to change notification settings - Fork 15
CER calculation
Mike Gerber edited this page Nov 12, 2020
·
1 revision
- We treat grapheme clusters as characters
- Reasoning: This is what users commonly perceive as characters. We cannot simply use code points as there are grapheme clusters as LATIN SMALL LETTER M, COMBINING TILDE that cannot be represented with a single code point.
- We count whitespace
- Reasoning: A missing space is an error
- We count punctuation
- Reasoning: A missing period (.) or wrong hyphen (-) is an error
- Normalization
- We normalize MUFI PUA characters to their canonically equivalent Unicode representations