Skip to content

Commit

Permalink
fix(dict): Remove unsure corrections
Browse files Browse the repository at this point in the history
The typo dictionary words.csv previously contained
a bunch of problematic entries such as:

    abouta,about
    algorithmi,algorithm
    attachen,attach
    shouldbe,should

Which resulted in wrong corrections if the following
spaces (indicated by ␣) were accidentally missed:

    about␣a
    algorithm␣i developed
    attach␣en masse
    should␣be

Many of these entries were introduced by taking entries from the
codespell-dict and removing corrections containing spaces (since typos
currently doesn't support them), e.g the codespell dictionary contains:

    abouta->about a, about,
    shouldbe->should, should be,

This commit updates `tests/verify.rs` to automatically remove
entries in the form of `{correction}{common_word},{correction}`,
where `{common_word}` is one of the 1000 most frequent English words.

The top-1000-most-frequent-words.csv file was generated by running:

    curl https://norvig.com/ngrams/count_1w.txt \
      | head -n1024 \
      | awk '{print $1;}' \
      | grep -vE '^([^ia]|al|re)$' \
      > top-1000-most-frequent-words.csv
  • Loading branch information
not-my-profile committed Aug 7, 2023
1 parent 41ce6be commit ec32cf5
Show file tree
Hide file tree
Showing 4 changed files with 1,122 additions and 369 deletions.
Loading

0 comments on commit ec32cf5

Please sign in to comment.