[pt] Improve degree sign and ordinal indicator detection rules #9975
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Prob. still not perfect, but first let's get rid of the low-hanging fruit FPs:
make a systematic distinction between temperature scales (which require a space before the degree sign) and cardinal points (which require the space between the degree and the abbreviation):
90 °F
but90° N
, 'noventa graus Fahrenheit' but 'noventa graus ao norte'.improve, somewhat, the recognition for coordinates like
90° 45′ 22″
, using the prime/double prime characters for angular minutes/seconds, but this may require some more tinkering (and potentially also different tokenisation, which I'm not keen on...);crucially, I've added
o
to the ordinal/degree detection rule – the nightly corpus reveals many instances of100o
clearly indicating degrees; once this is working, theORDINAL_ABBREVIATION
rules should be easier to work with, as we will have eliminated many FPs (and these rules are, crucially, the last ones of this group to run!).As a small aside, users' habit of replacing the degree sign with lowercase
o
may come from some kind of automatic MS Word rule to convert that to the ordinal indicator after\d
, since they, uh, look similar. It would be lovely to eliminate this habit somehow, and the first results on the live degree/ordinal rules show very high acceptance rates.