Tokenizer treats an alphabetic character as a word-delimiter #300

martindholmes · 2024-05-01T03:43:29Z

The codepoint U+A78F:

https://util.unicode.org/UnicodeJsps/character.jsp?a=A78F

(LATIN LETTER SINOLOGICAL DOT) is in the Latin Extended D block, and is Alphabetic, and in the Other_Letter category; Wikipedia explains "A middot may be used as a consonant or modifier letter, rather than as punctuation, in transcription systems and in language orthographies. For such uses Unicode provides the code point U+A78F ꞏ LATIN LETTER SINOLOGICAL DOT.[16]".

It's being proposed for use in this way (as a consonant to signal length) in Wendat orthography. However, our tokenizer currently treats it as a word-break character; I think this is a bug. It could be a bug in the regex in the tokenizer, or in the Java Unicode regex handling; the character is new enough in Unicode (2015) that the problem could just be that the code hasn't caught up. If so, I think we should special-case it.

martindholmes · 2024-05-06T15:48:04Z

This seems to be a bug in Saxon or Java, because both of these test false:

matches('ꞏ', '\p{L}')
matches('ꞏ', '\p{L}')

I think the best thing to do for now is to add this character explicitly to the regex for alphanumerics.

martindholmes · 2024-05-06T16:06:21Z

Fix and test for it committed in branch iss-300-sindot. PR #301 created.

Fix with accompanying test for issue #300, mishandling of U_A78F.

martindholmes · 2024-05-06T19:53:58Z

Martin Honnen pointed me at the Saxon documentation which says that it's still using Unicode 6 tables:

https://www.saxonica.com/html/documentation12/conformance/xpath31.html

So that would explain it, if the documentation is up to date.

martindholmes · 2024-05-21T22:12:57Z

I think this issue is complete, but only through the ad-hoc hack of adding the specific character concerned into the regex. Somehow or other, we should keep this around to remind ourselves that when Saxon 12.5 comes out, we need to move to it, and remove the hack.

martindholmes · 2024-09-05T22:41:50Z

Note: Saxon 12.5 was released in July, so I'll add a ticket for upgrading to it, and link it to this ticket. If the upgrade goes smoothly we should be able to test the removal of this hack.

martindholmes · 2024-09-23T23:19:00Z

Saxon 12.5 now merged, so this can be tested and the hack removed if no longer required.

martindholmes added the bug Something isn't working label May 1, 2024

martindholmes self-assigned this May 1, 2024

joeytakeda added a commit that referenced this issue May 6, 2024

Merge pull request #301 from projectEndings/iss-300-sindot

9c141e9

Fix with accompanying test for issue #300, mishandling of U_A78F.

martindholmes added the fix committed A fix has been made, so if no problems emerge the issue can be closed. label May 6, 2024

martindholmes mentioned this issue May 8, 2024

Consider making the tokenization regex configurable #303

Open

martindholmes added a commit that referenced this issue May 8, 2024

Port of the fix for issue #300 into release branch.

4a62d9a

martindholmes added this to the Release 1.4.x milestone May 8, 2024

martindholmes mentioned this issue Sep 5, 2024

Move to Saxon 12.5 #312

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer treats an alphabetic character as a word-delimiter #300

Tokenizer treats an alphabetic character as a word-delimiter #300

martindholmes commented May 1, 2024

martindholmes commented May 6, 2024

martindholmes commented May 6, 2024 •

edited

Loading

martindholmes commented May 6, 2024

martindholmes commented May 21, 2024

martindholmes commented Sep 5, 2024

martindholmes commented Sep 23, 2024

Tokenizer treats an alphabetic character as a word-delimiter #300

Tokenizer treats an alphabetic character as a word-delimiter #300

Comments

martindholmes commented May 1, 2024

martindholmes commented May 6, 2024

martindholmes commented May 6, 2024 • edited Loading

martindholmes commented May 6, 2024

martindholmes commented May 21, 2024

martindholmes commented Sep 5, 2024

martindholmes commented Sep 23, 2024

martindholmes commented May 6, 2024 •

edited

Loading