Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer treats an alphabetic character as a word-delimiter #300

Open
martindholmes opened this issue May 1, 2024 · 6 comments
Open

Tokenizer treats an alphabetic character as a word-delimiter #300

martindholmes opened this issue May 1, 2024 · 6 comments
Assignees
Labels
bug Something isn't working fix committed A fix has been made, so if no problems emerge the issue can be closed.
Milestone

Comments

@martindholmes
Copy link
Collaborator

The codepoint U+A78F:

https://util.unicode.org/UnicodeJsps/character.jsp?a=A78F

(LATIN LETTER SINOLOGICAL DOT) is in the Latin Extended D block, and is Alphabetic, and in the Other_Letter category; Wikipedia explains "A middot may be used as a consonant or modifier letter, rather than as punctuation, in transcription systems and in language orthographies. For such uses Unicode provides the code point U+A78F ꞏ LATIN LETTER SINOLOGICAL DOT.[16]".

It's being proposed for use in this way (as a consonant to signal length) in Wendat orthography. However, our tokenizer currently treats it as a word-break character; I think this is a bug. It could be a bug in the regex in the tokenizer, or in the Java Unicode regex handling; the character is new enough in Unicode (2015) that the problem could just be that the code hasn't caught up. If so, I think we should special-case it.

@martindholmes martindholmes added the bug Something isn't working label May 1, 2024
@martindholmes martindholmes self-assigned this May 1, 2024
@martindholmes
Copy link
Collaborator Author

This seems to be a bug in Saxon or Java, because both of these test false:

matches('ꞏ', '\p{L}')
matches('ꞏ', '\p{L}')

I think the best thing to do for now is to add this character explicitly to the regex for alphanumerics.

@martindholmes
Copy link
Collaborator Author

martindholmes commented May 6, 2024

Fix and test for it committed in branch iss-300-sindot. PR #301 created.

joeytakeda added a commit that referenced this issue May 6, 2024
Fix with accompanying test for issue #300, mishandling of U_A78F.
@martindholmes martindholmes added the fix committed A fix has been made, so if no problems emerge the issue can be closed. label May 6, 2024
@martindholmes
Copy link
Collaborator Author

Martin Honnen pointed me at the Saxon documentation which says that it's still using Unicode 6 tables:

https://www.saxonica.com/html/documentation12/conformance/xpath31.html

So that would explain it, if the documentation is up to date.

@martindholmes
Copy link
Collaborator Author

I think this issue is complete, but only through the ad-hoc hack of adding the specific character concerned into the regex. Somehow or other, we should keep this around to remind ourselves that when Saxon 12.5 comes out, we need to move to it, and remove the hack.

@martindholmes
Copy link
Collaborator Author

Note: Saxon 12.5 was released in July, so I'll add a ticket for upgrading to it, and link it to this ticket. If the upgrade goes smoothly we should be able to test the removal of this hack.

@martindholmes
Copy link
Collaborator Author

Saxon 12.5 now merged, so this can be tested and the hack removed if no longer required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working fix committed A fix has been made, so if no problems emerge the issue can be closed.
Projects
None yet
Development

No branches or pull requests

1 participant