Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text.tokenize add unicode support ,fix Bug #1248

Open
wants to merge 1 commit into
base: titan10
Choose a base branch
from

Conversation

HonestManXin
Copy link

When the string argument of function Text.tokenize's length is 1, then the tokens list will be a empty list.

@titan-cla
Copy link

Hi @HonestManXin, thanks for your contribution!

In order for us to evaluate and accept your PR, we ask that you sign a contribution license agreement. It's all electronic and will take just minutes.

@titan-cla
Copy link

You did it @HonestManXin!

Thank you for signing the Contribution License Agreement.

@graben1437
Copy link
Contributor

Hi,
This is a nice suggested fix.
I pulled this into my https://github.com/graben1437/titan1withtp3.1.git build but am wondering if you can also provide/suggest a new test case that "breaks" with the old code but works with the new code ?

@HonestManXin
Copy link
Author

I just found this Java Character API Documentation

The methods that only accept a char value cannot support supplementary characters. They treat char values from the surrogate ranges as undefined characters. For example, Character.isLetter('\uD840') returns false, even though this specific value if followed by any low-surrogate value in a string would represent a letter.
The methods that accept an int value support all Unicode characters, including supplementary characters. For example, Character.isLetter(0x2F81A) returns true because the code point value represents a letter (a CJK ideograph).

And I'm so sorry, this still have some problems, the proper implementation would be below.

public static List<String> tokenize(String str) {
        ArrayList<String> tokens = new ArrayList<String>();
        int previous = 0;
        int codePoint;
        for (int p = 0; p < str.length(); p += Character.charCount(codePoint)) {
            codePoint = str.codePointAt(p);
            if (!Character.isLetterOrDigit(codePoint)) {
                if (p > previous + MIN_TOKEN_LENGTH) tokens.add(str.substring(previous, p));
                previous = p + Character.charCount(codePoint);
            }
        }
        if (previous + MIN_TOKEN_LENGTH <= str.length()) tokens.add(str.substring(previous, str.length()));
        return tokens;
    }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants