Text.tokenize add unicode support ,fix Bug #1248

HonestManXin · 2016-01-22T14:03:28Z

When the string argument of function Text.tokenize's length is 1, then the tokens list will be a empty list.

… return tokens is empty list

titan-cla · 2016-01-22T14:03:39Z

Hi @HonestManXin, thanks for your contribution!

In order for us to evaluate and accept your PR, we ask that you sign a contribution license agreement. It's all electronic and will take just minutes.

titan-cla · 2016-01-22T14:15:07Z

You did it @HonestManXin!

Thank you for signing the Contribution License Agreement.

graben1437 · 2016-02-11T14:56:39Z

Hi,
This is a nice suggested fix.
I pulled this into my https://github.com/graben1437/titan1withtp3.1.git build but am wondering if you can also provide/suggest a new test case that "breaks" with the old code but works with the new code ?

HonestManXin · 2016-02-11T15:45:25Z

I just found this Java Character API Documentation，

The methods that only accept a char value cannot support supplementary characters. They treat char values from the surrogate ranges as undefined characters. For example, Character.isLetter('\uD840') returns false, even though this specific value if followed by any low-surrogate value in a string would represent a letter.
The methods that accept an int value support all Unicode characters, including supplementary characters. For example, Character.isLetter(0x2F81A) returns true because the code point value represents a letter (a CJK ideograph).

And I'm so sorry, this still have some problems, the proper implementation would be below.

public static List<String> tokenize(String str) {
        ArrayList<String> tokens = new ArrayList<String>();
        int previous = 0;
        int codePoint;
        for (int p = 0; p < str.length(); p += Character.charCount(codePoint)) {
            codePoint = str.codePointAt(p);
            if (!Character.isLetterOrDigit(codePoint)) {
                if (p > previous + MIN_TOKEN_LENGTH) tokens.add(str.substring(previous, p));
                previous = p + Character.charCount(codePoint);
            }
        }
        if (previous + MIN_TOKEN_LENGTH <= str.length()) tokens.add(str.substring(previous, str.length()));
        return tokens;
    }

Text.tokenize add unicode support ,fix Bug when str.length() == 1 the…

22dd68c

… return tokens is empty list

titan-cla added the cla-missing label Jan 22, 2016

titan-cla removed the cla-missing label Jan 22, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text.tokenize add unicode support ,fix Bug #1248

Text.tokenize add unicode support ,fix Bug #1248

HonestManXin commented Jan 22, 2016

titan-cla commented Jan 22, 2016

titan-cla commented Jan 22, 2016

graben1437 commented Feb 11, 2016

HonestManXin commented Feb 11, 2016

Text.tokenize add unicode support ,fix Bug #1248

Are you sure you want to change the base?

Text.tokenize add unicode support ,fix Bug #1248

Conversation

HonestManXin commented Jan 22, 2016

titan-cla commented Jan 22, 2016

titan-cla commented Jan 22, 2016

graben1437 commented Feb 11, 2016

HonestManXin commented Feb 11, 2016