Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GB18030 false positive with WINDOWS-1252 data set #11

Open
GoogleCodeExporter opened this issue Jan 25, 2016 · 4 comments
Open

GB18030 false positive with WINDOWS-1252 data set #11

GoogleCodeExporter opened this issue Jan 25, 2016 · 4 comments

Comments

@GoogleCodeExporter
Copy link

What steps will reproduce the problem?
1. Pass UniversalDetector a byte buffer for WINDOWS-1252 containing a series of 
degree symbols and character / numbers
 e.g. {91, -80, 52, -80, 48, -80, 84, -80, 67, -80, 67, -80, 48, -80, 67, -80, 84}
2. Call UniversalDetector#getDetectedCharset(), it should be WINDOWS-1252, but 
instead returns GB18030.

See attached unit test for minimal reproduction test case.

What is the expected output? What do you see instead?
Expected output from UniversalDetector#getDetectedCharset() is "WINDOWS-1252," 
but instead is "GB18030."

What version of the product are you using? On what operating system?
 I'm using version 1.0.3 on 64-bit Ubuntu 11.4 (Natty) with default kernel 2.6.38-10-generic.  The JDK I'm currently running is 1.6.0_23-x64.

Original issue reported on code.google.com by [email protected] on 13 Jul 2011 at 4:34

@GoogleCodeExporter
Copy link
Author

Unit test attached

Original comment by [email protected] on 13 Jul 2011 at 4:41

Attachments:

@GoogleCodeExporter
Copy link
Author

Experienced the same issue. Changing the buffersize for reading the inputstream 
from 4096 to 128 solved the problem. The error occurred with buffer sizes of 
253 and above.

Original comment by [email protected] on 28 Feb 2012 at 9:39

@GoogleCodeExporter
Copy link
Author

[deleted comment]

@GoogleCodeExporter
Copy link
Author

Changing the buffersize did not solve the issue on real files. 

The workaround I am currently using is to detect if one or more degree 
characters (°) are present in the byte stream(buf[i] == (byte) 0xB0). If true 
and if the detector returns "GB18030", I use "WINDOWS-1252" instead.
This gives good results (as long as you do not have to detect GB18030 encoded 
files)

Original comment by [email protected] on 29 Jan 2015 at 10:20

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant