-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Always detecting US-ASCII for UTF-8 encoded files #35
Comments
On small files, if all characters are ASCII, the default (now) is return US_ASCII as encoding. |
UTF-8 is a superset of ASCII so if a file doesn't have any characters outside of ASCII then I don't think there's a meaningful way to identify it as UTF-8. |
it doesn't work while charsets is too short,how can i solve it. |
The following unit test has a string that will be detected as TIS-620, where UTF-8 would be preferred: import org.junit.jupiter.api.Test;
import org.mozilla.universalchardet.UniversalDetector;
import java.nio.charset.Charset;
import java.nio.charset.StandardCharsets;
import static org.junit.jupiter.api.Assertions.assertEquals;
import static org.junit.jupiter.api.Assertions.assertNotNull;
public class EncodingTest {
@Test
public void test_Encoding_UTF8_UTF8() {
final var bytes = testBytes();
final var detector = new UniversalDetector( null );
detector.handleData( bytes, 0, bytes.length );
detector.dataEnd();
final var expectedCharset = StandardCharsets.UTF_8;
final var detectedCharset = detector.getDetectedCharset();
assertNotNull( detectedCharset );
final var actualCharset = Charset.forName( detectedCharset );
assertEquals( expectedCharset, actualCharset );
}
private static byte[] testBytes() {
return
"One humid afternoon during the harrowing heatwave of 2060, Renato Salvatierra, a man with blood sausage fingers and a footfall that silenced rooms, received a box at his police station. Taped to the box was a ransom note; within were his wife's eyes. By year's end, a supermax prison overflowed with felons, owing to Salvatierra's efforts to find his beloved. Soon after, he flipped profession into an entry-level land management position that, his wife insisted, would be, in her words, *infinitamente más relajante*---infinitely more relaxing."
.getBytes();
}
} Reports:
A similar scenario caused US-ASCII to be detected, as well, despite there being a diacritic. |
I tried
It shows US-ASCII
The text was updated successfully, but these errors were encountered: