Always detecting US-ASCII for UTF-8 encoded files #35

neerajjain92 · 2020-07-15T06:38:09Z

I tried

UniversalDetector detector = new UniversalDetector();
FileInputStream fis = new FileInputStream(file);
byte[] buf = new byte[4096];
int nread;
while ((nread = fis.read(buf)) > 0 && !detector.isDone()) {
    detector.handleData(buf, 0, nread);
}
// (3)
detector.dataEnd();

// (4)
String encoding = detector.getDetectedCharset();
System.out.println(encoding);

It shows US-ASCII

The text was updated successfully, but these errors were encountered:

albfernandez · 2020-08-22T11:39:49Z

On small files, if all characters are ASCII, the default (now) is return US_ASCII as encoding.
I need to see some sample if it is not the case.

amake · 2020-09-23T08:05:45Z

UTF-8 is a superset of ASCII so if a file doesn't have any characters outside of ASCII then I don't think there's a meaningful way to identify it as UTF-8.

yangsichen · 2020-11-03T08:14:10Z

it doesn't work while charsets is too short,how can i solve it.

DarkTyger · 2024-04-07T20:37:08Z

I need to see some sample if it is not the case.

The following unit test has a string that will be detected as TIS-620, where UTF-8 would be preferred:

import org.junit.jupiter.api.Test;
import org.mozilla.universalchardet.UniversalDetector;

import java.nio.charset.Charset;
import java.nio.charset.StandardCharsets;

import static org.junit.jupiter.api.Assertions.assertEquals;
import static org.junit.jupiter.api.Assertions.assertNotNull;

public class EncodingTest {
  @Test
  public void test_Encoding_UTF8_UTF8() {
    final var bytes = testBytes();

    final var detector = new UniversalDetector( null );
    detector.handleData( bytes, 0, bytes.length );
    detector.dataEnd();

    final var expectedCharset = StandardCharsets.UTF_8;
    final var detectedCharset = detector.getDetectedCharset();
    
    assertNotNull( detectedCharset );

    final var actualCharset = Charset.forName( detectedCharset );

    assertEquals( expectedCharset, actualCharset );
  }

  private static byte[] testBytes() {
    return
      "One humid afternoon during the harrowing heatwave of 2060, Renato Salvatierra, a man with blood sausage fingers and a footfall that silenced rooms, received a box at his police station. Taped to the box was a ransom note; within were his wife's eyes. By year's end, a supermax prison overflowed with felons, owing to Salvatierra's efforts to find his beloved. Soon after, he flipped profession into an entry-level land management position that, his wife insisted, would be, in her words, *infinitamente más relajante*---infinitely more relaxing."
      .getBytes();
  }
}

Reports:

org.opentest4j.AssertionFailedError: 
Expected :UTF-8
Actual   :TIS-620

A similar scenario caused US-ASCII to be detected, as well, despite there being a diacritic.

albfernandez self-assigned this Aug 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Always detecting US-ASCII for UTF-8 encoded files #35

Always detecting US-ASCII for UTF-8 encoded files #35

neerajjain92 commented Jul 15, 2020

albfernandez commented Aug 22, 2020

amake commented Sep 23, 2020

yangsichen commented Nov 3, 2020

DarkTyger commented Apr 7, 2024 •

edited

Loading

Always detecting US-ASCII for UTF-8 encoded files #35

Always detecting US-ASCII for UTF-8 encoded files #35

Comments

neerajjain92 commented Jul 15, 2020

albfernandez commented Aug 22, 2020

amake commented Sep 23, 2020

yangsichen commented Nov 3, 2020

DarkTyger commented Apr 7, 2024 • edited Loading

DarkTyger commented Apr 7, 2024 •

edited

Loading