Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always detecting US-ASCII for UTF-8 encoded files #35

Open
neerajjain92 opened this issue Jul 15, 2020 · 4 comments
Open

Always detecting US-ASCII for UTF-8 encoded files #35

neerajjain92 opened this issue Jul 15, 2020 · 4 comments
Assignees

Comments

@neerajjain92
Copy link

I tried

UniversalDetector detector = new UniversalDetector();
FileInputStream fis = new FileInputStream(file);
byte[] buf = new byte[4096];
int nread;
while ((nread = fis.read(buf)) > 0 && !detector.isDone()) {
    detector.handleData(buf, 0, nread);
}
// (3)
detector.dataEnd();

// (4)
String encoding = detector.getDetectedCharset();
System.out.println(encoding);

It shows US-ASCII

@albfernandez albfernandez self-assigned this Aug 22, 2020
@albfernandez
Copy link
Owner

On small files, if all characters are ASCII, the default (now) is return US_ASCII as encoding.
I need to see some sample if it is not the case.

@amake
Copy link

amake commented Sep 23, 2020

UTF-8 is a superset of ASCII so if a file doesn't have any characters outside of ASCII then I don't think there's a meaningful way to identify it as UTF-8.

@yangsichen
Copy link

it doesn't work while charsets is too short,how can i solve it.

@DarkTyger
Copy link

DarkTyger commented Apr 7, 2024

I need to see some sample if it is not the case.

The following unit test has a string that will be detected as TIS-620, where UTF-8 would be preferred:

import org.junit.jupiter.api.Test;
import org.mozilla.universalchardet.UniversalDetector;

import java.nio.charset.Charset;
import java.nio.charset.StandardCharsets;

import static org.junit.jupiter.api.Assertions.assertEquals;
import static org.junit.jupiter.api.Assertions.assertNotNull;

public class EncodingTest {
  @Test
  public void test_Encoding_UTF8_UTF8() {
    final var bytes = testBytes();

    final var detector = new UniversalDetector( null );
    detector.handleData( bytes, 0, bytes.length );
    detector.dataEnd();

    final var expectedCharset = StandardCharsets.UTF_8;
    final var detectedCharset = detector.getDetectedCharset();
    
    assertNotNull( detectedCharset );

    final var actualCharset = Charset.forName( detectedCharset );

    assertEquals( expectedCharset, actualCharset );
  }

  private static byte[] testBytes() {
    return
      "One humid afternoon during the harrowing heatwave of 2060, Renato Salvatierra, a man with blood sausage fingers and a footfall that silenced rooms, received a box at his police station. Taped to the box was a ransom note; within were his wife's eyes. By year's end, a supermax prison overflowed with felons, owing to Salvatierra's efforts to find his beloved. Soon after, he flipped profession into an entry-level land management position that, his wife insisted, would be, in her words, *infinitamente más relajante*---infinitely more relaxing."
      .getBytes();
  }
}

Reports:

org.opentest4j.AssertionFailedError: 
Expected :UTF-8
Actual   :TIS-620

A similar scenario caused US-ASCII to be detected, as well, despite there being a diacritic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants