Character encoding detection #1

achudnov · 2014-05-03T03:11:29Z

Right now character encodings are taken from HTTP headers. While the RFC mandates that encodings be specified, some websites don't do that. Right now the library assumes UTF-8 if no encoding is specified in Content-Type. However, this is brittle. A better way is to use character encoding detection, for example, from text-icu. This has been started in the chardet branch.

The text was updated successfully, but these errors were encountered:

achudnov added the enhancement label May 3, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Character encoding detection #1

Character encoding detection #1

achudnov commented May 3, 2014

Character encoding detection #1

Character encoding detection #1

Comments

achudnov commented May 3, 2014