Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

International characters are removed. #99

Open
cphsolutionslab opened this issue Oct 22, 2013 · 15 comments
Open

International characters are removed. #99

cphsolutionslab opened this issue Oct 22, 2013 · 15 comments

Comments

@cphsolutionslab
Copy link

The code below removes international characters, or so it seems. When I open the saved_file, which is UTF-8, and print out the content all my international characters are correctly shown. After the any_tableset is run and I print out all the rows, all the international characters are removed.

    f = open(result['saved_file'], 'rb')
    try:
        table_sets = any_tableset(
            f,
            mimetype=content_type,
            extension=resource['format'].lower()
        )
        # only first sheet in xls for time being
        row_set = table_sets.tables[0]
        offset, headers = headers_guess(row_set.sample)
@scraperdragon
Copy link

I've knocked together a similar case but can't reproduce the bug:

saved_file:
cat,dog,mouseß
æï,ñø,åÅ

code:
from messytables import any_tableset, headers_guess
for i in (1,):
f = open('saved_file', 'rb')
table_sets = any_tableset(f)
row_set = table_sets.tables[0]
print list(row_set)
offset, headers = headers_guess(row_set.sample)
print offset, headers

for row in row_set:
for cell in row:
print cell.value

output:
[[<Cell(String:u'cat'>, <Cell(String:u'dog'>, <Cell(String:u'mouse\xdf'>],
[<Cell(String:u'\xe6\xef'>, <Cell(String:u'\xf1\xf8'>,
<Cell(String:u'\xe5\xc5'>]]
0 [u'cat', u'dog', u'mouse\xdf']
cat
dog
mouseß
æï
ñø
åÅ

which seems right to me. (This happens regardless of whether the file is
CSV or XLS)

  1. Is it stripping characters or replacing them with gibberish?
  2. Can you provide the values of content_type and
    resource['format'].lower()?
  3. Could you run this code with your file and let us know whether it
    produces the expected strings?
  4. If your spreadsheet isn't sensitive, could you let us see it?

Dave.

On Tue, Oct 22, 2013 at 8:57 AM, bu1g [email protected] wrote:

The code below removes international characters, or so it seems. When I
open the saved_file, which is UTF-8, and print out the content all my
international characters are correctly shown. After the any_tableset is run
and I print out all the rows, all the international characters are removed.

f = open(result['saved_file'], 'rb')
try:
    table_sets = any_tableset(
        f,
        mimetype=content_type,
        extension=resource['format'].lower()
    )
    # only first sheet in xls for time being
    row_set = table_sets.tables[0]
    offset, headers = headers_guess(row_set.sample)


Reply to this email directly or view it on GitHubhttps://github.com//issues/99
.

@cphsolutionslab
Copy link
Author

It's currently an issue with DataStorer in CKAN: http://data.kk.dk/dataset/betalingszoner/resource/cde21ea2-6f87-46e1-be1f-f7a0d2cfc985

Debugging it resolved in the characters being removed (not gibberish but stripped) just after:

table_sets = any_tableset(
f,
mimetype=content_type,
extension=resource['format'].lower()
)

@rossjones
Copy link
Contributor

Can you 100% confirm that your file is utf-8? Does chardet think your file is utf-8?

@cphsolutionslab
Copy link
Author

This is the output:
{'confidence': 0.99, 'encoding': 'utf-8'}

after this:
f = open(result['saved_file'], 'rb')
print chardet.detect(f.read())
try:
table_sets = any_tableset(
f,
mimetype=content_type,
extension=resource['format'].lower()
)

@scraperdragon
Copy link

I don't think the file is UTF-8:

(python 2)

f= open("ows.csv", "r").read()
f[-100:]
'03651972, 12.579757761196365 55.66998698041308))",Gr\xf8n,Gr\xf8n
betalingszone,2010-02-22,2013-07-17,21\r\n'

Given that this is a bytestring, it shouldn't contain any single non-ASCII
characters (since all UTF8 characters that aren't ASCII are multibyte)

'file' agrees:
$ file ows.csv
ows.csv: ISO-8859 text, with very long lines, with CRLF line terminators

Strange that chardet thinks it's UTF-8, given that UTF-8 is one of the
easiest things to prove some text isn't.

On Tue, Oct 22, 2013 at 2:29 PM, bu1g [email protected] wrote:

This is the output:
{'confidence': 0.99, 'encoding': 'utf-8'}

after this:

f = open(result['saved_file'], 'rb')
print chardet.detect(f.read())

try:
table_sets = any_tableset(
f,
mimetype=content_type,
extension=resource['format'].lower()
)

Reply to this email directly or view it on GitHubhttps://github.com//issues/99#issuecomment-26802438
.

@scraperdragon
Copy link

commas.py 24: self.reader = codecs.getreader(encoding)(f, 'ignore')
... I'm not sure silently ignoring characters that don't decode will ever
be the correct behaviour.

Also, only checking the start of the file causes misdetection in chardet,
due to the big polygons

chardet.detect(open("/home/dragon/ows.csv").read())
{'confidence': 0.766658867395801, 'encoding': 'ISO-8859-2'}
chardet.detect(open("/home/dragon/ows.csv").read(2000))
{'confidence': 1.0, 'encoding': 'ascii'}

We could fall back to a possibly-mangling latin-1 instead of an
always-wrong UTF-8 minus the bad bits. It'd be good to warn when this had
occurred.

On Tue, Oct 22, 2013 at 4:30 PM, Dave McKee [email protected] wrote:

I don't think the file is UTF-8:

(python 2)

f= open("ows.csv", "r").read()
f[-100:]
'03651972, 12.579757761196365 55.66998698041308))",Gr\xf8n,Gr\xf8n
betalingszone,2010-02-22,2013-07-17,21\r\n'

Given that this is a bytestring, it shouldn't contain any single non-ASCII
characters (since all UTF8 characters that aren't ASCII are multibyte)

'file' agrees:
$ file ows.csv
ows.csv: ISO-8859 text, with very long lines, with CRLF line terminators

Strange that chardet thinks it's UTF-8, given that UTF-8 is one of the
easiest things to prove some text isn't.

On Tue, Oct 22, 2013 at 2:29 PM, bu1g [email protected] wrote:

This is the output:
{'confidence': 0.99, 'encoding': 'utf-8'}

after this:

f = open(result['saved_file'], 'rb')
print chardet.detect(f.read())

try:
table_sets = any_tableset(
f,
mimetype=content_type,
extension=resource['format'].lower()
)

Reply to this email directly or view it on GitHubhttps://github.com//issues/99#issuecomment-26802438
.

@cphsolutionslab
Copy link
Author

My result of the following:
f = open(result['saved_file'], 'rb')
f_test = f.read()
print f_test[-100:]
gives my this:
651972, 12.579757761196365 55.66998698041308))",Grøn,Grøn betalingszone,2010-02-22,2013-07-17,21

@cphsolutionslab
Copy link
Author

But yes, it does says ascii when using read(2000)...

@scraperdragon
Copy link

Ah; I was in an interactive Python session - so
f[-100:]
is equivalent to
print repr(f[-100:])

@cphsolutionslab
Copy link
Author

When I try to open the file as UTF-8:
f = codecs.open(result['saved_file'], 'rb', 'utf-8')
I get the error:

/usr/lib/ckan/default/local/lib/python2.7/site-packages/chardet/universaldetector.py:90: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
if aBuf[:len(chunk)] == chunk:
2013-10-29 11:09:22,057 ERROR [root] 'ascii' codec can't encode character u'\xd8' in position 1456: ordinal not in range(128)
...

Any idea on how to fix this? UTF-8 is a superset of ASCII.

@cphsolutionslab
Copy link
Author

? I think I got lost somewhere...
Should messytables be able to handle ISO-8859-1 encoded files?
Should messytables be able to handle ascii characters?

I'm kinda lost if I should fis the file, messytables or...?

@scraperdragon
Copy link

The error message: "'ascii' codec can't encode..." is caused by something trying to convert a character outside of ASCII (e.g. a byte bigger than 127 in ISO-8859, a unicode code point beyond 127) to ASCII; ASCII doesn't have a character to represent that character.

Yes, messytables should handle ISO-8859-1 files, however it is not correctly detecting this file due to truncating detection at 2k.
I'm not sure why chardet is trying to coerce data into ASCII. It shouldn't be trying to do that.
Your data isn't UTF-8. Attempting to decode it as such is bound to fail.

I believe this would be fixed if:

  1. we didn't truncate the analysis at 2k
  2. we fell back to ISO-8859-1, not UTF-8 if the file can't be decoded correctly

@cphsolutionslab
Copy link
Author

Removing the 2K limit fixed the issue...
Should this be a permanent solution?

And thank you for patience with me. I really do appreciate it.

@scraperdragon
Copy link

Glad to hear the specific problem is fixed!

Making it permament is a little more complicated; there's a need for some users to not load the whole file in multiple times. We'll have to think about it.

@guibos
Copy link

guibos commented Nov 25, 2020

The error message: "'ascii' codec can't encode..." is caused by something trying to convert a character outside of ASCII (e.g. a byte bigger than 127 in ISO-8859, a unicode code point beyond 127) to ASCII; ASCII doesn't have a character to represent that character.

Yes, messytables should handle ISO-8859-1 files, however it is not correctly detecting this file due to truncating detection at 2k.
I'm not sure why chardet is trying to coerce data into ASCII. It shouldn't be trying to do that.
Your data isn't UTF-8. Attempting to decode it as such is bound to fail.

I believe this would be fixed if:

1. we didn't truncate the analysis at 2k

2. we fell back to ISO-8859-1, not UTF-8 if the file can't be decoded correctly

There are plans to solve this problem. I am quite interested in solving this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants