Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode error due to using surrogates #18

Open
jgbarah opened this issue Mar 15, 2016 · 3 comments · May be fixed by #22
Open

Unicode error due to using surrogates #18

jgbarah opened this issue Mar 15, 2016 · 3 comments · May be fixed by #22
Labels
bug good first issue Good issue for first-time contributors

Comments

@jgbarah
Copy link
Contributor

jgbarah commented Mar 15, 2016

When running the git backend to analyze the git repo of the torvalds/linux GitHub repository, and uploading it to ElasticSearch (using the Python elasticsearch module) i get an error which apparently is due to trying to UTF-8 encode a "surrogated" string. The relevant part of the debugging messages and the exception I get is:

...
'message': "[PATCH] intel8x0: AC'97 audio patch for Intel ESB2\n\nThis
patch adds the Intel ESB2 DID's to the intel8x0.c file for AC'97
audio\nsupport.\n\nSigned-off-by: \udca0Jason Gaston
<[email protected]>\nSigned-off-by: Andrew Morton
<[email protected]>\nSigned-off-by: Linus Torvalds <[email protected]>",
'commit': 'c4c8ea948aa21527d502e87227b2f1d951bc506d'}
...
Traceback (most recent call last):
...
  File "/home/jgb/venvs/grimoirelab/lib/python3.5/site-packages/elasticsearch/transport.py", line 312, in perform_request
     body = body.encode('utf-8')
 UnicodeEncodeError: 'utf-8' codec can't encode character '\udca0' in position 984: surrogates not allowed

The code causing the error is:

res = self.es.index(index = self.index, doc_type = self.type,
                            id = self._id(item), body = item)

The problem seems to be that there is a character, decoded from the git log by Perceval into a Unicode string, which is using surrogates (the character is '\udca0'), and it cannot be properly encoded in UTF-8 by the elasticsearch code, thus raising the exception.

@jgbarah
Copy link
Contributor Author

jgbarah commented Mar 15, 2016

After some more inspection, the error seems to be that the elasticsearch module expects Unicode strings that can be directly encoded as UTF-8, but surrogates are Unicode characters that cannot.

The problem sees to come from the way Perceval reads the data from the git log (thanks @sduenas for pointing this out):

        for line in gitlog:
            line = line.decode(encoding, errors='surrogateescape')
            yield line

This means that Perceval is producing Unicode strings with surrogates, which raise the exception UnicodeEncodeError when elasticsearch tries to encode them as UTF-8.

@jgbarah
Copy link
Contributor Author

jgbarah commented Mar 15, 2016

After commenting with @sduenas and @acs the solutions I see are:

  • (1) Use some other method for non-UTF-8 strings in the git log. I suggest trying the "usual" Latin-1 encoding if UTF-8 fails, and then 'backslashreplace' as the error handler instead of 'surrogateescape'. A simple code (without trying another encoding if UTF-8 fails) would be:
        for line in gitlog:
            line = line.decode(encoding, errors='backslashreplace')
            yield line
  • (2) Removing surrogates when they are found. That is, "sanitizing" the strings to make sure they can be encoded to UTF-8. For example, in my code, this could amount to:
        try:
            body.encode("utf-8")
        except UnicodeEncodeError as e:
            if e.reason == 'surrogates not allowed':
                body = body.encode('utf-8', "backslashreplace").decode('utf-8')
        res = self.es.index(index = self.index, doc_type = self.type,
                            id = self._id(item), body = item)
  • (3) Ignoring surrogated characters, by using "ignore" instead of "backslashreplace" in the code above.

Maybe (1) is the best solution (trying with another encodings, and if none works, using "backslashreplace"), since it allows for all strings to be "encodeable" as UTF-8, which is a good thing.

I would avoid (3), because that way some characters could be lost with a trace, but it is clearly better than doing nothing.

If (1) cannot be done for some reason, maybe (2) (use "backslashreplace") could be the next better thing to do. That would allow to capture most of the problematic cases, not losing any character, and still being able of encoding as UTF-8. The main drawback of this method is that it would force all strings to be checked for UTF-8 compliance. But if the operation raising the exception is idempotent, the code could be more efficient, sanitizing strings only when the exception is raised:

        try:
            res = self.es.index(index = self.index, doc_type = self.type,
                            id = self._id(item), body = item)
        except UnicodeEncodeError as e:
            if e.reason == 'surrogates not allowed':
                body = body.encode('utf-8', "backslashreplace").decode('utf-8')
                res = self.es.index(index = self.index, doc_type = self.type,
                            id = self._id(item), body = item)

Therefore, could (1) be implemented in Perceval (at least for git)? I can produce a PR for that, if convenient...

Some references which I read before making up my mind:

@sduenas
Copy link
Member

sduenas commented Mar 18, 2016

My idea using subrogateescape was the same you can get with blackslashreplace but I didn't know the weird behaviour of store those characters inside the Unicode private area.

(1) sounds good but I have to find a way to make it clean because it can add more complexity to the code when it isn't really needed.

jgbarah added a commit that referenced this issue Mar 18, 2016
When decoding as utf8, if the character cannnot be decoded,
use the backslashreplace error handler, instead of the
surrogateescape error handler.

Fixes #18 for git backend, maybe others should be fixed too.
jgbarah added a commit to jgbarah-chaoss/grimoirelab-perceval that referenced this issue Mar 19, 2016
When decoding as utf8, if the character cannnot be decoded,
use the backslashreplace error handler, instead of the
surrogateescape error handler.

Fixes chaoss#18 for git backend, maybe others should be fixed too.
valeriocos pushed a commit to valeriocos/perceval that referenced this issue Dec 6, 2017
@sduenas sduenas added bug good first issue Good issue for first-time contributors labels Oct 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug good first issue Good issue for first-time contributors
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants