You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am using the latest pip version - readability-lxml 0.8.1 and I found a curious issue. When there are both non-ascii utf-8 chars and HTML entities the output is not properly utf-8 encoded.
<!DOCTYPE html>
<html><head>
<title>title</title>
<meta charset="utf-8" />
</head><body>
This is déjà vu …
</body></html>
… being converted beforehand to … doc.summary().encode('raw_unicode_escape').decode('utf-8') returns This is déjà vu \u2026
It is very common to have both non-ascii utf-8 and HTML entities together.
As the output is HTML anyway, leaving entities unprocessed could be a solution.
Thank you.
The text was updated successfully, but these errors were encountered:
I am using the latest pip version - readability-lxml 0.8.1 and I found a curious issue. When there are both non-ascii utf-8 chars and HTML entities the output is not properly utf-8 encoded.
doc.summary()
returnsThis is déjà vu …
…
being converted beforehand to…
doc.summary().encode('raw_unicode_escape').decode('utf-8')
returnsThis is déjà vu \u2026
It is very common to have both non-ascii utf-8 and HTML entities together.
As the output is HTML anyway, leaving entities unprocessed could be a solution.
Thank you.
The text was updated successfully, but these errors were encountered: