Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with utf8 and HTML entities #175

Closed
uuencode opened this issue Dec 7, 2022 · 2 comments
Closed

Issue with utf8 and HTML entities #175

uuencode opened this issue Dec 7, 2022 · 2 comments

Comments

@uuencode
Copy link

uuencode commented Dec 7, 2022

I am using the latest pip version - readability-lxml 0.8.1 and I found a curious issue. When there are both non-ascii utf-8 chars and HTML entities the output is not properly utf-8 encoded.

<!DOCTYPE html>
<html><head>
<title>title</title>
<meta charset="utf-8" />
</head><body>
This is déjà vu &hellip;
</body></html>

doc.summary() returns This is déjà vu …


&hellip; being converted beforehand to
doc.summary().encode('raw_unicode_escape').decode('utf-8') returns This is déjà vu \u2026


It is very common to have both non-ascii utf-8 and HTML entities together.
As the output is HTML anyway, leaving entities unprocessed could be a solution.

Thank you.

@uuencode
Copy link
Author

uuencode commented Dec 7, 2022

Found another issue that mentioned Document(response.content) should be used instead of Document(response.text) and that fixed it.

#163

A good idea to update the readme.

@uuencode uuencode closed this as completed Dec 8, 2022
@buriy
Copy link
Owner

buriy commented Dec 9, 2022

Thanks! Updated readme!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants