Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.text may guess the encoding incorrectly #163

Open
097115 opened this issue Sep 15, 2021 · 4 comments
Open

.text may guess the encoding incorrectly #163

097115 opened this issue Sep 15, 2021 · 4 comments

Comments

@097115
Copy link

097115 commented Sep 15, 2021

Steps to reproduce:

import requests
from readability import Document
response = requests.get('https://polit.ru/article/2021/09/14/ps_dennet/')
print(Document(response.text).summary())

However, if we use .content:

    print(Document(response.content).summary())

everything will be just fine.

May be updating README.rst is worth a shot :)

@buriy
Copy link
Owner

buriy commented Sep 15, 2021

So, do you think that requests encoding guessing is reliable?
I think it is not: https://stackoverflow.com/questions/44203397/python-requests-get-returns-improperly-decoded-text-instead-of-utf-8

@097115
Copy link
Author

097115 commented Sep 15, 2021

My point is exactly that guessing is unreliable (and therefore using .content is a better approach)

:)

@buriy
Copy link
Owner

buriy commented Sep 15, 2021

Oh, thanks. That's a good point.
I'll update README.
Actually, both ways are unreliable, so I think, it is better if developers can choose the best option.
Technically,requests lib can do better guessing sometimes, because it can also access Content-type header. But that field can provide wrong info, and I know it happens sometimes.

@buriy
Copy link
Owner

buriy commented Dec 9, 2022

Updated readme. Thanks to everyone involved!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants