Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'lxml.etree._Element' object has no attribute 'text_content' #319

Closed
asjsrep opened this issue Apr 7, 2023 · 17 comments · Fixed by #331
Closed

'lxml.etree._Element' object has no attribute 'text_content' #319

asjsrep opened this issue Apr 7, 2023 · 17 comments · Fixed by #331
Labels
bug Something isn't working documentation Docs in need of update or extension

Comments

@asjsrep
Copy link

asjsrep commented Apr 7, 2023

Extraction of the following URL fails

trafilatura -u "https://buffer.com/resources/ai-content-creation/"

ERROR: 'lxml.etree._Element' object has no attribute 'text_content'
Traceback (most recent call last):
  File "venv/lib/python3.10/site-packages/trafilatura/cli_utils.py", line 339, in examine
    result = extract(htmlstring, url=url, no_fallback=args.fast,
  File "venv/lib/python3.10/site-packages/trafilatura/core.py", line 1055, in extract
    document = bare_extraction(
  File "venv/lib/python3.10/site-packages/trafilatura/core.py", line 927, in bare_extraction
    postbody, temp_text, len_text = compare_extraction(cleaned_tree_backup, tree_backup_1, url, postbody, temp_text, len_text, options)
  File "venv/lib/python3.10/site-packages/trafilatura/core.py", line 644, in compare_extraction
    algo_text = trim(temppost_algo.text_content())
AttributeError: 'lxml.etree._Element' object has no attribute 'text_content'

trafilatura version : 1.5.0

@adbar
Copy link
Owner

adbar commented Apr 11, 2023

Hi @asjsrep, I cannot reproduce the bug and on a conceptual level this error should not happen. So I do not know what is happening here, do you have more details to share?

@asjsrep
Copy link
Author

asjsrep commented Apr 11, 2023

@adbar sure,

I'm running it on a Mac M2, Python version Python 3.10.8

This is the pip freeze in a new virtual environment:

certifi==2022.12.7
charset-normalizer==3.1.0
courlan==0.9.0
dateparser==1.1.8
htmldate==1.4.2
jusText==3.0.0
langcodes==3.3.0
lxml==4.9.2
python-dateutil==2.8.2
pytz==2023.3
pytz-deprecation-shim==0.1.0.post0
regex==2023.3.23
six==1.16.0
tld==0.13
trafilatura==1.5.0
tzdata==2023.3
tzlocal==4.3
urllib3==1.26.15

Is there any specific debugging data which would be useful?

@asjsrep
Copy link
Author

asjsrep commented Apr 12, 2023

I've had a chance to look into this in more detail now. The error was being raised in the lxml document_fromstring method via the try_readability method and appears to be due to a character encoding issue.

Changing:
trafilatura/external.py", line 46,

to

return fromstring(doc.summary().encode('UTF-8'), parser=HTML_PARSER)

i.e. forcing the UTF-8 encoding seems to fix the issue, but I haven't tested it widely. I'm sure there's a better place to correct the character encoding but I'm not very familiar with the codebase or Python.

@adbar
Copy link
Owner

adbar commented Apr 12, 2023

Thanks for the additional information, the document has already parsed once at this stage so I don't quite understand how the problem can arise.

Unfortunately the underlying parser (LXML) is known to have trouble with Mac M1/M2 systems (see #166) and your issue could be related to that. It should be fixed soon (with version 5).

In the meantime we'll keep track of the issue.

@adbar adbar added the question Further information is requested label Apr 12, 2023
@asjsrep
Copy link
Author

asjsrep commented Apr 12, 2023

A workaround for lxml==4.9.2 on M1/M2/arm64 Macs seems to be building a specific wheel for lxml before installing trafilatura

pip install wheel
STATIC_DEPS=true pip install lxml 
pip install trafilatura  

@adbar
Copy link
Owner

adbar commented Apr 12, 2023

Yes, there already are temporary fixes, could you try one of them and did it solve the problem?

@asjsrep
Copy link
Author

asjsrep commented Apr 12, 2023

Yes, building lxml with STATIC_DEPS=true pip install lxml works for me

@adbar adbar added documentation Docs in need of update or extension and removed question Further information is requested labels Apr 12, 2023
@adbar
Copy link
Owner

adbar commented Apr 12, 2023

Nice, so it's more a documentation issue, unless LXML v5 gets released soon.

@adbar adbar mentioned this issue Apr 24, 2023
@Snow314
Copy link

Snow314 commented Apr 26, 2023

Hi,

I am getting the same error on a M1 macbook. I have tried the steps above but they don't seem to work.

I have attached my pip freeze

aiohttp==3.8.4
aiosignal==1.3.1
async-timeout==4.0.2
attrs==22.2.0
biopython==1.81
blinker==1.6.2
boto3==1.26.99
botocore==1.29.99
certifi==2022.12.7
cffi==1.15.1
charset-normalizer==3.1.0
click==8.1.3
cognitojwt==1.4.1
contourpy==1.0.7
courlan==0.9.0
cryptography==39.0.2
cycler==0.11.0
Cython==0.29.34
dateparser==1.1.7
ecdsa==0.18.0
Flask==2.2.3
Flask-Cognito==1.18
Flask-Cors==3.0.10
fonttools==4.39.3
frozenlist==1.3.3
gunicorn==20.1.0
htmldate==1.4.2
idna==3.4
itsdangerous==2.1.2
Jinja2==3.1.2
jmespath==1.0.1
jusText==3.0.0
kiwisolver==1.4.4
langcodes==3.3.0
lxml==4.9.2
MarkupSafe==2.1.2
matplotlib==3.7.1
mock==1.0.1
multidict==6.0.4
numpy==1.24.3
openai==0.27.2
packaging==23.1
Pillow==9.5.0
pyasn1==0.4.8
pycparser==2.21
pyparsing==3.0.9
pysam==0.21.0
python-dateutil==2.8.2
python-jose==3.3.0
pytz==2022.7.1
pytz-deprecation-shim==0.1.0.post0
regex==2022.10.31
requests==2.28.2
rsa==4.9
RUST==0.1.1
s3transfer==0.6.0
six==1.16.0
tld==0.13
tqdm==4.65.0
trafilatura==1.5.0
tzdata==2022.7
tzlocal==4.2
urllib3==1.26.15
Werkzeug==2.2.3
yarl==1.8.2

I am using Python 3.11.2.

Are there any other steps I can do to resolve this issue?

Thanks!

@asjsrep
Copy link
Author

asjsrep commented Apr 26, 2023

@SnowstormAI not sure if it'll work, but try deleting the pip cache and then building the wheel for lxml

pip cache remove *
pip install wheel
STATIC_DEPS=true pip install lxml
pip install trafilatura

@Snow314
Copy link

Snow314 commented Apr 26, 2023

Hi,

Thanks for the suggestion @asjsrep, but this doesn't seem to work either.

Are there other workarounds I can try out?

@asjsrep
Copy link
Author

asjsrep commented Apr 27, 2023

@SnowstormAI did you try editing the trafilatura code directly?

For me this also worked (but obviously isn't ideal)

Changing:
trafilatura/external.py", line 46,

to

return fromstring(doc.summary().encode('UTF-8'), parser=HTML_PARSER)

Or, I haven't tried it myself, but running your code inside a docker container might be a workaround

@adbar
Copy link
Owner

adbar commented Apr 27, 2023

See also PR #331. The problem is that I cannot run automated tests on such devices and thus I don't know if/when the issues are solved.

A new LXML version (v5) is pending, it will hopefully solve this or the other problem.

@surajtripathy07
Copy link

surajtripathy07 commented May 6, 2023

@SnowstormAI did you try editing the trafilatura code directly?

For me this also worked (but obviously isn't ideal)

Changing: trafilatura/external.py", line 46,

to

return fromstring(doc.summary().encode('UTF-8'), parser=HTML_PARSER)

Or, I haven't tried it myself, but running your code inside a docker container might be a workaround

@asjsrep Had the same issue on a M1 mac, updating the code as suggested above fixes the issue. Thank you!

The problem is that I cannot run automated tests on such devices and thus I don't know if/when the issues are solved.

@adbar is there some tests that you would like to be run on M1? I can run and share results if required

Alternatively, how about having a specific check for m1 arm and making the above change? until the lxml v5 changes are done? (It might work like a bandaid but would be a workaround until the lxml v5 changes are done)

@Snow314
Copy link

Snow314 commented May 7, 2023

Thank you everyone! This seems to fix my issue, but isn't an ideal longterm solution. Do we have an eta on when lxml v5 will roll out?

Changing: trafilatura/external.py", line 46,

to

return fromstring(doc.summary().encode('UTF-8'), parser=HTML_PARSER)

This issue can be closed as it isn't directly connected to Trafilatura code.

Thanks again for everyones help!

@adbar adbar added the bug Something isn't working label May 9, 2023
@adbar
Copy link
Owner

adbar commented May 9, 2023

Dear all, thanks for your feedback!

As far as I know it is not possible to test the software on Apple Silicon with Github Actions which is the CI/CD solution I use (feel free to suggest ideas if you know a way).

But since there are concording reports I am filing it as a bug. I have no idea when LXML v5 will be released so I am planning to edit and accept PR #331 accordingly.

@adbar adbar linked a pull request May 11, 2023 that will close this issue
@adbar
Copy link
Owner

adbar commented May 11, 2023

I chose to apply two different fixes however I cannot reproduce the bug so I cannot be sure both of them are necessary. Please get in touch if problems persist.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working documentation Docs in need of update or extension
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants