Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with LXML on M1 / Apple arm64 platforms #166

Closed
naftalibeder opened this issue Jan 22, 2022 · 9 comments · Fixed by #437
Closed

Issue with LXML on M1 / Apple arm64 platforms #166

naftalibeder opened this issue Jan 22, 2022 · 9 comments · Fixed by #437
Labels
bug Something isn't working documentation Docs in need of update or extension wontfix This will not be worked on

Comments

@naftalibeder
Copy link

On a clean install on the master branch, metadata_tests.py and realworld_tests.py fail.

short test summary info
=============================================================================================
FAILED tests/metadata_tests.py::test_pages - AssertionError: assert None == 'Want to See a More Diverse WordPress Contributor Community? So Do We.'
FAILED tests/realworld_tests.py::test_extract[False] - AssertionError: assert ('Zweitens wird der Genderstern' in '! D O C T Y P E h t m l >')
FAILED tests/realworld_tests.py::test_extract[True] - assert ('Zweitens wird der Genderstern' in '<doc sitename="scilogs.spektrum.de" source="https://scilogs.spektrum.de/engelbart-galaxis/die-ablehnung-der-ge...
===================================================================================
3 failed, 48 passed, 10 warnings in 32.17s 

Please see the full clone/install/test flow and output below.

Shell output

~ % cd Desktop
~/Desktop % git clone [email protected]:adbar/trafilatura.git
Cloning into 'trafilatura'...
remote: Enumerating objects: 6635, done.
remote: Counting objects: 100% (1544/1544), done.
remote: Compressing objects: 100% (705/705), done.
remote: Total 6635 (delta 1148), reused 1063 (delta 831), pack-reused 5091
Receiving objects: 100% (6635/6635), 16.89 MiB | 22.93 MiB/s, done.
Resolving deltas: 100% (4575/4575), done.
~/Desktop % cd trafilatura
~/Desktop/trafilatura % python -m venv venv
~/Desktop/trafilatura % source venv/bin/activate
(venv) ~/Desktop/trafilatura %
(venv) ~/Desktop/trafilatura % git branch
* master
(venv) ~/Desktop/trafilatura % git pull
Already up to date.
(venv) ~/Desktop/trafilatura % python --version
Python 3.9.9
(venv) ~/Desktop/trafilatura % python -m pip --version
pip 21.3.1 from /Users/naftalibeder/Desktop/trafilatura/venv/lib/python3.9/site-packages/pip (python 3.9)
(venv) ~/Desktop/trafilatura % python -m pip install -r docs/requirements.txt
Collecting Sphinx==4.3.2
  Using cached Sphinx-4.3.2-py3-none-any.whl (3.1 MB)
Collecting pydata-sphinx-theme==0.7.2
  Using cached pydata_sphinx_theme-0.7.2-py3-none-any.whl (1.4 MB)
Collecting docutils==0.17.1
  Using cached docutils-0.17.1-py2.py3-none-any.whl (575 kB)
Collecting trafilatura
  Using cached trafilatura-1.0.0-py3-none-any.whl (180 kB)
Requirement already satisfied: setuptools in ./venv/lib/python3.9/site-packages (from Sphinx==4.3.2->-r docs/requirements.txt (line 2)) (59.0.1)
Collecting sphinxcontrib-htmlhelp>=2.0.0
  Using cached sphinxcontrib_htmlhelp-2.0.0-py2.py3-none-any.whl (100 kB)
Collecting imagesize
  Using cached imagesize-1.3.0-py2.py3-none-any.whl (5.2 kB)
Collecting Pygments>=2.0
  Using cached Pygments-2.11.2-py3-none-any.whl (1.1 MB)
Collecting sphinxcontrib-qthelp
  Using cached sphinxcontrib_qthelp-1.0.3-py2.py3-none-any.whl (90 kB)
Collecting Jinja2>=2.3
  Using cached Jinja2-3.0.3-py3-none-any.whl (133 kB)
Collecting sphinxcontrib-serializinghtml>=1.1.5
  Using cached sphinxcontrib_serializinghtml-1.1.5-py2.py3-none-any.whl (94 kB)
Collecting sphinxcontrib-devhelp
  Using cached sphinxcontrib_devhelp-1.0.2-py2.py3-none-any.whl (84 kB)
Collecting packaging
  Using cached packaging-21.3-py3-none-any.whl (40 kB)
Collecting snowballstemmer>=1.1
  Using cached snowballstemmer-2.2.0-py2.py3-none-any.whl (93 kB)
Collecting babel>=1.3
  Using cached Babel-2.9.1-py2.py3-none-any.whl (8.8 MB)
Collecting requests>=2.5.0
  Using cached requests-2.27.1-py2.py3-none-any.whl (63 kB)
Collecting sphinxcontrib-applehelp
  Using cached sphinxcontrib_applehelp-1.0.2-py2.py3-none-any.whl (121 kB)
Collecting alabaster<0.8,>=0.7
  Using cached alabaster-0.7.12-py2.py3-none-any.whl (14 kB)
Collecting sphinxcontrib-jsmath
  Using cached sphinxcontrib_jsmath-1.0.1-py2.py3-none-any.whl (5.1 kB)
Collecting beautifulsoup4
  Using cached beautifulsoup4-4.10.0-py3-none-any.whl (97 kB)
Collecting urllib3<2,>=1.26
  Using cached urllib3-1.26.8-py2.py3-none-any.whl (138 kB)
Collecting htmldate>=1.0.0
  Using cached htmldate-1.0.0-py3-none-any.whl (30 kB)
Collecting courlan>=0.6.0
  Using cached courlan-0.6.0-py3-none-any.whl (26 kB)
Collecting certifi
  Using cached certifi-2021.10.8-py2.py3-none-any.whl (149 kB)
Collecting readability-lxml>=0.8.1
  Using cached readability_lxml-0.8.1-py3-none-any.whl (20 kB)
Collecting lxml>=4.6.4
  Using cached lxml-4.7.1.tar.gz (3.2 MB)
  Preparing metadata (setup.py) ... done
Collecting justext>=3.0.0
  Using cached jusText-3.0.0-py2.py3-none-any.whl (837 kB)
Collecting charset-normalizer>=2.0.8
  Using cached charset_normalizer-2.0.10-py3-none-any.whl (39 kB)
Collecting pytz>=2015.7
  Using cached pytz-2021.3-py2.py3-none-any.whl (503 kB)
Collecting langcodes>=3.2.1
  Using cached langcodes-3.3.0-py3-none-any.whl (181 kB)
Collecting tld>=0.12
  Using cached tld-0.12.6-py39-none-any.whl (412 kB)
Collecting dateparser>=1.1.0
  Using cached dateparser-1.1.0-py2.py3-none-any.whl (288 kB)
Collecting python-dateutil>=2.8.2
  Using cached python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB)
Collecting MarkupSafe>=2.0
  Using cached MarkupSafe-2.0.1-cp39-cp39-macosx_10_9_universal2.whl (18 kB)
Collecting cssselect
  Using cached cssselect-1.1.0-py2.py3-none-any.whl (16 kB)
Collecting chardet
  Using cached chardet-4.0.0-py2.py3-none-any.whl (178 kB)
Collecting idna<4,>=2.5
  Using cached idna-3.3-py3-none-any.whl (61 kB)
Collecting soupsieve>1.2
  Using cached soupsieve-2.3.1-py3-none-any.whl (37 kB)
Collecting pyparsing!=3.0.5,>=2.0.2
  Downloading pyparsing-3.0.7-py3-none-any.whl (98 kB)
     |████████████████████████████████| 98 kB 4.9 MB/s
Collecting tzlocal
  Using cached tzlocal-4.1-py3-none-any.whl (19 kB)
Collecting regex!=2019.02.19,!=2021.8.27
  Downloading regex-2022.1.18-cp39-cp39-macosx_11_0_arm64.whl (281 kB)
     |████████████████████████████████| 281 kB 144.2 MB/s
Collecting six>=1.5
  Using cached six-1.16.0-py2.py3-none-any.whl (11 kB)
Collecting pytz-deprecation-shim
  Using cached pytz_deprecation_shim-0.1.0.post0-py2.py3-none-any.whl (15 kB)
Collecting tzdata
  Using cached tzdata-2021.5-py2.py3-none-any.whl (339 kB)
Using legacy 'setup.py install' for lxml, since package 'wheel' is not installed.
Installing collected packages: tzdata, six, pytz-deprecation-shim, urllib3, tzlocal, regex, pytz, python-dateutil, pyparsing, MarkupSafe, idna, charset-normalizer, certifi, tld, sphinxcontrib-serializinghtml, sphinxcontrib-qthelp, sphinxcontrib-jsmath, sphinxcontrib-htmlhelp, sphinxcontrib-devhelp, sphinxcontrib-applehelp, soupsieve, snowballstemmer, requests, Pygments, packaging, lxml, langcodes, Jinja2, imagesize, docutils, dateparser, cssselect, chardet, babel, alabaster, Sphinx, readability-lxml, justext, htmldate, courlan, beautifulsoup4, trafilatura, pydata-sphinx-theme
    Running setup.py install for lxml ... done
Successfully installed Jinja2-3.0.3 MarkupSafe-2.0.1 Pygments-2.11.2 Sphinx-4.3.2 alabaster-0.7.12 babel-2.9.1 beautifulsoup4-4.10.0 certifi-2021.10.8 chardet-4.0.0 charset-normalizer-2.0.10 courlan-0.6.0 cssselect-1.1.0 dateparser-1.1.0 docutils-0.17.1 htmldate-1.0.0 idna-3.3 imagesize-1.3.0 justext-3.0.0 langcodes-3.3.0 lxml-4.7.1 packaging-21.3 pydata-sphinx-theme-0.7.2 pyparsing-3.0.7 python-dateutil-2.8.2 pytz-2021.3 pytz-deprecation-shim-0.1.0.post0 readability-lxml-0.8.1 regex-2022.1.18 requests-2.27.1 six-1.16.0 snowballstemmer-2.2.0 soupsieve-2.3.1 sphinxcontrib-applehelp-1.0.2 sphinxcontrib-devhelp-1.0.2 sphinxcontrib-htmlhelp-2.0.0 sphinxcontrib-jsmath-1.0.1 sphinxcontrib-qthelp-1.0.3 sphinxcontrib-serializinghtml-1.1.5 tld-0.12.6 trafilatura-1.0.0 tzdata-2021.5 tzlocal-4.1 urllib3-1.26.8
(venv) ~/Desktop/trafilatura % pytest
zsh: command not found: pytest
(venv) ~/Desktop/trafilatura % python -m pip install pytest
Collecting pytest
  Using cached pytest-6.2.5-py3-none-any.whl (280 kB)
Collecting pluggy<2.0,>=0.12
  Using cached pluggy-1.0.0-py2.py3-none-any.whl (13 kB)
Collecting iniconfig
  Using cached iniconfig-1.1.1-py2.py3-none-any.whl (5.0 kB)
Collecting toml
  Using cached toml-0.10.2-py2.py3-none-any.whl (16 kB)
Requirement already satisfied: packaging in ./venv/lib/python3.9/site-packages (from pytest) (21.3)
Collecting py>=1.8.2
  Using cached py-1.11.0-py2.py3-none-any.whl (98 kB)
Collecting attrs>=19.2.0
  Using cached attrs-21.4.0-py2.py3-none-any.whl (60 kB)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in ./venv/lib/python3.9/site-packages (from packaging->pytest) (3.0.7)
Installing collected packages: toml, py, pluggy, iniconfig, attrs, pytest
Successfully installed attrs-21.4.0 iniconfig-1.1.1 pluggy-1.0.0 py-1.11.0 pytest-6.2.5 toml-0.10.2
(venv) ~/Desktop/trafilatura % pytest
=============================================================================================== test session starts ===============================================================================================
platform darwin -- Python 3.9.9, pytest-6.2.5, py-1.11.0, pluggy-1.0.0
rootdir: /Users/naftalibeder/Desktop/trafilatura, configfile: pytest.ini
collected 51 items

tests/cli_tests.py .......                                                                                                                                                                                  [ 13%]
tests/downloads_tests.py ....                                                                                                                                                                               [ 21%]
tests/feeds_tests.py .....                                                                                                                                                                                  [ 31%]
tests/json_metadata_tests.py .                                                                                                                                                                              [ 33%]
tests/metadata_tests.py .........F                                                                                                                                                                          [ 52%]
tests/realworld_tests.py FF                                                                                                                                                                                 [ 56%]
tests/sitemaps_tests.py ...                                                                                                                                                                                 [ 62%]
tests/spider_tests.py .....                                                                                                                                                                                 [ 72%]
tests/unit_tests.py ..............                                                                                                                                                                          [100%]

==================================================================================================== FAILURES =====================================================================================================
___________________________________________________________________________________________________ test_pages ____________________________________________________________________________________________________

    def test_pages():
        '''Test on real web pages'''
        metadata = extract_metadata(load_mock_page('http://blog.python.org/2016/12/python-360-is-now-available.html'))
        assert metadata['title'] == 'Python 3.6.0 is now available!'
        assert metadata['description'] == 'Python 3.6.0 is now available! Python 3.6.0 is the newest major release of the Python language, and it contains many new features and opti...'
        assert metadata['author'] == 'Ned Deily'
        assert metadata['url'] == 'http://blog.python.org/2016/12/python-360-is-now-available.html'
        assert metadata['sitename'] == 'blog.python.org'

        metadata = extract_metadata(load_mock_page('https://en.blog.wordpress.com/2019/06/19/want-to-see-a-more-diverse-wordpress-contributor-community-so-do-we/'))
>       assert metadata['title'] == 'Want to See a More Diverse WordPress Contributor Community? So Do We.'
E       AssertionError: assert None == 'Want to See a More Diverse WordPress Contributor Community? So Do We.'

tests/metadata_tests.py:320: AssertionError
------------------------------------------------------------------------------------------------ Captured log call ------------------------------------------------------------------------------------------------
WARNING  trafilatura.metadata:metadata.py:239 no main title found
WARNING  trafilatura.metadata:metadata.py:247 no h2 title found
_______________________________________________________________________________________________ test_extract[False] _______________________________________________________________________________________________

xmloutput = False

    @pytest.mark.parametrize("xmloutput", [False, True])
    def test_extract(xmloutput): # xmloutput=False
        '''test extraction from HTML'''
        result = load_mock_page('https://die-partei.net/luebeck/2012/05/31/das-ministerium-fur-club-kultur-informiert/', xmloutput)
        assert 'Impressum' not in result and 'Die GEMA dreht völlig am Zeiger!' in result

        result = load_mock_page('https://www.bmjv.de/DE/Verbraucherportal/KonsumImAlltag/TransparenzPreisanpassung/TransparenzPreisanpassung_node.html', xmloutput)
        assert 'Impressum' not in result and 'Anbieter von Fernwärme haben innerhalb ihres Leitungsnetzes ein Monopol' in result

        result = load_mock_page('https://denkanstoos.wordpress.com/2012/04/11/denkanstoos-april-2012/', xmloutput)
        assert 'Two or three 10-15 min' in result and 'What type? Etc. (30 mins)' in result and 'Dieser Eintrag wurde veröffentlicht' not in result and 'Mit anderen Teillen' not in result

        result = load_mock_page('https://www.ebrosia.de/beringer-zinfandel-rose-stone-cellars-lieblich-suess', xmloutput)
        assert 'Das Bukett präsentiert sich' in result and 'Kunden kauften auch' not in result and 'Gutschein sichern' not in result # and 'Besonders gut passt er zu asiatischen Gerichten' in result

        result = load_mock_page('https://www.landwirt.com/Precision-Farming-Moderne-Sensortechnik-im-Kuhstall,,4229,,Bericht.html', xmloutput)
        assert 'Überwachung der somatischen Zellen' in result and 'tragbaren Ultraschall-Geräten' in result and 'Kotkonsistenz' in result  and 'Anzeigentarife' not in result # and 'Aktuelle Berichte aus dieser Kategorie' not in result

        result = load_mock_page('http://www.rs-ingenieure.de/de/hochbau/leistungen/tragwerksplanung', xmloutput)
        #print(result)
        if xmloutput is False:
            assert 'Wir bearbeiten alle Leistungsbilder' in result and 'Brückenbau' not in result

        result = load_mock_page('http://www.shingon-reiki.de/reiki-und-schamanismus/', xmloutput)
        assert 'Catch Evolution' not in result and 'und gekennzeichnet mit' not in result and 'Heut geht es' in result and 'Ich komme dann zu dir vor Ort.' in result

        result = load_mock_page('http://love-hina.ch/news/0409.html', xmloutput)
        assert 'Kapitel 121 ist' in result and 'Besucher online' not in result and 'Kommentare schreiben' not in result

        result = load_mock_page('http://www.cdu-fraktion-erfurt.de/inhalte/aktuelles/entwicklung-der-waldorfschule-ermoeglicht/index.html', xmloutput)
        assert 'der steigenden Nachfrage gerecht zu werden.' in result and 'Zurück zur Übersicht' not in result # and 'Erhöhung für Zoo-Eintritt' not in result

        result = load_mock_page('https://de.creativecommons.org/index.php/2014/03/20/endlich-wird-es-spannend-die-nc-einschraenkung-nach-deutschem-recht/', xmloutput)
        assert 'das letzte Wort sein kann.' in result and 'Ähnliche Beiträge' not in result # and 'Michael Blahm' not in result # comments

        result = load_mock_page('https://piratenpartei-mv.de/blog/2013/09/12/grundeinkommen-ist-ein-menschenrecht/', xmloutput)
        assert 'Unter diesem Motto findet am 14. September' in result and 'Volksinitiative Schweiz zum Grundeinkommen.' in result and 'getaggt mit:' not in result # and 'Was denkst du?' not in result

        result = load_mock_page('https://scilogs.spektrum.de/engelbart-galaxis/die-ablehnung-der-gendersprache/', xmloutput)
>       assert 'Zweitens wird der Genderstern' in result and 'alldem leider – nichts.' in result # and 'Beitragsbild' not in result
E       AssertionError: assert ('Zweitens wird der Genderstern' in '! D O C T Y P E h t m l >')

tests/realworld_tests.py:190: AssertionError
_______________________________________________________________________________________________ test_extract[True] ________________________________________________________________________________________________

xmloutput = True

    @pytest.mark.parametrize("xmloutput", [False, True])
    def test_extract(xmloutput): # xmloutput=False
        '''test extraction from HTML'''
        result = load_mock_page('https://die-partei.net/luebeck/2012/05/31/das-ministerium-fur-club-kultur-informiert/', xmloutput)
        assert 'Impressum' not in result and 'Die GEMA dreht völlig am Zeiger!' in result

        result = load_mock_page('https://www.bmjv.de/DE/Verbraucherportal/KonsumImAlltag/TransparenzPreisanpassung/TransparenzPreisanpassung_node.html', xmloutput)
        assert 'Impressum' not in result and 'Anbieter von Fernwärme haben innerhalb ihres Leitungsnetzes ein Monopol' in result

        result = load_mock_page('https://denkanstoos.wordpress.com/2012/04/11/denkanstoos-april-2012/', xmloutput)
        assert 'Two or three 10-15 min' in result and 'What type? Etc. (30 mins)' in result and 'Dieser Eintrag wurde veröffentlicht' not in result and 'Mit anderen Teillen' not in result

        result = load_mock_page('https://www.ebrosia.de/beringer-zinfandel-rose-stone-cellars-lieblich-suess', xmloutput)
        assert 'Das Bukett präsentiert sich' in result and 'Kunden kauften auch' not in result and 'Gutschein sichern' not in result # and 'Besonders gut passt er zu asiatischen Gerichten' in result

        result = load_mock_page('https://www.landwirt.com/Precision-Farming-Moderne-Sensortechnik-im-Kuhstall,,4229,,Bericht.html', xmloutput)
        assert 'Überwachung der somatischen Zellen' in result and 'tragbaren Ultraschall-Geräten' in result and 'Kotkonsistenz' in result  and 'Anzeigentarife' not in result # and 'Aktuelle Berichte aus dieser Kategorie' not in result

        result = load_mock_page('http://www.rs-ingenieure.de/de/hochbau/leistungen/tragwerksplanung', xmloutput)
        #print(result)
        if xmloutput is False:
            assert 'Wir bearbeiten alle Leistungsbilder' in result and 'Brückenbau' not in result

        result = load_mock_page('http://www.shingon-reiki.de/reiki-und-schamanismus/', xmloutput)
        assert 'Catch Evolution' not in result and 'und gekennzeichnet mit' not in result and 'Heut geht es' in result and 'Ich komme dann zu dir vor Ort.' in result

        result = load_mock_page('http://love-hina.ch/news/0409.html', xmloutput)
        assert 'Kapitel 121 ist' in result and 'Besucher online' not in result and 'Kommentare schreiben' not in result

        result = load_mock_page('http://www.cdu-fraktion-erfurt.de/inhalte/aktuelles/entwicklung-der-waldorfschule-ermoeglicht/index.html', xmloutput)
        assert 'der steigenden Nachfrage gerecht zu werden.' in result and 'Zurück zur Übersicht' not in result # and 'Erhöhung für Zoo-Eintritt' not in result

        result = load_mock_page('https://de.creativecommons.org/index.php/2014/03/20/endlich-wird-es-spannend-die-nc-einschraenkung-nach-deutschem-recht/', xmloutput)
        assert 'das letzte Wort sein kann.' in result and 'Ähnliche Beiträge' not in result # and 'Michael Blahm' not in result # comments

        result = load_mock_page('https://piratenpartei-mv.de/blog/2013/09/12/grundeinkommen-ist-ein-menschenrecht/', xmloutput)
        assert 'Unter diesem Motto findet am 14. September' in result and 'Volksinitiative Schweiz zum Grundeinkommen.' in result and 'getaggt mit:' not in result # and 'Was denkst du?' not in result

        result = load_mock_page('https://scilogs.spektrum.de/engelbart-galaxis/die-ablehnung-der-gendersprache/', xmloutput)
>       assert 'Zweitens wird der Genderstern' in result and 'alldem leider – nichts.' in result # and 'Beitragsbild' not in result
E       assert ('Zweitens wird der Genderstern' in '<doc sitename="scilogs.spektrum.de" source="https://scilogs.spektrum.de/engelbart-galaxis/die-ablehnung-der-genderspr...t="2jmj7l5rSw0yVb/vlWAYkK/YBwk=">\n  <main>\n    <p>! D O C T Y P E h t m l &gt;</p>\n  </main>\n  <comments/>\n</doc>')

tests/realworld_tests.py:190: AssertionError
------------------------------------------------------------------------------------------------ Captured log call ------------------------------------------------------------------------------------------------
WARNING  trafilatura.metadata:metadata.py:239 no main title found
WARNING  trafilatura.metadata:metadata.py:247 no h2 title found
================================================================================================ warnings summary =================================================================================================
tests/cli_tests.py::test_parser
tests/cli_tests.py::test_parser
  /Users/naftalibeder/Desktop/trafilatura/trafilatura/cli.py:212: PendingDeprecationWarning: --notables will be deprecated in a future version,
                 use --no-tables instead
    warnings.warn(

tests/cli_tests.py::test_parser
tests/cli_tests.py::test_parser
tests/cli_tests.py::test_parser
tests/cli_tests.py::test_parser
  /Users/naftalibeder/Desktop/trafilatura/trafilatura/cli.py:205: PendingDeprecationWarning: --nocomments will be deprecated in a future version,
                 use --no-comments instead
    warnings.warn(

tests/cli_tests.py::test_parser
tests/cli_tests.py::test_parser
tests/cli_tests.py::test_cli_pipeline
  /Users/naftalibeder/Desktop/trafilatura/trafilatura/cli.py:219: PendingDeprecationWarning: --with-metadata will be deprecated in a future version,
                 use --only-with-metadata instead
    warnings.warn(

tests/downloads_tests.py::test_fetch
  /Users/naftalibeder/Desktop/trafilatura/venv/lib/python3.9/site-packages/urllib3/connectionpool.py:1043: InsecureRequestWarning: Unverified HTTPS request is being made to host 'httpbin.org'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
    warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/warnings.html
============================================================================================= short test summary info =============================================================================================
FAILED tests/metadata_tests.py::test_pages - AssertionError: assert None == 'Want to See a More Diverse WordPress Contributor Community? So Do We.'
FAILED tests/realworld_tests.py::test_extract[False] - AssertionError: assert ('Zweitens wird der Genderstern' in '! D O C T Y P E h t m l >')
FAILED tests/realworld_tests.py::test_extract[True] - assert ('Zweitens wird der Genderstern' in '<doc sitename="scilogs.spektrum.de" source="https://scilogs.spektrum.de/engelbart-galaxis/die-ablehnung-der-ge...
=================================================================================== 3 failed, 48 passed, 10 warnings in 32.17s ====================================================================================

@adbar
Copy link
Owner

adbar commented Jan 22, 2022

Hi @naftalibeder, all tests pass except for the dev version Python 3.11 which is experimental: https://github.com/adbar/trafilatura/actions/runs/1730002541

I don't understand what it is happening here, you're using Python 3.9.9? On which OS?

Can you try pip install trafilatura[all] (additional packages) to see if it solves the problem? It may be an decoding/encoding issue since the tests on real data are failing and not the other unit tests.
Especially the fact that a document is interpreted as ! D O C T Y P E h t m l > shows there is something wrong.

@naftalibeder
Copy link
Author

Very odd! I’m running this on an M1 Mac. I wouldn’t have thought this would cause a problem, since I haven’t had any transition issues with other Python projects.

But here’s the latest after some testing:

  • Python 3.8, 3.9, and 3.10 on M1 Mac: same failures
  • Python 3.7 and 3.8 on Intel Mac: all passed

I don’t know what the issue could be (especially since the actual software mostly works right), and I may look into it. I’ll also see if I can run this through the x86 emulator.

@adbar adbar added the feedback Feedback from users requested label Jan 25, 2022
@adbar
Copy link
Owner

adbar commented Jan 25, 2022

This is really strange, it could have something to do with lxml and its underlying XML library but I'm not sure. Please keep be posted if you find an explanation.

@naftalibeder
Copy link
Author

I’ve narrowed it down to the offending line, and it looks like you were right.

elif isinstance(htmlobject, str):
    try:
        # htmlobject logs correctly as the entire contents of the webpage
        tree = html.fromstring(htmlobject, parser=HTML_PARSER)
        # tree logs as '! D O C T Y P E h t m l >'
    except ValueError:
        …

I’ve tried a variety of tweaks to the parser configuration, with no meaningful effect.

Given that almost all of the tests pass, my vague hypothesis is that the HTML of this particular webpage is invalid, and for whatever reason the underlying library on my M1 is more sensitive to that. (But I don’t have a good understanding of this stack.) Basic googling turned up nothing.

I would be happy to try any ideas you might have! Please let me know what you think.

@adbar
Copy link
Owner

adbar commented Jan 27, 2022

Could you try the underlying library LXML alone on the problem at hand?

You open the file, load it, and try to perform an operation on the tree, here is the gist of a possible test:

from lxml import html

# load the document e.g. as "mydoc"
# ...

lxml_tree = html.fromstring(mydoc)
print(len(lxml_tree))
print(list(lxml_tree))

This should print a number higher than 1 and a list of nodes in the document.
If it does not work, it is a case for a LXML bug report here: https://bugs.launchpad.net/lxml/

Could you please try it out?

@naftalibeder
Copy link
Author

Sure enough, it's reproducible in a minimal project: https://github.com/naftalibeder/example-lxml.

I tried messing around with the offending html file, to see what part of it is leading to the corruption, but I wasn't able to easily figure that out.

I submitted a bug report at https://bugs.launchpad.net/lxml/+bug/1959358.

@adbar
Copy link
Owner

adbar commented Jan 28, 2022

Thanks, let's follow the resolution of the issue there.

@adbar adbar changed the title Tests fail on master Issue with LXML on M1 / Apple arm64 platforms Jan 28, 2022
@adbar adbar added the bug Something isn't working label Jan 28, 2022
@adbar adbar added wontfix This will not be worked on and removed feedback Feedback from users requested labels Mar 2, 2022
@adbar
Copy link
Owner

adbar commented Apr 25, 2022

@naftalibeder It's doesn't appear to be going forward. Did you try building LXML from source?

@adbar
Copy link
Owner

adbar commented Oct 12, 2023

Please note that brew can now be used to install Trafilatura on MacOS in a seamless way:
https://formulae.brew.sh/formula/trafilatura

@adbar adbar added the documentation Docs in need of update or extension label Oct 20, 2023
@adbar adbar linked a pull request Nov 3, 2023 that will close this issue
@adbar adbar closed this as completed in #437 Nov 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working documentation Docs in need of update or extension wontfix This will not be worked on
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants