Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

some extraction duplicated in xml #634

Open
fortyfourforty opened this issue Jun 27, 2024 · 3 comments
Open

some extraction duplicated in xml #634

fortyfourforty opened this issue Jun 27, 2024 · 3 comments
Labels
question Further information is requested

Comments

@fortyfourforty
Copy link

hi,

I was setting a test site and playing with trafilatura and found a weird bug.

site URL:
https://milkfriends.s1-tastewp.com/2024/06/27/ok-this/
as this test site is only available for 2 days, so I also attached the simple Gutenberg block code below for you to replicate

Command:

html = trafilatura.fetch_url(url, no_ssl=True,)
ts = trafilatura.extract(html, output_format='xml', include_comments=False)

the Wordpress Gutenberg htmls below

<!-- wp:paragraph -->
<p>this is sample intro</p>
<!-- /wp:paragraph -->

<!-- wp:heading {"level":3} -->
<h3 class="wp-block-heading">intro 2</h3>
<!-- /wp:heading -->

<!-- wp:paragraph -->
<p>table below</p>
<!-- /wp:paragraph -->

<!-- wp:table -->
<figure class="wp-block-table"><table><tbody><tr><td>a</td><td>b</td><td></td></tr><tr><td>f</td><td>s</td><td>s</td></tr><tr><td>g</td><td></td><td>b</td></tr></tbody></table></figure>
<!-- /wp:table -->

<!-- wp:paragraph -->
<p>header table below</p>
<!-- /wp:paragraph -->

<!-- wp:table -->
<figure class="wp-block-table"><table><thead><tr><th>b</th><th>s</th><th>h</th></tr></thead><tbody><tr><td>a</td><td>b</td><td></td></tr><tr><td>f</td><td>s</td><td>s</td></tr><tr><td>g</td><td></td><td>b</td></tr></tbody></table></figure>
<!-- /wp:table -->

<!-- wp:paragraph -->
<p>list below</p>
<!-- /wp:paragraph -->

<!-- wp:list -->
<ul><!-- wp:list-item -->
<li>this is 1</li>
<!-- /wp:list-item -->

<!-- wp:list-item -->
<li>this is 2</li>
<!-- /wp:list-item -->

<!-- wp:list-item -->
<li>this is 3</li>
<!-- /wp:list-item --></ul>
<!-- /wp:list -->

<!-- wp:paragraph -->
<p>numbered list below</p>
<!-- /wp:paragraph -->

<!-- wp:list {"ordered":true} -->
<ol><!-- wp:list-item -->
<li>this is 1</li>
<!-- /wp:list-item -->

<!-- wp:list-item -->
<li>this is 2</li>
<!-- /wp:list-item -->

<!-- wp:list-item -->
<li>this is 3</li>
<!-- /wp:list-item --></ol>
<!-- /wp:list -->

It is very simple extraction but I find some elements are extracted twice.
elements below "this is sample intro" appeared twice but not all of the elements appear twice. some of the list elements only show up once.

See the extraction below:

<doc sitename="milkfriends.s1-tastewp.com" title="ok this" author="Admin" date="2024-06-27" url="https://milkfriends.s1-tastewp.com/2024/06/27/ok-this/" hostname="s1-tastewp.com" fingerprint="f69d7033beefe32d">
  <main>
    <p>this is sample intro</p>
    <head rend="h3">intro 2</head>
    <p>table below</p>
    <table>
      <row span="3">
        <cell>a</cell>
        <cell>b</cell>
      </row>
      <row span="3">
        <cell>f</cell>
        <cell>s</cell>
        <cell>s</cell>
      </row>
      <row>
        <cell>g</cell>
        <cell>b</cell>
      </row>
    </table>
    <p>header table below</p>
    <table>
      <row span="3">
        <cell role="head">b</cell>
        <cell role="head">s</cell>
        <cell role="head">h</cell>
      </row>
      <row span="3">
        <cell>a</cell>
        <cell>b</cell>
      </row>
      <row span="3">
        <cell>f</cell>
        <cell>s</cell>
        <cell>s</cell>
      </row>
      <row>
        <cell>g</cell>
        <cell>b</cell>
      </row>
    </table>
    <p>list below</p>
    <list rend="ul">
      <item>this is 1</item>
      <item>this is 2</item>
      <item>this is 3</item>
    </list>
    <p>numbered list below</p>
    <list rend="ol">
      <item>this is 1</item>
      <item>this is 2</item>
      <item>this is 3</item>
    </list>
    <p>this is sample intro</p>
    <p>table below</p>
    <table>
      <row span="3">
        <cell>a</cell>
        <cell>b</cell>
      </row>
      <row span="3">
        <cell>f</cell>
        <cell>s</cell>
        <cell>s</cell>
      </row>
      <row>
        <cell>g</cell>
        <cell>b</cell>
      </row>
    </table>
    <p>header table below</p>
    <table>
      <row span="3">
        <cell role="head">b</cell>
        <cell role="head">s</cell>
        <cell role="head">h</cell>
      </row>
      <row span="3">
        <cell>a</cell>
        <cell>b</cell>
      </row>
      <row span="3">
        <cell>f</cell>
        <cell>s</cell>
        <cell>s</cell>
      </row>
      <row>
        <cell>g</cell>
        <cell>b</cell>
      </row>
    </table>
    <p>list below</p>
    <p>numbered list below</p>
  </main>
</doc>
@adbar adbar added the question Further information is requested label Jun 27, 2024
@adbar
Copy link
Owner

adbar commented Jun 27, 2024

I'm not sure what happens here but this is odd indeed. Note that if you can use a web archive to reproduce the errors later.

In general, duplicated elements can be easily tackled by using the integrated deduplication filters and setting the right threshold.

@fortyfourforty
Copy link
Author

sorry, I forgot about archive.is. Noted.

I don't think using deduplicate = True is a valid workaround as there are some pages that do have extact same text segments on the same page.

@adbar
Copy link
Owner

adbar commented Jul 25, 2024

@fortyfourforty The integrated deduplication does prevent identical text segments on the same page.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants