Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Fix various cases of HTML text missing after partition #1587

Merged
merged 9 commits into from
Oct 3, 2023
Merged

fix: Fix various cases of HTML text missing after partition #1587

merged 9 commits into from
Oct 3, 2023

Conversation

unifyh
Copy link
Contributor

@unifyh unifyh commented Sep 30, 2023

Fix 4 cases of text missing after partition:

  1. Text immediately after <body>
<body>
  missing1
  <div>hello</div>
</body>
  1. Text inside container and immediately after <br/>
<div>hello<br/>missing2</div>
  1. Text immediately after a text opening tag, if said tag contains <br/>
<p>missing3<br/>hello</p>
  1. Text inside <body> if it is the only content (different cause from case 1)
<body>missing4</body>

Also fix problem causing test_unstructured/documents/test_html.py::test_exclude_tag_types to not work as intended.

This will close #1543

Example: <p>missing<br/>not missing</p>
Example: <div>not missing<br/>missing</div>
Example: <body>missing</body>
@unifyh unifyh changed the title fix: Fix various cases of HTML text missing fix: Fix various cases of HTML text missing after partition Sep 30, 2023
@unifyh unifyh marked this pull request as ready for review September 30, 2023 09:27
@cragwolfe
Copy link
Contributor

Thanks for the contribution, @unifyh ! Please add a unittest confirming the fix. 🙏

@unifyh
Copy link
Contributor Author

unifyh commented Oct 2, 2023

Unit tests added. ☺️

Copy link
Contributor

@cragwolfe cragwolfe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@cragwolfe cragwolfe enabled auto-merge (squash) October 3, 2023 03:59
@cragwolfe cragwolfe merged commit 89bd2fa into Unstructured-IO:main Oct 3, 2023
37 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug/HTML partitioner doesn't partition plain text or break tags correctly
2 participants