Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/1209 tweak xycut ordering output #1630

Merged
merged 15 commits into from
Oct 5, 2023

Conversation

christinestraub
Copy link
Collaborator

@christinestraub christinestraub commented Oct 3, 2023

Closes #1209.

Summary

  • add swapped xycut sorting
  • update xycut sorting evaluation script

PDFs:

Testing

elements = partition_pdf("sbaa031.073.pdf", strategy="hi_res")
print("\n\n".join([str(el) for el in elements]))

Evaluation

PYTHONPATH=. python examples/custom-layout-order/evaluate_xy_cut_sorting.py sbaa031.073.pdf hi_res xycut_only

@cragwolfe cragwolfe requested a review from Klaijan October 3, 2023 19:02
# Conflicts:
#	CHANGELOG.md
#	unstructured/__version__.py
ryannikolaidis and others added 5 commits October 3, 2023 16:08
…1632)

This pull request includes updated ingest test fixtures.
Please review and merge if appropriate.

Co-authored-by: cragwolfe <[email protected]>
…1641)

This pull request includes updated ingest test fixtures.
Please review and merge if appropriate.

Co-authored-by: cragwolfe <[email protected]>
Copy link
Contributor

@Klaijan Klaijan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Majority of the fixture tests improve with this tweak. Both versions get confused when reading tables or mixed of single paragraph and figures or tables in one page. With table reading, I feel like the previous version can capture better reading order, but not significantly better.

For reference, I have my note here:

test_unstructured/partition/utils/test_xycut.py meh
test_unstructured_ingest/expected-structured-output/azure/IRS-form-1987.pdf.json worse
test_unstructured_ingest/expected-structured-output/azure/IRS-form-1987.png.json worse (better ordering within its section, but worse overall)
test_unstructured_ingest/expected-structured-output/biomed-api/65/11/main.PMC6312790.pdf.json better
test_unstructured_ingest/expected-structured-output/biomed-api/75/29/main.PMC6312793.pdf.json better
test_unstructured_ingest/expected-structured-output/biomed-path/07/07/sbaa031.073.PMC7234218.pdf.json better
test_unstructured_ingest/expected-structured-output/pdf-fast-reprocess/biomed-api/65/11/main.PMC6312790.pdf.json better (meh but will give benefit of the doubt)
test_unstructured_ingest/expected-structured-output/pdf-fast-reprocess/biomed-api/75/29/main.PMC6312793.pdf.json better
test_unstructured_ingest/expected-structured-output/pdf-fast-reprocess/biomed-path/07/07/sbaa031.073.PMC7234218.pdf.json better
test_unstructured_ingest/expected-structured-output/pdf-fast-reprocess/s3/small-pdf-set/2023-Jan-economic-outlook.pdf.json both ok
test_unstructured_ingest/expected-structured-output/pdf-fast-reprocess/s3/small-pdf-set/Silent-Giant-(1).pdf.json better
test_unstructured_ingest/expected-structured-output/pdf-fast-reprocess/s3/small-pdf-set/recalibrating-risk-report.pdf.json better (both have trouble reading tables, but the new code did worse at reading paragraphs)
test_unstructured_ingest/expected-structured-output/s3/small-pdf-set/2023-Jan-economic-outlook.pdf.json both ok (not very good at reading table still)
test_unstructured_ingest/expected-structured-output/s3/small-pdf-set/Silent-Giant-(1).pdf.json better
test_unstructured_ingest/expected-structured-output/s3/small-pdf-set/recalibrating-risk-report.pdf.json better

@cragwolfe cragwolfe enabled auto-merge (squash) October 5, 2023 07:14
@cragwolfe cragwolfe merged commit b30d6a6 into main Oct 5, 2023
39 checks passed
@cragwolfe cragwolfe deleted the fix/1209-tweak-xycut-ordering-output branch October 5, 2023 07:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug: tweak xycut ordering output to be more column friendly.
4 participants