-
Notifications
You must be signed in to change notification settings - Fork 743
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix/1209 tweak xycut ordering output #1630
Conversation
# Conflicts: # setup.py
# Conflicts: # CHANGELOG.md # unstructured/__version__.py
…1632) This pull request includes updated ingest test fixtures. Please review and merge if appropriate. Co-authored-by: cragwolfe <[email protected]>
…1641) This pull request includes updated ingest test fixtures. Please review and merge if appropriate. Co-authored-by: cragwolfe <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Majority of the fixture tests improve with this tweak. Both versions get confused when reading tables or mixed of single paragraph and figures or tables in one page. With table reading, I feel like the previous version can capture better reading order, but not significantly better.
For reference, I have my note here:
test_unstructured/partition/utils/test_xycut.py meh
test_unstructured_ingest/expected-structured-output/azure/IRS-form-1987.pdf.json worse
test_unstructured_ingest/expected-structured-output/azure/IRS-form-1987.png.json worse (better ordering within its section, but worse overall)
test_unstructured_ingest/expected-structured-output/biomed-api/65/11/main.PMC6312790.pdf.json better
test_unstructured_ingest/expected-structured-output/biomed-api/75/29/main.PMC6312793.pdf.json better
test_unstructured_ingest/expected-structured-output/biomed-path/07/07/sbaa031.073.PMC7234218.pdf.json better
test_unstructured_ingest/expected-structured-output/pdf-fast-reprocess/biomed-api/65/11/main.PMC6312790.pdf.json better (meh but will give benefit of the doubt)
test_unstructured_ingest/expected-structured-output/pdf-fast-reprocess/biomed-api/75/29/main.PMC6312793.pdf.json better
test_unstructured_ingest/expected-structured-output/pdf-fast-reprocess/biomed-path/07/07/sbaa031.073.PMC7234218.pdf.json better
test_unstructured_ingest/expected-structured-output/pdf-fast-reprocess/s3/small-pdf-set/2023-Jan-economic-outlook.pdf.json both ok
test_unstructured_ingest/expected-structured-output/pdf-fast-reprocess/s3/small-pdf-set/Silent-Giant-(1).pdf.json better
test_unstructured_ingest/expected-structured-output/pdf-fast-reprocess/s3/small-pdf-set/recalibrating-risk-report.pdf.json better (both have trouble reading tables, but the new code did worse at reading paragraphs)
test_unstructured_ingest/expected-structured-output/s3/small-pdf-set/2023-Jan-economic-outlook.pdf.json both ok (not very good at reading table still)
test_unstructured_ingest/expected-structured-output/s3/small-pdf-set/Silent-Giant-(1).pdf.json better
test_unstructured_ingest/expected-structured-output/s3/small-pdf-set/recalibrating-risk-report.pdf.json better
Closes #1209.
Summary
xycut
sortingxycut
sorting evaluation scriptPDFs:
Testing
Evaluation