Skip to content

Commit

Permalink
feat: bbox shrinking in xycut algo, better natural reading order (#1560)
Browse files Browse the repository at this point in the history
Closes GH Issue #1233.

### Summary
- add functionality to shrink all bounding boxes along x and y axes
(still centered around the same center point) before running xy-cut sort

### Evaluation
Run the followin gcommand for this
[PDF](https://utic-dev-tech-fixtures.s3.us-east-2.amazonaws.com/pastebin/patent-11723901-page2.pdf).

PYTHONPATH=. python examples/custom-layout-order/evaluate_xy_cut_sorting.py <file_path> <strategy>
  • Loading branch information
christinestraub authored Sep 29, 2023
1 parent cd8c6a2 commit 94fbbed
Show file tree
Hide file tree
Showing 20 changed files with 1,484 additions and 1,416 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

### Enhancements

* **Better detection of natural reading order in images and PDF's** The elements returned by partition better reflect natural reading order in some cases, particularly in complicated multi-column layouts, leading to better chunking and retrieval for downstream applications. Achieved by improving the `xy-cut` sorting to preprocess bboxes, shrinking all bounding boxes by 90% along x and y axes (still centered around the same center point), which allows projection lines to be drawn where not possible before if layout bboxes overlapped.
* **Improves `partition_xml` to be faster and more memory efficient when partitioning large XML files** The new behavior is to partition iteratively to prevent loading the entire XML tree into memory at once in most use cases.
* **Adds data source properties to SharePoint, Outlook, Onedrive, Reddit, Slack, and DeltaTable connectors** These properties (date_created, date_modified, version, source_url, record_locator) are written to element metadata during ingest, mapping elements to information about the document source from which they derive. This functionality enables downstream applications to reveal source document applications, e.g. a link to a GDrive doc, Salesforce record, etc.
* **Add functionality to save embedded images in PDF's separately as images** This allows users to save embedded images in PDF's separately as images, given some directory path. The saved image path is written to the metadata for the Image element. Downstream applications may benefit by providing users with image links from relevant "hits."
Expand Down
2 changes: 1 addition & 1 deletion test_unstructured/partition/pdf-image/test_pdf.py
Original file line number Diff line number Diff line change
Expand Up @@ -479,7 +479,7 @@ def test_partition_pdf_fast_groups_text_in_text_box():
system=expected_coordinate_system_3,
),
)
assert elements[3] == Text("2.5", metadata=expected_elem_metadata_3)
assert elements[2] == Text("2.5", metadata=expected_elem_metadata_3)


def test_partition_pdf_with_metadata_filename(
Expand Down
27 changes: 27 additions & 0 deletions test_unstructured/partition/utils/test_sorting.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,19 @@
from unstructured.partition.utils.constants import SORT_MODE_BASIC, SORT_MODE_XY_CUT
from unstructured.partition.utils.sorting import (
coord_has_valid_points,
coordinates_to_bbox,
shrink_bbox,
sort_page_elements,
)


class MockCoordinatesMetadata(CoordinatesMetadata):
def __init__(self, points):
system = PixelSpace(width=300, height=500)

super().__init__(points, system)


def test_coord_valid_coordinates():
coordinates = CoordinatesMetadata([(1, 2), (3, 4), (5, 6), (7, 8)], PixelSpace)
assert coord_has_valid_points(coordinates) is True
Expand Down Expand Up @@ -98,3 +107,21 @@ def test_sort_basic_pos_coordinates():

sorted_elem_text = " ".join([str(elem.text) for elem in sorted_page_elements])
assert sorted_elem_text == "7 8 9"


def test_coordinates_to_bbox():
coordinates_data = MockCoordinatesMetadata([(10, 20), (10, 200), (100, 200), (100, 20)])
expected_result = (10, 20, 100, 200)
assert coordinates_to_bbox(coordinates_data) == expected_result


def test_shrink_bbox():
bbox = (0, 0, 100, 100)
shrink_factor = 0.5
expected_result = (25, 25, 75, 75)
assert shrink_bbox(bbox, shrink_factor) == expected_result

bbox = (0, 0, 200, 100)
shrink_factor = 0.9
expected_result = (10, 5, 190, 95)
assert shrink_bbox(bbox, shrink_factor) == expected_result
Original file line number Diff line number Diff line change
Expand Up @@ -266,8 +266,8 @@
"text": "Executive Summary"
},
{
"type": "NarrativeText",
"element_id": "2364a6d2f9a3858d51d91b817732e6c9",
"type": "Title",
"element_id": "6712d87f1d156abf6171f700e2875889",
"metadata": {
"data_source": {
"url": "abfs://container1/Core-Skills-for-Biomedical-Data-Scientists-2-pages.pdf",
Expand All @@ -282,11 +282,11 @@
"filetype": "application/pdf",
"page_number": 1
},
"text": "This report provides recommendations for a scientists based on analysis that draws on opinions of data scientists, curricula for existing science requirements science jobs."
"text": "biomedical"
},
{
"type": "Title",
"element_id": "6712d87f1d156abf6171f700e2875889",
"type": "NarrativeText",
"element_id": "2364a6d2f9a3858d51d91b817732e6c9",
"metadata": {
"data_source": {
"url": "abfs://container1/Core-Skills-for-Biomedical-Data-Scientists-2-pages.pdf",
Expand All @@ -301,7 +301,7 @@
"filetype": "application/pdf",
"page_number": 1
},
"text": "biomedical"
"text": "This report provides recommendations for a scientists based on analysis that draws on opinions of data scientists, curricula for existing science requirements science jobs."
},
{
"type": "Title",
Expand Down Expand Up @@ -836,8 +836,8 @@
"text": "The"
},
{
"type": "NarrativeText",
"element_id": "cdc3773cb12cf99d302b9f00c48ae1e8",
"type": "Title",
"element_id": "aa3b88196a6407c3866c85acdcc8c981",
"metadata": {
"data_source": {
"url": "abfs://container1/Core-Skills-for-Biomedical-Data-Scientists-2-pages.pdf",
Expand All @@ -852,11 +852,11 @@
"filetype": "application/pdf",
"page_number": 2
},
"text": "required of"
"text": "Workforce"
},
{
"type": "Title",
"element_id": "aa3b88196a6407c3866c85acdcc8c981",
"type": "NarrativeText",
"element_id": "cdc3773cb12cf99d302b9f00c48ae1e8",
"metadata": {
"data_source": {
"url": "abfs://container1/Core-Skills-for-Biomedical-Data-Scientists-2-pages.pdf",
Expand All @@ -871,7 +871,7 @@
"filetype": "application/pdf",
"page_number": 2
},
"text": "Workforce"
"text": "required of"
},
{
"type": "NarrativeText",
Expand Down Expand Up @@ -1083,8 +1083,8 @@
"text": "b)"
},
{
"type": "NarrativeText",
"element_id": "1117af46b0a22dd02d3869ab9738a8a8",
"type": "Title",
"element_id": "6b847a0ed0b2c484c73f2749e29b4db5",
"metadata": {
"data_source": {
"url": "abfs://container1/Core-Skills-for-Biomedical-Data-Scientists-2-pages.pdf",
Expand All @@ -1099,11 +1099,11 @@
"filetype": "application/pdf",
"page_number": 2
},
"text": "Data science skills taught in BD2K-funded training programs. A qualitative content analysis applied to the descriptions of required offered under the BD2kK-funded training programs. Each course was coded using qualitative data analysis software, with each skill that was present in the description counted once. The coding schema of data science-related skills was inductively developed and was organized four major categories: (1) statistics and math skills; (2) computer science; (3) subject knowledge; (4) general skills, like communication and teamwork. The coding schema is detailed in Appendix A."
"text": "into"
},
{
"type": "Title",
"element_id": "6b847a0ed0b2c484c73f2749e29b4db5",
"type": "NarrativeText",
"element_id": "1117af46b0a22dd02d3869ab9738a8a8",
"metadata": {
"data_source": {
"url": "abfs://container1/Core-Skills-for-Biomedical-Data-Scientists-2-pages.pdf",
Expand All @@ -1118,7 +1118,7 @@
"filetype": "application/pdf",
"page_number": 2
},
"text": "into"
"text": "Data science skills taught in BD2K-funded training programs. A qualitative content analysis applied to the descriptions of required offered under the BD2kK-funded training programs. Each course was coded using qualitative data analysis software, with each skill that was present in the description counted once. The coding schema of data science-related skills was inductively developed and was organized four major categories: (1) statistics and math skills; (2) computer science; (3) subject knowledge; (4) general skills, like communication and teamwork. The coding schema is detailed in Appendix A."
},
{
"type": "NarrativeText",
Expand Down Expand Up @@ -1197,8 +1197,8 @@
"text": "c)"
},
{
"type": "NarrativeText",
"element_id": "961a38da2886c3cc25091d912769aa0d",
"type": "Title",
"element_id": "6d0607a7a2ac9823f9fb2a62ea2b7385",
"metadata": {
"data_source": {
"url": "abfs://container1/Core-Skills-for-Biomedical-Data-Scientists-2-pages.pdf",
Expand All @@ -1213,7 +1213,7 @@
"filetype": "application/pdf",
"page_number": 2
},
"text": "job job government (8.5%), (42.4%), industry (83.9%), and nonprofit (15.3%) were sampled from websites like Glassdoor, Linkedin, and Ziprecruiter. The content analysis methodology and coding schema in analyzing the training programs were applied to the job descriptions. Because many job ads mentioned the same skill more than once, each occurrence of the skill was coded, therefore weighting single ad."
"text": "Desired"
},
{
"type": "NarrativeText",
Expand All @@ -1235,8 +1235,8 @@
"text": "important skills that were mentioned multiple times in"
},
{
"type": "Title",
"element_id": "6d0607a7a2ac9823f9fb2a62ea2b7385",
"type": "NarrativeText",
"element_id": "961a38da2886c3cc25091d912769aa0d",
"metadata": {
"data_source": {
"url": "abfs://container1/Core-Skills-for-Biomedical-Data-Scientists-2-pages.pdf",
Expand All @@ -1251,7 +1251,7 @@
"filetype": "application/pdf",
"page_number": 2
},
"text": "Desired"
"text": "job job government (8.5%), (42.4%), industry (83.9%), and nonprofit (15.3%) were sampled from websites like Glassdoor, Linkedin, and Ziprecruiter. The content analysis methodology and coding schema in analyzing the training programs were applied to the job descriptions. Because many job ads mentioned the same skill more than once, each occurrence of the skill was coded, therefore weighting single ad."
},
{
"type": "Title",
Expand Down
Loading

0 comments on commit 94fbbed

Please sign in to comment.