Skip to content

Commit

Permalink
enhancement: entire page OCR output included with hi_res (#1263)
Browse files Browse the repository at this point in the history
Bumps unstructured-inference==0.5.19 to bring in @christinestraub's
enhancement
Unstructured-IO/unstructured-inference#186 .

This is a **massive** improvement where previously omitted text was not
included in `hi_res` output if the layout model had not put a bounding
box around it. In addition, the xycut sorting algorithm generally does a
good job of ordering the merged OCR-text-not-in-layout-model bboxes with
layout-model bboxes into "natural reading order." More details in
Unstructured-IO/unstructured-inference#186 (comment) .

Bonus: changelog fix.
  • Loading branch information
cragwolfe authored Sep 1, 2023
1 parent 9191be7 commit 6534411
Show file tree
Hide file tree
Showing 14 changed files with 2,618 additions and 664 deletions.
8 changes: 6 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,18 @@
## 0.10.11-dev0
## 0.10.11

### Enhancements

* Bump unstructured-inference
* Combine entire-page OCR output with layout-detected elements, to ensure full coverage of the page (0.5.19)

### Features

* Add in ingest cli s3 writer

### Fixes

* Fix a bug where `xy-cut` sorting attemps to sort elements without valid coordinates; now xy cut sorting only works when **all** elements have valid coordinates

## 0.10.10

### Enhancements
Expand Down Expand Up @@ -44,7 +49,6 @@
* Edit `add_pytesseract_bbox_to_elements`'s (`ocr_only` strategy) `metadata.coordinates.points` return type to `Tuple` for consistency.
* Re-enable test-ingest-confluence-diff for ingest tests
* Fix syntax for ingest test check number of files
* Fix a bug where `xy-cut` sorting attemps to sort elements without valid coordinates; now xy cut sorting only works when **all** elements have valid coordinates

## 0.10.8

Expand Down
2 changes: 1 addition & 1 deletion requirements/constraints.in
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ Pillow<10.0.0
# AttributeError: 'ResourcePath' object has no attribute 'collection'
Office365-REST-Python-Client<2.4.3
# NOTE(christine) Pinned to set the `unstructured-inference` version
unstructured-inference==0.5.18
unstructured-inference==0.5.19
# NOTE(klaijan) - Moved pin from test.in
# pinning to avoid error in argilla library
pydantic<2
Expand Down
2 changes: 1 addition & 1 deletion requirements/extra-pdf-image.txt
Original file line number Diff line number Diff line change
Expand Up @@ -209,7 +209,7 @@ typing-extensions==4.7.1
# torch
tzdata==2023.3
# via pandas
unstructured-inference==0.5.18
unstructured-inference==0.5.19
# via
# -c requirements/constraints.in
# -r requirements/extra-pdf-image.in
Expand Down
2 changes: 1 addition & 1 deletion requirements/test.txt
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ idna==3.4
# yarl
iniconfig==2.0.0
# via pytest
label-studio-sdk==0.0.30
label-studio-sdk==0.0.32
# via -r requirements/test.in
label-studio-tools==0.0.3
# via label-studio-sdk
Expand Down
Loading

0 comments on commit 6534411

Please sign in to comment.