Skip to content

Commit

Permalink
fix: Don't call extractable_elements if strategy is ocr_only (#1160)
Browse files Browse the repository at this point in the history
- fixes #1079 where partitioning is happening twice in the case of
`strategy="ocr_only"`
- only calls `extractable_elements` if we can predetermine that
`ocr_only` is not a possible strategy even if it was the intended
strategy.
- Adds additional assertion test that `_partition_pdf_or_image_with_ocr`
is not called when falling back to `fast` from `ocr_only`
  • Loading branch information
cdpierse authored Aug 23, 2023
1 parent e9c6492 commit 1ddf542
Show file tree
Hide file tree
Showing 3 changed files with 20 additions and 2 deletions.
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@

### Fixes

* Fix bug in `partition_pdf_or_image` where two partitions were called if `strategy == "ocr_only"`.

## 0.10.5

### Enhancements
Expand Down
6 changes: 5 additions & 1 deletion test_unstructured/partition/pdf-image/test_pdf.py
Original file line number Diff line number Diff line change
Expand Up @@ -325,10 +325,14 @@ def mock_exists(dep):
pdf,
"extractable_elements",
return_value=mock_return,
) as mock_partition:
) as mock_partition, mock.patch.object(
pdf,
"_partition_pdf_or_image_with_ocr",
) as mock_partition_ocr:
pdf.partition_pdf(filename=filename, url=None, strategy="ocr_only")

mock_partition.assert_called_once()
mock_partition_ocr.assert_not_called()
assert "pytesseract is not installed" in caplog.text


Expand Down
14 changes: 13 additions & 1 deletion unstructured/partition/pdf.py
Original file line number Diff line number Diff line change
Expand Up @@ -155,7 +155,18 @@ def partition_pdf_or_image(
file=file,
filename=filename,
)
if not is_image:

if (
not is_image
and determine_pdf_or_image_strategy(
strategy,
filename=filename,
file=file,
is_image=is_image,
infer_table_structure=infer_table_structure,
)
!= "ocr_only"
):
extracted_elements = extractable_elements(
filename=filename,
file=spooled_to_bytes_io_if_needed(file),
Expand Down Expand Up @@ -209,6 +220,7 @@ def partition_pdf_or_image(
min_partition=min_partition,
metadata_last_modified=metadata_last_modified or last_modification_date,
)

return layout_elements


Expand Down

0 comments on commit 1ddf542

Please sign in to comment.