Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat/add sources from unstructured inference #1538

Merged
merged 57 commits into from
Oct 5, 2023

Conversation

benjats07
Copy link
Contributor

@benjats07 benjats07 commented Sep 26, 2023

This PR adds support for source property from unstructured_inference, allowing the user to be able to see the origin of the data under detection_originfield environment variable UNSTRUCTURED_INCLUDE_DEBUG_METADATA=true

In order to try this feature you can use this code:

from unstructured.partition.pdf import partition_pdf_or_image

yolox_elements = partition_pdf_or_image(filename='example-docs/loremipsum-flat.pdf', strategy='hi_res', model_name='yolox')

sources = [e.detection_origin for e in yolox_elements]
print(sources)

And will print 'yolox' as source for all the elements

@benjats07 benjats07 linked an issue Sep 26, 2023 that may be closed by this pull request
@benjats07 benjats07 changed the title Benjamin/feat/add sources from unstructured inference feat/add sources from unstructured inference Sep 27, 2023
benjats07 and others added 4 commits September 27, 2023 17:55
Copy link
Contributor

@cragwolfe cragwolfe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some quick feedback. this is mainly for debugging purposes, right?

  • let's avoid overloading the word "source" which is has a meaning for data connectors.
  • assuming this "breadcrumb" exists for debugging purposes, it should not exist in any element json output by default

@benjats07
Copy link
Contributor Author

some quick feedback. this is mainly for debugging purposes, right?

* let's avoid overloading the word "source" which is has a meaning for data connectors.

* assuming this "breadcrumb" exists for debugging purposes, it should not exist in any element json output by default
  • "source" isn't used anymore in this PR
  • Restored original output

@benjats07 benjats07 closed this Sep 28, 2023
@benjats07 benjats07 reopened this Sep 28, 2023
@benjats07 benjats07 marked this pull request as ready for review September 28, 2023 22:34
@benjats07 benjats07 requested a review from qued October 4, 2023 18:24
Copy link
Contributor

@qued qued left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@cragwolfe cragwolfe dismissed their stale review October 5, 2023 17:16

changes addressed

Comment on lines +136 to +137
if UNSTRUCTURED_INCLUDE_DEBUG_METADATA:
assert {element.metadata.detection_origin for element in result} == {origin}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe use mocker to set the constant to true so we are certain this is tested; I understand that in the ci we set the env variable so this is tested in CI but this if statement here can cause confusion for local testing vs. ci; and potentially have code silently fail if ci for some reason dropped the env

detection_origin: Optional[str] = None

def __setattr__(self, key, value):
if not UNSTRUCTURED_INCLUDE_DEBUG_METADATA and key == "detection_origin":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here and below where we hard wire "detection_origin" maybe we can use a parameter to hold those debug_metadata_keys for checks so it is easier to change them later (just change one place)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, but at this point, I prefer not doing that as this is the only parameter with that behavior, I can add it if we begin to add variables just during debugging

ryannikolaidis and others added 2 commits October 5, 2023 13:31
…update (#1655)

This pull request includes updated ingest test fixtures.
Please review and merge if appropriate.

Co-authored-by: benjats07 <[email protected]>
@benjats07 benjats07 enabled auto-merge (squash) October 5, 2023 19:42
@benjats07 benjats07 enabled auto-merge (squash) October 5, 2023 19:42
@benjats07 benjats07 merged commit e0201e9 into main Oct 5, 2023
30 of 39 checks passed
@benjats07 benjats07 deleted the benjamin/feat/add-sources-from-unstructured-inference branch October 5, 2023 20:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat/add sources from unstructured-inference
5 participants