Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat/add sources from unstructured inference #1538

Merged
merged 57 commits into from
Oct 5, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
3ae41c1
Changelog update
Sep 21, 2023
b31358c
CHANGELOG update
Sep 22, 2023
0dfbd90
feat: add support for store unstructured-infrence data sources
Sep 26, 2023
8320d49
refactor: change import location
Sep 26, 2023
f9ab8c9
lining
Sep 26, 2023
9bab134
fix: handles sources of non-hi_res elements
Sep 26, 2023
3b07daf
fix: correctly query dictionary
Sep 26, 2023
51d7257
Benjamin/feat/add sources from unstructured inference <- Ingest test …
ryannikolaidis Sep 27, 2023
2de6040
feat: add data_origin to all document elements
Sep 27, 2023
ef2f019
fix: add type ignore
Sep 27, 2023
44c5fb4
Merge branch 'main' into benjamin/feat/add-sources-from-unstructured-…
benjats07 Sep 27, 2023
12b30a0
test: fix reference object
Sep 28, 2023
786ab88
fix: corrected image source
Sep 28, 2023
64ca051
feat/add sources from unstructured inference <- Ingest test fixtures …
ryannikolaidis Sep 28, 2023
e700182
feat: add missing origins
Sep 28, 2023
ca23d44
test: add tests for checking source of elements on all types
Sep 28, 2023
15f42d4
linting
Sep 28, 2023
fe3671d
fix: remove data_origin from JSON outputs
Sep 28, 2023
ba194c7
fix: remove variable naming containing _source_
Sep 28, 2023
a265d0b
Merge branch 'main' into benjamin/feat/add-sources-from-unstructured-…
benjats07 Sep 28, 2023
3f0618b
feat/add sources from unstructured inference <- Ingest test fixtures …
ryannikolaidis Sep 28, 2023
f3d7d07
Merge branch 'main' into benjamin/feat/add-sources-from-unstructured-…
benjats07 Sep 28, 2023
b186c5d
Merge branch 'main' into benjamin/feat/add-sources-from-unstructured-…
benjats07 Sep 29, 2023
0d9bcd5
Merge branch 'main' into benjamin/feat/add-sources-from-unstructured-…
benjats07 Sep 29, 2023
60992bb
Update CHANGELOG
Sep 28, 2023
7e95e48
refactor: makes data_source optional field
Oct 2, 2023
30a64f6
refactor: uses setattr for data_source
Oct 3, 2023
8ee672e
Linting
Oct 3, 2023
b7cc62b
Merge branch 'main' into benjamin/feat/add-sources-from-unstructured-…
benjats07 Oct 3, 2023
8b03bf3
refactor: delete lines checking for debug variable
Oct 3, 2023
a423cf5
fix: add debug variable to test-no-extras
Oct 3, 2023
1ed45a5
fix: missing env variable assignation
Oct 3, 2023
25f54a6
fix: missing env variable assignation
Oct 3, 2023
7e4b5a3
Merge branch 'main' into benjamin/feat/add-sources-from-unstructured-…
benjats07 Oct 3, 2023
a73bf65
refactor: change way to asign data_origin
Oct 3, 2023
61d3a4a
test: update tests for checking data_origin
Oct 3, 2023
95d9378
fix: missing data_origin on some elements
Oct 3, 2023
bb47079
Merge branch 'main' into benjamin/feat/add-sources-from-unstructured-…
benjats07 Oct 4, 2023
164963d
Merge branch 'main' into benjamin/feat/add-sources-from-unstructured-…
benjats07 Oct 4, 2023
860af76
refactor: change data_origin by detection_origin
Oct 4, 2023
63ef0ee
Merge branch 'main' into benjamin/feat/add-sources-from-unstructured-…
benjats07 Oct 4, 2023
861d9ac
linting
Oct 4, 2023
1e7a085
Merge branch 'main' into benjamin/feat/add-sources-from-unstructured-…
benjats07 Oct 5, 2023
9065e4e
Merge branch 'main' into benjamin/feat/add-sources-from-unstructured-…
benjats07 Oct 5, 2023
c16b79c
linting
Oct 5, 2023
9dfa83a
fix: new elements missed detection_origin
Oct 5, 2023
b8d5167
Version and changelog update
Oct 5, 2023
519dba1
Merge branch 'main' into benjamin/feat/add-sources-from-unstructured-…
benjats07 Oct 5, 2023
5c1d31d
fix: recovered line lost when merging main
Oct 5, 2023
0b0d68b
fix: variable incorrectly setted
Oct 5, 2023
2fb50ac
refactor: returns objects directly
Oct 5, 2023
2b4110d
refactor: change detection_origin by constants
Oct 5, 2023
73bd60a
feat/add sources from unstructured inference <- Ingest test fixtures …
ryannikolaidis Oct 5, 2023
fbb3e3f
Merge branch 'main' into benjamin/feat/add-sources-from-unstructured-…
benjats07 Oct 5, 2023
e3ed76a
Merge branch 'main' into benjamin/feat/add-sources-from-unstructured-…
benjats07 Oct 5, 2023
1a9a303
linting
Oct 5, 2023
5fc6e82
Changelog update
Oct 5, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -147,7 +147,7 @@ jobs:
tesseract --version
# FIXME (yao): sometimes there is cache but we still miss argilla in the env; so we add make install-ci again
make install-ci
make test CI=true
make test CI=true UNSTRUCTURED_INCLUDE_DEBUG_METADATA=true
make check-coverage

test_unit_no_extras:
Expand Down Expand Up @@ -419,4 +419,4 @@ jobs:
source .venv/bin/activate
echo "UNS_API_KEY=${{ secrets.UNS_API_KEY }}" > uns_test_env_file
make docker-build
make docker-test CI=true
make docker-test CI=true UNSTRUCTURED_INCLUDE_DEBUG_METADATA=true
15 changes: 11 additions & 4 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
## 0.10.20-dev3
## 0.10.20-dev4

### Enhancements

Expand All @@ -9,10 +9,16 @@

### Features

* **Adds detection_origin field to metadata** Problem: Currently isn't an easy way to find out how an element was created. With this change that information is added. Importance: With this information the developers and users are now able to know how an element was created to make decisions on how to use it. In order tu use this feature
setting UNSTRUCTURED_INCLUDE_DEBUG_METADATA=true is needed.

### Fixes

* **Fixes category_depth None value for Title elements** Problem: `Title` elements from `chipper` get `category_depth`= None even when `Headline` and/or `Subheadline` elements are present in the same page. Fix: all `Title` elements with `category_depth` = None should be set to have a depth of 0 instead iff there are `Headline` and/or `Subheadline` element-types present. Importance: `Title` elements should be equivalent html `H1` when nested headings are present; otherwise, `category_depth` metadata can result ambiguous within elements in a page.
* **Tweak `xy-cut` ordering output to be more column friendly** This results in the order of elements more closely reflecting natural reading order which benefits downstream applications. While element ordering from `xy-cut` is usually mostly correct when ordering multi-column documents, sometimes elements from a RHS column will appear before elements in a LHS column. Fix: add swapped `xy-cut` ordering by sorting by X coordinate first and then Y coordinate.
* **Fixes badly initialized Formula** Problem: YoloX contain new types of elements, when loading a document that contain formulas a new element of that class
should be generated, however the Formula class inherits from Element instead of Text. After this change the element is correctly created with the correct class
allowing the document to be loaded. Fix: Change parent class for Formula to Text. Importance: Crucial to be able to load documents that contain formulas.

## 0.10.19

Expand Down Expand Up @@ -64,15 +70,16 @@
* **Fixes issue where unstructured-inference was not getting updated** Problem: unstructured-inference was not getting upgraded to the version to match unstructured release when doing a pip install. Solution: using `pip install unstructured[all-docs]` it will now upgrade both unstructured and unstructured-inference. Importance: This will ensure that the inference library is always in sync with the unstructured library, otherwise users will be using outdated libraries which will likely lead to unintended behavior.
* **Fixes SharePoint connector failures if any document has an unsupported filetype** Problem: Currently the entire connector ingest run fails if a single IngestDoc has an unsupported filetype. This is because a ValueError is raised in the IngestDoc's `__post_init__`. Fix: Adds a try/catch when the IngestConnector runs get_ingest_docs such that the error is logged but all processable documents->IngestDocs are still instantiated and returned. Importance: Allows users to ingest SharePoint content even when some files with unsupported filetypes exist there.
* **Fixes Sharepoint connector server_path issue** Problem: Server path for the Sharepoint Ingest Doc was incorrectly formatted, causing issues while fetching pages from the remote source. Fix: changes formatting of remote file path before instantiating SharepointIngestDocs and appends a '/' while fetching pages from the remote source. Importance: Allows users to fetch pages from Sharepoint Sites.
* **Fixes badly initialized Formula** Problem: YoloX contain new types of elements, when loading a document that contain formulas a new element of that class
should be generated, however the Formula class inherits from Element instead of Text. After this change the element is correctly created with the correct class
allowing the document to be loaded. Fix: Change parent class for Formula to Text. Importance: Crucial to be able to load documents that contain formulas.
* **Fixes Sphinx errors.** Fixes errors when running Sphinx `make html` and installs library to suppress warnings.
* **Fixes a metadata backwards compatibility error** Problem: When calling `partition_via_api`, the hosted api may return an element schema that's newer than the current `unstructured`. In this case, metadata fields were added which did not exist in the local `ElementMetadata` dataclass, and `__init__()` threw an error. Fix: remove nonexistent fields before instantiating in `ElementMetadata.from_json()`. Importance: Crucial to avoid breaking changes when adding fields.
* **Fixes issue with Discord connector when a channel returns `None`** Problem: Getting the `jump_url` from a nonexistent Discord `channel` fails. Fix: property `jump_url` is now retrieved within the same context as the messages from the channel. Importance: Avoids cascading issues when the connector fails to fetch information about a Discord channel.
* **Fixes occasionally SIGABTR when writing table with `deltalake` on Linux** Problem: occasionally on Linux ingest can throw a `SIGABTR` when writing `deltalake` table even though the table was written correctly. Fix: put the writing function into a `Process` to ensure its execution to the fullest extent before returning to the main process. Importance: Improves stability of connectors using `deltalake`


* **Fix badly initialized Formula** Problem: YoloX contain new types of elements, when loading a document that contain formulas a new element of that class
should be generated, however the Formula class inherits from Element instead of Text. After this change the element is correctly created with the correct class
allowing the document to be loaded. Fix: Change parent class for Formula to Text. Importance: Crucial to be able to load documents that contain formulas.

## 0.10.16

### Enhancements
Expand Down
11 changes: 8 additions & 3 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -241,11 +241,13 @@ uninstall-project-local:
#################

export CI ?= false
export UNSTRUCTURED_INCLUDE_DEBUG_METADATA ?= false

## test: runs all unittests
.PHONY: test
test:
PYTHONPATH=. CI=$(CI) pytest test_${PACKAGE_NAME} --cov=${PACKAGE_NAME} --cov-report term-missing
PYTHONPATH=. CI=$(CI) \
UNSTRUCTURED_INCLUDE_DEBUG_METADATA=$(UNSTRUCTURED_INCLUDE_DEBUG_METADATA) pytest test_${PACKAGE_NAME} --cov=${PACKAGE_NAME} --cov-report term-missing

.PHONY: test-unstructured-api-unit
test-unstructured-api-unit:
Expand All @@ -254,7 +256,8 @@ test-unstructured-api-unit:
.PHONY: test-no-extras
# TODO(newelh) Add json test when fixed
test-no-extras:
PYTHONPATH=. CI=$(CI) pytest \
PYTHONPATH=. CI=$(CI) \
UNSTRUCTURED_INCLUDE_DEBUG_METADATA=$(UNSTRUCTURED_INCLUDE_DEBUG_METADATA) pytest \
test_${PACKAGE_NAME}/partition/test_text.py \
test_${PACKAGE_NAME}/partition/test_email.py \
test_${PACKAGE_NAME}/partition/test_html_partition.py \
Expand Down Expand Up @@ -394,7 +397,9 @@ docker-test:
-v ${CURRENT_DIR}/test_unstructured_ingest:/home/notebook-user/test_unstructured_ingest \
$(if $(wildcard uns_test_env_file),--env-file uns_test_env_file,) \
$(DOCKER_IMAGE) \
bash -c "CI=$(CI) pytest $(if $(TEST_NAME),-k $(TEST_NAME),) test_unstructured"
bash -c "CI=$(CI) \
UNSTRUCTURED_INCLUDE_DEBUG_METADATA=$(UNSTRUCTURED_INCLUDE_DEBUG_METADATA) \
pytest $(if $(TEST_NAME),-k $(TEST_NAME),) test_unstructured"

.PHONY: docker-smoke-test
docker-smoke-test:
Expand Down
4 changes: 3 additions & 1 deletion test_unstructured/partition/csv/test_csv.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
from unstructured.documents.elements import Table
from unstructured.partition.csv import partition_csv
from unstructured.partition.json import partition_json
from unstructured.partition.utils.constants import UNSTRUCTURED_INCLUDE_DEBUG_METADATA
from unstructured.staging.base import elements_to_json

EXPECTED_FILETYPE = "text/csv"
Expand Down Expand Up @@ -55,12 +56,13 @@ def test_partition_csv_from_file(filename, expected_text, expected_table):
f_path = f"example-docs/{filename}"
with open(f_path, "rb") as f:
elements = partition_csv(file=f)

assert clean_extra_whitespace(elements[0].text) == expected_text
assert isinstance(elements[0], Table)
assert elements[0].metadata.text_as_html == expected_table
assert elements[0].metadata.filetype == EXPECTED_FILETYPE
assert elements[0].metadata.filename is None
if UNSTRUCTURED_INCLUDE_DEBUG_METADATA:
assert {element.metadata.detection_origin for element in elements} == {"csv"}


def test_partition_csv_from_file_with_metadata_filename(filename="example-docs/stanley-cups.csv"):
Expand Down
3 changes: 3 additions & 0 deletions test_unstructured/partition/docx/test_docx.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
from unstructured.partition.doc import partition_doc
from unstructured.partition.docx import _DocxPartitioner, partition_docx
from unstructured.partition.json import partition_json
from unstructured.partition.utils.constants import UNSTRUCTURED_INCLUDE_DEBUG_METADATA
from unstructured.staging.base import elements_to_json


Expand Down Expand Up @@ -107,6 +108,8 @@ def test_partition_docx_from_filename(
assert elements[0].metadata.page_number is None
for element in elements:
assert element.metadata.filename == "mock_document.docx"
if UNSTRUCTURED_INCLUDE_DEBUG_METADATA:
assert {element.metadata.detection_origin for element in elements} == {"docx"}


def test_partition_docx_from_filename_with_metadata_filename(mock_document, tmpdir):
Expand Down
3 changes: 3 additions & 0 deletions test_unstructured/partition/epub/test_epub.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
from unstructured.documents.elements import Table, Text
from unstructured.partition.epub import partition_epub
from unstructured.partition.json import partition_json
from unstructured.partition.utils.constants import UNSTRUCTURED_INCLUDE_DEBUG_METADATA
from unstructured.staging.base import elements_to_json

DIRECTORY = pathlib.Path(__file__).parent.resolve()
Expand Down Expand Up @@ -33,6 +34,8 @@ def test_partition_epub_from_filename():
assert element.metadata.section is not None
all_sections.add(element.metadata.section)
assert all_sections == expected_sections
if UNSTRUCTURED_INCLUDE_DEBUG_METADATA:
assert {element.metadata.detection_origin for element in elements} == {"epub"}


def test_partition_epub_from_filename_returns_table_in_elements():
Expand Down
3 changes: 3 additions & 0 deletions test_unstructured/partition/markdown/test_md.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
from unstructured.documents.elements import Title
from unstructured.partition.json import partition_json
from unstructured.partition.md import partition_md
from unstructured.partition.utils.constants import UNSTRUCTURED_INCLUDE_DEBUG_METADATA
from unstructured.staging.base import elements_to_json

DIRECTORY = pathlib.Path(__file__).parent.resolve()
Expand All @@ -21,6 +22,8 @@ def test_partition_md_from_filename():
assert len(elements) > 0
for element in elements:
assert element.metadata.filename == "README.md"
if UNSTRUCTURED_INCLUDE_DEBUG_METADATA:
assert {element.metadata.detection_origin for element in elements} == {"md"}


def test_partition_md_from_filename_returns_uns_elements():
Expand Down
3 changes: 3 additions & 0 deletions test_unstructured/partition/msg/test_msg.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
from unstructured.partition.json import partition_json
from unstructured.partition.msg import extract_msg_attachment_info, partition_msg
from unstructured.partition.text import partition_text
from unstructured.partition.utils.constants import UNSTRUCTURED_INCLUDE_DEBUG_METADATA
from unstructured.staging.base import elements_to_json

DIRECTORY = pathlib.Path(__file__).parent.resolve()
Expand Down Expand Up @@ -59,6 +60,8 @@ def test_partition_msg_from_filename():
)
for element in elements:
assert element.metadata.filename == "fake-email.msg"
if UNSTRUCTURED_INCLUDE_DEBUG_METADATA:
assert {element.metadata.detection_origin for element in elements} == {"msg"}


def test_partition_msg_from_filename_returns_uns_elements():
Expand Down
5 changes: 5 additions & 0 deletions test_unstructured/partition/odt/test_odt.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
from unstructured.documents.elements import Table, TableChunk, Title
from unstructured.partition.json import partition_json
from unstructured.partition.odt import partition_odt
from unstructured.partition.utils.constants import UNSTRUCTURED_INCLUDE_DEBUG_METADATA
from unstructured.staging.base import elements_to_json

DIRECTORY = pathlib.Path(__file__).parent.resolve()
Expand All @@ -26,6 +27,10 @@ def test_partition_odt_from_filename():
]
for element in elements:
assert element.metadata.filename == "fake.odt"
if UNSTRUCTURED_INCLUDE_DEBUG_METADATA:
assert {element.metadata.detection_origin for element in elements} == {
"docx",
} # this file is processed by docx backend


def test_partition_odt_from_filename_with_metadata_filename():
Expand Down
3 changes: 3 additions & 0 deletions test_unstructured/partition/pdf-image/test_image.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
from unstructured.chunking.title import chunk_by_title
from unstructured.partition import image, pdf
from unstructured.partition.json import partition_json
from unstructured.partition.utils.constants import UNSTRUCTURED_INCLUDE_DEBUG_METADATA
from unstructured.staging.base import elements_to_json

DIRECTORY = pathlib.Path(__file__).parent.resolve()
Expand Down Expand Up @@ -247,6 +248,8 @@ def test_partition_image_default_strategy_hi_res():
assert elements[idx].metadata.coordinates is not None
assert elements[idx].metadata.detection_class_prob is not None
assert isinstance(elements[idx].metadata.detection_class_prob, float)
if UNSTRUCTURED_INCLUDE_DEBUG_METADATA:
assert {element.metadata.detection_origin for element in elements} == {"image"}


def test_partition_image_metadata_date(
Expand Down
8 changes: 6 additions & 2 deletions test_unstructured/partition/pdf-image/test_pdf.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
)
from unstructured.partition import pdf, strategies
from unstructured.partition.json import partition_json
from unstructured.partition.utils.constants import UNSTRUCTURED_INCLUDE_DEBUG_METADATA
from unstructured.staging.base import elements_to_json


Expand Down Expand Up @@ -114,15 +115,16 @@ def test_partition_pdf_local_raises_with_no_filename():

@pytest.mark.parametrize("file_mode", ["filename", "rb", "spool"])
@pytest.mark.parametrize(
("strategy", "expected"),
("strategy", "expected", "origin"),
# fast: can't capture the "intentionally left blank page" page
# others: will ignore the actual blank page
[("fast", {1, 4}), ("hi_res", {1, 3, 4}), ("ocr_only", {1, 3, 4})],
[("fast", {1, 4}, "pdfminer"), ("hi_res", {1, 3, 4}, "pdf"), ("ocr_only", {1, 3, 4}, "OCR")],
)
def test_partition_pdf(
file_mode,
strategy,
expected,
origin,
filename="example-docs/layout-parser-paper-with-empty-pages.pdf",
):
# Test that the partition_pdf function can handle filename
Expand All @@ -131,6 +133,8 @@ def _test(result):
assert len(result) > 10
# check that the pdf has multiple different page numbers
assert {element.metadata.page_number for element in result} == expected
if UNSTRUCTURED_INCLUDE_DEBUG_METADATA:
assert {element.metadata.detection_origin for element in result} == {origin}
Comment on lines +136 to +137
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe use mocker to set the constant to true so we are certain this is tested; I understand that in the ci we set the env variable so this is tested in CI but this if statement here can cause confusion for local testing vs. ci; and potentially have code silently fail if ci for some reason dropped the env


if file_mode == "filename":
result = pdf.partition_pdf(filename=filename, strategy=strategy)
Expand Down
3 changes: 3 additions & 0 deletions test_unstructured/partition/pptx/test_ppt.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
from unstructured.documents.elements import ListItem, NarrativeText, Title
from unstructured.partition.json import partition_json
from unstructured.partition.ppt import partition_ppt
from unstructured.partition.utils.constants import UNSTRUCTURED_INCLUDE_DEBUG_METADATA
from unstructured.staging.base import elements_to_json

DIRECTORY = pathlib.Path(__file__).parent.resolve()
Expand All @@ -28,6 +29,8 @@ def test_partition_ppt_from_filename():
assert elements == EXPECTED_PPT_OUTPUT
for element in elements:
assert element.metadata.filename == "fake-power-point.ppt"
if UNSTRUCTURED_INCLUDE_DEBUG_METADATA:
assert {element.metadata.detection_origin for element in elements} == {"pptx"}


def test_partition_ppt_from_filename_with_metadata_filename():
Expand Down
3 changes: 3 additions & 0 deletions test_unstructured/partition/test_text.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
partition_text,
split_content_to_fit_max,
)
from unstructured.partition.utils.constants import UNSTRUCTURED_INCLUDE_DEBUG_METADATA
from unstructured.staging.base import elements_to_json

DIRECTORY = pathlib.Path(__file__).parent.resolve()
Expand Down Expand Up @@ -67,6 +68,8 @@ def test_partition_text_from_filename(filename, encoding):
assert elements == EXPECTED_OUTPUT
for element in elements:
assert element.metadata.filename == filename
if UNSTRUCTURED_INCLUDE_DEBUG_METADATA:
assert {element.metadata.detection_origin for element in elements} == {"text"}


def test_partition_text_from_filename_with_metadata_filename():
Expand Down
3 changes: 3 additions & 0 deletions test_unstructured/partition/test_xml_partition.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
from unstructured.chunking.title import chunk_by_title
from unstructured.documents.elements import NarrativeText, Title
from unstructured.partition.json import partition_json
from unstructured.partition.utils.constants import UNSTRUCTURED_INCLUDE_DEBUG_METADATA
from unstructured.partition.xml import partition_xml
from unstructured.staging.base import elements_to_json

Expand All @@ -22,6 +23,8 @@ def test_partition_xml_from_filename(filename):

assert elements[0].text == "United States"
assert elements[0].metadata.filename == filename
if UNSTRUCTURED_INCLUDE_DEBUG_METADATA:
assert {element.metadata.detection_origin for element in elements} == {"xml"}


def test_partition_xml_from_filename_with_metadata_filename():
Expand Down
2 changes: 1 addition & 1 deletion unstructured/__version__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "0.10.20-dev3" # pragma: no cover
__version__ = "0.10.20-dev4" # pragma: no cover
Loading
Loading