Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat/add sources from unstructured inference #1538

Merged
merged 57 commits into from
Oct 5, 2023
Merged
Show file tree
Hide file tree
Changes from 43 commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
3ae41c1
Changelog update
Sep 21, 2023
b31358c
CHANGELOG update
Sep 22, 2023
0dfbd90
feat: add support for store unstructured-infrence data sources
Sep 26, 2023
8320d49
refactor: change import location
Sep 26, 2023
f9ab8c9
lining
Sep 26, 2023
9bab134
fix: handles sources of non-hi_res elements
Sep 26, 2023
3b07daf
fix: correctly query dictionary
Sep 26, 2023
51d7257
Benjamin/feat/add sources from unstructured inference <- Ingest test …
ryannikolaidis Sep 27, 2023
2de6040
feat: add data_origin to all document elements
Sep 27, 2023
ef2f019
fix: add type ignore
Sep 27, 2023
44c5fb4
Merge branch 'main' into benjamin/feat/add-sources-from-unstructured-…
benjats07 Sep 27, 2023
12b30a0
test: fix reference object
Sep 28, 2023
786ab88
fix: corrected image source
Sep 28, 2023
64ca051
feat/add sources from unstructured inference <- Ingest test fixtures …
ryannikolaidis Sep 28, 2023
e700182
feat: add missing origins
Sep 28, 2023
ca23d44
test: add tests for checking source of elements on all types
Sep 28, 2023
15f42d4
linting
Sep 28, 2023
fe3671d
fix: remove data_origin from JSON outputs
Sep 28, 2023
ba194c7
fix: remove variable naming containing _source_
Sep 28, 2023
a265d0b
Merge branch 'main' into benjamin/feat/add-sources-from-unstructured-…
benjats07 Sep 28, 2023
3f0618b
feat/add sources from unstructured inference <- Ingest test fixtures …
ryannikolaidis Sep 28, 2023
f3d7d07
Merge branch 'main' into benjamin/feat/add-sources-from-unstructured-…
benjats07 Sep 28, 2023
b186c5d
Merge branch 'main' into benjamin/feat/add-sources-from-unstructured-…
benjats07 Sep 29, 2023
0d9bcd5
Merge branch 'main' into benjamin/feat/add-sources-from-unstructured-…
benjats07 Sep 29, 2023
60992bb
Update CHANGELOG
Sep 28, 2023
7e95e48
refactor: makes data_source optional field
Oct 2, 2023
30a64f6
refactor: uses setattr for data_source
Oct 3, 2023
8ee672e
Linting
Oct 3, 2023
b7cc62b
Merge branch 'main' into benjamin/feat/add-sources-from-unstructured-…
benjats07 Oct 3, 2023
8b03bf3
refactor: delete lines checking for debug variable
Oct 3, 2023
a423cf5
fix: add debug variable to test-no-extras
Oct 3, 2023
1ed45a5
fix: missing env variable assignation
Oct 3, 2023
25f54a6
fix: missing env variable assignation
Oct 3, 2023
7e4b5a3
Merge branch 'main' into benjamin/feat/add-sources-from-unstructured-…
benjats07 Oct 3, 2023
a73bf65
refactor: change way to asign data_origin
Oct 3, 2023
61d3a4a
test: update tests for checking data_origin
Oct 3, 2023
95d9378
fix: missing data_origin on some elements
Oct 3, 2023
bb47079
Merge branch 'main' into benjamin/feat/add-sources-from-unstructured-…
benjats07 Oct 4, 2023
164963d
Merge branch 'main' into benjamin/feat/add-sources-from-unstructured-…
benjats07 Oct 4, 2023
860af76
refactor: change data_origin by detection_origin
Oct 4, 2023
63ef0ee
Merge branch 'main' into benjamin/feat/add-sources-from-unstructured-…
benjats07 Oct 4, 2023
861d9ac
linting
Oct 4, 2023
1e7a085
Merge branch 'main' into benjamin/feat/add-sources-from-unstructured-…
benjats07 Oct 5, 2023
9065e4e
Merge branch 'main' into benjamin/feat/add-sources-from-unstructured-…
benjats07 Oct 5, 2023
c16b79c
linting
Oct 5, 2023
9dfa83a
fix: new elements missed detection_origin
Oct 5, 2023
b8d5167
Version and changelog update
Oct 5, 2023
519dba1
Merge branch 'main' into benjamin/feat/add-sources-from-unstructured-…
benjats07 Oct 5, 2023
5c1d31d
fix: recovered line lost when merging main
Oct 5, 2023
0b0d68b
fix: variable incorrectly setted
Oct 5, 2023
2fb50ac
refactor: returns objects directly
Oct 5, 2023
2b4110d
refactor: change detection_origin by constants
Oct 5, 2023
73bd60a
feat/add sources from unstructured inference <- Ingest test fixtures …
ryannikolaidis Oct 5, 2023
fbb3e3f
Merge branch 'main' into benjamin/feat/add-sources-from-unstructured-…
benjats07 Oct 5, 2023
e3ed76a
Merge branch 'main' into benjamin/feat/add-sources-from-unstructured-…
benjats07 Oct 5, 2023
1a9a303
linting
Oct 5, 2023
5fc6e82
Changelog update
Oct 5, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -147,7 +147,7 @@ jobs:
tesseract --version
# FIXME (yao): sometimes there is cache but we still miss argilla in the env; so we add make install-ci again
make install-ci
make test CI=true
make test CI=true UNSTRUCTURED_INCLUDE_DEBUG_METADATA=true
make check-coverage

test_unit_no_extras:
Expand Down Expand Up @@ -419,4 +419,4 @@ jobs:
source .venv/bin/activate
echo "UNS_API_KEY=${{ secrets.UNS_API_KEY }}" > uns_test_env_file
make docker-build
make docker-test CI=true
make docker-test CI=true UNSTRUCTURED_INCLUDE_DEBUG_METADATA=true
5 changes: 5 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,7 @@

* **Adds `links` metadata in `partition_pdf` for `fast` strategy.** Problem: PDF files contain rich information and hyperlink that Unstructured did not captured earlier. Feature: `partition_pdf` now can capture embedded links within the file along with its associated text and page number. Importance: Providing depth in extracted elements give user a better understanding and richer context of documents. This also enables user to map to other elements within the document if the hyperlink is refered internally.
* **Adds the embedding module to be able to embed Elements** Problem: Many NLP applications require the ability to represent parts of documents in a semantic way. Until now, Unstructured did not have text embedding ability within the core library. Feature: This embedding module is able to track embeddings related data with a class, embed a list of elements, and return an updated list of Elements with the *embeddings* property. The module is also able to embed query strings. Importance: Ability to embed documents or parts of documents will enable users to make use of these semantic representations in different NLP applications, such as search, retrieval, and retrieval augmented generation.
* **Adds data_origin field to metadata** Problem: Currently isn't an easy way to find out how an element was created. With this change that information is added. Importance: With this information the developers and users are now able to know how an element was created to make decisions on how to use it.

### Fixes

Expand All @@ -67,6 +68,10 @@ allowing the document to be loaded. Fix: Change parent class for Formula to Text
* **Fixes occasionally SIGABTR when writing table with `deltalake` on Linux** Problem: occasionally on Linux ingest can throw a `SIGABTR` when writing `deltalake` table even though the table was written correctly. Fix: put the writing function into a `Process` to ensure its execution to the fullest extent before returning to the main process. Importance: Improves stability of connectors using `deltalake`


* **Fix badly initialized Formula** Problem: YoloX contain new types of elements, when loading a document that contain formulas a new element of that class
should be generated, however the Formula class inherits from Element instead of Text. After this change the element is correctly created with the correct class
allowing the document to be loaded. Fix: Change parent class for Formula to Text. Importance: Crucial to be able to load documents that contain formulas.

## 0.10.16

### Enhancements
Expand Down
11 changes: 8 additions & 3 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -241,11 +241,13 @@ uninstall-project-local:
#################

export CI ?= false
export UNSTRUCTURED_INCLUDE_DEBUG_METADATA ?= false

## test: runs all unittests
.PHONY: test
test:
PYTHONPATH=. CI=$(CI) pytest test_${PACKAGE_NAME} --cov=${PACKAGE_NAME} --cov-report term-missing
PYTHONPATH=. CI=$(CI) \
UNSTRUCTURED_INCLUDE_DEBUG_METADATA=$(UNSTRUCTURED_INCLUDE_DEBUG_METADATA) pytest test_${PACKAGE_NAME} --cov=${PACKAGE_NAME} --cov-report term-missing

.PHONY: test-unstructured-api-unit
test-unstructured-api-unit:
Expand All @@ -254,7 +256,8 @@ test-unstructured-api-unit:
.PHONY: test-no-extras
# TODO(newelh) Add json test when fixed
test-no-extras:
PYTHONPATH=. CI=$(CI) pytest \
PYTHONPATH=. CI=$(CI) \
UNSTRUCTURED_INCLUDE_DEBUG_METADATA=$(UNSTRUCTURED_INCLUDE_DEBUG_METADATA) pytest \
test_${PACKAGE_NAME}/partition/test_text.py \
test_${PACKAGE_NAME}/partition/test_email.py \
test_${PACKAGE_NAME}/partition/test_html_partition.py \
Expand Down Expand Up @@ -394,7 +397,9 @@ docker-test:
-v ${CURRENT_DIR}/test_unstructured_ingest:/home/notebook-user/test_unstructured_ingest \
$(if $(wildcard uns_test_env_file),--env-file uns_test_env_file,) \
$(DOCKER_IMAGE) \
bash -c "CI=$(CI) pytest $(if $(TEST_NAME),-k $(TEST_NAME),) test_unstructured"
bash -c "CI=$(CI) \
UNSTRUCTURED_INCLUDE_DEBUG_METADATA=$(UNSTRUCTURED_INCLUDE_DEBUG_METADATA) \
pytest $(if $(TEST_NAME),-k $(TEST_NAME),) test_unstructured"

.PHONY: docker-smoke-test
docker-smoke-test:
Expand Down
4 changes: 3 additions & 1 deletion test_unstructured/partition/csv/test_csv.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
from unstructured.documents.elements import Table
from unstructured.partition.csv import partition_csv
from unstructured.partition.json import partition_json
from unstructured.partition.utils.constants import UNSTRUCTURED_INCLUDE_DEBUG_METADATA
from unstructured.staging.base import elements_to_json

EXPECTED_FILETYPE = "text/csv"
Expand Down Expand Up @@ -55,12 +56,13 @@ def test_partition_csv_from_file(filename, expected_text, expected_table):
f_path = f"example-docs/{filename}"
with open(f_path, "rb") as f:
elements = partition_csv(file=f)

assert clean_extra_whitespace(elements[0].text) == expected_text
assert isinstance(elements[0], Table)
assert elements[0].metadata.text_as_html == expected_table
assert elements[0].metadata.filetype == EXPECTED_FILETYPE
assert elements[0].metadata.filename is None
if UNSTRUCTURED_INCLUDE_DEBUG_METADATA:
assert {element.metadata.detection_origin for element in elements} == {"csv"}


def test_partition_csv_from_file_with_metadata_filename(filename="example-docs/stanley-cups.csv"):
Expand Down
3 changes: 3 additions & 0 deletions test_unstructured/partition/docx/test_docx.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
from unstructured.partition.doc import partition_doc
from unstructured.partition.docx import _DocxPartitioner, partition_docx
from unstructured.partition.json import partition_json
from unstructured.partition.utils.constants import UNSTRUCTURED_INCLUDE_DEBUG_METADATA
from unstructured.staging.base import elements_to_json


Expand Down Expand Up @@ -107,6 +108,8 @@ def test_partition_docx_from_filename(
assert elements[0].metadata.page_number is None
for element in elements:
assert element.metadata.filename == "mock_document.docx"
if UNSTRUCTURED_INCLUDE_DEBUG_METADATA:
assert {element.metadata.detection_origin for element in elements} == {"docx"}


def test_partition_docx_from_filename_with_metadata_filename(mock_document, tmpdir):
Expand Down
3 changes: 3 additions & 0 deletions test_unstructured/partition/epub/test_epub.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
from unstructured.documents.elements import Table, Text
from unstructured.partition.epub import partition_epub
from unstructured.partition.json import partition_json
from unstructured.partition.utils.constants import UNSTRUCTURED_INCLUDE_DEBUG_METADATA
from unstructured.staging.base import elements_to_json

DIRECTORY = pathlib.Path(__file__).parent.resolve()
Expand Down Expand Up @@ -33,6 +34,8 @@ def test_partition_epub_from_filename():
assert element.metadata.section is not None
all_sections.add(element.metadata.section)
assert all_sections == expected_sections
if UNSTRUCTURED_INCLUDE_DEBUG_METADATA:
assert {element.metadata.detection_origin for element in elements} == {"epub"}


def test_partition_epub_from_filename_returns_table_in_elements():
Expand Down
3 changes: 3 additions & 0 deletions test_unstructured/partition/markdown/test_md.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
from unstructured.documents.elements import Title
from unstructured.partition.json import partition_json
from unstructured.partition.md import partition_md
from unstructured.partition.utils.constants import UNSTRUCTURED_INCLUDE_DEBUG_METADATA
from unstructured.staging.base import elements_to_json

DIRECTORY = pathlib.Path(__file__).parent.resolve()
Expand All @@ -21,6 +22,8 @@ def test_partition_md_from_filename():
assert len(elements) > 0
for element in elements:
assert element.metadata.filename == "README.md"
if UNSTRUCTURED_INCLUDE_DEBUG_METADATA:
assert {element.metadata.detection_origin for element in elements} == {"md"}


def test_partition_md_from_filename_returns_uns_elements():
Expand Down
3 changes: 3 additions & 0 deletions test_unstructured/partition/msg/test_msg.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
from unstructured.partition.json import partition_json
from unstructured.partition.msg import extract_msg_attachment_info, partition_msg
from unstructured.partition.text import partition_text
from unstructured.partition.utils.constants import UNSTRUCTURED_INCLUDE_DEBUG_METADATA
from unstructured.staging.base import elements_to_json

DIRECTORY = pathlib.Path(__file__).parent.resolve()
Expand Down Expand Up @@ -59,6 +60,8 @@ def test_partition_msg_from_filename():
)
for element in elements:
assert element.metadata.filename == "fake-email.msg"
if UNSTRUCTURED_INCLUDE_DEBUG_METADATA:
assert {element.metadata.detection_origin for element in elements} == {"msg"}


def test_partition_msg_from_filename_returns_uns_elements():
Expand Down
5 changes: 5 additions & 0 deletions test_unstructured/partition/odt/test_odt.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
from unstructured.documents.elements import Table, TableChunk, Title
from unstructured.partition.json import partition_json
from unstructured.partition.odt import partition_odt
from unstructured.partition.utils.constants import UNSTRUCTURED_INCLUDE_DEBUG_METADATA
from unstructured.staging.base import elements_to_json

DIRECTORY = pathlib.Path(__file__).parent.resolve()
Expand All @@ -26,6 +27,10 @@ def test_partition_odt_from_filename():
]
for element in elements:
assert element.metadata.filename == "fake.odt"
if UNSTRUCTURED_INCLUDE_DEBUG_METADATA:
assert {element.metadata.detection_origin for element in elements} == {
"docx",
} # this file is processed by docx backend


def test_partition_odt_from_filename_with_metadata_filename():
Expand Down
3 changes: 3 additions & 0 deletions test_unstructured/partition/pdf-image/test_image.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
from unstructured.chunking.title import chunk_by_title
from unstructured.partition import image, pdf
from unstructured.partition.json import partition_json
from unstructured.partition.utils.constants import UNSTRUCTURED_INCLUDE_DEBUG_METADATA
from unstructured.staging.base import elements_to_json

DIRECTORY = pathlib.Path(__file__).parent.resolve()
Expand Down Expand Up @@ -245,6 +246,8 @@ def test_partition_image_default_strategy_hi_res():
assert elements[0].metadata.coordinates is not None
assert elements[0].metadata.detection_class_prob is not None
assert isinstance(elements[0].metadata.detection_class_prob, float)
if UNSTRUCTURED_INCLUDE_DEBUG_METADATA:
assert {element.metadata.detection_origin for element in elements} == {"image"}


def test_partition_image_metadata_date(
Expand Down
8 changes: 6 additions & 2 deletions test_unstructured/partition/pdf-image/test_pdf.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
)
from unstructured.partition import pdf, strategies
from unstructured.partition.json import partition_json
from unstructured.partition.utils.constants import UNSTRUCTURED_INCLUDE_DEBUG_METADATA
from unstructured.staging.base import elements_to_json


Expand Down Expand Up @@ -114,15 +115,16 @@ def test_partition_pdf_local_raises_with_no_filename():

@pytest.mark.parametrize("file_mode", ["filename", "rb", "spool"])
@pytest.mark.parametrize(
("strategy", "expected"),
("strategy", "expected", "origin"),
# fast: can't capture the "intentionally left blank page" page
# others: will ignore the actual blank page
[("fast", {1, 4}), ("hi_res", {1, 3, 4}), ("ocr_only", {1, 3, 4})],
[("fast", {1, 4}, "pdfminer"), ("hi_res", {1, 3, 4}, "pdf"), ("ocr_only", {1, 3, 4}, "OCR")],
)
def test_partition_pdf(
file_mode,
strategy,
expected,
origin,
filename="example-docs/layout-parser-paper-with-empty-pages.pdf",
):
# Test that the partition_pdf function can handle filename
Expand All @@ -131,6 +133,8 @@ def _test(result):
assert len(result) > 10
# check that the pdf has multiple different page numbers
assert {element.metadata.page_number for element in result} == expected
if UNSTRUCTURED_INCLUDE_DEBUG_METADATA:
assert {element.metadata.detection_origin for element in result} == {origin}
Comment on lines +136 to +137
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe use mocker to set the constant to true so we are certain this is tested; I understand that in the ci we set the env variable so this is tested in CI but this if statement here can cause confusion for local testing vs. ci; and potentially have code silently fail if ci for some reason dropped the env


if file_mode == "filename":
result = pdf.partition_pdf(filename=filename, strategy=strategy)
Expand Down
3 changes: 3 additions & 0 deletions test_unstructured/partition/pptx/test_ppt.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
from unstructured.documents.elements import ListItem, NarrativeText, Title
from unstructured.partition.json import partition_json
from unstructured.partition.ppt import partition_ppt
from unstructured.partition.utils.constants import UNSTRUCTURED_INCLUDE_DEBUG_METADATA
from unstructured.staging.base import elements_to_json

DIRECTORY = pathlib.Path(__file__).parent.resolve()
Expand All @@ -28,6 +29,8 @@ def test_partition_ppt_from_filename():
assert elements == EXPECTED_PPT_OUTPUT
for element in elements:
assert element.metadata.filename == "fake-power-point.ppt"
if UNSTRUCTURED_INCLUDE_DEBUG_METADATA:
assert {element.metadata.detection_origin for element in elements} == {"pptx"}


def test_partition_ppt_from_filename_with_metadata_filename():
Expand Down
3 changes: 3 additions & 0 deletions test_unstructured/partition/test_text.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
partition_text,
split_content_to_fit_max,
)
from unstructured.partition.utils.constants import UNSTRUCTURED_INCLUDE_DEBUG_METADATA
from unstructured.staging.base import elements_to_json

DIRECTORY = pathlib.Path(__file__).parent.resolve()
Expand Down Expand Up @@ -67,6 +68,8 @@ def test_partition_text_from_filename(filename, encoding):
assert elements == EXPECTED_OUTPUT
for element in elements:
assert element.metadata.filename == filename
if UNSTRUCTURED_INCLUDE_DEBUG_METADATA:
assert {element.metadata.detection_origin for element in elements} == {"text"}


def test_partition_text_from_filename_with_metadata_filename():
Expand Down
3 changes: 3 additions & 0 deletions test_unstructured/partition/test_xml_partition.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
from unstructured.chunking.title import chunk_by_title
from unstructured.documents.elements import NarrativeText, Title
from unstructured.partition.json import partition_json
from unstructured.partition.utils.constants import UNSTRUCTURED_INCLUDE_DEBUG_METADATA
from unstructured.partition.xml import partition_xml
from unstructured.staging.base import elements_to_json

Expand All @@ -22,6 +23,8 @@ def test_partition_xml_from_filename(filename):

assert elements[0].text == "United States"
assert elements[0].metadata.filename == filename
if UNSTRUCTURED_INCLUDE_DEBUG_METADATA:
assert {element.metadata.detection_origin for element in elements} == {"xml"}


def test_partition_xml_from_filename_with_metadata_filename():
Expand Down

This file was deleted.

Loading
Loading