Skip to content

Commit

Permalink
Merge branch 'main' into sebastian/fix-title-depth
Browse files Browse the repository at this point in the history
  • Loading branch information
qued authored Oct 4, 2023
2 parents 27ce065 + 19d8bff commit cdced8e
Show file tree
Hide file tree
Showing 202 changed files with 8,427 additions and 11,460 deletions.
5 changes: 5 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -255,6 +255,10 @@ jobs:
source .venv/bin/activate
mkdir "$NLTK_DATA"
make install-ci
- name: Setup docker-compose
uses: KengoTODA/actions-setup-docker-compose@v1
with:
version: '2.22.0'
- name: Test Ingest (unit)
run: |
source .venv/bin/activate
Expand Down Expand Up @@ -293,6 +297,7 @@ jobs:
AZURE_SEARCH_API_KEY: ${{ secrets.AZURE_SEARCH_API_KEY }}
TABLE_OCR: "tesseract"
ENTIRE_PAGE_OCR: "tesseract"
CI: "true"
run: |
source .venv/bin/activate
sudo apt-get update
Expand Down
9 changes: 7 additions & 2 deletions .github/workflows/ingest-test-fixtures-update-pr.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ env:

jobs:
setup:
runs-on: ubuntu-latest
runs-on: ubuntu-latest-m
if: |
github.event_name == 'workflow_dispatch' ||
(github.event_name == 'push' && contains(github.event.head_commit.message, 'ingest-test-fixtures-update'))
Expand Down Expand Up @@ -37,7 +37,7 @@ jobs:
make install-ci
update-fixtures-and-pr:
runs-on: ubuntu-latest
runs-on: ubuntu-latest-m
env:
NLTK_DATA: ${{ github.workspace }}/nltk_data
needs: [setup]
Expand All @@ -56,6 +56,10 @@ jobs:
source .venv/bin/activate
mkdir "$NLTK_DATA"
make install-ci
- name: Setup docker-compose
uses: KengoTODA/actions-setup-docker-compose@v1
with:
version: '2.22.0'
- name: Update test fixtures
env:
AIRTABLE_PERSONAL_ACCESS_TOKEN: ${{ secrets.AIRTABLE_PERSONAL_ACCESS_TOKEN }}
Expand Down Expand Up @@ -91,6 +95,7 @@ jobs:
TABLE_OCR: "tesseract"
ENTIRE_PAGE_OCR: "tesseract"
OVERWRITE_FIXTURES: "true"
CI: "true"
run: |
source .venv/bin/activate
sudo apt-get update
Expand Down
34 changes: 32 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,44 @@
## 0.10.17-dev9
## 0.10.19-dev10

### Enhancements

* **bump `unstructured-inference` to `0.6.6`** The updated version of `unstructured-inference` makes table extraction in `hi_res` mode configurable to fine tune table extraction performance; it also improves element detection by adding a deduplication post processing step in the `hi_res` partitioning of pdfs and images.
* **Detect text in HTML Heading Tags as Titles** This will increase the accuracy of hierarchies in HTML documents and provide more accurate element categorization. If text is in an HTML heading tag and is not a list item, address, or narrative text, categorize it as a title.
* **Update python-based docs** Refactor docs to use the actual unstructured code rather than using the subprocess library to run the cli command itself.
* **Adds data source properties to SharePoint, Outlook, Onedrive, Reddit, and Slack connectors** These properties (date_created, date_modified, version, source_url, record_locator) are written to element metadata during ingest, mapping elements to information about the document source from which they derive. This functionality enables downstream applications to reveal source document applications, e.g. a link to a GDrive doc, Salesforce record, etc.
* **Adds Table support for the `add_chunking_strategy` decorator to partition functions.** In addition to combining elements under Title elements, user's can now specify the `max_characters=<n>` argument to chunk Table elements into TableChunk elements with `text` and `text_as_html` of length <n> characters. This means partitioned Table results are ready for use in downstream applications without any post processing.
* **Expose endpoint url for s3 connectors** By allowing for the endpoint url to be explicitly overwritten, this allows for any non-AWS data providers supporting the s3 protocol to be supported (i.e. minio).
* **change default `hi_res` model for pdf/image partition to `yolox`** Now partitioning pdf/image using `hi_res` strategy utilizes `yolox_quantized` model isntead of `detectron2_onnx` model. This new default model has better recall for tables and produces more detailed categories for elements.

### Features

### Fixes

* **Fixes partition_pdf is_alnum reference bug** Problem: The `partition_pdf` when attempt to get bounding box from element experienced a reference before assignment error when the first object is not text extractable. Fix: Switched to a flag when the condition is met. Importance: Crucial to be able to partition with pdf.
* **Fix various cases of HTML text missing after partition**
Problem: Under certain circumstances, text immediately after some HTML tags will be misssing from partition result.
Fix: Updated code to deal with these cases.
Importance: This will ensure the correctness when partitioning HTML and Markdown documents.

## 0.10.18

### Enhancements

* **Better detection of natural reading order in images and PDF's** The elements returned by partition better reflect natural reading order in some cases, particularly in complicated multi-column layouts, leading to better chunking and retrieval for downstream applications. Achieved by improving the `xy-cut` sorting to preprocess bboxes, shrinking all bounding boxes by 90% along x and y axes (still centered around the same center point), which allows projection lines to be drawn where not possible before if layout bboxes overlapped.
* **Improves `partition_xml` to be faster and more memory efficient when partitioning large XML files** The new behavior is to partition iteratively to prevent loading the entire XML tree into memory at once in most use cases.
* **Adds data source properties to SharePoint, Outlook, Onedrive, Reddit, Slack, and DeltaTable connectors** These properties (date_created, date_modified, version, source_url, record_locator) are written to element metadata during ingest, mapping elements to information about the document source from which they derive. This functionality enables downstream applications to reveal source document applications, e.g. a link to a GDrive doc, Salesforce record, etc.
* **Add functionality to save embedded images in PDF's separately as images** This allows users to save embedded images in PDF's separately as images, given some directory path. The saved image path is written to the metadata for the Image element. Downstream applications may benefit by providing users with image links from relevant "hits."
* **Azure Cognite Search destination connector** New Azure Cognitive Search destination connector added to ingest CLI. Users may now use `unstructured-ingest` to write partitioned data from over 20 data sources (so far) to an Azure Cognitive Search index.
* **Improves salesforce partitioning** Partitions Salesforce data as xlm instead of text for improved detail and flexibility. Partitions htmlbody instead of textbody for Salesforce emails. Importance: Allows all Salesforce fields to be ingested and gives Salesforce emails more detailed partitioning.
* **Add document level language detection functionality.** Introduces the "auto" default for the languages param, which then detects the languages present in the document using the `langdetect` package. Adds the document languages as ISO 639-3 codes to the element metadata. Implemented only for the partition_text function to start.
* **PPTX partitioner refactored in preparation for enhancement.** Behavior should be unchanged except that shapes enclosed in a group-shape are now included, as many levels deep as required (a group-shape can itself contain a group-shape).
* **Embeddings support for the SharePoint SourceConnector via unstructured-ingest CLI** The SharePoint connector can now optionally create embeddings from the elements it pulls out during partition and upload those embeddings to Azure Cognitive Search index.
* **Improves hierarchy from docx files by leveraging natural hierarchies built into docx documents** Hierarchy can now be detected from an indentation level for list bullets/numbers and by style name (e.g. Heading 1, List Bullet 2, List Number).
* **Chunking support for the SharePoint SourceConnector via unstructured-ingest CLI** The SharePoint connector can now optionally chunk the elements pulled out during partition via the chunking unstructured brick. This can be used as a stage before creating embeddings.

### Features

* **Adds `links` metadata in `partition_pdf` for `fast` strategy.** Problem: PDF files contain rich information and hyperlink that Unstructured did not captured earlier. Feature: `partition_pdf` now can capture embedded links within the file along with its associated text and page number. Importance: Providing depth in extracted elements give user a better understanding and richer context of documents. This also enables user to map to other elements within the document if the hyperlink is refered internally.
* **Adds the embedding module to be able to embed Elements** Problem: Many NLP applications require the ability to represent parts of documents in a semantic way. Until now, Unstructured did not have text embedding ability within the core library. Feature: This embedding module is able to track embeddings related data with a class, embed a list of elements, and return an updated list of Elements with the *embeddings* property. The module is also able to embed query strings. Importance: Ability to embed documents or parts of documents will enable users to make use of these semantic representations in different NLP applications, such as search, retrieval, and retrieval augmented generation.

### Fixes
Expand All @@ -22,9 +49,12 @@
* **Fixes SharePoint connector failures if any document has an unsupported filetype** Problem: Currently the entire connector ingest run fails if a single IngestDoc has an unsupported filetype. This is because a ValueError is raised in the IngestDoc's `__post_init__`. Fix: Adds a try/catch when the IngestConnector runs get_ingest_docs such that the error is logged but all processable documents->IngestDocs are still instantiated and returned. Importance: Allows users to ingest SharePoint content even when some files with unsupported filetypes exist there.
* **Fixes Sharepoint connector server_path issue** Problem: Server path for the Sharepoint Ingest Doc was incorrectly formatted, causing issues while fetching pages from the remote source. Fix: changes formatting of remote file path before instantiating SharepointIngestDocs and appends a '/' while fetching pages from the remote source. Importance: Allows users to fetch pages from Sharepoint Sites.
* **Fixes badly initialized Formula** Problem: YoloX contain new types of elements, when loading a document that contain formulas a new element of that class
should be generated, however the Formula class inherits from Element instead of Text. After this change the element is correctly created with the correct class
should be generated, however the Formula class inherits from Element instead of Text. After this change the element is correctly created with the correct class
allowing the document to be loaded. Fix: Change parent class for Formula to Text. Importance: Crucial to be able to load documents that contain formulas.
* **Fixes Sphinx errors.** Fixes errors when running Sphinx `make html` and installs library to suppress warnings.
* **Fixes a metadata backwards compatibility error** Problem: When calling `partition_via_api`, the hosted api may return an element schema that's newer than the current `unstructured`. In this case, metadata fields were added which did not exist in the local `ElementMetadata` dataclass, and `__init__()` threw an error. Fix: remove nonexistent fields before instantiating in `ElementMetadata.from_json()`. Importance: Crucial to avoid breaking changes when adding fields.
* **Fixes issue with Discord connector when a channel returns `None`** Problem: Getting the `jump_url` from a nonexistent Discord `channel` fails. Fix: property `jump_url` is now retrieved within the same context as the messages from the channel. Importance: Avoids cascading issues when the connector fails to fetch information about a Discord channel.
* **Fixes occasionally SIGABTR when writing table with `deltalake` on Linux** Problem: occasionally on Linux ingest can throw a `SIGABTR` when writing `deltalake` table even though the table was written correctly. Fix: put the writing function into a `Process` to ensure its execution to the fullest extent before returning to the main process. Importance: Improves stability of connectors using `deltalake`


## 0.10.16
Expand Down
2 changes: 1 addition & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# syntax=docker/dockerfile:experimental
FROM quay.io/unstructured-io/base-images:rocky9.2-4@sha256:b1063ffbf08c3037ee211620f011dd05bd2da9287c6e6a3473b15c1597724e4b as base
FROM quay.io/unstructured-io/base-images:rocky9.2-5@sha256:1721c3b0711e4e90587e3b4917f1b616e4603ddf5b4986bfaa68d02d82a13aba as base

# NOTE(crag): NB_USER ARG for mybinder.org compat:
# https://mybinder.readthedocs.io/en/latest/tutorials/dockerfile.html
Expand Down
20 changes: 10 additions & 10 deletions docs/source/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -108,7 +108,7 @@ When elements are extracted from PDFs or images, it may be useful to get their b
file_path = "/Path/To/File"
file_data = {'files': open(file_path, 'rb')}
response = requests.post(url, headers=headers, files=files, data=data)
response = requests.post(url, headers=headers, files=file_data, data=data)
file_data['files'].close()
Expand Down Expand Up @@ -155,7 +155,7 @@ You can specify the encoding to use to decode the text input. If no value is pro
file_path = "/Path/To/File"
file_data = {'files': open(file_path, 'rb')}
response = requests.post(url, headers=headers, files=files, data=data)
response = requests.post(url, headers=headers, files=file_data, data=data)
file_data['files'].close()
Expand Down Expand Up @@ -204,7 +204,7 @@ You can also specify what languages to use for OCR with the ``ocr_languages`` kw
file_path = "/Path/To/File"
file_data = {'files': open(file_path, 'rb')}
response = requests.post(url, headers=headers, files=files, data=data)
response = requests.post(url, headers=headers, files=file_data, data=data)
file_data['files'].close()
Expand Down Expand Up @@ -250,7 +250,7 @@ By default the result will be in ``json``, but it can be set to ``text/csv`` to
file_path = "/Path/To/File"
file_data = {'files': open(file_path, 'rb')}
response = requests.post(url, headers=headers, files=files, data=data)
response = requests.post(url, headers=headers, files=file_data, data=data)
file_data['files'].close()
Expand Down Expand Up @@ -296,7 +296,7 @@ Pass the `include_page_breaks` parameter to `true` to include `PageBreak` elemen
file_path = "/Path/To/File"
file_data = {'files': open(file_path, 'rb')}
response = requests.post(url, headers=headers, files=files, data=data)
response = requests.post(url, headers=headers, files=file_data, data=data)
file_data['files'].close()
Expand Down Expand Up @@ -345,7 +345,7 @@ On the other hand, ``hi_res`` is the better choice for PDFs that may have text w
file_path = "/Path/To/File"
file_data = {'files': open(file_path, 'rb')}
response = requests.post(url, headers=headers, files=files, data=data)
response = requests.post(url, headers=headers, files=file_data, data=data)
file_data['files'].close()
Expand Down Expand Up @@ -398,7 +398,7 @@ To use the ``hi_res`` strategy with **Chipper** model, pass the argument for ``h
file_path = "/Path/To/File"
file_data = {'files': open(file_path, 'rb')}
response = requests.post(url, headers=headers, files=files, data=data)
response = requests.post(url, headers=headers, files=file_data, data=data)
file_data['files'].close()
Expand Down Expand Up @@ -451,7 +451,7 @@ To extract the table structure from PDF files using the ``hi_res`` strategy, ens
file_path = "/Path/To/File"
file_data = {'files': open(file_path, 'rb')}
response = requests.post(url, headers=headers, files=files, data=data)
response = requests.post(url, headers=headers, files=file_data, data=data)
file_data['files'].close()
Expand Down Expand Up @@ -499,7 +499,7 @@ We also provide support for enabling and disabling table extraction for file typ
file_path = "/Path/To/File"
file_data = {'files': open(file_path, 'rb')}
response = requests.post(url, headers=headers, files=files, data=data)
response = requests.post(url, headers=headers, files=file_data, data=data)
file_data['files'].close()
Expand Down Expand Up @@ -545,7 +545,7 @@ When processing XML documents, set the ``xml_keep_tags`` parameter to ``true`` t
file_path = "/Path/To/File"
file_data = {'files': open(file_path, 'rb')}
response = requests.post(url, headers=headers, files=files, data=data)
response = requests.post(url, headers=headers, files=file_data, data=data)
file_data['files'].close()
Expand Down
80 changes: 32 additions & 48 deletions docs/source/source_connectors/airtable.rst
Original file line number Diff line number Diff line change
Expand Up @@ -29,29 +29,21 @@ Run Locally

.. code:: python
import subprocess
command = [
"unstructured-ingest",
"airtable",
"--metadata-exclude", "filename,file_directory,metadata.data_source.date_processed",
"--personal-access-token", "$AIRTABLE_PERSONAL_ACCESS_TOKEN",
"--output-dir", "airtable-ingest-output"
"--num-processes", "2",
"--reprocess",
]
# Run the command
process = subprocess.Popen(command, stdout=subprocess.PIPE)
output, error = process.communicate()
# Print output
if process.returncode == 0:
print('Command executed successfully. Output:')
print(output.decode())
else:
print('Command failed. Error:')
print(error.decode())
import os
from unstructured.ingest.interfaces import PartitionConfig, ReadConfig
from unstructured.ingest.runner.airtable import airtable
if __name__ == "__main__":
airtable(
verbose=True,
read_config=ReadConfig(),
partition_config=PartitionConfig(
output_dir="airtable-ingest-output",
num_processes=2,
),
personal_access_token=os.getenv("AIRTABLE_PERSONAL_ACCESS_TOKEN"),
)
Run via the API
---------------
Expand All @@ -78,31 +70,23 @@ You can also use upstream connectors with the ``unstructured`` API. For this you

.. code:: python
import subprocess
command = [
"unstructured-ingest",
"airtable",
"--metadata-exclude", "filename,file_directory,metadata.data_source.date_processed",
"--personal-access-token", "$AIRTABLE_PERSONAL_ACCESS_TOKEN",
"--output-dir", "airtable-ingest-output"
"--num-processes", "2",
"--reprocess",
"--partition-by-api",
"--api-key", "<UNSTRUCTURED-API-KEY>",
]
# Run the command
process = subprocess.Popen(command, stdout=subprocess.PIPE)
output, error = process.communicate()
# Print output
if process.returncode == 0:
print('Command executed successfully. Output:')
print(output.decode())
else:
print('Command failed. Error:')
print(error.decode())
import os
from unstructured.ingest.interfaces import PartitionConfig, ReadConfig
from unstructured.ingest.runner.airtable import airtable
if __name__ == "__main__":
airtable(
verbose=True,
read_config=ReadConfig(),
partition_config=PartitionConfig(
output_dir="airtable-ingest-output",
num_processes=2,
partition_by_api=True,
api_key=os.getenv("UNSTRUCTURED_API_KEY"),
),
personal_access_token=os.getenv("AIRTABLE_PERSONAL_ACCESS_TOKEN"),
)
Additionally, you will need to pass the ``--partition-endpoint`` if you're running the API locally. You can find more information about the ``unstructured`` API `here <https://github.com/Unstructured-IO/unstructured-api>`_.

Expand Down
Loading

0 comments on commit cdced8e

Please sign in to comment.