Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
…into newelh/hierarchy-fast/pptx
  • Loading branch information
newelh committed Oct 4, 2023
2 parents fcc239d + 19d8bff commit 0f92a20
Show file tree
Hide file tree
Showing 101 changed files with 5,244 additions and 6,964 deletions.
4 changes: 4 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -255,6 +255,10 @@ jobs:
source .venv/bin/activate
mkdir "$NLTK_DATA"
make install-ci
- name: Setup docker-compose
uses: KengoTODA/actions-setup-docker-compose@v1
with:
version: '2.22.0'
- name: Test Ingest (unit)
run: |
source .venv/bin/activate
Expand Down
6 changes: 5 additions & 1 deletion .github/workflows/ingest-test-fixtures-update-pr.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ env:

jobs:
setup:
runs-on: ubuntu-latest
runs-on: ubuntu-latest-m
if: |
github.event_name == 'workflow_dispatch' ||
(github.event_name == 'push' && contains(github.event.head_commit.message, 'ingest-test-fixtures-update'))
Expand Down Expand Up @@ -56,6 +56,10 @@ jobs:
source .venv/bin/activate
mkdir "$NLTK_DATA"
make install-ci
- name: Setup docker-compose
uses: KengoTODA/actions-setup-docker-compose@v1
with:
version: '2.22.0'
- name: Update test fixtures
env:
AIRTABLE_PERSONAL_ACCESS_TOKEN: ${{ secrets.AIRTABLE_PERSONAL_ACCESS_TOKEN }}
Expand Down
10 changes: 8 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,21 +1,27 @@
## 0.10.19-dev4
## 0.10.19-dev11

### Enhancements

* **bump `unstructured-inference` to `0.6.6`** The updated version of `unstructured-inference` makes table extraction in `hi_res` mode configurable to fine tune table extraction performance; it also improves element detection by adding a deduplication post processing step in the `hi_res` partitioning of pdfs and images.
* **Detect text in HTML Heading Tags as Titles** This will increase the accuracy of hierarchies in HTML documents and provide more accurate element categorization. If text is in an HTML heading tag and is not a list item, address, or narrative text, categorize it as a title.
* **Update python-based docs** Refactor docs to use the actual unstructured code rather than using the subprocess library to run the cli command itself.
* **Adds data source properties to SharePoint, Outlook, Onedrive, Reddit, and Slack connectors** These properties (date_created, date_modified, version, source_url, record_locator) are written to element metadata during ingest, mapping elements to information about the document source from which they derive. This functionality enables downstream applications to reveal source document applications, e.g. a link to a GDrive doc, Salesforce record, etc.
* **Adds Table support for the `add_chunking_strategy` decorator to partition functions.** In addition to combining elements under Title elements, user's can now specify the `max_characters=<n>` argument to chunk Table elements into TableChunk elements with `text` and `text_as_html` of length <n> characters. This means partitioned Table results are ready for use in downstream applications without any post processing.
* **Expose endpoint url for s3 connectors** By allowing for the endpoint url to be explicitly overwritten, this allows for any non-AWS data providers supporting the s3 protocol to be supported (i.e. minio).
* **change default `hi_res` model for pdf/image partition to `yolox`** Now partitioning pdf/image using `hi_res` strategy utilizes `yolox_quantized` model isntead of `detectron2_onnx` model. This new default model has better recall for tables and produces more detailed categories for elements.
* **Improve title detection in pptx documents** The default title textboxes on a pptx slide are now categorized as titles.
* **Improve hierarchy detection in pptx documents** List items, and other slide text are properly nested under the slide title. This will enable better chunking of pptx documents.

### Features

### Fixes

* **Fixes partition_pdf is_alnum reference bug** Problem: The `partition_pdf` when attempt to get bounding box from element experienced a reference before assignment error when the first object is not text extractable. Fix: Switched to a flag when the condition is met. Importance: Crucial to be able to partition with pdf.
* **Fix various cases of HTML text missing after partition**
Problem: Under certain circumstances, text immediately after some HTML tags will be misssing from partition result.
Fix: Updated code to deal with these cases.
Importance: This will ensure the correctness when partitioning HTML and Markdown documents.


## 0.10.18

### Enhancements
Expand Down
20 changes: 10 additions & 10 deletions docs/source/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -108,7 +108,7 @@ When elements are extracted from PDFs or images, it may be useful to get their b
file_path = "/Path/To/File"
file_data = {'files': open(file_path, 'rb')}
response = requests.post(url, headers=headers, files=files, data=data)
response = requests.post(url, headers=headers, files=file_data, data=data)
file_data['files'].close()
Expand Down Expand Up @@ -155,7 +155,7 @@ You can specify the encoding to use to decode the text input. If no value is pro
file_path = "/Path/To/File"
file_data = {'files': open(file_path, 'rb')}
response = requests.post(url, headers=headers, files=files, data=data)
response = requests.post(url, headers=headers, files=file_data, data=data)
file_data['files'].close()
Expand Down Expand Up @@ -204,7 +204,7 @@ You can also specify what languages to use for OCR with the ``ocr_languages`` kw
file_path = "/Path/To/File"
file_data = {'files': open(file_path, 'rb')}
response = requests.post(url, headers=headers, files=files, data=data)
response = requests.post(url, headers=headers, files=file_data, data=data)
file_data['files'].close()
Expand Down Expand Up @@ -250,7 +250,7 @@ By default the result will be in ``json``, but it can be set to ``text/csv`` to
file_path = "/Path/To/File"
file_data = {'files': open(file_path, 'rb')}
response = requests.post(url, headers=headers, files=files, data=data)
response = requests.post(url, headers=headers, files=file_data, data=data)
file_data['files'].close()
Expand Down Expand Up @@ -296,7 +296,7 @@ Pass the `include_page_breaks` parameter to `true` to include `PageBreak` elemen
file_path = "/Path/To/File"
file_data = {'files': open(file_path, 'rb')}
response = requests.post(url, headers=headers, files=files, data=data)
response = requests.post(url, headers=headers, files=file_data, data=data)
file_data['files'].close()
Expand Down Expand Up @@ -345,7 +345,7 @@ On the other hand, ``hi_res`` is the better choice for PDFs that may have text w
file_path = "/Path/To/File"
file_data = {'files': open(file_path, 'rb')}
response = requests.post(url, headers=headers, files=files, data=data)
response = requests.post(url, headers=headers, files=file_data, data=data)
file_data['files'].close()
Expand Down Expand Up @@ -398,7 +398,7 @@ To use the ``hi_res`` strategy with **Chipper** model, pass the argument for ``h
file_path = "/Path/To/File"
file_data = {'files': open(file_path, 'rb')}
response = requests.post(url, headers=headers, files=files, data=data)
response = requests.post(url, headers=headers, files=file_data, data=data)
file_data['files'].close()
Expand Down Expand Up @@ -451,7 +451,7 @@ To extract the table structure from PDF files using the ``hi_res`` strategy, ens
file_path = "/Path/To/File"
file_data = {'files': open(file_path, 'rb')}
response = requests.post(url, headers=headers, files=files, data=data)
response = requests.post(url, headers=headers, files=file_data, data=data)
file_data['files'].close()
Expand Down Expand Up @@ -499,7 +499,7 @@ We also provide support for enabling and disabling table extraction for file typ
file_path = "/Path/To/File"
file_data = {'files': open(file_path, 'rb')}
response = requests.post(url, headers=headers, files=files, data=data)
response = requests.post(url, headers=headers, files=file_data, data=data)
file_data['files'].close()
Expand Down Expand Up @@ -545,7 +545,7 @@ When processing XML documents, set the ``xml_keep_tags`` parameter to ``true`` t
file_path = "/Path/To/File"
file_data = {'files': open(file_path, 'rb')}
response = requests.post(url, headers=headers, files=files, data=data)
response = requests.post(url, headers=headers, files=file_data, data=data)
file_data['files'].close()
Expand Down
80 changes: 32 additions & 48 deletions docs/source/source_connectors/airtable.rst
Original file line number Diff line number Diff line change
Expand Up @@ -29,29 +29,21 @@ Run Locally

.. code:: python
import subprocess
command = [
"unstructured-ingest",
"airtable",
"--metadata-exclude", "filename,file_directory,metadata.data_source.date_processed",
"--personal-access-token", "$AIRTABLE_PERSONAL_ACCESS_TOKEN",
"--output-dir", "airtable-ingest-output"
"--num-processes", "2",
"--reprocess",
]
# Run the command
process = subprocess.Popen(command, stdout=subprocess.PIPE)
output, error = process.communicate()
# Print output
if process.returncode == 0:
print('Command executed successfully. Output:')
print(output.decode())
else:
print('Command failed. Error:')
print(error.decode())
import os
from unstructured.ingest.interfaces import PartitionConfig, ReadConfig
from unstructured.ingest.runner.airtable import airtable
if __name__ == "__main__":
airtable(
verbose=True,
read_config=ReadConfig(),
partition_config=PartitionConfig(
output_dir="airtable-ingest-output",
num_processes=2,
),
personal_access_token=os.getenv("AIRTABLE_PERSONAL_ACCESS_TOKEN"),
)
Run via the API
---------------
Expand All @@ -78,31 +70,23 @@ You can also use upstream connectors with the ``unstructured`` API. For this you

.. code:: python
import subprocess
command = [
"unstructured-ingest",
"airtable",
"--metadata-exclude", "filename,file_directory,metadata.data_source.date_processed",
"--personal-access-token", "$AIRTABLE_PERSONAL_ACCESS_TOKEN",
"--output-dir", "airtable-ingest-output"
"--num-processes", "2",
"--reprocess",
"--partition-by-api",
"--api-key", "<UNSTRUCTURED-API-KEY>",
]
# Run the command
process = subprocess.Popen(command, stdout=subprocess.PIPE)
output, error = process.communicate()
# Print output
if process.returncode == 0:
print('Command executed successfully. Output:')
print(output.decode())
else:
print('Command failed. Error:')
print(error.decode())
import os
from unstructured.ingest.interfaces import PartitionConfig, ReadConfig
from unstructured.ingest.runner.airtable import airtable
if __name__ == "__main__":
airtable(
verbose=True,
read_config=ReadConfig(),
partition_config=PartitionConfig(
output_dir="airtable-ingest-output",
num_processes=2,
partition_by_api=True,
api_key=os.getenv("UNSTRUCTURED_API_KEY"),
),
personal_access_token=os.getenv("AIRTABLE_PERSONAL_ACCESS_TOKEN"),
)
Additionally, you will need to pass the ``--partition-endpoint`` if you're running the API locally. You can find more information about the ``unstructured`` API `here <https://github.com/Unstructured-IO/unstructured-api>`_.

Expand Down
91 changes: 32 additions & 59 deletions docs/source/source_connectors/azure.rst
Original file line number Diff line number Diff line change
Expand Up @@ -28,28 +28,20 @@ Run Locally

.. code:: python
import subprocess
command = [
"unstructured-ingest",
"azure",
"--remote-url", "abfs://container1/",
"--account-name", "azureunstructured1"
"--output-dir", "/Output/Path/To/Files",
"--num-processes", "2",
]
# Run the command
process = subprocess.Popen(command, stdout=subprocess.PIPE)
output, error = process.communicate()
# Print output
if process.returncode == 0:
print('Command executed successfully. Output:')
print(output.decode())
else:
print('Command failed. Error:')
print(error.decode())
from unstructured.ingest.interfaces import PartitionConfig, ReadConfig
from unstructured.ingest.runner.azure import azure
if __name__ == "__main__":
azure(
verbose=True,
read_config=ReadConfig(),
partition_config=PartitionConfig(
output_dir="azure-ingest-output",
num_processes=2,
),
remote_url="abfs://container1/",
account_name="azureunstructured1",
)
Run via the API
---------------
Expand All @@ -62,43 +54,24 @@ You can also use upstream connectors with the ``unstructured`` API. For this you

.. code:: shell
unstructured-ingest \
azure \
--remote-url abfs://container1/ \
--account-name azureunstructured1 \
--output-dir azure-ingest-output \
--num-processes 2 \
--partition-by-api \
--api-key "<UNSTRUCTURED-API-KEY>"
.. tab:: Python

.. code:: python
import subprocess
command = [
"unstructured-ingest",
"azure",
"--remote-url", "abfs://container1/",
"--account-name", "azureunstructured1"
"--output-dir", "/Output/Path/To/Files",
"--num-processes", "2",
"--partition-by-api",
"--api-key", "<UNSTRUCTURED-API-KEY>",
]
# Run the command
process = subprocess.Popen(command, stdout=subprocess.PIPE)
output, error = process.communicate()
# Print output
if process.returncode == 0:
print('Command executed successfully. Output:')
print(output.decode())
else:
print('Command failed. Error:')
print(error.decode())
import os
from unstructured.ingest.interfaces import PartitionConfig, ReadConfig
from unstructured.ingest.runner.azure import azure
if __name__ == "__main__":
azure(
verbose=True,
read_config=ReadConfig(),
partition_config=PartitionConfig(
output_dir="azure-ingest-output",
num_processes=2,
partition_by_api=True,
api_key=os.getenv("UNSTRUCTURED_API_KEY"),
),
remote_url="abfs://container1/",
account_name="azureunstructured1",
)
Additionally, you will need to pass the ``--partition-endpoint`` if you're running the API locally. You can find more information about the ``unstructured`` API `here <https://github.com/Unstructured-IO/unstructured-api>`_.
Expand Down
Loading

0 comments on commit 0f92a20

Please sign in to comment.