Skip to content

Commit

Permalink
Merge branch 'main' into fix/1209-tweak-xycut-ordering-output
Browse files Browse the repository at this point in the history
# Conflicts:
#	CHANGELOG.md
#	unstructured/__version__.py
  • Loading branch information
christinestraub committed Oct 3, 2023
2 parents 8c51d75 + 8821689 commit 8a88e93
Show file tree
Hide file tree
Showing 90 changed files with 1,702 additions and 1,413 deletions.
4 changes: 4 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -255,6 +255,10 @@ jobs:
source .venv/bin/activate
mkdir "$NLTK_DATA"
make install-ci
- name: Setup docker-compose
uses: KengoTODA/actions-setup-docker-compose@v1
with:
version: '2.22.0'
- name: Test Ingest (unit)
run: |
source .venv/bin/activate
Expand Down
6 changes: 5 additions & 1 deletion .github/workflows/ingest-test-fixtures-update-pr.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ env:

jobs:
setup:
runs-on: ubuntu-latest
runs-on: ubuntu-latest-m
if: |
github.event_name == 'workflow_dispatch' ||
(github.event_name == 'push' && contains(github.event.head_commit.message, 'ingest-test-fixtures-update'))
Expand Down Expand Up @@ -56,6 +56,10 @@ jobs:
source .venv/bin/activate
mkdir "$NLTK_DATA"
make install-ci
- name: Setup docker-compose
uses: KengoTODA/actions-setup-docker-compose@v1
with:
version: '2.22.0'
- name: Update test fixtures
env:
AIRTABLE_PERSONAL_ACCESS_TOKEN: ${{ secrets.AIRTABLE_PERSONAL_ACCESS_TOKEN }}
Expand Down
14 changes: 13 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,26 @@
## 0.10.19-dev5
## 0.10.19-dev9

### Enhancements

* **bump `unstructured-inference` to `0.6.6`** The updated version of `unstructured-inference` makes table extraction in `hi_res` mode configurable to fine tune table extraction performance; it also improves element detection by adding a deduplication post processing step in the `hi_res` partitioning of pdfs and images.
* **Detect text in HTML Heading Tags as Titles** This will increase the accuracy of hierarchies in HTML documents and provide more accurate element categorization. If text is in an HTML heading tag and is not a list item, address, or narrative text, categorize it as a title.
* **Update python-based docs** Refactor docs to use the actual unstructured code rather than using the subprocess library to run the cli command itself.
* **Adds data source properties to SharePoint, Outlook, Onedrive, Reddit, and Slack connectors** These properties (date_created, date_modified, version, source_url, record_locator) are written to element metadata during ingest, mapping elements to information about the document source from which they derive. This functionality enables downstream applications to reveal source document applications, e.g. a link to a GDrive doc, Salesforce record, etc.
* **Adds Table support for the `add_chunking_strategy` decorator to partition functions.** In addition to combining elements under Title elements, user's can now specify the `max_characters=<n>` argument to chunk Table elements into TableChunk elements with `text` and `text_as_html` of length <n> characters. This means partitioned Table results are ready for use in downstream applications without any post processing.
* **Expose endpoint url for s3 connectors** By allowing for the endpoint url to be explicitly overwritten, this allows for any non-AWS data providers supporting the s3 protocol to be supported (i.e. minio).

### Features

### Features

### Fixes

* **Tweak `xy-cut` ordering output to be more column friendly** While element ordering from `xy-cut` is usually mostly correct when ordering multi-column documents, sometimes elements from a RHS column will appear before elements in a LHS column. Fix: add swapped `xy-cut` ordering by sorting by X coordinate first and then Y coordinate.
* **Fixes partition_pdf is_alnum reference bug** Problem: The `partition_pdf` when attempt to get bounding box from element experienced a reference before assignment error when the first object is not text extractable. Fix: Switched to a flag when the condition is met. Importance: Crucial to be able to partition with pdf.
* **Fix various cases of HTML text missing after partition**
Problem: Under certain circumstances, text immediately after some HTML tags will be misssing from partition result.
Fix: Updated code to deal with these cases.
Importance: This will ensure the correctness when partitioning HTML and Markdown documents.

## 0.10.18

Expand Down
80 changes: 32 additions & 48 deletions docs/source/source_connectors/airtable.rst
Original file line number Diff line number Diff line change
Expand Up @@ -29,29 +29,21 @@ Run Locally

.. code:: python
import subprocess
command = [
"unstructured-ingest",
"airtable",
"--metadata-exclude", "filename,file_directory,metadata.data_source.date_processed",
"--personal-access-token", "$AIRTABLE_PERSONAL_ACCESS_TOKEN",
"--output-dir", "airtable-ingest-output"
"--num-processes", "2",
"--reprocess",
]
# Run the command
process = subprocess.Popen(command, stdout=subprocess.PIPE)
output, error = process.communicate()
# Print output
if process.returncode == 0:
print('Command executed successfully. Output:')
print(output.decode())
else:
print('Command failed. Error:')
print(error.decode())
import os
from unstructured.ingest.interfaces import PartitionConfig, ReadConfig
from unstructured.ingest.runner.airtable import airtable
if __name__ == "__main__":
airtable(
verbose=True,
read_config=ReadConfig(),
partition_config=PartitionConfig(
output_dir="airtable-ingest-output",
num_processes=2,
),
personal_access_token=os.getenv("AIRTABLE_PERSONAL_ACCESS_TOKEN"),
)
Run via the API
---------------
Expand All @@ -78,31 +70,23 @@ You can also use upstream connectors with the ``unstructured`` API. For this you

.. code:: python
import subprocess
command = [
"unstructured-ingest",
"airtable",
"--metadata-exclude", "filename,file_directory,metadata.data_source.date_processed",
"--personal-access-token", "$AIRTABLE_PERSONAL_ACCESS_TOKEN",
"--output-dir", "airtable-ingest-output"
"--num-processes", "2",
"--reprocess",
"--partition-by-api",
"--api-key", "<UNSTRUCTURED-API-KEY>",
]
# Run the command
process = subprocess.Popen(command, stdout=subprocess.PIPE)
output, error = process.communicate()
# Print output
if process.returncode == 0:
print('Command executed successfully. Output:')
print(output.decode())
else:
print('Command failed. Error:')
print(error.decode())
import os
from unstructured.ingest.interfaces import PartitionConfig, ReadConfig
from unstructured.ingest.runner.airtable import airtable
if __name__ == "__main__":
airtable(
verbose=True,
read_config=ReadConfig(),
partition_config=PartitionConfig(
output_dir="airtable-ingest-output",
num_processes=2,
partition_by_api=True,
api_key=os.getenv("UNSTRUCTURED_API_KEY"),
),
personal_access_token=os.getenv("AIRTABLE_PERSONAL_ACCESS_TOKEN"),
)
Additionally, you will need to pass the ``--partition-endpoint`` if you're running the API locally. You can find more information about the ``unstructured`` API `here <https://github.com/Unstructured-IO/unstructured-api>`_.

Expand Down
91 changes: 32 additions & 59 deletions docs/source/source_connectors/azure.rst
Original file line number Diff line number Diff line change
Expand Up @@ -28,28 +28,20 @@ Run Locally

.. code:: python
import subprocess
command = [
"unstructured-ingest",
"azure",
"--remote-url", "abfs://container1/",
"--account-name", "azureunstructured1"
"--output-dir", "/Output/Path/To/Files",
"--num-processes", "2",
]
# Run the command
process = subprocess.Popen(command, stdout=subprocess.PIPE)
output, error = process.communicate()
# Print output
if process.returncode == 0:
print('Command executed successfully. Output:')
print(output.decode())
else:
print('Command failed. Error:')
print(error.decode())
from unstructured.ingest.interfaces import PartitionConfig, ReadConfig
from unstructured.ingest.runner.azure import azure
if __name__ == "__main__":
azure(
verbose=True,
read_config=ReadConfig(),
partition_config=PartitionConfig(
output_dir="azure-ingest-output",
num_processes=2,
),
remote_url="abfs://container1/",
account_name="azureunstructured1",
)
Run via the API
---------------
Expand All @@ -62,43 +54,24 @@ You can also use upstream connectors with the ``unstructured`` API. For this you

.. code:: shell
unstructured-ingest \
azure \
--remote-url abfs://container1/ \
--account-name azureunstructured1 \
--output-dir azure-ingest-output \
--num-processes 2 \
--partition-by-api \
--api-key "<UNSTRUCTURED-API-KEY>"
.. tab:: Python

.. code:: python
import subprocess
command = [
"unstructured-ingest",
"azure",
"--remote-url", "abfs://container1/",
"--account-name", "azureunstructured1"
"--output-dir", "/Output/Path/To/Files",
"--num-processes", "2",
"--partition-by-api",
"--api-key", "<UNSTRUCTURED-API-KEY>",
]
# Run the command
process = subprocess.Popen(command, stdout=subprocess.PIPE)
output, error = process.communicate()
# Print output
if process.returncode == 0:
print('Command executed successfully. Output:')
print(output.decode())
else:
print('Command failed. Error:')
print(error.decode())
import os
from unstructured.ingest.interfaces import PartitionConfig, ReadConfig
from unstructured.ingest.runner.azure import azure
if __name__ == "__main__":
azure(
verbose=True,
read_config=ReadConfig(),
partition_config=PartitionConfig(
output_dir="azure-ingest-output",
num_processes=2,
partition_by_api=True,
api_key=os.getenv("UNSTRUCTURED_API_KEY"),
),
remote_url="abfs://container1/",
account_name="azureunstructured1",
)
Additionally, you will need to pass the ``--partition-endpoint`` if you're running the API locally. You can find more information about the ``unstructured`` API `here <https://github.com/Unstructured-IO/unstructured-api>`_.
Expand Down
82 changes: 34 additions & 48 deletions docs/source/source_connectors/biomed.rst
Original file line number Diff line number Diff line change
Expand Up @@ -29,29 +29,21 @@ Run Locally

.. code:: python
import subprocess
command = [
"unstructured-ingest",
"biomed",
"--path", "oa_pdf/07/07/sbaa031.073.PMC7234218.pdf",
"--output-dir", "/Output/Path/To/Files",
"--num-processes", "2",
"--verbose",
"--preserve-downloads",
]
# Run the command
process = subprocess.Popen(command, stdout=subprocess.PIPE)
output, error = process.communicate()
# Print output
if process.returncode == 0:
print('Command executed successfully. Output:')
print(output.decode())
else:
print('Command failed. Error:')
print(error.decode())
from unstructured.ingest.interfaces import PartitionConfig, ReadConfig
from unstructured.ingest.runner.biomed import biomed
if __name__ == "__main__":
biomed(
verbose=True,
read_config=ReadConfig(
preserve_downloads=True,
),
partition_config=PartitionConfig(
output_dir="biomed-ingest-output-path",
num_processes=2,
),
path="oa_pdf/07/07/sbaa031.073.PMC7234218.pdf",
)
Run via the API
---------------
Expand All @@ -78,31 +70,25 @@ You can also use upstream connectors with the ``unstructured`` API. For this you

.. code:: python
import subprocess
command = [
"unstructured-ingest",
"biomed",
"--path", "oa_pdf/07/07/sbaa031.073.PMC7234218.pdf",
"--output-dir", "/Output/Path/To/Files",
"--num-processes", "2",
"--verbose",
"--preserve-downloads",
"--partition-by-api",
"--api-key", "<UNSTRUCTURED-API-KEY>",
]
# Run the command
process = subprocess.Popen(command, stdout=subprocess.PIPE)
output, error = process.communicate()
# Print output
if process.returncode == 0:
print('Command executed successfully. Output:')
print(output.decode())
else:
print('Command failed. Error:')
print(error.decode())
import os
from unstructured.ingest.interfaces import PartitionConfig, ReadConfig
from unstructured.ingest.runner.biomed import biomed
if __name__ == "__main__":
biomed(
verbose=True,
read_config=ReadConfig(
preserve_downloads=True,
),
partition_config=PartitionConfig(
output_dir="biomed-ingest-output-path",
num_processes=2,
partition_by_api=True,
api_key=os.getenv("UNSTRUCTURED_API_KEY"),
),
path="oa_pdf/07/07/sbaa031.073.PMC7234218.pdf",
)
Additionally, you will need to pass the ``--partition-endpoint`` if you're running the API locally. You can find more information about the ``unstructured`` API `here <https://github.com/Unstructured-IO/unstructured-api>`_.

Expand Down
Loading

0 comments on commit 8a88e93

Please sign in to comment.