Skip to content

Commit

Permalink
Roman/s3 minio all cloud support (#1606)
Browse files Browse the repository at this point in the history
### Description
Exposes the endpoint url as an access kwarg when using the s3 filesystem
library via the fsspec abstraction. This allows for any non-aws data
providers that support the s3 protocol to be used with the s3 connector
(i.e. minio)

Closes out #950

---------

Co-authored-by: ryannikolaidis <[email protected]>
Co-authored-by: rbiseck3 <[email protected]>
  • Loading branch information
3 people authored Oct 3, 2023
1 parent 1fb4642 commit 8821689
Show file tree
Hide file tree
Showing 11 changed files with 161 additions and 6 deletions.
8 changes: 5 additions & 3 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,18 @@
## 0.10.19-dev7
## 0.10.19-dev8

### Enhancements

* **bump `unstructured-inference` to `0.6.6`** The updated version of `unstructured-inference` makes table extraction in `hi_res` mode configurable to fine tune table extraction performance; it also improves element detection by adding a deduplication post processing step in the `hi_res` partitioning of pdfs and images.
* **Detect text in HTML Heading Tags as Titles** This will increase the accuracy of hierarchies in HTML documents and provide more accurate element categorization. If text is in an HTML heading tag and is not a list item, address, or narrative text, categorize it as a title.
* **Update python-based docs** Refactor docs to use the actual unstructured code rather than using the subprocess library to run the cli command itself.
* * **Adds data source properties to SharePoint, Outlook, Onedrive, Reddit, and Slack connectors** These properties (date_created, date_modified, version, source_url, record_locator) are written to element metadata during ingest, mapping elements to information about the document source from which they derive. This functionality enables downstream applications to reveal source document applications, e.g. a link to a GDrive doc, Salesforce record, etc.
* **Adds data source properties to SharePoint, Outlook, Onedrive, Reddit, and Slack connectors** These properties (date_created, date_modified, version, source_url, record_locator) are written to element metadata during ingest, mapping elements to information about the document source from which they derive. This functionality enables downstream applications to reveal source document applications, e.g. a link to a GDrive doc, Salesforce record, etc.
* **Adds Table support for the `add_chunking_strategy` decorator to partition functions.** In addition to combining elements under Title elements, user's can now specify the `max_characters=<n>` argument to chunk Table elements into TableChunk elements with `text` and `text_as_html` of length <n> characters. This means partitioned Table results are ready for use in downstream applications without any post processing.

* **Expose endpoint url for s3 connectors** By allowing for the endpoint url to be explicitly overwritten, this allows for any non-AWS data providers supporting the s3 protocol to be supported (i.e. minio).

### Features

### Features

### Fixes

* **Fixes partition_pdf is_alnum reference bug** Problem: The `partition_pdf` when attempt to get bounding box from element experienced a reference before assignment error when the first object is not text extractable. Fix: Switched to a flag when the condition is met. Importance: Crucial to be able to partition with pdf.
Expand Down
25 changes: 25 additions & 0 deletions scripts/minio-test-helpers/create-and-check-minio.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
#!/usr/bin/env bash

SCRIPT_DIR=$(dirname "$(realpath "$0")")

secret_key=minioadmin
access_key=minioadmin
region=us-east-2
endpoint_url=http://localhost:9000
bucket_name=utic-dev-tech-fixtures

function upload(){
echo "Uploading test content to new bucket in minio"
AWS_REGION=$region AWS_SECRET_ACCESS_KEY=$secret_key AWS_ACCESS_KEY_ID=$access_key \
aws --output json --endpoint-url $endpoint_url s3api create-bucket --bucket $bucket_name | jq
AWS_REGION=$region AWS_SECRET_ACCESS_KEY=$secret_key AWS_ACCESS_KEY_ID=$access_key \
aws --endpoint-url $endpoint_url s3 cp "$SCRIPT_DIR"/wiki_movie_plots_small.csv s3://$bucket_name/
}

# Create Minio single server
docker compose version
docker compose -f "$SCRIPT_DIR"/docker-compose.yaml up --wait
docker compose -f "$SCRIPT_DIR"/docker-compose.yaml ps

echo "Cluster is live."
upload
13 changes: 13 additions & 0 deletions scripts/minio-test-helpers/docker-compose.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
services:
minio:
image: quay.io/minio/minio
container_name: minio-test
ports:
- 9000:9000
- 9001:9001
command: server --console-address ":9001" /data
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"]
interval: 5s
timeout: 20s
retries: 3
31 changes: 31 additions & 0 deletions scripts/minio-test-helpers/wiki_movie_plots_small.csv

Large diffs are not rendered by default.

Large diffs are not rendered by default.

46 changes: 46 additions & 0 deletions test_unstructured_ingest/test-ingest-s3-minio.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
#!/usr/bin/env bash

set -e


SCRIPT_DIR=$(dirname "$(realpath "$0")")
cd "$SCRIPT_DIR"/.. || exit 1
OUTPUT_FOLDER_NAME=s3-minio
OUTPUT_DIR=$SCRIPT_DIR/structured-output/$OUTPUT_FOLDER_NAME
DOWNLOAD_DIR=$SCRIPT_DIR/download/$OUTPUT_FOLDER_NAME
max_processes=${MAX_PROCESSES:=$(python3 -c "import os; print(os.cpu_count())")}
secret_key=minioadmin
access_key=minioadmin

# shellcheck disable=SC1091
source "$SCRIPT_DIR"/cleanup.sh

function cleanup() {
# Kill the container so the script can be repeatedly run using the same ports
echo "Stopping Minio Docker container"
docker-compose -f scripts/minio-test-helpers/docker-compose.yaml down --remove-orphans -v

cleanup_dir "$OUTPUT_DIR"
}

trap cleanup EXIT

# shellcheck source=/dev/null
scripts/minio-test-helpers/create-and-check-minio.sh
wait

AWS_SECRET_ACCESS_KEY=$secret_key AWS_ACCESS_KEY_ID=$access_key PYTHONPATH=. ./unstructured/ingest/main.py \
s3 \
--num-processes "$max_processes" \
--download-dir "$DOWNLOAD_DIR" \
--metadata-exclude coordinates,filename,file_directory,metadata.data_source.date_processed,metadata.data_source.date_modified,metadata.last_modified,metadata.detection_class_prob,metadata.parent_id,metadata.category_depth \
--strategy hi_res \
--preserve-downloads \
--reprocess \
--output-dir "$OUTPUT_DIR" \
--verbose \
--remote-url s3://utic-dev-tech-fixtures/ \
--endpoint-url http://localhost:9000


"$SCRIPT_DIR"/check-diff-expected-output.sh $OUTPUT_FOLDER_NAME
1 change: 1 addition & 0 deletions test_unstructured_ingest/test-ingest.sh
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ export OMP_THREAD_LIMIT=1

scripts=(
'test-ingest-s3.sh'
'test-ingest-s3-minio.sh'
'test-ingest-azure.sh'
'test-ingest-biomed-api.sh'
'test-ingest-biomed-path.sh'
Expand Down
2 changes: 1 addition & 1 deletion unstructured/__version__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "0.10.19-dev7" # pragma: no cover
__version__ = "0.10.19-dev8" # pragma: no cover
9 changes: 9 additions & 0 deletions unstructured/ingest/cli/cmds/s3.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
import logging
import typing as t
from dataclasses import dataclass

import click
Expand All @@ -22,6 +23,7 @@
@dataclass
class S3CliConfig(BaseConfig, CliMixin):
anonymous: bool = False
endpoint_url: t.Optional[str] = None

@staticmethod
def add_cli_options(cmd: click.Command) -> None:
Expand All @@ -32,6 +34,13 @@ def add_cli_options(cmd: click.Command) -> None:
default=False,
help="Connect to s3 without local AWS credentials.",
),
click.Option(
["--endpoint-url"],
type=str,
default=None,
help="Use this endpoint_url, if specified. Needed for "
"connecting to non-AWS S3 buckets.",
),
]
cmd.params.extend(options)

Expand Down
6 changes: 5 additions & 1 deletion unstructured/ingest/runner/s3.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ def s3(
verbose: bool = False,
recursive: bool = False,
anonymous: bool = False,
endpoint_url: t.Optional[str] = None,
writer_type: t.Optional[str] = None,
writer_kwargs: t.Optional[dict] = None,
**kwargs,
Expand All @@ -31,11 +32,14 @@ def s3(

from unstructured.ingest.connector.s3 import S3SourceConnector, SimpleS3Config

access_kwargs: t.Dict[str, t.Any] = {"anon": anonymous}
if endpoint_url:
access_kwargs["endpoint_url"] = endpoint_url
source_doc_connector = S3SourceConnector( # type: ignore
connector_config=SimpleS3Config(
path=remote_url,
recursive=recursive,
access_kwargs={"anon": anonymous},
access_kwargs=access_kwargs,
),
read_config=read_config,
partition_config=partition_config,
Expand Down
7 changes: 6 additions & 1 deletion unstructured/ingest/runner/writers.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
def s3_writer(
remote_url: str,
anonymous: bool,
endpoint_url: t.Optional[str] = None,
verbose: bool = False,
**kwargs,
):
Expand All @@ -17,11 +18,15 @@ def s3_writer(
SimpleS3Config,
)

access_kwargs: t.Dict[str, t.Any] = {"anon": anonymous}
if endpoint_url:
access_kwargs["endpoint_url"] = endpoint_url

return S3DestinationConnector(
write_config=WriteConfig(),
connector_config=SimpleS3Config(
path=remote_url,
access_kwargs={"anon": anonymous},
access_kwargs=access_kwargs,
),
)

Expand Down

0 comments on commit 8821689

Please sign in to comment.