Skip to content

Commit

Permalink
Merge branch 'main' into chore/process-chipper-hierarchy
Browse files Browse the repository at this point in the history
  • Loading branch information
qued committed Oct 3, 2023
2 parents f78726f + 13453d6 commit 0a52d02
Show file tree
Hide file tree
Showing 35 changed files with 694 additions and 116 deletions.
14 changes: 7 additions & 7 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,26 +1,26 @@
## 0.10.19-dev6
## 0.10.19-dev9

### Enhancements

* **bump `unstructured-inference` to `0.6.6`** The updated version of `unstructured-inference` makes table extraction in `hi_res` mode configurable to fine tune table extraction performance; it also improves element detection by adding a deduplication post processing step in the `hi_res` partitioning of pdfs and images.
* **Detect text in HTML Heading Tags as Titles** This will increase the accuracy of hierarchies in HTML documents and provide more accurate element categorization. If text is in an HTML heading tag and is not a list item, address, or narrative text, categorize it as a title.
* **Update python-based docs** Refactor docs to use the actual unstructured code rather than using the subprocess library to run the cli command itself.

## 0.10.17-dev3

### Enhancements

* **Adds data source properties to SharePoint, Outlook, Onedrive, Reddit, and Slack connectors** These properties (date_created, date_modified, version, source_url, record_locator) are written to element metadata during ingest, mapping elements to information about the document source from which they derive. This functionality enables downstream applications to reveal source document applications, e.g. a link to a GDrive doc, Salesforce record, etc.
* **Adds Table support for the `add_chunking_strategy` decorator to partition functions.** In addition to combining elements under Title elements, user's can now specify the `max_characters=<n>` argument to chunk Table elements into TableChunk elements with `text` and `text_as_html` of length <n> characters. This means partitioned Table results are ready for use in downstream applications without any post processing.
* **Expose endpoint url for s3 connectors** By allowing for the endpoint url to be explicitly overwritten, this allows for any non-AWS data providers supporting the s3 protocol to be supported (i.e. minio).

### Features

### Features

### Fixes

* **Fixes partition_pdf is_alnum reference bug** Problem: The `partition_pdf` when attempt to get bounding box from element experienced a reference before assignment error when the first object is not text extractable. Fix: Switched to a flag when the condition is met. Importance: Crucial to be able to partition with pdf.
* **Fix various cases of HTML text missing after partition**
Problem: Under certain circumstances, text immediately after some HTML tags will be misssing from partition result.
Fix: Updated code to deal with these cases.
Importance: This will ensure the correctness when partitioning HTML and Markdown documents.


## 0.10.18

### Enhancements
Expand Down
20 changes: 10 additions & 10 deletions docs/source/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -108,7 +108,7 @@ When elements are extracted from PDFs or images, it may be useful to get their b
file_path = "/Path/To/File"
file_data = {'files': open(file_path, 'rb')}
response = requests.post(url, headers=headers, files=files, data=data)
response = requests.post(url, headers=headers, files=file_data, data=data)
file_data['files'].close()
Expand Down Expand Up @@ -155,7 +155,7 @@ You can specify the encoding to use to decode the text input. If no value is pro
file_path = "/Path/To/File"
file_data = {'files': open(file_path, 'rb')}
response = requests.post(url, headers=headers, files=files, data=data)
response = requests.post(url, headers=headers, files=file_data, data=data)
file_data['files'].close()
Expand Down Expand Up @@ -204,7 +204,7 @@ You can also specify what languages to use for OCR with the ``ocr_languages`` kw
file_path = "/Path/To/File"
file_data = {'files': open(file_path, 'rb')}
response = requests.post(url, headers=headers, files=files, data=data)
response = requests.post(url, headers=headers, files=file_data, data=data)
file_data['files'].close()
Expand Down Expand Up @@ -250,7 +250,7 @@ By default the result will be in ``json``, but it can be set to ``text/csv`` to
file_path = "/Path/To/File"
file_data = {'files': open(file_path, 'rb')}
response = requests.post(url, headers=headers, files=files, data=data)
response = requests.post(url, headers=headers, files=file_data, data=data)
file_data['files'].close()
Expand Down Expand Up @@ -296,7 +296,7 @@ Pass the `include_page_breaks` parameter to `true` to include `PageBreak` elemen
file_path = "/Path/To/File"
file_data = {'files': open(file_path, 'rb')}
response = requests.post(url, headers=headers, files=files, data=data)
response = requests.post(url, headers=headers, files=file_data, data=data)
file_data['files'].close()
Expand Down Expand Up @@ -345,7 +345,7 @@ On the other hand, ``hi_res`` is the better choice for PDFs that may have text w
file_path = "/Path/To/File"
file_data = {'files': open(file_path, 'rb')}
response = requests.post(url, headers=headers, files=files, data=data)
response = requests.post(url, headers=headers, files=file_data, data=data)
file_data['files'].close()
Expand Down Expand Up @@ -398,7 +398,7 @@ To use the ``hi_res`` strategy with **Chipper** model, pass the argument for ``h
file_path = "/Path/To/File"
file_data = {'files': open(file_path, 'rb')}
response = requests.post(url, headers=headers, files=files, data=data)
response = requests.post(url, headers=headers, files=file_data, data=data)
file_data['files'].close()
Expand Down Expand Up @@ -451,7 +451,7 @@ To extract the table structure from PDF files using the ``hi_res`` strategy, ens
file_path = "/Path/To/File"
file_data = {'files': open(file_path, 'rb')}
response = requests.post(url, headers=headers, files=files, data=data)
response = requests.post(url, headers=headers, files=file_data, data=data)
file_data['files'].close()
Expand Down Expand Up @@ -499,7 +499,7 @@ We also provide support for enabling and disabling table extraction for file typ
file_path = "/Path/To/File"
file_data = {'files': open(file_path, 'rb')}
response = requests.post(url, headers=headers, files=files, data=data)
response = requests.post(url, headers=headers, files=file_data, data=data)
file_data['files'].close()
Expand Down Expand Up @@ -545,7 +545,7 @@ When processing XML documents, set the ``xml_keep_tags`` parameter to ``true`` t
file_path = "/Path/To/File"
file_data = {'files': open(file_path, 'rb')}
response = requests.post(url, headers=headers, files=files, data=data)
response = requests.post(url, headers=headers, files=file_data, data=data)
file_data['files'].close()
Expand Down
Binary file added example-docs/interface-config-guide-p93.pdf
Binary file not shown.
25 changes: 25 additions & 0 deletions scripts/minio-test-helpers/create-and-check-minio.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
#!/usr/bin/env bash

SCRIPT_DIR=$(dirname "$(realpath "$0")")

secret_key=minioadmin
access_key=minioadmin
region=us-east-2
endpoint_url=http://localhost:9000
bucket_name=utic-dev-tech-fixtures

function upload(){
echo "Uploading test content to new bucket in minio"
AWS_REGION=$region AWS_SECRET_ACCESS_KEY=$secret_key AWS_ACCESS_KEY_ID=$access_key \
aws --output json --endpoint-url $endpoint_url s3api create-bucket --bucket $bucket_name | jq
AWS_REGION=$region AWS_SECRET_ACCESS_KEY=$secret_key AWS_ACCESS_KEY_ID=$access_key \
aws --endpoint-url $endpoint_url s3 cp "$SCRIPT_DIR"/wiki_movie_plots_small.csv s3://$bucket_name/
}

# Create Minio single server
docker compose version
docker compose -f "$SCRIPT_DIR"/docker-compose.yaml up --wait
docker compose -f "$SCRIPT_DIR"/docker-compose.yaml ps

echo "Cluster is live."
upload
13 changes: 13 additions & 0 deletions scripts/minio-test-helpers/docker-compose.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
services:
minio:
image: quay.io/minio/minio
container_name: minio-test
ports:
- 9000:9000
- 9001:9001
command: server --console-address ":9001" /data
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"]
interval: 5s
timeout: 20s
retries: 3
Loading

0 comments on commit 0a52d02

Please sign in to comment.