Merge branch 'main' into chore/process-chipper-hierarchy

Unstructured-IO · Oct 3, 2023 · 0a52d02 · 0a52d02
2 parents f78726f + 13453d6
commit 0a52d02
Show file tree

Hide file tree

Showing 35 changed files with 694 additions and 116 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,26 +1,26 @@
-## 0.10.19-dev6
+## 0.10.19-dev9
 
 ### Enhancements
 
 * **bump `unstructured-inference` to `0.6.6`** The updated version of `unstructured-inference` makes table extraction in `hi_res` mode configurable to fine tune table extraction performance; it also improves element detection by adding a deduplication post processing step in the `hi_res` partitioning of pdfs and images.
+* **Detect text in HTML Heading Tags as Titles** This will increase the accuracy of hierarchies in HTML documents and provide more accurate element categorization. If text is in an HTML heading tag and is not a list item, address, or narrative text, categorize it as a title.
 * **Update python-based docs** Refactor docs to use the actual unstructured code rather than using the subprocess library to run the cli command itself.
-
-## 0.10.17-dev3
-
-### Enhancements
-
 * **Adds data source properties to SharePoint, Outlook, Onedrive, Reddit, and Slack connectors** These properties (date_created, date_modified, version, source_url, record_locator) are written to element metadata during ingest, mapping elements to information about the document source from which they derive. This functionality enables downstream applications to reveal source document applications, e.g. a link to a GDrive doc, Salesforce record, etc.
+* **Adds Table support for the `add_chunking_strategy` decorator to partition functions.** In addition to combining elements under Title elements, user's can now specify the `max_characters=<n>` argument to chunk Table elements into TableChunk elements with `text` and `text_as_html` of length <n> characters. This means partitioned Table results are ready for use in downstream applications without any post processing.
+* **Expose endpoint url for s3 connectors** By allowing for the endpoint url to be explicitly overwritten, this allows for any non-AWS data providers supporting the s3 protocol to be supported (i.e. minio).
+
+### Features 
 
 ### Features
 
 ### Fixes
 
+* **Fixes partition_pdf is_alnum reference bug** Problem: The `partition_pdf` when attempt to get bounding box from element experienced a reference before assignment error when the first object is not text extractable.  Fix: Switched to a flag when the condition is met. Importance: Crucial to be able to partition with pdf.
 * **Fix various cases of HTML text missing after partition**
   Problem: Under certain circumstances, text immediately after some HTML tags will be misssing from partition result.
   Fix: Updated code to deal with these cases.
   Importance: This will ensure the correctness when partitioning HTML and Markdown documents.
 
-
 ## 0.10.18
 
 ### Enhancements

diff --git a/docs/source/api.rst b/docs/source/api.rst
@@ -108,7 +108,7 @@ When elements are extracted from PDFs or images, it may be useful to get their b
       file_path = "/Path/To/File"
       file_data = {'files': open(file_path, 'rb')}
 
-      response = requests.post(url, headers=headers, files=files, data=data)
+      response = requests.post(url, headers=headers, files=file_data, data=data)
       
       file_data['files'].close()
 
@@ -155,7 +155,7 @@ You can specify the encoding to use to decode the text input. If no value is pro
       file_path = "/Path/To/File"
       file_data = {'files': open(file_path, 'rb')}
 
-      response = requests.post(url, headers=headers, files=files, data=data)
+      response = requests.post(url, headers=headers, files=file_data, data=data)
 
       file_data['files'].close()
 
@@ -204,7 +204,7 @@ You can also specify what languages to use for OCR with the ``ocr_languages`` kw
       file_path = "/Path/To/File"
       file_data = {'files': open(file_path, 'rb')}
 
-      response = requests.post(url, headers=headers, files=files, data=data)
+      response = requests.post(url, headers=headers, files=file_data, data=data)
 
       file_data['files'].close()
 
@@ -250,7 +250,7 @@ By default the result will be in ``json``, but it can be set to ``text/csv`` to
       file_path = "/Path/To/File"
       file_data = {'files': open(file_path, 'rb')}
 
-      response = requests.post(url, headers=headers, files=files, data=data)
+      response = requests.post(url, headers=headers, files=file_data, data=data)
 
       file_data['files'].close()
 
@@ -296,7 +296,7 @@ Pass the `include_page_breaks` parameter to `true` to include `PageBreak` elemen
       file_path = "/Path/To/File"
       file_data = {'files': open(file_path, 'rb')}
 
-      response = requests.post(url, headers=headers, files=files, data=data)
+      response = requests.post(url, headers=headers, files=file_data, data=data)
 
       file_data['files'].close()
 
@@ -345,7 +345,7 @@ On the other hand, ``hi_res`` is the better choice for PDFs that may have text w
       file_path = "/Path/To/File"
       file_data = {'files': open(file_path, 'rb')}
 
-      response = requests.post(url, headers=headers, files=files, data=data)
+      response = requests.post(url, headers=headers, files=file_data, data=data)
 
       file_data['files'].close()
 
@@ -398,7 +398,7 @@ To use the ``hi_res`` strategy with **Chipper** model, pass the argument for ``h
       file_path = "/Path/To/File"
       file_data = {'files': open(file_path, 'rb')}
 
-      response = requests.post(url, headers=headers, files=files, data=data)
+      response = requests.post(url, headers=headers, files=file_data, data=data)
 
       file_data['files'].close()
 
@@ -451,7 +451,7 @@ To extract the table structure from PDF files using the ``hi_res`` strategy, ens
       file_path = "/Path/To/File"
       file_data = {'files': open(file_path, 'rb')}
 
-      response = requests.post(url, headers=headers, files=files, data=data)
+      response = requests.post(url, headers=headers, files=file_data, data=data)
 
       file_data['files'].close()
 
@@ -499,7 +499,7 @@ We also provide support for enabling and disabling table extraction for file typ
       file_path = "/Path/To/File"
       file_data = {'files': open(file_path, 'rb')}
 
-      response = requests.post(url, headers=headers, files=files, data=data)
+      response = requests.post(url, headers=headers, files=file_data, data=data)
 
       file_data['files'].close()
 
@@ -545,7 +545,7 @@ When processing XML documents, set the ``xml_keep_tags`` parameter to ``true`` t
       file_path = "/Path/To/File"
       file_data = {'files': open(file_path, 'rb')}
 
-      response = requests.post(url, headers=headers, files=files, data=data)
+      response = requests.post(url, headers=headers, files=file_data, data=data)
 
       file_data['files'].close()
 

diff --git a/example-docs/interface-config-guide-p93.pdf b/example-docs/interface-config-guide-p93.pdf
diff --git a/scripts/minio-test-helpers/create-and-check-minio.sh b/scripts/minio-test-helpers/create-and-check-minio.sh
@@ -0,0 +1,25 @@
+#!/usr/bin/env bash
+
+SCRIPT_DIR=$(dirname "$(realpath "$0")")
+
+secret_key=minioadmin
+access_key=minioadmin
+region=us-east-2
+endpoint_url=http://localhost:9000
+bucket_name=utic-dev-tech-fixtures
+
+function upload(){
+  echo "Uploading test content to new bucket in minio"
+  AWS_REGION=$region AWS_SECRET_ACCESS_KEY=$secret_key AWS_ACCESS_KEY_ID=$access_key \
+  aws --output json --endpoint-url $endpoint_url s3api create-bucket --bucket $bucket_name | jq
+  AWS_REGION=$region AWS_SECRET_ACCESS_KEY=$secret_key AWS_ACCESS_KEY_ID=$access_key \
+  aws --endpoint-url $endpoint_url s3 cp "$SCRIPT_DIR"/wiki_movie_plots_small.csv s3://$bucket_name/
+}
+
+# Create Minio single server
+docker compose version
+docker compose -f "$SCRIPT_DIR"/docker-compose.yaml up --wait
+docker compose -f "$SCRIPT_DIR"/docker-compose.yaml ps
+
+echo "Cluster is live."
+upload
diff --git a/scripts/minio-test-helpers/docker-compose.yaml b/scripts/minio-test-helpers/docker-compose.yaml
@@ -0,0 +1,13 @@
+services:
+  minio:
+    image: quay.io/minio/minio
+    container_name: minio-test
+    ports:
+      - 9000:9000
+      - 9001:9001
+    command: server --console-address ":9001" /data
+    healthcheck:
+      test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"]
+      interval: 5s
+      timeout: 20s
+      retries: 3