Merge branch 'main' of https://github.com/Unstructured-IO/unstructured …

…into newelh/hierarchy-fast/pptx
Unstructured-IO · Oct 4, 2023 · 0f92a20 · 0f92a20
2 parents fcc239d + 19d8bff
commit 0f92a20
Show file tree

Hide file tree

Showing 101 changed files with 5,244 additions and 6,964 deletions.
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -255,6 +255,10 @@ jobs:
         source .venv/bin/activate
         mkdir "$NLTK_DATA"
         make install-ci
+    - name: Setup docker-compose
+      uses: KengoTODA/actions-setup-docker-compose@v1
+      with:
+        version: '2.22.0'
     - name: Test Ingest (unit)
       run: |
         source .venv/bin/activate

diff --git a/.github/workflows/ingest-test-fixtures-update-pr.yml b/.github/workflows/ingest-test-fixtures-update-pr.yml
@@ -9,7 +9,7 @@ env:
 
 jobs:
   setup:
-    runs-on: ubuntu-latest
+    runs-on: ubuntu-latest-m
     if: |
       github.event_name == 'workflow_dispatch' ||
       (github.event_name == 'push' && contains(github.event.head_commit.message, 'ingest-test-fixtures-update'))
@@ -56,6 +56,10 @@ jobs:
           source .venv/bin/activate
           mkdir "$NLTK_DATA"
           make install-ci
+      - name: Setup docker-compose
+        uses: KengoTODA/actions-setup-docker-compose@v1
+        with:
+          version: '2.22.0'
       - name: Update test fixtures
         env:
           AIRTABLE_PERSONAL_ACCESS_TOKEN: ${{ secrets.AIRTABLE_PERSONAL_ACCESS_TOKEN }}

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,21 +1,27 @@
-## 0.10.19-dev4
+## 0.10.19-dev11
 
 ### Enhancements
 
 * **bump `unstructured-inference` to `0.6.6`** The updated version of `unstructured-inference` makes table extraction in `hi_res` mode configurable to fine tune table extraction performance; it also improves element detection by adding a deduplication post processing step in the `hi_res` partitioning of pdfs and images.
+* **Detect text in HTML Heading Tags as Titles** This will increase the accuracy of hierarchies in HTML documents and provide more accurate element categorization. If text is in an HTML heading tag and is not a list item, address, or narrative text, categorize it as a title.
+* **Update python-based docs** Refactor docs to use the actual unstructured code rather than using the subprocess library to run the cli command itself.
+* **Adds data source properties to SharePoint, Outlook, Onedrive, Reddit, and Slack connectors** These properties (date_created, date_modified, version, source_url, record_locator) are written to element metadata during ingest, mapping elements to information about the document source from which they derive. This functionality enables downstream applications to reveal source document applications, e.g. a link to a GDrive doc, Salesforce record, etc.
+* **Adds Table support for the `add_chunking_strategy` decorator to partition functions.** In addition to combining elements under Title elements, user's can now specify the `max_characters=<n>` argument to chunk Table elements into TableChunk elements with `text` and `text_as_html` of length <n> characters. This means partitioned Table results are ready for use in downstream applications without any post processing.
+* **Expose endpoint url for s3 connectors** By allowing for the endpoint url to be explicitly overwritten, this allows for any non-AWS data providers supporting the s3 protocol to be supported (i.e. minio).
+* **change default `hi_res` model for pdf/image partition to `yolox`** Now partitioning pdf/image using `hi_res` strategy utilizes `yolox_quantized` model isntead of `detectron2_onnx` model. This new default model has better recall for tables and produces more detailed categories for elements.
 * **Improve title detection in pptx documents** The default title textboxes on a pptx slide are now categorized as titles.
 * **Improve hierarchy detection in pptx documents** List items, and other slide text are properly nested under the slide title. This will enable better chunking of pptx documents.
 
 ### Features
 
 ### Fixes
 
+* **Fixes partition_pdf is_alnum reference bug** Problem: The `partition_pdf` when attempt to get bounding box from element experienced a reference before assignment error when the first object is not text extractable.  Fix: Switched to a flag when the condition is met. Importance: Crucial to be able to partition with pdf.
 * **Fix various cases of HTML text missing after partition**
   Problem: Under certain circumstances, text immediately after some HTML tags will be misssing from partition result.
   Fix: Updated code to deal with these cases.
   Importance: This will ensure the correctness when partitioning HTML and Markdown documents.
 
-
 ## 0.10.18
 
 ### Enhancements

diff --git a/docs/source/api.rst b/docs/source/api.rst
@@ -108,7 +108,7 @@ When elements are extracted from PDFs or images, it may be useful to get their b
       file_path = "/Path/To/File"
       file_data = {'files': open(file_path, 'rb')}
 
-      response = requests.post(url, headers=headers, files=files, data=data)
+      response = requests.post(url, headers=headers, files=file_data, data=data)
       
       file_data['files'].close()
 
@@ -155,7 +155,7 @@ You can specify the encoding to use to decode the text input. If no value is pro
       file_path = "/Path/To/File"
       file_data = {'files': open(file_path, 'rb')}
 
-      response = requests.post(url, headers=headers, files=files, data=data)
+      response = requests.post(url, headers=headers, files=file_data, data=data)
 
       file_data['files'].close()
 
@@ -204,7 +204,7 @@ You can also specify what languages to use for OCR with the ``ocr_languages`` kw
       file_path = "/Path/To/File"
       file_data = {'files': open(file_path, 'rb')}
 
-      response = requests.post(url, headers=headers, files=files, data=data)
+      response = requests.post(url, headers=headers, files=file_data, data=data)
 
       file_data['files'].close()
 
@@ -250,7 +250,7 @@ By default the result will be in ``json``, but it can be set to ``text/csv`` to
       file_path = "/Path/To/File"
       file_data = {'files': open(file_path, 'rb')}
 
-      response = requests.post(url, headers=headers, files=files, data=data)
+      response = requests.post(url, headers=headers, files=file_data, data=data)
 
       file_data['files'].close()
 
@@ -296,7 +296,7 @@ Pass the `include_page_breaks` parameter to `true` to include `PageBreak` elemen
       file_path = "/Path/To/File"
       file_data = {'files': open(file_path, 'rb')}
 
-      response = requests.post(url, headers=headers, files=files, data=data)
+      response = requests.post(url, headers=headers, files=file_data, data=data)
 
       file_data['files'].close()
 
@@ -345,7 +345,7 @@ On the other hand, ``hi_res`` is the better choice for PDFs that may have text w
       file_path = "/Path/To/File"
       file_data = {'files': open(file_path, 'rb')}
 
-      response = requests.post(url, headers=headers, files=files, data=data)
+      response = requests.post(url, headers=headers, files=file_data, data=data)
 
       file_data['files'].close()
 
@@ -398,7 +398,7 @@ To use the ``hi_res`` strategy with **Chipper** model, pass the argument for ``h
       file_path = "/Path/To/File"
       file_data = {'files': open(file_path, 'rb')}
 
-      response = requests.post(url, headers=headers, files=files, data=data)
+      response = requests.post(url, headers=headers, files=file_data, data=data)
 
       file_data['files'].close()
 
@@ -451,7 +451,7 @@ To extract the table structure from PDF files using the ``hi_res`` strategy, ens
       file_path = "/Path/To/File"
       file_data = {'files': open(file_path, 'rb')}
 
-      response = requests.post(url, headers=headers, files=files, data=data)
+      response = requests.post(url, headers=headers, files=file_data, data=data)
 
       file_data['files'].close()
 
@@ -499,7 +499,7 @@ We also provide support for enabling and disabling table extraction for file typ
       file_path = "/Path/To/File"
       file_data = {'files': open(file_path, 'rb')}
 
-      response = requests.post(url, headers=headers, files=files, data=data)
+      response = requests.post(url, headers=headers, files=file_data, data=data)
 
       file_data['files'].close()
 
@@ -545,7 +545,7 @@ When processing XML documents, set the ``xml_keep_tags`` parameter to ``true`` t
       file_path = "/Path/To/File"
       file_data = {'files': open(file_path, 'rb')}
 
-      response = requests.post(url, headers=headers, files=files, data=data)
+      response = requests.post(url, headers=headers, files=file_data, data=data)
 
       file_data['files'].close()
 

diff --git a/docs/source/source_connectors/airtable.rst b/docs/source/source_connectors/airtable.rst
@@ -29,29 +29,21 @@ Run Locally
 
       .. code:: python
 
-        import subprocess
-
-        command = [
-          "unstructured-ingest",
-          "airtable",
-          "--metadata-exclude", "filename,file_directory,metadata.data_source.date_processed",
-          "--personal-access-token", "$AIRTABLE_PERSONAL_ACCESS_TOKEN",
-          "--output-dir", "airtable-ingest-output"
-          "--num-processes", "2",
-          "--reprocess",
-        ]
-
-        # Run the command
-        process = subprocess.Popen(command, stdout=subprocess.PIPE)
-        output, error = process.communicate()
-
-        # Print output
-        if process.returncode == 0:
-            print('Command executed successfully. Output:')
-            print(output.decode())
-        else:
-            print('Command failed. Error:')
-            print(error.decode())
+        import os
+
+        from unstructured.ingest.interfaces import PartitionConfig, ReadConfig
+        from unstructured.ingest.runner.airtable import airtable
+
+        if __name__ == "__main__":
+            airtable(
+                verbose=True,
+                read_config=ReadConfig(),
+                partition_config=PartitionConfig(
+                    output_dir="airtable-ingest-output",
+                    num_processes=2,
+                ),
+                personal_access_token=os.getenv("AIRTABLE_PERSONAL_ACCESS_TOKEN"),
+            )
 
 Run via the API
 ---------------
@@ -78,31 +70,23 @@ You can also use upstream connectors with the ``unstructured`` API. For this you
 
       .. code:: python
 
-        import subprocess
-
-        command = [
-          "unstructured-ingest",
-          "airtable",
-          "--metadata-exclude", "filename,file_directory,metadata.data_source.date_processed",
-          "--personal-access-token", "$AIRTABLE_PERSONAL_ACCESS_TOKEN",
-          "--output-dir", "airtable-ingest-output"
-          "--num-processes", "2",
-          "--reprocess",
-          "--partition-by-api",
-          "--api-key", "<UNSTRUCTURED-API-KEY>",
-        ]
-
-        # Run the command
-        process = subprocess.Popen(command, stdout=subprocess.PIPE)
-        output, error = process.communicate()
-
-        # Print output
-        if process.returncode == 0:
-            print('Command executed successfully. Output:')
-            print(output.decode())
-        else:
-            print('Command failed. Error:')
-            print(error.decode())
+        import os
+
+        from unstructured.ingest.interfaces import PartitionConfig, ReadConfig
+        from unstructured.ingest.runner.airtable import airtable
+
+        if __name__ == "__main__":
+            airtable(
+                verbose=True,
+                read_config=ReadConfig(),
+                partition_config=PartitionConfig(
+                    output_dir="airtable-ingest-output",
+                    num_processes=2,
+                    partition_by_api=True,
+                    api_key=os.getenv("UNSTRUCTURED_API_KEY"),
+                ),
+                personal_access_token=os.getenv("AIRTABLE_PERSONAL_ACCESS_TOKEN"),
+            )
 
 Additionally, you will need to pass the ``--partition-endpoint`` if you're running the API locally. You can find more information about the ``unstructured`` API `here <https://github.com/Unstructured-IO/unstructured-api>`_.
 

diff --git a/docs/source/source_connectors/azure.rst b/docs/source/source_connectors/azure.rst
@@ -28,28 +28,20 @@ Run Locally
 
       .. code:: python
 
-        import subprocess
-
-        command = [
-          "unstructured-ingest",
-          "azure",
-          "--remote-url", "abfs://container1/",
-          "--account-name", "azureunstructured1"
-          "--output-dir", "/Output/Path/To/Files",
-          "--num-processes", "2",
-        ]
-
-        # Run the command
-        process = subprocess.Popen(command, stdout=subprocess.PIPE)
-        output, error = process.communicate()
-
-        # Print output
-        if process.returncode == 0:
-            print('Command executed successfully. Output:')
-            print(output.decode())
-        else:
-            print('Command failed. Error:')
-            print(error.decode())
+        from unstructured.ingest.interfaces import PartitionConfig, ReadConfig
+        from unstructured.ingest.runner.azure import azure
+
+        if __name__ == "__main__":
+            azure(
+                verbose=True,
+                read_config=ReadConfig(),
+                partition_config=PartitionConfig(
+                    output_dir="azure-ingest-output",
+                    num_processes=2,
+                ),
+                remote_url="abfs://container1/",
+                account_name="azureunstructured1",
+            )
 
 Run via the API
 ---------------
@@ -62,43 +54,24 @@ You can also use upstream connectors with the ``unstructured`` API. For this you
 
       .. code:: shell
 
-        unstructured-ingest \
-          azure \
-          --remote-url abfs://container1/ \
-          --account-name azureunstructured1 \
-          --output-dir azure-ingest-output \
-          --num-processes 2 \
-          --partition-by-api \
-          --api-key "<UNSTRUCTURED-API-KEY>"
-
-   .. tab:: Python
-
-      .. code:: python
-
-        import subprocess
-
-        command = [
-          "unstructured-ingest",
-          "azure",
-          "--remote-url", "abfs://container1/",
-          "--account-name", "azureunstructured1"
-          "--output-dir", "/Output/Path/To/Files",
-          "--num-processes", "2",
-          "--partition-by-api",
-          "--api-key", "<UNSTRUCTURED-API-KEY>",
-        ]
-
-        # Run the command
-        process = subprocess.Popen(command, stdout=subprocess.PIPE)
-        output, error = process.communicate()
-
-        # Print output
-        if process.returncode == 0:
-            print('Command executed successfully. Output:')
-            print(output.decode())
-        else:
-            print('Command failed. Error:')
-            print(error.decode())
+        import os
+
+        from unstructured.ingest.interfaces import PartitionConfig, ReadConfig
+        from unstructured.ingest.runner.azure import azure
+
+        if __name__ == "__main__":
+            azure(
+                verbose=True,
+                read_config=ReadConfig(),
+                partition_config=PartitionConfig(
+                    output_dir="azure-ingest-output",
+                    num_processes=2,
+                    partition_by_api=True,
+                    api_key=os.getenv("UNSTRUCTURED_API_KEY"),
+                ),
+                remote_url="abfs://container1/",
+                account_name="azureunstructured1",
+            )
 
 Additionally, you will need to pass the ``--partition-endpoint`` if you're running the API locally. You can find more information about the ``unstructured`` API `here <https://github.com/Unstructured-IO/unstructured-api>`_.