Merge branch 'main' into fix/1209-tweak-xycut-ordering-output

# Conflicts: # CHANGELOG.md # unstructured/__version__.py
Unstructured-IO · Oct 3, 2023 · 8a88e93 · 8a88e93
2 parents 8c51d75 + 8821689
commit 8a88e93
Show file tree

Hide file tree

Showing 90 changed files with 1,702 additions and 1,413 deletions.
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -255,6 +255,10 @@ jobs:
         source .venv/bin/activate
         mkdir "$NLTK_DATA"
         make install-ci
+    - name: Setup docker-compose
+      uses: KengoTODA/actions-setup-docker-compose@v1
+      with:
+        version: '2.22.0'
     - name: Test Ingest (unit)
       run: |
         source .venv/bin/activate

diff --git a/.github/workflows/ingest-test-fixtures-update-pr.yml b/.github/workflows/ingest-test-fixtures-update-pr.yml
@@ -9,7 +9,7 @@ env:
 
 jobs:
   setup:
-    runs-on: ubuntu-latest
+    runs-on: ubuntu-latest-m
     if: |
       github.event_name == 'workflow_dispatch' ||
       (github.event_name == 'push' && contains(github.event.head_commit.message, 'ingest-test-fixtures-update'))
@@ -56,6 +56,10 @@ jobs:
           source .venv/bin/activate
           mkdir "$NLTK_DATA"
           make install-ci
+      - name: Setup docker-compose
+        uses: KengoTODA/actions-setup-docker-compose@v1
+        with:
+          version: '2.22.0'
       - name: Update test fixtures
         env:
           AIRTABLE_PERSONAL_ACCESS_TOKEN: ${{ secrets.AIRTABLE_PERSONAL_ACCESS_TOKEN }}

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,14 +1,26 @@
-## 0.10.19-dev5
+## 0.10.19-dev9
 
 ### Enhancements
 
 * **bump `unstructured-inference` to `0.6.6`** The updated version of `unstructured-inference` makes table extraction in `hi_res` mode configurable to fine tune table extraction performance; it also improves element detection by adding a deduplication post processing step in the `hi_res` partitioning of pdfs and images.
+* **Detect text in HTML Heading Tags as Titles** This will increase the accuracy of hierarchies in HTML documents and provide more accurate element categorization. If text is in an HTML heading tag and is not a list item, address, or narrative text, categorize it as a title.
+* **Update python-based docs** Refactor docs to use the actual unstructured code rather than using the subprocess library to run the cli command itself.
+* **Adds data source properties to SharePoint, Outlook, Onedrive, Reddit, and Slack connectors** These properties (date_created, date_modified, version, source_url, record_locator) are written to element metadata during ingest, mapping elements to information about the document source from which they derive. This functionality enables downstream applications to reveal source document applications, e.g. a link to a GDrive doc, Salesforce record, etc.
+* **Adds Table support for the `add_chunking_strategy` decorator to partition functions.** In addition to combining elements under Title elements, user's can now specify the `max_characters=<n>` argument to chunk Table elements into TableChunk elements with `text` and `text_as_html` of length <n> characters. This means partitioned Table results are ready for use in downstream applications without any post processing.
+* **Expose endpoint url for s3 connectors** By allowing for the endpoint url to be explicitly overwritten, this allows for any non-AWS data providers supporting the s3 protocol to be supported (i.e. minio).
+
+### Features 
 
 ### Features
 
 ### Fixes
 
 * **Tweak `xy-cut` ordering output to be more column friendly** While element ordering from `xy-cut` is usually mostly correct when ordering multi-column documents, sometimes elements from a RHS column will appear before elements in a LHS column. Fix: add swapped `xy-cut` ordering by sorting by X coordinate first and then Y coordinate.
+* **Fixes partition_pdf is_alnum reference bug** Problem: The `partition_pdf` when attempt to get bounding box from element experienced a reference before assignment error when the first object is not text extractable.  Fix: Switched to a flag when the condition is met. Importance: Crucial to be able to partition with pdf.
+* **Fix various cases of HTML text missing after partition**
+  Problem: Under certain circumstances, text immediately after some HTML tags will be misssing from partition result.
+  Fix: Updated code to deal with these cases.
+  Importance: This will ensure the correctness when partitioning HTML and Markdown documents.
 
 ## 0.10.18
 

diff --git a/docs/source/source_connectors/airtable.rst b/docs/source/source_connectors/airtable.rst
@@ -29,29 +29,21 @@ Run Locally
 
       .. code:: python
 
-        import subprocess
-
-        command = [
-          "unstructured-ingest",
-          "airtable",
-          "--metadata-exclude", "filename,file_directory,metadata.data_source.date_processed",
-          "--personal-access-token", "$AIRTABLE_PERSONAL_ACCESS_TOKEN",
-          "--output-dir", "airtable-ingest-output"
-          "--num-processes", "2",
-          "--reprocess",
-        ]
-
-        # Run the command
-        process = subprocess.Popen(command, stdout=subprocess.PIPE)
-        output, error = process.communicate()
-
-        # Print output
-        if process.returncode == 0:
-            print('Command executed successfully. Output:')
-            print(output.decode())
-        else:
-            print('Command failed. Error:')
-            print(error.decode())
+        import os
+
+        from unstructured.ingest.interfaces import PartitionConfig, ReadConfig
+        from unstructured.ingest.runner.airtable import airtable
+
+        if __name__ == "__main__":
+            airtable(
+                verbose=True,
+                read_config=ReadConfig(),
+                partition_config=PartitionConfig(
+                    output_dir="airtable-ingest-output",
+                    num_processes=2,
+                ),
+                personal_access_token=os.getenv("AIRTABLE_PERSONAL_ACCESS_TOKEN"),
+            )
 
 Run via the API
 ---------------
@@ -78,31 +70,23 @@ You can also use upstream connectors with the ``unstructured`` API. For this you
 
       .. code:: python
 
-        import subprocess
-
-        command = [
-          "unstructured-ingest",
-          "airtable",
-          "--metadata-exclude", "filename,file_directory,metadata.data_source.date_processed",
-          "--personal-access-token", "$AIRTABLE_PERSONAL_ACCESS_TOKEN",
-          "--output-dir", "airtable-ingest-output"
-          "--num-processes", "2",
-          "--reprocess",
-          "--partition-by-api",
-          "--api-key", "<UNSTRUCTURED-API-KEY>",
-        ]
-
-        # Run the command
-        process = subprocess.Popen(command, stdout=subprocess.PIPE)
-        output, error = process.communicate()
-
-        # Print output
-        if process.returncode == 0:
-            print('Command executed successfully. Output:')
-            print(output.decode())
-        else:
-            print('Command failed. Error:')
-            print(error.decode())
+        import os
+
+        from unstructured.ingest.interfaces import PartitionConfig, ReadConfig
+        from unstructured.ingest.runner.airtable import airtable
+
+        if __name__ == "__main__":
+            airtable(
+                verbose=True,
+                read_config=ReadConfig(),
+                partition_config=PartitionConfig(
+                    output_dir="airtable-ingest-output",
+                    num_processes=2,
+                    partition_by_api=True,
+                    api_key=os.getenv("UNSTRUCTURED_API_KEY"),
+                ),
+                personal_access_token=os.getenv("AIRTABLE_PERSONAL_ACCESS_TOKEN"),
+            )
 
 Additionally, you will need to pass the ``--partition-endpoint`` if you're running the API locally. You can find more information about the ``unstructured`` API `here <https://github.com/Unstructured-IO/unstructured-api>`_.
 

diff --git a/docs/source/source_connectors/azure.rst b/docs/source/source_connectors/azure.rst
@@ -28,28 +28,20 @@ Run Locally
 
       .. code:: python
 
-        import subprocess
-
-        command = [
-          "unstructured-ingest",
-          "azure",
-          "--remote-url", "abfs://container1/",
-          "--account-name", "azureunstructured1"
-          "--output-dir", "/Output/Path/To/Files",
-          "--num-processes", "2",
-        ]
-
-        # Run the command
-        process = subprocess.Popen(command, stdout=subprocess.PIPE)
-        output, error = process.communicate()
-
-        # Print output
-        if process.returncode == 0:
-            print('Command executed successfully. Output:')
-            print(output.decode())
-        else:
-            print('Command failed. Error:')
-            print(error.decode())
+        from unstructured.ingest.interfaces import PartitionConfig, ReadConfig
+        from unstructured.ingest.runner.azure import azure
+
+        if __name__ == "__main__":
+            azure(
+                verbose=True,
+                read_config=ReadConfig(),
+                partition_config=PartitionConfig(
+                    output_dir="azure-ingest-output",
+                    num_processes=2,
+                ),
+                remote_url="abfs://container1/",
+                account_name="azureunstructured1",
+            )
 
 Run via the API
 ---------------
@@ -62,43 +54,24 @@ You can also use upstream connectors with the ``unstructured`` API. For this you
 
       .. code:: shell
 
-        unstructured-ingest \
-          azure \
-          --remote-url abfs://container1/ \
-          --account-name azureunstructured1 \
-          --output-dir azure-ingest-output \
-          --num-processes 2 \
-          --partition-by-api \
-          --api-key "<UNSTRUCTURED-API-KEY>"
-
-   .. tab:: Python
-
-      .. code:: python
-
-        import subprocess
-
-        command = [
-          "unstructured-ingest",
-          "azure",
-          "--remote-url", "abfs://container1/",
-          "--account-name", "azureunstructured1"
-          "--output-dir", "/Output/Path/To/Files",
-          "--num-processes", "2",
-          "--partition-by-api",
-          "--api-key", "<UNSTRUCTURED-API-KEY>",
-        ]
-
-        # Run the command
-        process = subprocess.Popen(command, stdout=subprocess.PIPE)
-        output, error = process.communicate()
-
-        # Print output
-        if process.returncode == 0:
-            print('Command executed successfully. Output:')
-            print(output.decode())
-        else:
-            print('Command failed. Error:')
-            print(error.decode())
+        import os
+
+        from unstructured.ingest.interfaces import PartitionConfig, ReadConfig
+        from unstructured.ingest.runner.azure import azure
+
+        if __name__ == "__main__":
+            azure(
+                verbose=True,
+                read_config=ReadConfig(),
+                partition_config=PartitionConfig(
+                    output_dir="azure-ingest-output",
+                    num_processes=2,
+                    partition_by_api=True,
+                    api_key=os.getenv("UNSTRUCTURED_API_KEY"),
+                ),
+                remote_url="abfs://container1/",
+                account_name="azureunstructured1",
+            )
 
 Additionally, you will need to pass the ``--partition-endpoint`` if you're running the API locally. You can find more information about the ``unstructured`` API `here <https://github.com/Unstructured-IO/unstructured-api>`_.
 

diff --git a/docs/source/source_connectors/biomed.rst b/docs/source/source_connectors/biomed.rst
@@ -29,29 +29,21 @@ Run Locally
 
       .. code:: python
 
-        import subprocess
-
-        command = [
-          "unstructured-ingest",
-          "biomed",
-          "--path", "oa_pdf/07/07/sbaa031.073.PMC7234218.pdf",
-          "--output-dir", "/Output/Path/To/Files",
-          "--num-processes", "2",
-          "--verbose",
-          "--preserve-downloads",
-        ]
-
-        # Run the command
-        process = subprocess.Popen(command, stdout=subprocess.PIPE)
-        output, error = process.communicate()
-
-        # Print output
-        if process.returncode == 0:
-            print('Command executed successfully. Output:')
-            print(output.decode())
-        else:
-            print('Command failed. Error:')
-            print(error.decode())
+        from unstructured.ingest.interfaces import PartitionConfig, ReadConfig
+        from unstructured.ingest.runner.biomed import biomed
+
+        if __name__ == "__main__":
+            biomed(
+                verbose=True,
+                read_config=ReadConfig(
+                    preserve_downloads=True,
+                ),
+                partition_config=PartitionConfig(
+                    output_dir="biomed-ingest-output-path",
+                    num_processes=2,
+                ),
+                path="oa_pdf/07/07/sbaa031.073.PMC7234218.pdf",
+            )
 
 Run via the API
 ---------------
@@ -78,31 +70,25 @@ You can also use upstream connectors with the ``unstructured`` API. For this you
 
       .. code:: python
 
-        import subprocess
-
-        command = [
-          "unstructured-ingest",
-          "biomed",
-          "--path", "oa_pdf/07/07/sbaa031.073.PMC7234218.pdf",
-          "--output-dir", "/Output/Path/To/Files",
-          "--num-processes", "2",
-          "--verbose",
-          "--preserve-downloads",
-          "--partition-by-api",
-          "--api-key", "<UNSTRUCTURED-API-KEY>",
-        ]
-
-        # Run the command
-        process = subprocess.Popen(command, stdout=subprocess.PIPE)
-        output, error = process.communicate()
-
-        # Print output
-        if process.returncode == 0:
-            print('Command executed successfully. Output:')
-            print(output.decode())
-        else:
-            print('Command failed. Error:')
-            print(error.decode())
+        import os
+
+        from unstructured.ingest.interfaces import PartitionConfig, ReadConfig
+        from unstructured.ingest.runner.biomed import biomed
+
+        if __name__ == "__main__":
+            biomed(
+                verbose=True,
+                read_config=ReadConfig(
+                    preserve_downloads=True,
+                ),
+                partition_config=PartitionConfig(
+                    output_dir="biomed-ingest-output-path",
+                    num_processes=2,
+                    partition_by_api=True,
+                    api_key=os.getenv("UNSTRUCTURED_API_KEY"),
+                ),
+                path="oa_pdf/07/07/sbaa031.073.PMC7234218.pdf",
+            )
 
 Additionally, you will need to pass the ``--partition-endpoint`` if you're running the API locally. You can find more information about the ``unstructured`` API `here <https://github.com/Unstructured-IO/unstructured-api>`_.