deploy: b534b2a

Unstructured-IO · Sep 16, 2023 · a891eb4 · a891eb4
1 parent f47a0c9
commit a891eb4
Show file tree

Hide file tree

Showing 89 changed files with 3,351 additions and 788 deletions.
diff --git a/.buildinfo b/.buildinfo
@@ -1,4 +1,4 @@
 # Sphinx build info version 1
 # This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
-config: 1f40850e79e2bb23b0edf30af367ffa5
+config: ebdf42ef28445b576fd8bb65dffa2e1d
 tags: 645f666f9bcd5a90fca523b33c5a78b7
diff --git a/_sources/api.rst.txt b/_sources/api.rst.txt
@@ -460,7 +460,7 @@ To extract the table structure from PDF files using the ``hi_res`` strategy, ens
 Table Extraction for other filetypes
 ------------------------------------
 
-We also provide support for enabling and disabling table extraction for file types other than PDF files. Set parameter ``skip_infer_table_types`` to specify the document types that you want to skip table extraction with. By default, we skip table extraction for PDFs and Images, which are ``pdf``, ``jpg`` and ``png``. Note that table extraction only works with ``hi_res`` strategy. For example, if you don't want to skip table extraction for images, you can pass an empty value to ``skip_infer_table_types`` with:
+We also provide support for enabling and disabling table extraction for file types other than PDF files. Set parameter ``skip_infer_table_types`` to specify the document types that you want to skip table extraction with. By default, we skip table extraction for PDFs Images, and Excel files which are ``pdf``, ``jpg``, ``png``, ``xlsx``, and ``xls``. Note that table extraction only works with ``hi_res`` strategy. For example, if you don't want to skip table extraction for images, you can pass an empty value to ``skip_infer_table_types`` with:
 
 .. tabs::
 

diff --git a/_sources/best_practices.rst.txt b/_sources/best_practices.rst.txt
@@ -13,3 +13,4 @@ High-level overview of available strategies and models in ``Unstructured`` libra
    :maxdepth: 1
 
    best_practices/strategies
+   best_practices/models
diff --git a/_sources/best_practices/models.rst.txt b/_sources/best_practices/models.rst.txt
@@ -0,0 +1,91 @@
+.. role:: raw-html(raw)
+    :format: html
+
+Models
+======
+
+Depending on your need, ``Unstructured`` provides OCR-based and Transformer-based models to detect elements in the documents. The models are useful to detect the complex layout in the documents and predict the element types.
+
+**Basic usage:**
+
+.. code:: python
+
+    elements = partition(filename=filename, strategy='hi_res', model_name='chipper')
+
+Notes:
+
+* To use a the detection model, set: ``strategy='hi_res'``.
+* When ``model_name`` is not defined, the inferences will fall back to the default model.
+
+:raw-html:`<br />`
+**List of Available Models in the Partitions:**
+
+* ``detectron2_onnx`` is a Computer Vision model by Facebook AI that provides object detection and segmentation algorithms with ONNX Runtime. It is the fastest model with the ``hi_res`` strategy.
+* ``yolox`` is a single-stage real-time object detector that modifies YOLOv3 with a DarkNet53 backbone.
+* ``yolox_quantized``: runs faster than YoloX and its speed is closer to Detectron2.
+* ``chipper`` (beta version): the Chipper model is Unstructured’s in-house image-to-text model based on transformer-based Visual Document Understanding (VDU) models.
+
+
+Using a Non-Default Model
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+``Unstructured`` will download the model specified in ``UNSTRUCTURED_HI_RES_MODEL_NAME`` environment variable. If not defined, it will download the default model.
+
+There are three ways you can use the non-default model as follows:
+
+1. Store the model name in the environment variable
+
+.. code:: python
+
+    import os
+    from unstructured.partition.pdf import partition_pdf
+
+    os.environ["UNSTRUCTURED_HI_RES_MODEL_NAME"] = "yolox"
+    out_yolox = partition_pdf("example-docs/layout-parser-paper-fast.pdf", strategy="hi_res")
+
+
+2. Pass the model name in the ``partition`` function.
+
+.. code:: python
+
+    filename = "example-docs/layout-parser-paper-fast.pdf"
+    elements = partition(filename=filename, strategy='hi_res', model_name='yolox')
+
+3. Use `unstructured-inference <url_>`_ library.
+
+.. _url: https://github.com/Unstructured-IO/unstructured-inference
+
+.. code:: python
+
+    from unstructured_inference.models.base import get_model
+    from unstructured_inference.inference.layout import DocumentLayout
+
+    model = get_model("yolox")
+    layout = DocumentLayout.from_file("sample-docs/layout-parser-paper.pdf", detection_model=model)
+
+
+
+Bring Your Own Models
+^^^^^^^^^^^^^^^^^^^^^
+
+**Utilizing Layout Detection Model Zoo**
+
+In the `LayoutParser <layout_>`_ library, you can use various pre-trained models available in the `model zoo <modelzoo_>`_ for document layout analysis. Here's a guide on leveraging this feature using the ``UnstructuredDetectronModel`` class in ``unstructured-inference`` library.
+
+The ``UnstructuredDetectronModel`` class in ``unstructured_inference.models.detectron2`` uses the ``faster_rcnn_R_50_FPN_3x`` model pretrained on ``DocLayNet``. But any model in the model zoo can be used by using different construction parameters. ``UnstructuredDetectronModel`` is a light wrapper around the LayoutParser's ``Detectron2LayoutModel`` object, and accepts the same arguments.
+
+.. _modelzoo: https://layout-parser.readthedocs.io/en/latest/notes/modelzoo.html
+
+.. _layout: https://layout-parser.readthedocs.io/en/latest/api_doc/models.html#layoutparser.models.Detectron2LayoutModel
+
+**Using Your Own Object Detection Model**
+
+To seamlessly integrate your custom detection and extraction models into ``unstructured_inference`` pipeline, start by wrapping your model within the ``UnstructuredObjectDetectionModel`` class. This class acts as an intermediary between your detection model and Unstructured workflow.
+
+Ensure your ``UnstructuredObjectDetectionModel`` subclass incorporates two vital methods:
+
+1. The ``predict`` method, which should be designed to accept a ``PIL.Image.Image`` type and return a list of ``LayoutElements``, facilitating the communication of your model's results.
+2. The ``initialize`` method is essential for loading and prepping your model for inference, guaranteeing its readiness for any incoming tasks.
+
+It's important that your model's outputs, specifically from the predict method, integrate smoothly with the DocumentLayout class for optimal performance.
+
diff --git a/_sources/destination_connectors.rst.txt b/_sources/destination_connectors.rst.txt
@@ -0,0 +1,11 @@
+Destination Connectors
+======================
+
+Connect to your favorite data storage platforms for effortless batch processing of your files.
+We are constantly adding new data connectors and if you don't see your favorite platform let us know
+in our community `Slack. <https://join.slack.com/t/unstructuredw-kbe4326/shared_invite/zt-20kd2e9ti-q5yz7RCa2nlyqmAba9vqRw>`_
+
+.. toctree::
+   :maxdepth: 1
+
+   destination_connectors/delta_table
diff --git a/_sources/destination_connectors/delta_table.rst.txt b/_sources/destination_connectors/delta_table.rst.txt
@@ -0,0 +1,67 @@
+Delta Table
+==========
+Batch process all your records using ``unstructured-ingest`` to store structured outputs locally on your filesystem and upload those local files to a Delta Table.
+
+First you'll need to install the delta table dependencies as shown here.
+
+.. code:: shell
+
+  pip install "unstructured[delta-table]"
+
+Run Locally
+-----------
+The upstream connector can be any of the ones supported, but for convenience here, showing a sample command using the
+upstream delta-table connector. This will create a new table on your local and will raise an error if that table already exists.
+
+.. tabs::
+
+   .. tab:: Shell
+
+      .. code:: shell
+
+        unstructured-ingest \
+            delta-table \
+            --table-uri s3://utic-dev-tech-fixtures/sample-delta-lake-data/deltatable/ \
+            --output-dir delta-table-example \
+            --storage_options "AWS_REGION=us-east-2,AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY" \
+            --verbose
+            delta-table \
+            --write-column json_data \
+            --table-uri delta-table-dest
+
+   .. tab:: Python
+
+      .. code:: python
+
+        import subprocess
+
+        command = [
+          "unstructured-ingest",
+          "delta-table",
+          "--table-uri", "s3://utic-dev-tech-fixtures/sample-delta-lake-data/deltatable/",
+          "--download-dir", "delta-table-ingest-download",
+          "--output-dir", "delta-table-example",
+          "--preserve-downloads",
+          "--storage_options", "AWS_REGION=us-east-2,AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY",
+          "--verbose",
+          "delta-table"
+          "--write-column json_data"
+          "--table-uri delta-table-dest"
+        ]
+
+        # Run the command
+        process = subprocess.Popen(command, stdout=subprocess.PIPE)
+        output, error = process.communicate()
+
+        # Print output
+        if process.returncode == 0:
+            print('Command executed successfully. Output:')
+            print(output.decode())
+        else:
+            print('Command failed. Error:')
+            print(error.decode())
+
+
+For a full list of the options the CLI accepts check ``unstructured-ingest <upstream connector> delta-table --help``.
+
+NOTE: Keep in mind that you will need to have all the appropriate extras and dependencies for the file types of the documents contained in your data storage platform if you're running this locally. You can find more information about this in the `installation guide <https://unstructured-io.github.io/unstructured/installing.html>`_.
diff --git a/_sources/downstream_connectors.rst.txt b/_sources/downstream_connectors.rst.txt
@@ -0,0 +1,11 @@
+Downstream Connectors
+===================
+
+Connect to your favorite data storage platforms for effortless batch processing of your files.
+We are constantly adding new data connectors and if you don't see your favorite platform let us know
+in our community `Slack. <https://join.slack.com/t/unstructuredw-kbe4326/shared_invite/zt-20kd2e9ti-q5yz7RCa2nlyqmAba9vqRw>`_
+
+.. toctree::
+   :maxdepth: 1
+
+   downstream_connectors/delta_table
diff --git a/_sources/index.rst.txt b/_sources/index.rst.txt
@@ -17,9 +17,12 @@ Library Documentation
 :doc:`bricks`
   Learn more about partitioning, cleaning, and staging bricks, including advanced usage patterns.
 
-:doc:`upstream_connectors`
+:doc:`source_connectors`
   Connect to your favorite data storage platforms for an effortless batch processing of your files.
 
+:doc:`destination_connectors`
+  Connect to your favorite data storage platforms to write you ingest results to.
+
 :doc:`metadata`
   Learn more about how metadata is tracked in the ``unstructured`` library.
 
@@ -43,8 +46,9 @@ Library Documentation
    installing
    api
    bricks
-   upstream_connectors
+   source_connectors
+   destination_connectors
    metadata
    examples
    integrations
-   best_practices
+   best_practices
diff --git a/_sources/integrations.rst.txt b/_sources/integrations.rst.txt
@@ -8,39 +8,39 @@ which take a list of ``Element`` objects as input and return formatted dictionar
 
 ``Integration with Argilla``
 ----------------------------
-You can convert a list of ``Text`` elements to an `Argilla <https://www.argilla.io/>`_ ``Dataset`` using the `stage_for_argilla <https://unstructured-io.github.io/unstructured/bricks.html#stage-for-argilla>`_ staging brick. Specify the type of dataset to be generated using the ``argilla_task`` parameter. Valid values are ``"text_classification"``, ``"token_classification"``, and ``"text2text"``. Follow the link for more details on usage.
+You can convert a list of ``Text`` elements to an `Argilla <https://www.argilla.io/>`_ ``Dataset`` using the `stage_for_argilla <https://unstructured-io.github.io/unstructured/bricks/staging.html#stage-for-argilla>`_ staging brick. Specify the type of dataset to be generated using the ``argilla_task`` parameter. Valid values are ``"text_classification"``, ``"token_classification"``, and ``"text2text"``. Follow the link for more details on usage.
 
 
 ``Integration with Baseplate``
 -------------------------------
 `Baseplate <https://docs.baseplate.ai/introduction>`_ is a backend optimized for use with LLMs that has an easy to use spreadsheet
 interface. The ``unstructured`` library offers a staging brick to convert a list of ``Element`` objects into the
 `rows format <https://docs.baseplate.ai/api-reference/documents/overview>`_ required by the Baseplate API. See the
-`stage_for_baseplate <https://unstructured-io.github.io/unstructured/bricks.html#stage-for-baseplate>`_ documentation for
+`stage_for_baseplate <https://unstructured-io.github.io/unstructured/bricks/staging.html#stage-for-baseplate>`_ documentation for
 information on how to stage elements for ingestion into Baseplate.
 
 
 ``Integration with Datasaur``
 ------------------------------
-You can format a list of ``Text`` elements as input to token based tasks in `Datasaur <https://datasaur.ai/>`_ using the `stage_for_datasaur <https://unstructured-io.github.io/unstructured/bricks.html#stage-for-datasaur>`_ staging brick. You will obtain a list of dictionaries indexed by the keys ``"text"`` with the content of the element, and ``"entities"`` with an empty list. Follow the link to learn how to customise your entities and for more details on usage.
+You can format a list of ``Text`` elements as input to token based tasks in `Datasaur <https://datasaur.ai/>`_ using the `stage_for_datasaur <https://unstructured-io.github.io/unstructured/bricks/staging.html#stage-for-datasaur>`_ staging brick. You will obtain a list of dictionaries indexed by the keys ``"text"`` with the content of the element, and ``"entities"`` with an empty list. Follow the link to learn how to customise your entities and for more details on usage.
 
 
 ``Integration with Hugging Face``
 ----------------------------------
 You can prepare ``Text`` elements for processing in Hugging Face `Transformers <https://huggingface.co/docs/transformers/index>`_
-pipelines by splitting the elements into chunks that fit into the model's attention window using the `stage_for_transformers <https://unstructured-io.github.io/unstructured/bricks.html#stage-for-transformers>`_ staging brick. You can customise the transformation by defining
+pipelines by splitting the elements into chunks that fit into the model's attention window using the `stage_for_transformers <https://unstructured-io.github.io/unstructured/bricks/staging.html#stage-for-transformers>`_ staging brick. You can customise the transformation by defining
 the ``buffer`` and ``window_size``, the ``split_function`` and the ``chunk_separator``. if you need to operate on
-text directly instead of ``unstructured`` ``Text`` objects, use the `chunk_by_attention_window <https://unstructured-io.github.io/unstructured/bricks.html#stage-for-transformers>`_ helper function. Follow the links for more details on usage.
+text directly instead of ``unstructured`` ``Text`` objects, use the `chunk_by_attention_window <https://unstructured-io.github.io/unstructured/bricks/staging.html#stage-for-transformers>`_ helper function. Follow the links for more details on usage.
 
 
 ``Integration with Labelbox``
 ------------------------------
-You can format your outputs for use with `LabelBox <https://labelbox.com/>`_ using the `stage_for_label_box <https://unstructured-io.github.io/unstructured/bricks.html#stage-for-label-box>`_ staging brick. LabelBox accepts cloud-hosted data and does not support importing text directly. With this integration you can stage the data files in the ``output_directory`` to be uploaded to a cloud storage service (such as S3 buckets) and get a config of type ``List[Dict[str, Any]]`` that can be written to a ``.json`` file and imported into LabelBox. Follow the link to see how to generate the ``config.json`` file that can be used with LabelBox, how to upload the staged data files to an S3 bucket, and for more details on usage.
+You can format your outputs for use with `LabelBox <https://labelbox.com/>`_ using the `stage_for_label_box <https://unstructured-io.github.io/unstructured/bricks/staging.html#stage-for-label-box>`_ staging brick. LabelBox accepts cloud-hosted data and does not support importing text directly. With this integration you can stage the data files in the ``output_directory`` to be uploaded to a cloud storage service (such as S3 buckets) and get a config of type ``List[Dict[str, Any]]`` that can be written to a ``.json`` file and imported into LabelBox. Follow the link to see how to generate the ``config.json`` file that can be used with LabelBox, how to upload the staged data files to an S3 bucket, and for more details on usage.
 
 
 ``Integration with Label Studio``
 ----------------------------------
-You can format your outputs for upload to `Label Studio <https://labelstud.io/>`_ using the `stage_for_label_studio <https://unstructured-io.github.io/unstructured/bricks.html#stage-for-label-studio>`_ staging brick. After running ``stage_for_label_studio``, you can write the results
+You can format your outputs for upload to `Label Studio <https://labelstud.io/>`_ using the `stage_for_label_studio <https://unstructured-io.github.io/unstructured/bricks/staging.html#stage-for-label-studio>`_ staging brick. After running ``stage_for_label_studio``, you can write the results
 to a JSON folder that is ready to be included in a new Label Studio project. You can also include pre-annotations and predictions
 as part of your upload.
 
@@ -85,12 +85,12 @@ See `here <https://llamahub.ai/>`_ for more LlamaHub examples.
 ``Integration with Pandas``
 ----------------------------
 You can convert a list of ``Element`` objects to a Pandas dataframe with columns for
-the text from each element and their types such as ``NarrativeText`` or ``Title`` using the `convert_to_dataframe <https://unstructured-io.github.io/unstructured/bricks.html#convert-to-dataframe>`_ staging brick. Follow the link for more details on usage.
+the text from each element and their types such as ``NarrativeText`` or ``Title`` using the `convert_to_dataframe <https://unstructured-io.github.io/unstructured/bricks/staging.html#convert-to-dataframe>`_ staging brick. Follow the link for more details on usage.
 
 
 ``Integration with Prodigy``
 -----------------------------
-You can format your JSON or CSV outputs for use with `Prodigy <https://prodi.gy/docs/api-loaders>`_ using the `stage_for_prodigy <https://unstructured-io.github.io/unstructured/bricks.html#stage-for-prodigy>`_ and `stage_csv_for_prodigy <https://unstructured-io.github.io/unstructured/bricks.html#stage-csv-for-prodigy>`_ staging bricks. After running ``stage_for_prodigy`` |
+You can format your JSON or CSV outputs for use with `Prodigy <https://prodi.gy/docs/api-loaders>`_ using the `stage_for_prodigy <https://unstructured-io.github.io/unstructured/bricks/staging.html#stage-for-prodigy>`_ and `stage_csv_for_prodigy <https://unstructured-io.github.io/unstructured/bricks/staging.html#stage-csv-for-prodigy>`_ staging bricks. After running ``stage_for_prodigy`` |
 ``stage_csv_for_prodigy``, you can write the results to a ``.json`` | ``.jsonl`` or a ``.csv`` file that is ready to be used with Prodigy. Follow the links for more details on usage.
 
 
@@ -99,6 +99,6 @@ You can format your JSON or CSV outputs for use with `Prodigy <https://prodi.gy/
 `Weaviate <https://weaviate.io/>`_ is an open-source vector database that allows you to store data objects and vector embeddings
 from a variety of ML models. Storing text and embeddings in a vector database such as Weaviate is a key component of the
 `emerging LLM tech stack <https://medium.com/@unstructured-io/llms-and-the-emerging-ml-tech-stack-bdb189c8be5c>`_.
-See the `stage_for_weaviate <https://unstructured-io.github.io/unstructured/bricks.html#stage-for-weaviate>`_ docs for details
+See the `stage_for_weaviate <https://unstructured-io.github.io/unstructured/bricks/staging.html#stage-for-weaviate>`_ docs for details
 on how to upload ``unstructured`` outputs to Weaviate. An example notebook is also available
 `here <https://github.com/Unstructured-IO/unstructured/tree/main/examples/weaviate>`_.