Skip to content

Commit

Permalink
deploy: b534b2a
Browse files Browse the repository at this point in the history
  • Loading branch information
cragwolfe committed Sep 16, 2023
1 parent f47a0c9 commit a891eb4
Show file tree
Hide file tree
Showing 89 changed files with 3,351 additions and 788 deletions.
2 changes: 1 addition & 1 deletion .buildinfo
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Sphinx build info version 1
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
config: 1f40850e79e2bb23b0edf30af367ffa5
config: ebdf42ef28445b576fd8bb65dffa2e1d
tags: 645f666f9bcd5a90fca523b33c5a78b7
2 changes: 1 addition & 1 deletion _sources/api.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -460,7 +460,7 @@ To extract the table structure from PDF files using the ``hi_res`` strategy, ens
Table Extraction for other filetypes
------------------------------------

We also provide support for enabling and disabling table extraction for file types other than PDF files. Set parameter ``skip_infer_table_types`` to specify the document types that you want to skip table extraction with. By default, we skip table extraction for PDFs and Images, which are ``pdf``, ``jpg`` and ``png``. Note that table extraction only works with ``hi_res`` strategy. For example, if you don't want to skip table extraction for images, you can pass an empty value to ``skip_infer_table_types`` with:
We also provide support for enabling and disabling table extraction for file types other than PDF files. Set parameter ``skip_infer_table_types`` to specify the document types that you want to skip table extraction with. By default, we skip table extraction for PDFs Images, and Excel files which are ``pdf``, ``jpg``, ``png``, ``xlsx``, and ``xls``. Note that table extraction only works with ``hi_res`` strategy. For example, if you don't want to skip table extraction for images, you can pass an empty value to ``skip_infer_table_types`` with:

.. tabs::

Expand Down
1 change: 1 addition & 0 deletions _sources/best_practices.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,4 @@ High-level overview of available strategies and models in ``Unstructured`` libra
:maxdepth: 1

best_practices/strategies
best_practices/models
91 changes: 91 additions & 0 deletions _sources/best_practices/models.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
.. role:: raw-html(raw)
:format: html

Models
======

Depending on your need, ``Unstructured`` provides OCR-based and Transformer-based models to detect elements in the documents. The models are useful to detect the complex layout in the documents and predict the element types.

**Basic usage:**

.. code:: python
elements = partition(filename=filename, strategy='hi_res', model_name='chipper')
Notes:

* To use a the detection model, set: ``strategy='hi_res'``.
* When ``model_name`` is not defined, the inferences will fall back to the default model.

:raw-html:`<br />`
**List of Available Models in the Partitions:**

* ``detectron2_onnx`` is a Computer Vision model by Facebook AI that provides object detection and segmentation algorithms with ONNX Runtime. It is the fastest model with the ``hi_res`` strategy.
* ``yolox`` is a single-stage real-time object detector that modifies YOLOv3 with a DarkNet53 backbone.
* ``yolox_quantized``: runs faster than YoloX and its speed is closer to Detectron2.
* ``chipper`` (beta version): the Chipper model is Unstructured’s in-house image-to-text model based on transformer-based Visual Document Understanding (VDU) models.


Using a Non-Default Model
^^^^^^^^^^^^^^^^^^^^^^^^^

``Unstructured`` will download the model specified in ``UNSTRUCTURED_HI_RES_MODEL_NAME`` environment variable. If not defined, it will download the default model.

There are three ways you can use the non-default model as follows:

1. Store the model name in the environment variable

.. code:: python
import os
from unstructured.partition.pdf import partition_pdf
os.environ["UNSTRUCTURED_HI_RES_MODEL_NAME"] = "yolox"
out_yolox = partition_pdf("example-docs/layout-parser-paper-fast.pdf", strategy="hi_res")
2. Pass the model name in the ``partition`` function.

.. code:: python
filename = "example-docs/layout-parser-paper-fast.pdf"
elements = partition(filename=filename, strategy='hi_res', model_name='yolox')
3. Use `unstructured-inference <url_>`_ library.

.. _url: https://github.com/Unstructured-IO/unstructured-inference

.. code:: python
from unstructured_inference.models.base import get_model
from unstructured_inference.inference.layout import DocumentLayout
model = get_model("yolox")
layout = DocumentLayout.from_file("sample-docs/layout-parser-paper.pdf", detection_model=model)
Bring Your Own Models
^^^^^^^^^^^^^^^^^^^^^

**Utilizing Layout Detection Model Zoo**

In the `LayoutParser <layout_>`_ library, you can use various pre-trained models available in the `model zoo <modelzoo_>`_ for document layout analysis. Here's a guide on leveraging this feature using the ``UnstructuredDetectronModel`` class in ``unstructured-inference`` library.

The ``UnstructuredDetectronModel`` class in ``unstructured_inference.models.detectron2`` uses the ``faster_rcnn_R_50_FPN_3x`` model pretrained on ``DocLayNet``. But any model in the model zoo can be used by using different construction parameters. ``UnstructuredDetectronModel`` is a light wrapper around the LayoutParser's ``Detectron2LayoutModel`` object, and accepts the same arguments.

.. _modelzoo: https://layout-parser.readthedocs.io/en/latest/notes/modelzoo.html

.. _layout: https://layout-parser.readthedocs.io/en/latest/api_doc/models.html#layoutparser.models.Detectron2LayoutModel

**Using Your Own Object Detection Model**

To seamlessly integrate your custom detection and extraction models into ``unstructured_inference`` pipeline, start by wrapping your model within the ``UnstructuredObjectDetectionModel`` class. This class acts as an intermediary between your detection model and Unstructured workflow.

Ensure your ``UnstructuredObjectDetectionModel`` subclass incorporates two vital methods:

1. The ``predict`` method, which should be designed to accept a ``PIL.Image.Image`` type and return a list of ``LayoutElements``, facilitating the communication of your model's results.
2. The ``initialize`` method is essential for loading and prepping your model for inference, guaranteeing its readiness for any incoming tasks.

It's important that your model's outputs, specifically from the predict method, integrate smoothly with the DocumentLayout class for optimal performance.

11 changes: 11 additions & 0 deletions _sources/destination_connectors.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
Destination Connectors
======================

Connect to your favorite data storage platforms for effortless batch processing of your files.
We are constantly adding new data connectors and if you don't see your favorite platform let us know
in our community `Slack. <https://join.slack.com/t/unstructuredw-kbe4326/shared_invite/zt-20kd2e9ti-q5yz7RCa2nlyqmAba9vqRw>`_

.. toctree::
:maxdepth: 1

destination_connectors/delta_table
67 changes: 67 additions & 0 deletions _sources/destination_connectors/delta_table.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
Delta Table
==========
Batch process all your records using ``unstructured-ingest`` to store structured outputs locally on your filesystem and upload those local files to a Delta Table.

First you'll need to install the delta table dependencies as shown here.

.. code:: shell
pip install "unstructured[delta-table]"
Run Locally
-----------
The upstream connector can be any of the ones supported, but for convenience here, showing a sample command using the
upstream delta-table connector. This will create a new table on your local and will raise an error if that table already exists.

.. tabs::

.. tab:: Shell

.. code:: shell
unstructured-ingest \
delta-table \
--table-uri s3://utic-dev-tech-fixtures/sample-delta-lake-data/deltatable/ \
--output-dir delta-table-example \
--storage_options "AWS_REGION=us-east-2,AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY" \
--verbose
delta-table \
--write-column json_data \
--table-uri delta-table-dest
.. tab:: Python

.. code:: python
import subprocess
command = [
"unstructured-ingest",
"delta-table",
"--table-uri", "s3://utic-dev-tech-fixtures/sample-delta-lake-data/deltatable/",
"--download-dir", "delta-table-ingest-download",
"--output-dir", "delta-table-example",
"--preserve-downloads",
"--storage_options", "AWS_REGION=us-east-2,AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY",
"--verbose",
"delta-table"
"--write-column json_data"
"--table-uri delta-table-dest"
]
# Run the command
process = subprocess.Popen(command, stdout=subprocess.PIPE)
output, error = process.communicate()
# Print output
if process.returncode == 0:
print('Command executed successfully. Output:')
print(output.decode())
else:
print('Command failed. Error:')
print(error.decode())
For a full list of the options the CLI accepts check ``unstructured-ingest <upstream connector> delta-table --help``.

NOTE: Keep in mind that you will need to have all the appropriate extras and dependencies for the file types of the documents contained in your data storage platform if you're running this locally. You can find more information about this in the `installation guide <https://unstructured-io.github.io/unstructured/installing.html>`_.
11 changes: 11 additions & 0 deletions _sources/downstream_connectors.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
Downstream Connectors
===================

Connect to your favorite data storage platforms for effortless batch processing of your files.
We are constantly adding new data connectors and if you don't see your favorite platform let us know
in our community `Slack. <https://join.slack.com/t/unstructuredw-kbe4326/shared_invite/zt-20kd2e9ti-q5yz7RCa2nlyqmAba9vqRw>`_

.. toctree::
:maxdepth: 1

downstream_connectors/delta_table
10 changes: 7 additions & 3 deletions _sources/index.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,12 @@ Library Documentation
:doc:`bricks`
Learn more about partitioning, cleaning, and staging bricks, including advanced usage patterns.

:doc:`upstream_connectors`
:doc:`source_connectors`
Connect to your favorite data storage platforms for an effortless batch processing of your files.

:doc:`destination_connectors`
Connect to your favorite data storage platforms to write you ingest results to.

:doc:`metadata`
Learn more about how metadata is tracked in the ``unstructured`` library.

Expand All @@ -43,8 +46,9 @@ Library Documentation
installing
api
bricks
upstream_connectors
source_connectors
destination_connectors
metadata
examples
integrations
best_practices
best_practices
20 changes: 10 additions & 10 deletions _sources/integrations.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -8,39 +8,39 @@ which take a list of ``Element`` objects as input and return formatted dictionar

``Integration with Argilla``
----------------------------
You can convert a list of ``Text`` elements to an `Argilla <https://www.argilla.io/>`_ ``Dataset`` using the `stage_for_argilla <https://unstructured-io.github.io/unstructured/bricks.html#stage-for-argilla>`_ staging brick. Specify the type of dataset to be generated using the ``argilla_task`` parameter. Valid values are ``"text_classification"``, ``"token_classification"``, and ``"text2text"``. Follow the link for more details on usage.
You can convert a list of ``Text`` elements to an `Argilla <https://www.argilla.io/>`_ ``Dataset`` using the `stage_for_argilla <https://unstructured-io.github.io/unstructured/bricks/staging.html#stage-for-argilla>`_ staging brick. Specify the type of dataset to be generated using the ``argilla_task`` parameter. Valid values are ``"text_classification"``, ``"token_classification"``, and ``"text2text"``. Follow the link for more details on usage.


``Integration with Baseplate``
-------------------------------
`Baseplate <https://docs.baseplate.ai/introduction>`_ is a backend optimized for use with LLMs that has an easy to use spreadsheet
interface. The ``unstructured`` library offers a staging brick to convert a list of ``Element`` objects into the
`rows format <https://docs.baseplate.ai/api-reference/documents/overview>`_ required by the Baseplate API. See the
`stage_for_baseplate <https://unstructured-io.github.io/unstructured/bricks.html#stage-for-baseplate>`_ documentation for
`stage_for_baseplate <https://unstructured-io.github.io/unstructured/bricks/staging.html#stage-for-baseplate>`_ documentation for
information on how to stage elements for ingestion into Baseplate.


``Integration with Datasaur``
------------------------------
You can format a list of ``Text`` elements as input to token based tasks in `Datasaur <https://datasaur.ai/>`_ using the `stage_for_datasaur <https://unstructured-io.github.io/unstructured/bricks.html#stage-for-datasaur>`_ staging brick. You will obtain a list of dictionaries indexed by the keys ``"text"`` with the content of the element, and ``"entities"`` with an empty list. Follow the link to learn how to customise your entities and for more details on usage.
You can format a list of ``Text`` elements as input to token based tasks in `Datasaur <https://datasaur.ai/>`_ using the `stage_for_datasaur <https://unstructured-io.github.io/unstructured/bricks/staging.html#stage-for-datasaur>`_ staging brick. You will obtain a list of dictionaries indexed by the keys ``"text"`` with the content of the element, and ``"entities"`` with an empty list. Follow the link to learn how to customise your entities and for more details on usage.


``Integration with Hugging Face``
----------------------------------
You can prepare ``Text`` elements for processing in Hugging Face `Transformers <https://huggingface.co/docs/transformers/index>`_
pipelines by splitting the elements into chunks that fit into the model's attention window using the `stage_for_transformers <https://unstructured-io.github.io/unstructured/bricks.html#stage-for-transformers>`_ staging brick. You can customise the transformation by defining
pipelines by splitting the elements into chunks that fit into the model's attention window using the `stage_for_transformers <https://unstructured-io.github.io/unstructured/bricks/staging.html#stage-for-transformers>`_ staging brick. You can customise the transformation by defining
the ``buffer`` and ``window_size``, the ``split_function`` and the ``chunk_separator``. if you need to operate on
text directly instead of ``unstructured`` ``Text`` objects, use the `chunk_by_attention_window <https://unstructured-io.github.io/unstructured/bricks.html#stage-for-transformers>`_ helper function. Follow the links for more details on usage.
text directly instead of ``unstructured`` ``Text`` objects, use the `chunk_by_attention_window <https://unstructured-io.github.io/unstructured/bricks/staging.html#stage-for-transformers>`_ helper function. Follow the links for more details on usage.


``Integration with Labelbox``
------------------------------
You can format your outputs for use with `LabelBox <https://labelbox.com/>`_ using the `stage_for_label_box <https://unstructured-io.github.io/unstructured/bricks.html#stage-for-label-box>`_ staging brick. LabelBox accepts cloud-hosted data and does not support importing text directly. With this integration you can stage the data files in the ``output_directory`` to be uploaded to a cloud storage service (such as S3 buckets) and get a config of type ``List[Dict[str, Any]]`` that can be written to a ``.json`` file and imported into LabelBox. Follow the link to see how to generate the ``config.json`` file that can be used with LabelBox, how to upload the staged data files to an S3 bucket, and for more details on usage.
You can format your outputs for use with `LabelBox <https://labelbox.com/>`_ using the `stage_for_label_box <https://unstructured-io.github.io/unstructured/bricks/staging.html#stage-for-label-box>`_ staging brick. LabelBox accepts cloud-hosted data and does not support importing text directly. With this integration you can stage the data files in the ``output_directory`` to be uploaded to a cloud storage service (such as S3 buckets) and get a config of type ``List[Dict[str, Any]]`` that can be written to a ``.json`` file and imported into LabelBox. Follow the link to see how to generate the ``config.json`` file that can be used with LabelBox, how to upload the staged data files to an S3 bucket, and for more details on usage.


``Integration with Label Studio``
----------------------------------
You can format your outputs for upload to `Label Studio <https://labelstud.io/>`_ using the `stage_for_label_studio <https://unstructured-io.github.io/unstructured/bricks.html#stage-for-label-studio>`_ staging brick. After running ``stage_for_label_studio``, you can write the results
You can format your outputs for upload to `Label Studio <https://labelstud.io/>`_ using the `stage_for_label_studio <https://unstructured-io.github.io/unstructured/bricks/staging.html#stage-for-label-studio>`_ staging brick. After running ``stage_for_label_studio``, you can write the results
to a JSON folder that is ready to be included in a new Label Studio project. You can also include pre-annotations and predictions
as part of your upload.

Expand Down Expand Up @@ -85,12 +85,12 @@ See `here <https://llamahub.ai/>`_ for more LlamaHub examples.
``Integration with Pandas``
----------------------------
You can convert a list of ``Element`` objects to a Pandas dataframe with columns for
the text from each element and their types such as ``NarrativeText`` or ``Title`` using the `convert_to_dataframe <https://unstructured-io.github.io/unstructured/bricks.html#convert-to-dataframe>`_ staging brick. Follow the link for more details on usage.
the text from each element and their types such as ``NarrativeText`` or ``Title`` using the `convert_to_dataframe <https://unstructured-io.github.io/unstructured/bricks/staging.html#convert-to-dataframe>`_ staging brick. Follow the link for more details on usage.


``Integration with Prodigy``
-----------------------------
You can format your JSON or CSV outputs for use with `Prodigy <https://prodi.gy/docs/api-loaders>`_ using the `stage_for_prodigy <https://unstructured-io.github.io/unstructured/bricks.html#stage-for-prodigy>`_ and `stage_csv_for_prodigy <https://unstructured-io.github.io/unstructured/bricks.html#stage-csv-for-prodigy>`_ staging bricks. After running ``stage_for_prodigy`` |
You can format your JSON or CSV outputs for use with `Prodigy <https://prodi.gy/docs/api-loaders>`_ using the `stage_for_prodigy <https://unstructured-io.github.io/unstructured/bricks/staging.html#stage-for-prodigy>`_ and `stage_csv_for_prodigy <https://unstructured-io.github.io/unstructured/bricks/staging.html#stage-csv-for-prodigy>`_ staging bricks. After running ``stage_for_prodigy`` |
``stage_csv_for_prodigy``, you can write the results to a ``.json`` | ``.jsonl`` or a ``.csv`` file that is ready to be used with Prodigy. Follow the links for more details on usage.


Expand All @@ -99,6 +99,6 @@ You can format your JSON or CSV outputs for use with `Prodigy <https://prodi.gy/
`Weaviate <https://weaviate.io/>`_ is an open-source vector database that allows you to store data objects and vector embeddings
from a variety of ML models. Storing text and embeddings in a vector database such as Weaviate is a key component of the
`emerging LLM tech stack <https://medium.com/@unstructured-io/llms-and-the-emerging-ml-tech-stack-bdb189c8be5c>`_.
See the `stage_for_weaviate <https://unstructured-io.github.io/unstructured/bricks.html#stage-for-weaviate>`_ docs for details
See the `stage_for_weaviate <https://unstructured-io.github.io/unstructured/bricks/staging.html#stage-for-weaviate>`_ docs for details
on how to upload ``unstructured`` outputs to Weaviate. An example notebook is also available
`here <https://github.com/Unstructured-IO/unstructured/tree/main/examples/weaviate>`_.
Loading

0 comments on commit a891eb4

Please sign in to comment.