-
Notifications
You must be signed in to change notification settings - Fork 743
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
89 changed files
with
3,351 additions
and
788 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,4 @@ | ||
# Sphinx build info version 1 | ||
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done. | ||
config: 1f40850e79e2bb23b0edf30af367ffa5 | ||
config: ebdf42ef28445b576fd8bb65dffa2e1d | ||
tags: 645f666f9bcd5a90fca523b33c5a78b7 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,91 @@ | ||
.. role:: raw-html(raw) | ||
:format: html | ||
|
||
Models | ||
====== | ||
|
||
Depending on your need, ``Unstructured`` provides OCR-based and Transformer-based models to detect elements in the documents. The models are useful to detect the complex layout in the documents and predict the element types. | ||
|
||
**Basic usage:** | ||
|
||
.. code:: python | ||
elements = partition(filename=filename, strategy='hi_res', model_name='chipper') | ||
Notes: | ||
|
||
* To use a the detection model, set: ``strategy='hi_res'``. | ||
* When ``model_name`` is not defined, the inferences will fall back to the default model. | ||
|
||
:raw-html:`<br />` | ||
**List of Available Models in the Partitions:** | ||
|
||
* ``detectron2_onnx`` is a Computer Vision model by Facebook AI that provides object detection and segmentation algorithms with ONNX Runtime. It is the fastest model with the ``hi_res`` strategy. | ||
* ``yolox`` is a single-stage real-time object detector that modifies YOLOv3 with a DarkNet53 backbone. | ||
* ``yolox_quantized``: runs faster than YoloX and its speed is closer to Detectron2. | ||
* ``chipper`` (beta version): the Chipper model is Unstructured’s in-house image-to-text model based on transformer-based Visual Document Understanding (VDU) models. | ||
|
||
|
||
Using a Non-Default Model | ||
^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
``Unstructured`` will download the model specified in ``UNSTRUCTURED_HI_RES_MODEL_NAME`` environment variable. If not defined, it will download the default model. | ||
|
||
There are three ways you can use the non-default model as follows: | ||
|
||
1. Store the model name in the environment variable | ||
|
||
.. code:: python | ||
import os | ||
from unstructured.partition.pdf import partition_pdf | ||
os.environ["UNSTRUCTURED_HI_RES_MODEL_NAME"] = "yolox" | ||
out_yolox = partition_pdf("example-docs/layout-parser-paper-fast.pdf", strategy="hi_res") | ||
2. Pass the model name in the ``partition`` function. | ||
|
||
.. code:: python | ||
filename = "example-docs/layout-parser-paper-fast.pdf" | ||
elements = partition(filename=filename, strategy='hi_res', model_name='yolox') | ||
3. Use `unstructured-inference <url_>`_ library. | ||
|
||
.. _url: https://github.com/Unstructured-IO/unstructured-inference | ||
|
||
.. code:: python | ||
from unstructured_inference.models.base import get_model | ||
from unstructured_inference.inference.layout import DocumentLayout | ||
model = get_model("yolox") | ||
layout = DocumentLayout.from_file("sample-docs/layout-parser-paper.pdf", detection_model=model) | ||
Bring Your Own Models | ||
^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
**Utilizing Layout Detection Model Zoo** | ||
|
||
In the `LayoutParser <layout_>`_ library, you can use various pre-trained models available in the `model zoo <modelzoo_>`_ for document layout analysis. Here's a guide on leveraging this feature using the ``UnstructuredDetectronModel`` class in ``unstructured-inference`` library. | ||
|
||
The ``UnstructuredDetectronModel`` class in ``unstructured_inference.models.detectron2`` uses the ``faster_rcnn_R_50_FPN_3x`` model pretrained on ``DocLayNet``. But any model in the model zoo can be used by using different construction parameters. ``UnstructuredDetectronModel`` is a light wrapper around the LayoutParser's ``Detectron2LayoutModel`` object, and accepts the same arguments. | ||
|
||
.. _modelzoo: https://layout-parser.readthedocs.io/en/latest/notes/modelzoo.html | ||
|
||
.. _layout: https://layout-parser.readthedocs.io/en/latest/api_doc/models.html#layoutparser.models.Detectron2LayoutModel | ||
|
||
**Using Your Own Object Detection Model** | ||
|
||
To seamlessly integrate your custom detection and extraction models into ``unstructured_inference`` pipeline, start by wrapping your model within the ``UnstructuredObjectDetectionModel`` class. This class acts as an intermediary between your detection model and Unstructured workflow. | ||
|
||
Ensure your ``UnstructuredObjectDetectionModel`` subclass incorporates two vital methods: | ||
|
||
1. The ``predict`` method, which should be designed to accept a ``PIL.Image.Image`` type and return a list of ``LayoutElements``, facilitating the communication of your model's results. | ||
2. The ``initialize`` method is essential for loading and prepping your model for inference, guaranteeing its readiness for any incoming tasks. | ||
|
||
It's important that your model's outputs, specifically from the predict method, integrate smoothly with the DocumentLayout class for optimal performance. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
Destination Connectors | ||
====================== | ||
|
||
Connect to your favorite data storage platforms for effortless batch processing of your files. | ||
We are constantly adding new data connectors and if you don't see your favorite platform let us know | ||
in our community `Slack. <https://join.slack.com/t/unstructuredw-kbe4326/shared_invite/zt-20kd2e9ti-q5yz7RCa2nlyqmAba9vqRw>`_ | ||
|
||
.. toctree:: | ||
:maxdepth: 1 | ||
|
||
destination_connectors/delta_table |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,67 @@ | ||
Delta Table | ||
========== | ||
Batch process all your records using ``unstructured-ingest`` to store structured outputs locally on your filesystem and upload those local files to a Delta Table. | ||
|
||
First you'll need to install the delta table dependencies as shown here. | ||
|
||
.. code:: shell | ||
pip install "unstructured[delta-table]" | ||
Run Locally | ||
----------- | ||
The upstream connector can be any of the ones supported, but for convenience here, showing a sample command using the | ||
upstream delta-table connector. This will create a new table on your local and will raise an error if that table already exists. | ||
|
||
.. tabs:: | ||
|
||
.. tab:: Shell | ||
|
||
.. code:: shell | ||
unstructured-ingest \ | ||
delta-table \ | ||
--table-uri s3://utic-dev-tech-fixtures/sample-delta-lake-data/deltatable/ \ | ||
--output-dir delta-table-example \ | ||
--storage_options "AWS_REGION=us-east-2,AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY" \ | ||
--verbose | ||
delta-table \ | ||
--write-column json_data \ | ||
--table-uri delta-table-dest | ||
.. tab:: Python | ||
|
||
.. code:: python | ||
import subprocess | ||
command = [ | ||
"unstructured-ingest", | ||
"delta-table", | ||
"--table-uri", "s3://utic-dev-tech-fixtures/sample-delta-lake-data/deltatable/", | ||
"--download-dir", "delta-table-ingest-download", | ||
"--output-dir", "delta-table-example", | ||
"--preserve-downloads", | ||
"--storage_options", "AWS_REGION=us-east-2,AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY", | ||
"--verbose", | ||
"delta-table" | ||
"--write-column json_data" | ||
"--table-uri delta-table-dest" | ||
] | ||
# Run the command | ||
process = subprocess.Popen(command, stdout=subprocess.PIPE) | ||
output, error = process.communicate() | ||
# Print output | ||
if process.returncode == 0: | ||
print('Command executed successfully. Output:') | ||
print(output.decode()) | ||
else: | ||
print('Command failed. Error:') | ||
print(error.decode()) | ||
For a full list of the options the CLI accepts check ``unstructured-ingest <upstream connector> delta-table --help``. | ||
|
||
NOTE: Keep in mind that you will need to have all the appropriate extras and dependencies for the file types of the documents contained in your data storage platform if you're running this locally. You can find more information about this in the `installation guide <https://unstructured-io.github.io/unstructured/installing.html>`_. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
Downstream Connectors | ||
=================== | ||
|
||
Connect to your favorite data storage platforms for effortless batch processing of your files. | ||
We are constantly adding new data connectors and if you don't see your favorite platform let us know | ||
in our community `Slack. <https://join.slack.com/t/unstructuredw-kbe4326/shared_invite/zt-20kd2e9ti-q5yz7RCa2nlyqmAba9vqRw>`_ | ||
|
||
.. toctree:: | ||
:maxdepth: 1 | ||
|
||
downstream_connectors/delta_table |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.