Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Metadata and Installation Documentation #1646

Merged
merged 20 commits into from
Oct 5, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
## 0.10.19-dev11
## 0.10.19

### Enhancements

Expand All @@ -10,8 +10,8 @@
* **Adds Table support for the `add_chunking_strategy` decorator to partition functions.** In addition to combining elements under Title elements, user's can now specify the `max_characters=<n>` argument to chunk Table elements into TableChunk elements with `text` and `text_as_html` of length <n> characters. This means partitioned Table results are ready for use in downstream applications without any post processing.
* **Expose endpoint url for s3 connectors** By allowing for the endpoint url to be explicitly overwritten, this allows for any non-AWS data providers supporting the s3 protocol to be supported (i.e. minio).
* **change default `hi_res` model for pdf/image partition to `yolox`** Now partitioning pdf/image using `hi_res` strategy utilizes `yolox_quantized` model isntead of `detectron2_onnx` model. This new default model has better recall for tables and produces more detailed categories for elements.

* **XLSX can now reads subtables within one sheet** Problem: Many .xlsx files are not created to be read as one full table per sheet. There are subtables, text and header along with more informations to extract from each sheet. Feature: This `partition_xlsx` now can reads subtable(s) within one .xlsx sheet, along with extracting other title and narrative texts. Importance: This enhance the power of .xlsx reading to not only one table per sheet, allowing user to capture more data tables from the file, if exists.
* **Update Documentation on Element Types and Metadata**: We have updated the documentation according to the latest element types and metadata. It includes the common and additional metadata provided by the Partitions and Connectors.

### Fixes

Expand Down
1 change: 1 addition & 0 deletions docs/source/bricks.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,3 +19,4 @@ After reading this section, you should understand the following:
bricks/extracting
bricks/staging
bricks/chunking
bricks/embedding
10 changes: 5 additions & 5 deletions docs/source/bricks/embedding.rst
Original file line number Diff line number Diff line change
@@ -1,21 +1,21 @@
########
#########
Embedding
########
#########

EmbeddingEncoder classes in ``unstructured`` use document elements detected
Embedding encoder classes in ``unstructured`` use document elements detected
with ``partition`` or document elements grouped with ``chunking`` to obtain
embeddings for each element, for uses cases such as Retrieval Augmented Generation (RAG).


``BaseEmbeddingEncoder``
------------------
------------------------

The ``BaseEmbeddingEncoder`` is an abstract base class that defines the methods to be implemented
for each ``EmbeddingEncoder`` subclass.


``OpenAIEmbeddingEncoder``
------------------
--------------------------

The ``OpenAIEmbeddingEncoder`` class uses langchain OpenAI integration under the hood
to connect to the OpenAI Text&Embedding API to obtain embeddings for pieces of text.
Expand Down
6 changes: 4 additions & 2 deletions docs/source/destination_connectors/azure_cognitive_search.rst
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
Azure Cognitive Search
==========
======================
ron-unstructured marked this conversation as resolved.
Show resolved Hide resolved

Batch process all your records using ``unstructured-ingest`` to store structured outputs locally on your filesystem and upload those local files to an Azure Cognitive Search index.

First you'll need to install the azure cognitive search dependencies as shown here.
Expand Down Expand Up @@ -72,7 +73,8 @@ For a full list of the options the CLI accepts check ``unstructured-ingest <upst
NOTE: Keep in mind that you will need to have all the appropriate extras and dependencies for the file types of the documents contained in your data storage platform if you're running this locally. You can find more information about this in the `installation guide <https://unstructured-io.github.io/unstructured/installing.html>`_.

Sample Index Schema
-----------
-------------------

To make sure the schema of the index matches the data being written to it, a sample schema json can be used:

.. literalinclude:: azure_cognitive_sample_index_schema.json
Expand Down
44 changes: 36 additions & 8 deletions docs/source/installation/full_installation.rst
Original file line number Diff line number Diff line change
@@ -1,28 +1,45 @@
.. role:: raw-html(raw)
:format: html

Full Installation
=================

1. **Installing Extras for Specific Document Types**:
If you're processing document types beyond the basics, you can install the necessary extras:
**Basic Usage**

For a complete set of extras catering to every document type, use:

.. code-block:: bash

pip install "unstructured[all-docs]"

**Installation for Specific Document Types**

If you're processing document types beyond the basics, you can install the necessary extras:

.. code-block:: bash

pip install "unstructured[docx,pptx]"

For a complete set of extras catering to every document type, use:
*Available document types:*

.. code-block:: bash

pip install "unstructured[all-docs]"
"csv", "doc", "docx", "epub", "image", "md", "msg", "odt", "org", "pdf", "ppt", "pptx", "rtf", "rst", "tsv", "xlsx"

2. **Note on Older Versions**:
For versions earlier than `unstructured<0.9.0`, the following installation pattern was recommended:
:raw-html:`<br />`
**Installation for Specific Data Connectors**

To use any of the data connectors, you must install the specific dependency:

.. code-block:: bash

pip install "unstructured[local-inference]"
pip install "unstructured[s3]"

While "local-inference" remains supported in newer versions for backward compatibility, it might be deprecated in future releases. It's advisable to transition to the "all-docs" extra for comprehensive support.
*Available data connectors:*

.. code-block:: bash

"airtable", "azure", "azure-cognitive-search", "biomed", "box", "confluence", "delta-table", "discord", "dropbox", "elasticsearch", "gcs", "github", "gitlab", "google-drive", "jira", "notion", "onedrive", "outlook", "reddit", "s3", "sharepoint", "salesforce", "slack", "wikipedia"

Installation with ``conda`` on Windows
--------------------------------------
Expand Down Expand Up @@ -155,3 +172,14 @@ library. This is not included as an ``unstructured`` dependency because it only
to some tokenizers. See the
`sentencepiece install instructions <https://github.com/google/sentencepiece#installation>`_ for
information on how to install ``sentencepiece`` if your tokenizer requires it.

Note on Older Versions
----------------------
For versions earlier than `unstructured<0.9.0`, the following installation pattern was recommended:

.. code-block:: bash

pip install "unstructured[local-inference]"

While "local-inference" remains supported in newer versions for backward compatibility, it might be deprecated in future releases. It's advisable to transition to the "all-docs" extra for comprehensive support.

49 changes: 35 additions & 14 deletions docs/source/introduction/getting_started.rst
Original file line number Diff line number Diff line change
Expand Up @@ -101,20 +101,41 @@ Document elements
When we partition a document, the output is a list of document ``Element`` objects.
These element objects represent different components of the source document. Currently, the ``unstructured`` library supports the following element types:

* ``Element``
* ``Text``
* ``FigureCaption``
* ``NarrativeText``
* ``ListItem``
* ``Title``
* ``Address``
* ``Table``
* ``PageBreak``
* ``Header``
* ``Footer``
* ``EmailAddress``
* ``CheckBox``
* ``Image``
**Elements**
^^^^^^^^^^^^

* ``type``

* ``FigureCaption``

* ``NarrativeText``

* ``ListItem``

* ``Title``

* ``Address``

* ``Table``

* ``PageBreak``

* ``Header``

* ``Footer``

* ``UncategorizedText``

* ``Image``

* ``Formula``

* ``element_id``

* ``metadata`` - see: :ref:`Metadata page <metadata-label>`

* ``text``


Other element types that we will add in the future include tables and figures.
Different partitioning functions use different methods for determining the element type and extracting the associated content.
Expand Down
Loading
Loading