Skip to content

Commit

Permalink
deploy: 8564d92
Browse files Browse the repository at this point in the history
  • Loading branch information
cragwolfe committed Oct 5, 2023
1 parent 8bf6680 commit e7ab50c
Show file tree
Hide file tree
Showing 87 changed files with 2,415 additions and 2,517 deletions.
2 changes: 1 addition & 1 deletion .buildinfo
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Sphinx build info version 1
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
config: 914bf6c1fdcb7583d99bfd1548e60630
config: 39ac1536ba6b738ef4f304e6af7e643a
tags: 645f666f9bcd5a90fca523b33c5a78b7
20 changes: 10 additions & 10 deletions _sources/api.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -108,7 +108,7 @@ When elements are extracted from PDFs or images, it may be useful to get their b
file_path = "/Path/To/File"
file_data = {'files': open(file_path, 'rb')}
response = requests.post(url, headers=headers, files=files, data=data)
response = requests.post(url, headers=headers, files=file_data, data=data)
file_data['files'].close()
Expand Down Expand Up @@ -155,7 +155,7 @@ You can specify the encoding to use to decode the text input. If no value is pro
file_path = "/Path/To/File"
file_data = {'files': open(file_path, 'rb')}
response = requests.post(url, headers=headers, files=files, data=data)
response = requests.post(url, headers=headers, files=file_data, data=data)
file_data['files'].close()
Expand Down Expand Up @@ -204,7 +204,7 @@ You can also specify what languages to use for OCR with the ``ocr_languages`` kw
file_path = "/Path/To/File"
file_data = {'files': open(file_path, 'rb')}
response = requests.post(url, headers=headers, files=files, data=data)
response = requests.post(url, headers=headers, files=file_data, data=data)
file_data['files'].close()
Expand Down Expand Up @@ -250,7 +250,7 @@ By default the result will be in ``json``, but it can be set to ``text/csv`` to
file_path = "/Path/To/File"
file_data = {'files': open(file_path, 'rb')}
response = requests.post(url, headers=headers, files=files, data=data)
response = requests.post(url, headers=headers, files=file_data, data=data)
file_data['files'].close()
Expand Down Expand Up @@ -296,7 +296,7 @@ Pass the `include_page_breaks` parameter to `true` to include `PageBreak` elemen
file_path = "/Path/To/File"
file_data = {'files': open(file_path, 'rb')}
response = requests.post(url, headers=headers, files=files, data=data)
response = requests.post(url, headers=headers, files=file_data, data=data)
file_data['files'].close()
Expand Down Expand Up @@ -345,7 +345,7 @@ On the other hand, ``hi_res`` is the better choice for PDFs that may have text w
file_path = "/Path/To/File"
file_data = {'files': open(file_path, 'rb')}
response = requests.post(url, headers=headers, files=files, data=data)
response = requests.post(url, headers=headers, files=file_data, data=data)
file_data['files'].close()
Expand Down Expand Up @@ -398,7 +398,7 @@ To use the ``hi_res`` strategy with **Chipper** model, pass the argument for ``h
file_path = "/Path/To/File"
file_data = {'files': open(file_path, 'rb')}
response = requests.post(url, headers=headers, files=files, data=data)
response = requests.post(url, headers=headers, files=file_data, data=data)
file_data['files'].close()
Expand Down Expand Up @@ -451,7 +451,7 @@ To extract the table structure from PDF files using the ``hi_res`` strategy, ens
file_path = "/Path/To/File"
file_data = {'files': open(file_path, 'rb')}
response = requests.post(url, headers=headers, files=files, data=data)
response = requests.post(url, headers=headers, files=file_data, data=data)
file_data['files'].close()
Expand Down Expand Up @@ -499,7 +499,7 @@ We also provide support for enabling and disabling table extraction for file typ
file_path = "/Path/To/File"
file_data = {'files': open(file_path, 'rb')}
response = requests.post(url, headers=headers, files=files, data=data)
response = requests.post(url, headers=headers, files=file_data, data=data)
file_data['files'].close()
Expand Down Expand Up @@ -545,7 +545,7 @@ When processing XML documents, set the ``xml_keep_tags`` parameter to ``true`` t
file_path = "/Path/To/File"
file_data = {'files': open(file_path, 'rb')}
response = requests.post(url, headers=headers, files=files, data=data)
response = requests.post(url, headers=headers, files=file_data, data=data)
file_data['files'].close()
Expand Down
1 change: 1 addition & 0 deletions _sources/bricks.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -19,3 +19,4 @@ After reading this section, you should understand the following:
bricks/extracting
bricks/staging
bricks/chunking
bricks/embedding
10 changes: 5 additions & 5 deletions _sources/bricks/embedding.rst.txt
Original file line number Diff line number Diff line change
@@ -1,21 +1,21 @@
########
#########
Embedding
########
#########

EmbeddingEncoder classes in ``unstructured`` use document elements detected
Embedding encoder classes in ``unstructured`` use document elements detected
with ``partition`` or document elements grouped with ``chunking`` to obtain
embeddings for each element, for uses cases such as Retrieval Augmented Generation (RAG).


``BaseEmbeddingEncoder``
------------------
------------------------

The ``BaseEmbeddingEncoder`` is an abstract base class that defines the methods to be implemented
for each ``EmbeddingEncoder`` subclass.


``OpenAIEmbeddingEncoder``
------------------
--------------------------

The ``OpenAIEmbeddingEncoder`` class uses langchain OpenAI integration under the hood
to connect to the OpenAI Text&Embedding API to obtain embeddings for pieces of text.
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
Azure Cognitive Search
==========
======================

Batch process all your records using ``unstructured-ingest`` to store structured outputs locally on your filesystem and upload those local files to an Azure Cognitive Search index.

First you'll need to install the azure cognitive search dependencies as shown here.
Expand Down Expand Up @@ -72,7 +73,8 @@ For a full list of the options the CLI accepts check ``unstructured-ingest <upst
NOTE: Keep in mind that you will need to have all the appropriate extras and dependencies for the file types of the documents contained in your data storage platform if you're running this locally. You can find more information about this in the `installation guide <https://unstructured-io.github.io/unstructured/installing.html>`_.

Sample Index Schema
-----------
-------------------

To make sure the schema of the index matches the data being written to it, a sample schema json can be used:

.. literalinclude:: azure_cognitive_sample_index_schema.json
Expand Down
44 changes: 36 additions & 8 deletions _sources/installation/full_installation.rst.txt
Original file line number Diff line number Diff line change
@@ -1,28 +1,45 @@
.. role:: raw-html(raw)
:format: html

Full Installation
=================

1. **Installing Extras for Specific Document Types**:
If you're processing document types beyond the basics, you can install the necessary extras:
**Basic Usage**

For a complete set of extras catering to every document type, use:

.. code-block:: bash
pip install "unstructured[all-docs]"
**Installation for Specific Document Types**

If you're processing document types beyond the basics, you can install the necessary extras:

.. code-block:: bash
pip install "unstructured[docx,pptx]"
For a complete set of extras catering to every document type, use:
*Available document types:*

.. code-block:: bash
pip install "unstructured[all-docs]"
"csv", "doc", "docx", "epub", "image", "md", "msg", "odt", "org", "pdf", "ppt", "pptx", "rtf", "rst", "tsv", "xlsx"
2. **Note on Older Versions**:
For versions earlier than `unstructured<0.9.0`, the following installation pattern was recommended:
:raw-html:`<br />`
**Installation for Specific Data Connectors**

To use any of the data connectors, you must install the specific dependency:

.. code-block:: bash
pip install "unstructured[local-inference]"
pip install "unstructured[s3]"
While "local-inference" remains supported in newer versions for backward compatibility, it might be deprecated in future releases. It's advisable to transition to the "all-docs" extra for comprehensive support.
*Available data connectors:*

.. code-block:: bash
"airtable", "azure", "azure-cognitive-search", "biomed", "box", "confluence", "delta-table", "discord", "dropbox", "elasticsearch", "gcs", "github", "gitlab", "google-drive", "jira", "notion", "onedrive", "outlook", "reddit", "s3", "sharepoint", "salesforce", "slack", "wikipedia"
Installation with ``conda`` on Windows
--------------------------------------
Expand Down Expand Up @@ -155,3 +172,14 @@ library. This is not included as an ``unstructured`` dependency because it only
to some tokenizers. See the
`sentencepiece install instructions <https://github.com/google/sentencepiece#installation>`_ for
information on how to install ``sentencepiece`` if your tokenizer requires it.

Note on Older Versions
----------------------
For versions earlier than `unstructured<0.9.0`, the following installation pattern was recommended:

.. code-block:: bash
pip install "unstructured[local-inference]"
While "local-inference" remains supported in newer versions for backward compatibility, it might be deprecated in future releases. It's advisable to transition to the "all-docs" extra for comprehensive support.

49 changes: 35 additions & 14 deletions _sources/introduction/getting_started.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -101,20 +101,41 @@ Document elements
When we partition a document, the output is a list of document ``Element`` objects.
These element objects represent different components of the source document. Currently, the ``unstructured`` library supports the following element types:

* ``Element``
* ``Text``
* ``FigureCaption``
* ``NarrativeText``
* ``ListItem``
* ``Title``
* ``Address``
* ``Table``
* ``PageBreak``
* ``Header``
* ``Footer``
* ``EmailAddress``
* ``CheckBox``
* ``Image``
**Elements**
^^^^^^^^^^^^

* ``type``

* ``FigureCaption``

* ``NarrativeText``

* ``ListItem``

* ``Title``

* ``Address``

* ``Table``

* ``PageBreak``

* ``Header``

* ``Footer``

* ``UncategorizedText``

* ``Image``

* ``Formula``

* ``element_id``

* ``metadata`` - see: :ref:`Metadata page <metadata-label>`

* ``text``


Other element types that we will add in the future include tables and figures.
Different partitioning functions use different methods for determining the element type and extracting the associated content.
Expand Down
Loading

0 comments on commit e7ab50c

Please sign in to comment.