From e0045f9e8137399f8ba9299e0b36ed0a9fe57d63 Mon Sep 17 00:00:00 2001 From: jonfritz <134336691+jonfritz@users.noreply.github.com> Date: Mon, 16 Sep 2024 13:18:36 -0700 Subject: [PATCH] Add Sycamore page (#8234) * Create sycamore.md Create Sycamore page Signed-off-by: jonfritz <134336691+jonfritz@users.noreply.github.com> * Update sycamore.md Add to docs Signed-off-by: jonfritz <134336691+jonfritz@users.noreply.github.com> * Update sycamore.md Add info to docs page Signed-off-by: jonfritz <134336691+jonfritz@users.noreply.github.com> * Update _tools/sycamore.md Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Signed-off-by: jonfritz <134336691+jonfritz@users.noreply.github.com> * Update _tools/sycamore.md Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Signed-off-by: jonfritz <134336691+jonfritz@users.noreply.github.com> * Update _tools/sycamore.md Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Signed-off-by: jonfritz <134336691+jonfritz@users.noreply.github.com> * Update _tools/sycamore.md Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Signed-off-by: jonfritz <134336691+jonfritz@users.noreply.github.com> * Update _tools/sycamore.md Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Signed-off-by: jonfritz <134336691+jonfritz@users.noreply.github.com> * Update _tools/sycamore.md Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Signed-off-by: jonfritz <134336691+jonfritz@users.noreply.github.com> * Update _tools/sycamore.md Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Signed-off-by: jonfritz <134336691+jonfritz@users.noreply.github.com> * Update _tools/sycamore.md Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Signed-off-by: jonfritz <134336691+jonfritz@users.noreply.github.com> * Update _tools/sycamore.md Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Signed-off-by: jonfritz <134336691+jonfritz@users.noreply.github.com> * Update _tools/sycamore.md Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Signed-off-by: jonfritz <134336691+jonfritz@users.noreply.github.com> * Update sycamore.md Correct typo Signed-off-by: jonfritz <134336691+jonfritz@users.noreply.github.com> * Update _tools/sycamore.md Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Signed-off-by: jonfritz <134336691+jonfritz@users.noreply.github.com> * Update _tools/sycamore.md Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Signed-off-by: jonfritz <134336691+jonfritz@users.noreply.github.com> * Update _tools/sycamore.md Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Signed-off-by: jonfritz <134336691+jonfritz@users.noreply.github.com> * Update _tools/sycamore.md Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Signed-off-by: jonfritz <134336691+jonfritz@users.noreply.github.com> * Update sycamore.md Updates from suggestions Signed-off-by: jonfritz <134336691+jonfritz@users.noreply.github.com> * Update index.md Updating index with Sycamore Signed-off-by: jonfritz <134336691+jonfritz@users.noreply.github.com> * Update _tools/index.md Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Add front matter Signed-off-by: Fanit Kolchina * Apply suggestions from code review Co-authored-by: Nathan Bower Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> --------- Signed-off-by: jonfritz <134336691+jonfritz@users.noreply.github.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Signed-off-by: Fanit Kolchina Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Fanit Kolchina Co-authored-by: Nathan Bower --- _tools/index.md | 7 +++++++ _tools/sycamore.md | 48 ++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 55 insertions(+) create mode 100644 _tools/sycamore.md diff --git a/_tools/index.md b/_tools/index.md index 108f10da97..c9d446a81a 100644 --- a/_tools/index.md +++ b/_tools/index.md @@ -18,6 +18,7 @@ This section provides documentation for OpenSearch-supported tools, including: - [OpenSearch CLI](#opensearch-cli) - [OpenSearch Kubernetes operator](#opensearch-kubernetes-operator) - [OpenSearch upgrade, migration, and comparison tools](#opensearch-upgrade-migration-and-comparison-tools) +- [Sycamore](#sycamore) for AI-powered extract, transform, load (ETL) on complex documents for vector and hybrid search For information about Data Prepper, the server-side data collector for filtering, enriching, transforming, normalizing, and aggregating data for downstream analytics and visualization, see [Data Prepper]({{site.url}}{{site.baseurl}}/data-prepper/index/). @@ -122,3 +123,9 @@ The OpenSearch Kubernetes Operator is an open-source Kubernetes operator that he OpenSearch migration tools facilitate migrations to OpenSearch and upgrades to newer versions of OpenSearch. These can help you can set up a proof-of-concept environment locally using Docker containers or deploy to AWS using a one-click deployment script. This empowers you to fine-tune cluster configurations and manage workloads more effectively before migration. For more information about OpenSearch migration tools, see the documentation in the [OpenSearch Migration GitHub repository](https://github.com/opensearch-project/opensearch-migrations/tree/capture-and-replay-v0.1.0). + +## Sycamore + +[Sycamore](https://github.com/aryn-ai/sycamore) is an open-source, AI-powered document processing engine designed to prepare unstructured data for retrieval-augmented generation (RAG) and semantic search using Python. Sycamore supports chunking and enriching a wide range of complex document types, including reports, presentations, transcripts, and manuals. Additionally, Sycamore can extract and process embedded elements, such as tables, figures, graphs, and other infographics. It can then load the data into target indexes, including vector and keyword indexes, using an [OpenSearch connector](https://sycamore.readthedocs.io/en/stable/sycamore/connectors/opensearch.html). + +For more information, see [Sycamore]({{site.url}}{{site.baseurl}}/tools/sycamore/). diff --git a/_tools/sycamore.md b/_tools/sycamore.md new file mode 100644 index 0000000000..7ce55931ac --- /dev/null +++ b/_tools/sycamore.md @@ -0,0 +1,48 @@ +--- +layout: default +title: Sycamore +nav_order: 210 +has_children: false +--- + +# Sycamore + +[Sycamore](https://github.com/aryn-ai/sycamore) is an open-source, AI-powered document processing engine designed to prepare unstructured data for retrieval-augmented generation (RAG) and semantic search using Python. Sycamore supports chunking and enriching a wide range of complex document types, including reports, presentations, transcripts, and manuals. Additionally, Sycamore can extract and process embedded elements, such as tables, figures, graphs, and other infographics. It can then load the data into target indexes, including vector and keyword indexes, using a connector like the [OpenSearch connector](https://sycamore.readthedocs.io/en/stable/sycamore/connectors/opensearch.html). + +To get started, visit the [Sycamore documentation](https://sycamore.readthedocs.io/en/stable/sycamore/get_started.html). + +# Sycamore ETL pipeline structure + +A Sycamore extract, transform, load (ETL) pipeline applies a series of transformations to a [DocSet](https://sycamore.readthedocs.io/en/stable/sycamore/get_started/concepts.html#docsets), which is a collection of documents and their constituent elements (for example, tables, blocks of text, or headers). At the end of the pipeline, the DocSet is loaded into OpenSearch vector and keyword indexes. + +A typical pipeline for preparing unstructured data for vector or hybrid search in OpenSearch consists of the following steps: + +* Read documents into a [DocSet](https://sycamore.readthedocs.io/en/stable/sycamore/get_started/concepts.html#docsets). +* [Partition documents](https://sycamore.readthedocs.io/en/stable/sycamore/transforms/partition.html) into structured JSON elements. +* Extract metadata, filter, and clean data using [transforms](https://sycamore.readthedocs.io/en/stable/sycamore/APIs/docset.html). +* Create [chunks](https://sycamore.readthedocs.io/en/stable/sycamore/transforms/merge.html) from groups of elements. +* Embed the chunks using the model of your choice. +* [Load](https://sycamore.readthedocs.io/en/stable/sycamore/connectors/opensearch.html) the embeddings, metadata, and text into OpenSearch vector and keyword indexes. + +For an example pipeline that uses this workflow, see [this notebook](https://github.com/aryn-ai/sycamore/blob/main/notebooks/opensearch_docs_etl.ipynb). + + +# Install Sycamore + +We recommend installing the Sycamore library using `pip`. The connector for OpenSearch can be specified and installed using extras. For example: + +```bash +pip install sycamore-ai[opensearch] +``` +{% include copy.html %} + +By default, Sycamore works with the Aryn Partitioning Service to process PDFs. To run inference locally for partitioning or embedding, install Sycamore with the `local-inference` extra as follows: + +```bash +pip install sycamore-ai[opensearch,local-inference] +``` +{% include copy.html %} + +## Next steps + +For more information, visit the [Sycamore documentation](https://sycamore.readthedocs.io/en/stable/sycamore/get_started.html). \ No newline at end of file