diff --git a/examples/demo_basic.ipynb b/examples/demo_basic.ipynb index f7baf9d..9802d9d 100644 --- a/examples/demo_basic.ipynb +++ b/examples/demo_basic.ipynb @@ -4,7 +4,29 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# LlamaExtract Usage" + "# Infer a schema to extract data from files" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In this notebook, we will demonstrate how to infer a schema from a set of files and using it to extract structured data from invoice PDF files.\n", + "\n", + "The steps are:\n", + "1. Infer a schema from the invoices files.\n", + "2. Extract structured data (i.e. JSONs) from invoice PDF files\n", + "\n", + "Additional Resources:\n", + "- `LlamaExtract`: https://docs.cloud.llamaindex.ai/" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Setup\n", + "Install `llama-extract` client library:" ] }, { @@ -16,6 +38,13 @@ "%pip install llama-extract" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Apply `nest_asyncio` and bring your own LlamaCloud API key:" + ] + }, { "cell_type": "code", "execution_count": null, @@ -32,6 +61,14 @@ "os.environ[\"LLAMA_CLOUD_API_KEY\"] = \"llx-...\"" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Infer the schema\n", + "First, let's infer the schema using the invoice files with `LlamaExtract`." + ] + }, { "cell_type": "code", "execution_count": null, @@ -47,6 +84,13 @@ ")" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Preview the inferred schema:" + ] + }, { "cell_type": "code", "execution_count": null, @@ -64,6 +108,14 @@ "print(extraction_schema.data_schema)" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Extract structured data\n", + "Now with the schema, we can extract structured data (i.e. JSON) from the our invoices files." + ] + }, { "cell_type": "code", "execution_count": null, @@ -84,6 +136,13 @@ ")" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Preview the extracted data:" + ] + }, { "cell_type": "code", "execution_count": null, diff --git a/examples/demo_existent_schema.ipynb b/examples/demo_existent_schema.ipynb deleted file mode 100644 index 3f1596e..0000000 --- a/examples/demo_existent_schema.ipynb +++ /dev/null @@ -1,123 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Extracting data from files using an existing schema" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%pip install llama-extract" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# llama-extract is async-first, running the sync code in a notebook requires the use of nest_asyncio\n", - "import nest_asyncio\n", - "\n", - "nest_asyncio.apply()\n", - "\n", - "import os\n", - "\n", - "os.environ[\"LLAMA_CLOUD_API_KEY\"] = \"llx-...\"" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from llama_extract import LlamaExtract\n", - "\n", - "extractor = LlamaExtract()\n", - "\n", - "extraction_schema = extractor.get_schema(\"schema_id...\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "id='88ea0633-937b-42f1-a35d-7da19c2db74e' created_at=datetime.datetime(2024, 7, 24, 19, 48, 49, 968786, tzinfo=datetime.timezone.utc) updated_at=datetime.datetime(2024, 7, 24, 19, 48, 49, 968786, tzinfo=datetime.timezone.utc) name='Test Schema' project_id='b1be5ffd-3f90-4fd1-9742-ca7c0a30f6f7' data_schema={'type': 'object', 'properties': {'date': {'type': 'string'}, 'amount': {'type': 'number'}, 'number': {'type': 'string'}}}\n" - ] - } - ], - "source": [ - "print(extraction_schema)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Extracting files: 100%|██████████| 2/2 [00:03<00:00, 1.71s/it]\n" - ] - } - ], - "source": [ - "extractions = extractor.extract(\n", - " extraction_schema.id,\n", - " [\"./data/noisebridge_receipt.pdf\", \"./data/parallels_invoice.pdf\"],\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "{'date': 'Jul 23, 2024', 'amount': '119.99', 'number': 'BKD-73649835575'}\n" - ] - } - ], - "source": [ - "print(extractions[1].data)" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "llama-extract-tm5usU00-py3.11", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} diff --git a/examples/demo_manual.ipynb b/examples/demo_manual.ipynb index b2db5f1..3f657c2 100644 --- a/examples/demo_manual.ipynb +++ b/examples/demo_manual.ipynb @@ -4,7 +4,29 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Create a schema with your own schema to extract data from files" + "# Manually create a schema to extract data from files" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In this notebook, we will demonstrate how to manually create a schema and using it to extract structured data from invoice PDF files.\n", + "\n", + "The steps are:\n", + "1. Create a schema using a valid JSON schema object.\n", + "2. Extract structured data (i.e. JSONs) from invoice PDF files\n", + "\n", + "Additional Resources:\n", + "- `LlamaExtract`: https://docs.cloud.llamaindex.ai/" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Setup\n", + "Install `llama-extract` client library:" ] }, { @@ -16,6 +38,13 @@ "%pip install llama-extract" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Apply `nest_asyncio` and bring your own LlamaCloud API key:" + ] + }, { "cell_type": "code", "execution_count": null, @@ -32,6 +61,14 @@ "os.environ[\"LLAMA_CLOUD_API_KEY\"] = \"llx-...\"" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Create the schema\n", + "First, let's create the schema using a valid JSON schema object with `LlamaExtract`." + ] + }, { "cell_type": "code", "execution_count": null, @@ -54,6 +91,13 @@ "extraction_schema = extractor.create_schema(\"Test Schema\", data_schema)" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's preview the created schema:" + ] + }, { "cell_type": "code", "execution_count": null, @@ -71,6 +115,14 @@ "print(extraction_schema)" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Extract structured data\n", + "Now with the schema, we can extract structured data (i.e. JSON) from the our invoices files." + ] + }, { "cell_type": "code", "execution_count": null, @@ -91,6 +143,13 @@ ")" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Preview the extracted data:" + ] + }, { "cell_type": "code", "execution_count": null,