From 8036d9d6a8f25cac19eaa2b485aa624bcc2020c8 Mon Sep 17 00:00:00 2001 From: Wes Kennedy Date: Wed, 1 Nov 2023 13:08:15 -0400 Subject: [PATCH] added link to pavans RAG blog post --- .../Semantic Search with OpenAI QA.ipynb | 620 ++++++++++++++++-- 1 file changed, 572 insertions(+), 48 deletions(-) diff --git a/notebooks/Semantic Search with OpenAI QA/Semantic Search with OpenAI QA.ipynb b/notebooks/Semantic Search with OpenAI QA/Semantic Search with OpenAI QA.ipynb index 2266cda..fa5b77c 100644 --- a/notebooks/Semantic Search with OpenAI QA/Semantic Search with OpenAI QA.ipynb +++ b/notebooks/Semantic Search with OpenAI QA/Semantic Search with OpenAI QA.ipynb @@ -13,25 +13,40 @@ }, "tags": [] }, - "source": "
\n \n \n
" + "source": [ + "
\n", + " \n", + " \n", + "
" + ] }, { "cell_type": "markdown", "id": "ea7cafdc-bb37-4ad8-8534-fb3ce6683fe4", "metadata": {}, - "source": "# Semantic Search with OpenAI QA" + "source": [ + "# Semantic Search with OpenAI QA\n", + "\n", + "In this Notebook you will use a combination of Semantic Search and a Large Langauge Model (LLM) to build a basic Retrieval Augmented Generation (RAG) application. For a great introduction into what RAG is, please read [A Beginner's Guide to Retrieval Augmented Generation (RAG)](https://www.singlestore.com/blog/a-guide-to-retrieval-augmented-generation-rag/)." + ] }, { "cell_type": "markdown", "id": "f801cd94-180c-4dea-b85c-67659aad0ea6", "metadata": {}, - "source": "## Prerequisites for interacting with ChatGPT" + "source": [ + "## Prerequisites for interacting with ChatGPT" + ] }, { "cell_type": "markdown", "id": "df04d713-f330-4335-9837-3ab79eb552d6", "metadata": {}, - "source": "### Install OpenAI package\n\nLet's start by installing tho [openai](https://platform.openai.com/docs/api-reference?lang=python) Python package." + "source": [ + "### Install OpenAI package\n", + "\n", + "Let's start by installing tho [openai](https://platform.openai.com/docs/api-reference?lang=python) Python package." + ] }, { "cell_type": "code", @@ -49,13 +64,17 @@ "trusted": true }, "outputs": [], - "source": "!pip install openai --quiet" + "source": [ + "!pip install openai --quiet" + ] }, { "cell_type": "markdown", "id": "62bd45fa-daac-4d71-ab76-be014ddd3a32", "metadata": {}, - "source": "### Connect to ChatGPT and display the response" + "source": [ + "### Connect to ChatGPT and display the response" + ] }, { "cell_type": "code", @@ -73,13 +92,20 @@ "trusted": true }, "outputs": [], - "source": "import openai\n\nEMBEDDING_MODEL = \"text-embedding-ada-002\"\nGPT_MODEL = \"gpt-3.5-turbo\"" + "source": [ + "import openai\n", + "\n", + "EMBEDDING_MODEL = \"text-embedding-ada-002\"\n", + "GPT_MODEL = \"gpt-3.5-turbo\"" + ] }, { "cell_type": "markdown", "id": "c244aa25-f548-47b2-8942-991552dc0ca1", "metadata": {}, - "source": "You will need an OpenAI API key in order to use the the `openai` Python library." + "source": [ + "You will need an OpenAI API key in order to use the the `openai` Python library." + ] }, { "cell_type": "code", @@ -97,13 +123,17 @@ "trusted": true }, "outputs": [], - "source": "openai.api_key = ''" + "source": [ + "openai.api_key = ''" + ] }, { "cell_type": "markdown", "id": "0663c6f2-7741-4966-aea8-d5629e4a1cd4", "metadata": {}, - "source": "Test the connection." + "source": [ + "Test the connection." + ] }, { "cell_type": "code", @@ -124,22 +154,38 @@ { "name": "stdout", "output_type": "stream", - "text": "I'm sorry, I cannot provide information about future events as they have not happened yet. The next Winter Olympics where curling will be contested is in 2022, but the winners have not been determined yet.\n" + "text": [ + "I'm sorry, I cannot provide information about future events as they have not happened yet. The next Winter Olympics where curling will be contested is in 2022, but the winners have not been determined yet.\n" + ] } ], - "source": "response = openai.ChatCompletion.create(\n model=GPT_MODEL,\n messages=[\n {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},\n {\"role\": \"user\", \"content\": \"Who won the gold medal for curling in Olymics 2022?\"},\n ]\n)\n\nprint(response['choices'][0]['message']['content'])" + "source": [ + "response = openai.ChatCompletion.create(\n", + " model=GPT_MODEL,\n", + " messages=[\n", + " {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},\n", + " {\"role\": \"user\", \"content\": \"Who won the gold medal for curling in Olymics 2022?\"},\n", + " ]\n", + ")\n", + "\n", + "print(response['choices'][0]['message']['content'])" + ] }, { "cell_type": "markdown", "id": "d287b813-2885-4b22-a431-03c6b4eab058", "metadata": {}, - "source": "# Get the data about Winter Olympics and provide the information to ChatGPT as context" + "source": [ + "# Get the data about Winter Olympics and provide the information to ChatGPT as context" + ] }, { "cell_type": "markdown", "id": "682326b6-a475-4d79-828d-951780a6fb96", "metadata": {}, - "source": "## 1. Install and import libraries" + "source": [ + "## 1. Install and import libraries" + ] }, { "cell_type": "code", @@ -157,7 +203,14 @@ "trusted": true }, "outputs": [], - "source": "!pip install matplotlib --quiet\n!pip install plotly.express --quiet\n!pip install scikit-learn --quiet\n!pip install tabulate --quiet\n!pip install tiktoken --quiet\n!pip install wget --quiet" + "source": [ + "!pip install matplotlib --quiet\n", + "!pip install plotly.express --quiet\n", + "!pip install scikit-learn --quiet\n", + "!pip install tabulate --quiet\n", + "!pip install tiktoken --quiet\n", + "!pip install wget --quiet" + ] }, { "cell_type": "code", @@ -175,19 +228,29 @@ "trusted": true }, "outputs": [], - "source": "import json\nimport numpy as np\nimport os\nimport pandas as pd\nimport wget" + "source": [ + "import json\n", + "import numpy as np\n", + "import os\n", + "import pandas as pd\n", + "import wget" + ] }, { "cell_type": "markdown", "id": "5f7aee40-4774-4ef1-b700-a83f9fed4fbb", "metadata": {}, - "source": "## 2. Fetch the CSV data and read it into a DataFrame" + "source": [ + "## 2. Fetch the CSV data and read it into a DataFrame" + ] }, { "cell_type": "markdown", "id": "05fcb9a8-2290-4507-aad1-a3002cab0ba6", "metadata": {}, - "source": "Download pre-chunked text and pre-computed embeddings. This file is ~200 MB, so may take a minute depending on your connection speed." + "source": [ + "Download pre-chunked text and pre-computed embeddings. This file is ~200 MB, so may take a minute depending on your connection speed." + ] }, { "cell_type": "code", @@ -208,16 +271,30 @@ { "name": "stdout", "output_type": "stream", - "text": "File downloaded successfully.\n" + "text": [ + "File downloaded successfully.\n" + ] } ], - "source": "embeddings_url = \"https://cdn.openai.com/API/examples/data/winter_olympics_2022.csv\"\nembeddings_path = \"winter_olympics_2022.csv\"\n\nif not os.path.exists(embeddings_path):\n wget.download(embeddings_url, embeddings_path)\n print(\"File downloaded successfully.\")\nelse:\n print(\"File already exists in the local file system.\")" + "source": [ + "embeddings_url = \"https://cdn.openai.com/API/examples/data/winter_olympics_2022.csv\"\n", + "embeddings_path = \"winter_olympics_2022.csv\"\n", + "\n", + "if not os.path.exists(embeddings_path):\n", + " wget.download(embeddings_url, embeddings_path)\n", + " print(\"File downloaded successfully.\")\n", + "else:\n", + " print(\"File already exists in the local file system.\")" + ] }, { "cell_type": "markdown", "id": "1faf22b7-5b99-4a9b-a88a-24acb16d133e", "metadata": {}, - "source": "Here we are using the `converters=` parameter of the `pd.read_csv` function to convert the JSON\narray in the CSV file to numpy arrays." + "source": [ + "Here we are using the `converters=` parameter of the `pd.read_csv` function to convert the JSON\n", + "array in the CSV file to numpy arrays." + ] }, { "cell_type": "code", @@ -237,15 +314,133 @@ "outputs": [ { "data": { - "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
textembedding
0Lviv bid for the 2022 Winter Olympics\\n\\n{{Oly...[-0.005021067801862955, 0.00026050032465718687...
1Lviv bid for the 2022 Winter Olympics\\n\\n==His...[0.0033927420154213905, -0.007447326090186834,...
2Lviv bid for the 2022 Winter Olympics\\n\\n==Ven...[-0.00915789045393467, -0.008366798982024193, ...
3Lviv bid for the 2022 Winter Olympics\\n\\n==Ven...[0.0030951891094446182, -0.006064314860850573,...
4Lviv bid for the 2022 Winter Olympics\\n\\n==Ven...[-0.002936174161732197, -0.006185177247971296,...
.........
6054Anaïs Chevalier-Bouchet\\n\\n==Personal life==\\n...[-0.027750400826334953, 0.001746018067933619, ...
6055Uliana Nigmatullina\\n\\n{{short description|Rus...[-0.021714167669415474, 0.016001321375370026, ...
6056Uliana Nigmatullina\\n\\n==Biathlon results==\\n\\...[-0.029143543913960457, 0.014654331840574741, ...
6057Uliana Nigmatullina\\n\\n==Biathlon results==\\n\\...[-0.024266039952635765, 0.011665306985378265, ...
6058Uliana Nigmatullina\\n\\n==Biathlon results==\\n\\...[-0.021818075329065323, 0.005420385394245386, ...
\n

6059 rows × 2 columns

\n
", - "text/plain": " text \\\n0 Lviv bid for the 2022 Winter Olympics\\n\\n{{Oly... \n1 Lviv bid for the 2022 Winter Olympics\\n\\n==His... \n2 Lviv bid for the 2022 Winter Olympics\\n\\n==Ven... \n3 Lviv bid for the 2022 Winter Olympics\\n\\n==Ven... \n4 Lviv bid for the 2022 Winter Olympics\\n\\n==Ven... \n... ... \n6054 Anaïs Chevalier-Bouchet\\n\\n==Personal life==\\n... \n6055 Uliana Nigmatullina\\n\\n{{short description|Rus... \n6056 Uliana Nigmatullina\\n\\n==Biathlon results==\\n\\... \n6057 Uliana Nigmatullina\\n\\n==Biathlon results==\\n\\... \n6058 Uliana Nigmatullina\\n\\n==Biathlon results==\\n\\... \n\n embedding \n0 [-0.005021067801862955, 0.00026050032465718687... \n1 [0.0033927420154213905, -0.007447326090186834,... \n2 [-0.00915789045393467, -0.008366798982024193, ... \n3 [0.0030951891094446182, -0.006064314860850573,... \n4 [-0.002936174161732197, -0.006185177247971296,... \n... ... \n6054 [-0.027750400826334953, 0.001746018067933619, ... \n6055 [-0.021714167669415474, 0.016001321375370026, ... \n6056 [-0.029143543913960457, 0.014654331840574741, ... \n6057 [-0.024266039952635765, 0.011665306985378265, ... \n6058 [-0.021818075329065323, 0.005420385394245386, ... \n\n[6059 rows x 2 columns]" + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
textembedding
0Lviv bid for the 2022 Winter Olympics\\n\\n{{Oly...[-0.005021067801862955, 0.00026050032465718687...
1Lviv bid for the 2022 Winter Olympics\\n\\n==His...[0.0033927420154213905, -0.007447326090186834,...
2Lviv bid for the 2022 Winter Olympics\\n\\n==Ven...[-0.00915789045393467, -0.008366798982024193, ...
3Lviv bid for the 2022 Winter Olympics\\n\\n==Ven...[0.0030951891094446182, -0.006064314860850573,...
4Lviv bid for the 2022 Winter Olympics\\n\\n==Ven...[-0.002936174161732197, -0.006185177247971296,...
.........
6054Anaïs Chevalier-Bouchet\\n\\n==Personal life==\\n...[-0.027750400826334953, 0.001746018067933619, ...
6055Uliana Nigmatullina\\n\\n{{short description|Rus...[-0.021714167669415474, 0.016001321375370026, ...
6056Uliana Nigmatullina\\n\\n==Biathlon results==\\n\\...[-0.029143543913960457, 0.014654331840574741, ...
6057Uliana Nigmatullina\\n\\n==Biathlon results==\\n\\...[-0.024266039952635765, 0.011665306985378265, ...
6058Uliana Nigmatullina\\n\\n==Biathlon results==\\n\\...[-0.021818075329065323, 0.005420385394245386, ...
\n", + "

6059 rows × 2 columns

\n", + "
" + ], + "text/plain": [ + " text \\\n", + "0 Lviv bid for the 2022 Winter Olympics\\n\\n{{Oly... \n", + "1 Lviv bid for the 2022 Winter Olympics\\n\\n==His... \n", + "2 Lviv bid for the 2022 Winter Olympics\\n\\n==Ven... \n", + "3 Lviv bid for the 2022 Winter Olympics\\n\\n==Ven... \n", + "4 Lviv bid for the 2022 Winter Olympics\\n\\n==Ven... \n", + "... ... \n", + "6054 Anaïs Chevalier-Bouchet\\n\\n==Personal life==\\n... \n", + "6055 Uliana Nigmatullina\\n\\n{{short description|Rus... \n", + "6056 Uliana Nigmatullina\\n\\n==Biathlon results==\\n\\... \n", + "6057 Uliana Nigmatullina\\n\\n==Biathlon results==\\n\\... \n", + "6058 Uliana Nigmatullina\\n\\n==Biathlon results==\\n\\... \n", + "\n", + " embedding \n", + "0 [-0.005021067801862955, 0.00026050032465718687... \n", + "1 [0.0033927420154213905, -0.007447326090186834,... \n", + "2 [-0.00915789045393467, -0.008366798982024193, ... \n", + "3 [0.0030951891094446182, -0.006064314860850573,... \n", + "4 [-0.002936174161732197, -0.006185177247971296,... \n", + "... ... \n", + "6054 [-0.027750400826334953, 0.001746018067933619, ... \n", + "6055 [-0.021714167669415474, 0.016001321375370026, ... \n", + "6056 [-0.029143543913960457, 0.014654331840574741, ... \n", + "6057 [-0.024266039952635765, 0.011665306985378265, ... \n", + "6058 [-0.021818075329065323, 0.005420385394245386, ... \n", + "\n", + "[6059 rows x 2 columns]" + ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], - "source": "def json_to_numpy_array(x: str | None) -> np.ndarray | None:\n \"\"\"Convert JSON array string into numpy array.\"\"\"\n return np.array(json.loads(x)) if x else None\n\ndf = pd.read_csv(embeddings_path, converters=dict(embedding=json_to_numpy_array))\ndf" + "source": [ + "def json_to_numpy_array(x: str | None) -> np.ndarray | None:\n", + " \"\"\"Convert JSON array string into numpy array.\"\"\"\n", + " return np.array(json.loads(x)) if x else None\n", + "\n", + "df = pd.read_csv(embeddings_path, converters=dict(embedding=json_to_numpy_array))\n", + "df" + ] }, { "cell_type": "code", @@ -266,22 +461,38 @@ { "name": "stdout", "output_type": "stream", - "text": "\nRangeIndex: 6059 entries, 0 to 6058\nData columns (total 2 columns):\n # Column Non-Null Count Dtype \n--- ------ -------------- ----- \n 0 text 6059 non-null object\n 1 embedding 6059 non-null object\ndtypes: object(2)\nmemory usage: 94.8+ KB\n" + "text": [ + "\n", + "RangeIndex: 6059 entries, 0 to 6058\n", + "Data columns (total 2 columns):\n", + " # Column Non-Null Count Dtype \n", + "--- ------ -------------- ----- \n", + " 0 text 6059 non-null object\n", + " 1 embedding 6059 non-null object\n", + "dtypes: object(2)\n", + "memory usage: 94.8+ KB\n" + ] } ], - "source": "df.info(show_counts=True)" + "source": [ + "df.info(show_counts=True)" + ] }, { "cell_type": "markdown", "id": "cb523f8c-78b2-4a75-be15-52d29fac0fff", "metadata": {}, - "source": "## 3. Set up the database" + "source": [ + "## 3. Set up the database" + ] }, { "cell_type": "markdown", "id": "ca811e5f-6dcd-471b-a4de-03eab42acf4f", "metadata": {}, - "source": "Create the database." + "source": [ + "Create the database." + ] }, { "cell_type": "code", @@ -301,20 +512,31 @@ "outputs": [ { "data": { - "text/plain": "" + "text/plain": [] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], - "source": "%%sql\nDROP DATABASE IF EXISTS winter_wikipedia;\n\nCREATE DATABASE winter_wikipedia;" + "source": [ + "%%sql\n", + "DROP DATABASE IF EXISTS winter_wikipedia;\n", + "\n", + "CREATE DATABASE winter_wikipedia;" + ] }, { "cell_type": "markdown", "id": "393e0d4a-8020-447e-b0ae-aa4199b1a016", "metadata": {}, - "source": "
\n

\n

Make sure to select the winter_wikipedia database from the drop-down menu at the top of this notebook.\n It updates the connection_url to connect to that database.

\n
" + "source": [ + "
\n", + "

\n", + "

Make sure to select the winter_wikipedia database from the drop-down menu at the top of this notebook.\n", + " It updates the connection_url to connect to that database.

\n", + "
" + ] }, { "cell_type": "code", @@ -334,26 +556,37 @@ "outputs": [ { "data": { - "text/plain": "" + "text/plain": [] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], - "source": "%%sql\nCREATE TABLE IF NOT EXISTS winter_olympics_2022 (\n id INT PRIMARY KEY,\n text TEXT CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci,\n embedding BLOB\n);" + "source": [ + "%%sql\n", + "CREATE TABLE IF NOT EXISTS winter_olympics_2022 (\n", + " id INT PRIMARY KEY,\n", + " text TEXT CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci,\n", + " embedding BLOB\n", + ");" + ] }, { "cell_type": "markdown", "id": "6b7ab530-4f55-482f-8e4c-475df06fe9b3", "metadata": {}, - "source": "## 4. Populate the table with our DataFrame" + "source": [ + "## 4. Populate the table with our DataFrame" + ] }, { "cell_type": "markdown", "id": "51be94b1-9901-499c-a364-c85782239e2a", "metadata": {}, - "source": "Create a SQLAlchemy connection." + "source": [ + "Create a SQLAlchemy connection." + ] }, { "cell_type": "code", @@ -371,13 +604,19 @@ "trusted": true }, "outputs": [], - "source": "import sqlalchemy as sa\n\ndb_connection = sa.create_engine(connection_url).connect()" + "source": [ + "import sqlalchemy as sa\n", + "\n", + "db_connection = sa.create_engine(connection_url).connect()" + ] }, { "cell_type": "markdown", "id": "2ce8c4c5-f389-4d0d-b434-6cd628343688", "metadata": {}, - "source": "Use the `to_sql` method of the DataFrame to upload the data to the requested table." + "source": [ + "Use the `to_sql` method of the DataFrame to upload the data to the requested table." + ] }, { "cell_type": "code", @@ -397,20 +636,26 @@ "outputs": [ { "data": { - "text/plain": "6059" + "text/plain": [ + "6059" + ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], - "source": "df.to_sql('winter_olympics_2022', con=db_connection, index=True, index_label='id', if_exists='append', chunksize=1000)" + "source": [ + "df.to_sql('winter_olympics_2022', con=db_connection, index=True, index_label='id', if_exists='append', chunksize=1000)" + ] }, { "cell_type": "markdown", "id": "c4d4602c-bfec-4819-904b-4d376b920e44", "metadata": {}, - "source": "## 5. Do a semantic search with the same question from above and use the response to send to OpenAI again" + "source": [ + "## 5. Do a semantic search with the same question from above and use the response to send to OpenAI again" + ] }, { "cell_type": "code", @@ -428,7 +673,45 @@ "trusted": true }, "outputs": [], - "source": "from openai.embeddings_utils import get_embedding\n\n\ndef strings_ranked_by_relatedness(\n query: str,\n df: pd.DataFrame,\n table_name: str,\n relatedness_fn=lambda x, y: 1 - spatial.distance.cosine(x, y),\n top_n: int=100,\n) -> tuple:\n \"\"\"Returns a list of strings and relatednesses, sorted from most related to least.\"\"\"\n\n # Get the embedding of the query.\n query_embedding_response = get_embedding(query, EMBEDDING_MODEL)\n\n # Create the SQL statement.\n stmt = f\"\"\"\n SELECT\n text,\n DOT_PRODUCT_F64(JSON_ARRAY_PACK_F64(%s), embedding) AS score\n FROM {table_name}\n ORDER BY score DESC\n LIMIT %s\n \"\"\"\n\n # Execute the SQL statement.\n results = db_connection.execute(stmt, [json.dumps(query_embedding_response), top_n])\n\n strings = []\n relatednesses = []\n\n for row in results:\n strings.append(row[0])\n relatednesses.append(row[1])\n\n # Return the results.\n return strings[:top_n], relatednesses[:top_n]" + "source": [ + "from openai.embeddings_utils import get_embedding\n", + "\n", + "\n", + "def strings_ranked_by_relatedness(\n", + " query: str,\n", + " df: pd.DataFrame,\n", + " table_name: str,\n", + " relatedness_fn=lambda x, y: 1 - spatial.distance.cosine(x, y),\n", + " top_n: int=100,\n", + ") -> tuple:\n", + " \"\"\"Returns a list of strings and relatednesses, sorted from most related to least.\"\"\"\n", + "\n", + " # Get the embedding of the query.\n", + " query_embedding_response = get_embedding(query, EMBEDDING_MODEL)\n", + "\n", + " # Create the SQL statement.\n", + " stmt = f\"\"\"\n", + " SELECT\n", + " text,\n", + " DOT_PRODUCT_F64(JSON_ARRAY_PACK_F64(%s), embedding) AS score\n", + " FROM {table_name}\n", + " ORDER BY score DESC\n", + " LIMIT %s\n", + " \"\"\"\n", + "\n", + " # Execute the SQL statement.\n", + " results = db_connection.execute(stmt, [json.dumps(query_embedding_response), top_n])\n", + "\n", + " strings = []\n", + " relatednesses = []\n", + "\n", + " for row in results:\n", + " strings.append(row[0])\n", + " relatednesses.append(row[1])\n", + "\n", + " # Return the results.\n", + " return strings[:top_n], relatednesses[:top_n]" + ] }, { "cell_type": "code", @@ -449,10 +732,186 @@ { "name": "stdout", "output_type": "stream", - "text": "relatedness=0.879\n╒═══════════════════════════════════════════════════╕\n│ Result │\n╞═══════════════════════════════════════════════════╡\n│ Curling at the 2022 Winter Olympics │\n│ │\n│ ==Medal summary== │\n│ │\n│ ===Medal table=== │\n│ │\n│ {{Medals table │\n│ | caption = │\n│ | host = │\n│ | flag_template = flagIOC │\n│ | event = 2022 Winter │\n│ | team = │\n│ | gold_CAN = 0 | silver_CAN = 0 | bronze_CAN = 1 │\n│ | gold_ITA = 1 | silver_ITA = 0 | bronze_ITA = 0 │\n│ | gold_NOR = 0 | silver_NOR = 1 | bronze_NOR = 0 │\n│ | gold_SWE = 1 | silver_SWE = 0 | bronze_SWE = 2 │\n│ | gold_GBR = 1 | silver_GBR = 1 | bronze_GBR = 0 │\n│ | gold_JPN = 0 | silver_JPN = 1 | bronze_JPN - 0 │\n│ }} │\n╘═══════════════════════════════════════════════════╛\n\n\n\nrelatedness=0.872\n╒══════════════════════════════════════════════════════════════════════╕\n│ Result │\n╞══════════════════════════════════════════════════════════════════════╡\n│ Curling at the 2022 Winter Olympics │\n│ │\n│ ==Results summary== │\n│ │\n│ ===Women's tournament=== │\n│ │\n│ ====Playoffs==== │\n│ │\n│ =====Gold medal game===== │\n│ │\n│ ''Sunday, 20 February, 9:05'' │\n│ {{#lst:Curling at the 2022 Winter Olympics – Women's tournament|GM}} │\n│ {{Player percentages │\n│ | team1 = {{flagIOC|JPN|2022 Winter}} │\n│ | [[Yurika Yoshida]] | 97% │\n│ | [[Yumi Suzuki]] | 82% │\n│ | [[Chinami Yoshida]] | 64% │\n│ | [[Satsuki Fujisawa]] | 69% │\n│ | teampct1 = 78% │\n│ | team2 = {{flagIOC|GBR|2022 Winter}} │\n│ | [[Hailey Duff]] | 90% │\n│ | [[Jennifer Dodds]] | 89% │\n│ | [[Vicky Wright]] | 89% │\n│ | [[Eve Muirhead]] | 88% │\n│ | teampct2 = 89% │\n│ }} │\n╘══════════════════════════════════════════════════════════════════════╛\n\n\n\nrelatedness=0.869\n╒═══════════════════════════════════════════════════════════════════════════════╕\n│ Result │\n╞═══════════════════════════════════════════════════════════════════════════════╡\n│ Curling at the 2022 Winter Olympics │\n│ │\n│ ==Results summary== │\n│ │\n│ ===Mixed doubles tournament=== │\n│ │\n│ ====Playoffs==== │\n│ │\n│ =====Gold medal game===== │\n│ │\n│ ''Tuesday, 8 February, 20:05'' │\n│ {{#lst:Curling at the 2022 Winter Olympics – Mixed doubles tournament|GM}} │\n│ {| class=\"wikitable\" │\n│ !colspan=4 width=400|Player percentages │\n│ |- │\n│ !colspan=2 width=200 style=\"white-space:nowrap;\"| {{flagIOC|ITA|2022 Winter}} │\n│ !colspan=2 width=200 style=\"white-space:nowrap;\"| {{flagIOC|NOR|2022 Winter}} │\n│ |- │\n│ | [[Stefania Constantini]] || 83% │\n│ | [[Kristin Skaslien]] || 70% │\n│ |- │\n│ | [[Amos Mosaner]] || 90% │\n│ | [[Magnus Nedregotten]] || 69% │\n│ |- │\n│ | '''Total''' || 87% │\n│ | '''Total''' || 69% │\n│ |} │\n╘═══════════════════════════════════════════════════════════════════════════════╛\n\n\n\nrelatedness=0.868\n╒══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╕\n│ Result │\n╞══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡\n│ Curling at the 2022 Winter Olympics │\n│ │\n│ ==Medal summary== │\n│ │\n│ ===Medalists=== │\n│ │\n│ {| {{MedalistTable|type=Event|columns=1}} │\n│ |- │\n│ |Men
{{DetailsLink|Curling at the 2022 Winter Olympics – Men's tournament}} │\n│ |{{flagIOC|SWE|2022 Winter}}
[[Niklas Edin]]
[[Oskar Eriksson]]
[[Rasmus Wranå]]
[[Christoffer Sundgren]]
[[Daniel Magnusson (curler)|Daniel Magnusson]] │\n│ |{{flagIOC|GBR|2022 Winter}}
[[Bruce Mouat]]
[[Grant Hardie]]
[[Bobby Lammie]]
[[Hammy McMillan Jr.]]
[[Ross Whyte]] │\n│ |{{flagIOC|CAN|2022 Winter}}
[[Brad Gushue]]
[[Mark Nichols (curler)|Mark Nichols]]
[[Brett Gallant]]
[[Geoff Walker (curler)|Geoff Walker]]
[[Marc Kennedy]] │\n│ |- │\n│ |Women
{{DetailsLink|Curling at the 2022 Winter Olympics – Women's tournament}} │\n│ |{{flagIOC|GBR|2022 Winter}}
[[Eve Muirhead]]
[[Vicky Wright]]
[[Jennifer Dodds]]
[[Hailey Duff]]
[[Mili Smith]] │\n│ |{{flagIOC|JPN|2022 Winter}}
[[Satsuki Fujisawa]]
[[Chinami Yoshida]]
[[Yumi Suzuki]]
[[Yurika Yoshida]]
[[Kotomi Ishizaki]] │\n│ |{{flagIOC|SWE|2022 Winter}}
[[Anna Hasselborg]]
[[Sara McManus]]
[[Agnes Knochenhauer]]
[[Sofia Mabergs]]
[[Johanna Heldin]] │\n│ |- │\n│ |Mixed doubles
{{DetailsLink|Curling at the 2022 Winter Olympics – Mixed doubles tournament}} │\n│ |{{flagIOC|ITA|2022 Winter}}
[[Stefania Constantini]]
[[Amos Mosaner]] │\n│ |{{flagIOC|NOR|2022 Winter}}
[[Kristin Skaslien]]
[[Magnus Nedregotten]] │\n│ |{{flagIOC|SWE|2022 Winter}}
[[Almida de Val]]
[[Oskar Eriksson]] │\n│ |} │\n╘══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╛\n\n\n\nrelatedness=0.867\n╒════════════════════════════════════════════════════════════════════╕\n│ Result │\n╞════════════════════════════════════════════════════════════════════╡\n│ Curling at the 2022 Winter Olympics │\n│ │\n│ ==Results summary== │\n│ │\n│ ===Men's tournament=== │\n│ │\n│ ====Playoffs==== │\n│ │\n│ =====Gold medal game===== │\n│ │\n│ ''Saturday, 19 February, 14:50'' │\n│ {{#lst:Curling at the 2022 Winter Olympics – Men's tournament|GM}} │\n│ {{Player percentages │\n│ | team1 = {{flagIOC|GBR|2022 Winter}} │\n│ | [[Hammy McMillan Jr.]] | 95% │\n│ | [[Bobby Lammie]] | 80% │\n│ | [[Grant Hardie]] | 94% │\n│ | [[Bruce Mouat]] | 89% │\n│ | teampct1 = 90% │\n│ | team2 = {{flagIOC|SWE|2022 Winter}} │\n│ | [[Christoffer Sundgren]] | 99% │\n│ | [[Rasmus Wranå]] | 95% │\n│ | [[Oskar Eriksson]] | 93% │\n│ | [[Niklas Edin]] | 87% │\n│ | teampct2 = 94% │\n│ }} │\n╘════════════════════════════════════════════════════════════════════╛\n\n\n\n" + "text": [ + "relatedness=0.879\n", + "╒═══════════════════════════════════════════════════╕\n", + "│ Result │\n", + "╞═══════════════════════════════════════════════════╡\n", + "│ Curling at the 2022 Winter Olympics │\n", + "│ │\n", + "│ ==Medal summary== │\n", + "│ │\n", + "│ ===Medal table=== │\n", + "│ │\n", + "│ {{Medals table │\n", + "│ | caption = │\n", + "│ | host = │\n", + "│ | flag_template = flagIOC │\n", + "│ | event = 2022 Winter │\n", + "│ | team = │\n", + "│ | gold_CAN = 0 | silver_CAN = 0 | bronze_CAN = 1 │\n", + "│ | gold_ITA = 1 | silver_ITA = 0 | bronze_ITA = 0 │\n", + "│ | gold_NOR = 0 | silver_NOR = 1 | bronze_NOR = 0 │\n", + "│ | gold_SWE = 1 | silver_SWE = 0 | bronze_SWE = 2 │\n", + "│ | gold_GBR = 1 | silver_GBR = 1 | bronze_GBR = 0 │\n", + "│ | gold_JPN = 0 | silver_JPN = 1 | bronze_JPN - 0 │\n", + "│ }} │\n", + "╘═══════════════════════════════════════════════════╛\n", + "\n", + "\n", + "\n", + "relatedness=0.872\n", + "╒══════════════════════════════════════════════════════════════════════╕\n", + "│ Result │\n", + "╞══════════════════════════════════════════════════════════════════════╡\n", + "│ Curling at the 2022 Winter Olympics │\n", + "│ │\n", + "│ ==Results summary== │\n", + "│ │\n", + "│ ===Women's tournament=== │\n", + "│ │\n", + "│ ====Playoffs==== │\n", + "│ │\n", + "│ =====Gold medal game===== │\n", + "│ │\n", + "│ ''Sunday, 20 February, 9:05'' │\n", + "│ {{#lst:Curling at the 2022 Winter Olympics – Women's tournament|GM}} │\n", + "│ {{Player percentages │\n", + "│ | team1 = {{flagIOC|JPN|2022 Winter}} │\n", + "│ | [[Yurika Yoshida]] | 97% │\n", + "│ | [[Yumi Suzuki]] | 82% │\n", + "│ | [[Chinami Yoshida]] | 64% │\n", + "│ | [[Satsuki Fujisawa]] | 69% │\n", + "│ | teampct1 = 78% │\n", + "│ | team2 = {{flagIOC|GBR|2022 Winter}} │\n", + "│ | [[Hailey Duff]] | 90% │\n", + "│ | [[Jennifer Dodds]] | 89% │\n", + "│ | [[Vicky Wright]] | 89% │\n", + "│ | [[Eve Muirhead]] | 88% │\n", + "│ | teampct2 = 89% │\n", + "│ }} │\n", + "╘══════════════════════════════════════════════════════════════════════╛\n", + "\n", + "\n", + "\n", + "relatedness=0.869\n", + "╒═══════════════════════════════════════════════════════════════════════════════╕\n", + "│ Result │\n", + "╞═══════════════════════════════════════════════════════════════════════════════╡\n", + "│ Curling at the 2022 Winter Olympics │\n", + "│ │\n", + "│ ==Results summary== │\n", + "│ │\n", + "│ ===Mixed doubles tournament=== │\n", + "│ │\n", + "│ ====Playoffs==== │\n", + "│ │\n", + "│ =====Gold medal game===== │\n", + "│ │\n", + "│ ''Tuesday, 8 February, 20:05'' │\n", + "│ {{#lst:Curling at the 2022 Winter Olympics – Mixed doubles tournament|GM}} │\n", + "│ {| class=\"wikitable\" │\n", + "│ !colspan=4 width=400|Player percentages │\n", + "│ |- │\n", + "│ !colspan=2 width=200 style=\"white-space:nowrap;\"| {{flagIOC|ITA|2022 Winter}} │\n", + "│ !colspan=2 width=200 style=\"white-space:nowrap;\"| {{flagIOC|NOR|2022 Winter}} │\n", + "│ |- │\n", + "│ | [[Stefania Constantini]] || 83% │\n", + "│ | [[Kristin Skaslien]] || 70% │\n", + "│ |- │\n", + "│ | [[Amos Mosaner]] || 90% │\n", + "│ | [[Magnus Nedregotten]] || 69% │\n", + "│ |- │\n", + "│ | '''Total''' || 87% │\n", + "│ | '''Total''' || 69% │\n", + "│ |} │\n", + "╘═══════════════════════════════════════════════════════════════════════════════╛\n", + "\n", + "\n", + "\n", + "relatedness=0.868\n", + "╒══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╕\n", + "│ Result │\n", + "╞══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡\n", + "│ Curling at the 2022 Winter Olympics │\n", + "│ │\n", + "│ ==Medal summary== │\n", + "│ │\n", + "│ ===Medalists=== │\n", + "│ │\n", + "│ {| {{MedalistTable|type=Event|columns=1}} │\n", + "│ |- │\n", + "│ |Men
{{DetailsLink|Curling at the 2022 Winter Olympics – Men's tournament}} │\n", + "│ |{{flagIOC|SWE|2022 Winter}}
[[Niklas Edin]]
[[Oskar Eriksson]]
[[Rasmus Wranå]]
[[Christoffer Sundgren]]
[[Daniel Magnusson (curler)|Daniel Magnusson]] │\n", + "│ |{{flagIOC|GBR|2022 Winter}}
[[Bruce Mouat]]
[[Grant Hardie]]
[[Bobby Lammie]]
[[Hammy McMillan Jr.]]
[[Ross Whyte]] │\n", + "│ |{{flagIOC|CAN|2022 Winter}}
[[Brad Gushue]]
[[Mark Nichols (curler)|Mark Nichols]]
[[Brett Gallant]]
[[Geoff Walker (curler)|Geoff Walker]]
[[Marc Kennedy]] │\n", + "│ |- │\n", + "│ |Women
{{DetailsLink|Curling at the 2022 Winter Olympics – Women's tournament}} │\n", + "│ |{{flagIOC|GBR|2022 Winter}}
[[Eve Muirhead]]
[[Vicky Wright]]
[[Jennifer Dodds]]
[[Hailey Duff]]
[[Mili Smith]] │\n", + "│ |{{flagIOC|JPN|2022 Winter}}
[[Satsuki Fujisawa]]
[[Chinami Yoshida]]
[[Yumi Suzuki]]
[[Yurika Yoshida]]
[[Kotomi Ishizaki]] │\n", + "│ |{{flagIOC|SWE|2022 Winter}}
[[Anna Hasselborg]]
[[Sara McManus]]
[[Agnes Knochenhauer]]
[[Sofia Mabergs]]
[[Johanna Heldin]] │\n", + "│ |- │\n", + "│ |Mixed doubles
{{DetailsLink|Curling at the 2022 Winter Olympics – Mixed doubles tournament}} │\n", + "│ |{{flagIOC|ITA|2022 Winter}}
[[Stefania Constantini]]
[[Amos Mosaner]] │\n", + "│ |{{flagIOC|NOR|2022 Winter}}
[[Kristin Skaslien]]
[[Magnus Nedregotten]] │\n", + "│ |{{flagIOC|SWE|2022 Winter}}
[[Almida de Val]]
[[Oskar Eriksson]] │\n", + "│ |} │\n", + "╘══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╛\n", + "\n", + "\n", + "\n", + "relatedness=0.867\n", + "╒════════════════════════════════════════════════════════════════════╕\n", + "│ Result │\n", + "╞════════════════════════════════════════════════════════════════════╡\n", + "│ Curling at the 2022 Winter Olympics │\n", + "│ │\n", + "│ ==Results summary== │\n", + "│ │\n", + "│ ===Men's tournament=== │\n", + "│ │\n", + "│ ====Playoffs==== │\n", + "│ │\n", + "│ =====Gold medal game===== │\n", + "│ │\n", + "│ ''Saturday, 19 February, 14:50'' │\n", + "│ {{#lst:Curling at the 2022 Winter Olympics – Men's tournament|GM}} │\n", + "│ {{Player percentages │\n", + "│ | team1 = {{flagIOC|GBR|2022 Winter}} │\n", + "│ | [[Hammy McMillan Jr.]] | 95% │\n", + "│ | [[Bobby Lammie]] | 80% │\n", + "│ | [[Grant Hardie]] | 94% │\n", + "│ | [[Bruce Mouat]] | 89% │\n", + "│ | teampct1 = 90% │\n", + "│ | team2 = {{flagIOC|SWE|2022 Winter}} │\n", + "│ | [[Christoffer Sundgren]] | 99% │\n", + "│ | [[Rasmus Wranå]] | 95% │\n", + "│ | [[Oskar Eriksson]] | 93% │\n", + "│ | [[Niklas Edin]] | 87% │\n", + "│ | teampct2 = 94% │\n", + "│ }} │\n", + "╘════════════════════════════════════════════════════════════════════╛\n", + "\n", + "\n", + "\n" + ] } ], - "source": "from tabulate import tabulate\n\nstrings, relatednesses = strings_ranked_by_relatedness(\n \"curling gold medal\",\n df,\n \"winter_olympics_2022\",\n top_n=5\n)\n\nfor string, relatedness in zip(strings, relatednesses):\n print(f\"{relatedness=:.3f}\")\n print(tabulate([[string]], headers=['Result'], tablefmt='fancy_grid'))\n print('\\n\\n')" + "source": [ + "from tabulate import tabulate\n", + "\n", + "strings, relatednesses = strings_ranked_by_relatedness(\n", + " \"curling gold medal\",\n", + " df,\n", + " \"winter_olympics_2022\",\n", + " top_n=5\n", + ")\n", + "\n", + "for string, relatedness in zip(strings, relatednesses):\n", + " print(f\"{relatedness=:.3f}\")\n", + " print(tabulate([[string]], headers=['Result'], tablefmt='fancy_grid'))\n", + " print('\\n\\n')" + ] }, { "cell_type": "code", @@ -470,7 +929,62 @@ "trusted": true }, "outputs": [], - "source": "import tiktoken\n\n\ndef num_tokens(text: str, model: str=GPT_MODEL) -> int:\n \"\"\"Return the number of tokens in a string.\"\"\"\n encoding = tiktoken.encoding_for_model(model)\n return len(encoding.encode(text))\n\n\ndef query_message(\n query: str,\n df: pd.DataFrame,\n model: str,\n token_budget: int\n) -> str:\n \"\"\"Return a message for GPT, with relevant source texts pulled from SingleStoreDB.\"\"\"\n strings, relatednesses = strings_ranked_by_relatedness(query, df, \"winter_olympics_2022\")\n introduction = 'Use the below articles on the 2022 Winter Olympics to answer the subsequent question. If the answer cannot be found in the articles, write \"I could not find an answer.\"'\n question = f\"\\n\\nQuestion: {query}\"\n message = introduction\n for string in strings:\n next_article = f'\\n\\nWikipedia article section:\\n\"\"\"\\n{string}\\n\"\"\"'\n if (\n num_tokens(message + next_article + question, model=model)\n > token_budget\n ):\n break\n else:\n message += next_article\n return message + question\n\n\ndef ask(\n query: str,\n df: pd.DataFrame=df,\n model: str=GPT_MODEL,\n token_budget: int=4096 - 500,\n print_message: bool=False,\n) -> str:\n \"\"\"Answers a query using GPT and a table of relevant texts and embeddings in SingleStoreDB.\"\"\"\n message = query_message(query, df, model=model, token_budget=token_budget)\n if print_message:\n print(message)\n messages = [\n {\"role\": \"system\", \"content\": \"You answer questions about the 2022 Winter Olympics.\"},\n {\"role\": \"user\", \"content\": message},\n ]\n response = openai.ChatCompletion.create(\n model=model,\n messages=messages,\n temperature=0\n )\n response_message = response[\"choices\"][0][\"message\"][\"content\"]\n return response_message" + "source": [ + "import tiktoken\n", + "\n", + "\n", + "def num_tokens(text: str, model: str=GPT_MODEL) -> int:\n", + " \"\"\"Return the number of tokens in a string.\"\"\"\n", + " encoding = tiktoken.encoding_for_model(model)\n", + " return len(encoding.encode(text))\n", + "\n", + "\n", + "def query_message(\n", + " query: str,\n", + " df: pd.DataFrame,\n", + " model: str,\n", + " token_budget: int\n", + ") -> str:\n", + " \"\"\"Return a message for GPT, with relevant source texts pulled from SingleStoreDB.\"\"\"\n", + " strings, relatednesses = strings_ranked_by_relatedness(query, df, \"winter_olympics_2022\")\n", + " introduction = 'Use the below articles on the 2022 Winter Olympics to answer the subsequent question. If the answer cannot be found in the articles, write \"I could not find an answer.\"'\n", + " question = f\"\\n\\nQuestion: {query}\"\n", + " message = introduction\n", + " for string in strings:\n", + " next_article = f'\\n\\nWikipedia article section:\\n\"\"\"\\n{string}\\n\"\"\"'\n", + " if (\n", + " num_tokens(message + next_article + question, model=model)\n", + " > token_budget\n", + " ):\n", + " break\n", + " else:\n", + " message += next_article\n", + " return message + question\n", + "\n", + "\n", + "def ask(\n", + " query: str,\n", + " df: pd.DataFrame=df,\n", + " model: str=GPT_MODEL,\n", + " token_budget: int=4096 - 500,\n", + " print_message: bool=False,\n", + ") -> str:\n", + " \"\"\"Answers a query using GPT and a table of relevant texts and embeddings in SingleStoreDB.\"\"\"\n", + " message = query_message(query, df, model=model, token_budget=token_budget)\n", + " if print_message:\n", + " print(message)\n", + " messages = [\n", + " {\"role\": \"system\", \"content\": \"You answer questions about the 2022 Winter Olympics.\"},\n", + " {\"role\": \"user\", \"content\": message},\n", + " ]\n", + " response = openai.ChatCompletion.create(\n", + " model=model,\n", + " messages=messages,\n", + " temperature=0\n", + " )\n", + " response_message = response[\"choices\"][0][\"message\"][\"content\"]\n", + " return response_message" + ] }, { "cell_type": "code", @@ -491,16 +1005,26 @@ { "name": "stdout", "output_type": "stream", - "text": "There were three curling events at the 2022 Winter Olympics: men's, women's, and mixed doubles. The gold medalists for each event are:\n\n- Men's: Sweden (Niklas Edin, Oskar Eriksson, Rasmus Wranå, Christoffer Sundgren, Daniel Magnusson)\n- Women's: Great Britain (Eve Muirhead, Vicky Wright, Jennifer Dodds, Hailey Duff, Mili Smith)\n- Mixed doubles: Italy (Stefania Constantini, Amos Mosaner)\n" + "text": [ + "There were three curling events at the 2022 Winter Olympics: men's, women's, and mixed doubles. The gold medalists for each event are:\n", + "\n", + "- Men's: Sweden (Niklas Edin, Oskar Eriksson, Rasmus Wranå, Christoffer Sundgren, Daniel Magnusson)\n", + "- Women's: Great Britain (Eve Muirhead, Vicky Wright, Jennifer Dodds, Hailey Duff, Mili Smith)\n", + "- Mixed doubles: Italy (Stefania Constantini, Amos Mosaner)\n" + ] } ], - "source": "print(ask('Who won the gold medal for curling in Olymics 2022?'))" + "source": [ + "print(ask('Who won the gold medal for curling in Olymics 2022?'))" + ] }, { "cell_type": "markdown", "id": "30cca5fc-9cf5-474b-820f-440255193976", "metadata": {}, - "source": "" + "source": [ + "" + ] }, { "cell_type": "code", @@ -508,7 +1032,7 @@ "id": "aac72aa8-ce6e-4f1e-9684-efe36275be48", "metadata": {}, "outputs": [], - "source": "" + "source": [] } ], "metadata": {