From 14a2aa59a0d474fe786979ade1ae9839c9468143 Mon Sep 17 00:00:00 2001 From: Kevin D Smith Date: Tue, 8 Oct 2024 10:18:56 -0500 Subject: [PATCH] Update notebook.ipynb --- notebooks/load-data-json/notebook.ipynb | 404 +++++++++++------------- 1 file changed, 190 insertions(+), 214 deletions(-) diff --git a/notebooks/load-data-json/notebook.ipynb b/notebooks/load-data-json/notebook.ipynb index ee616d2..adee439 100644 --- a/notebooks/load-data-json/notebook.ipynb +++ b/notebooks/load-data-json/notebook.ipynb @@ -2,146 +2,143 @@ "cells": [ { "cell_type": "markdown", - "id": "", + "id": "deb8dbf4-2368-41b4-9f09-b14c96ccb344", "metadata": {}, "source": [ - "
\n", + "
\n", "
\n", - " \n", + " \n", "
\n", "
\n", "
SingleStore Notebooks
\n", - "

Employee Data Analysis JSON Dataset

\n", + "

Load JSON files with Pipeline from S3

\n", "
\n", "
" ] }, { "cell_type": "markdown", - "id": "", + "id": "b4b337ff", "metadata": {}, "source": [ - "
\n", - " \n", + "
\n", + " \n", "
\n", "

Note

\n", - "

This notebook can be run on a Free Starter Workspace. To create a Free Starter Workspace navigate to Start using the left nav. You can also use your existing Standard or Premium workspace with this Notebook.

\n", + "

This tutorial is meant for Standard & Premium Workspaces. You can't run this with a Free Starter Workspace due to restrictions on Storage. Create a Workspace using +group in the left nav & select Standard for this notebook. Gallery notebooks tagged with \"Starter\" are suitable to run on a Free Starter Workspace

\n", "
\n", "
" ] }, { - "attachments": {}, - "cell_type": "markdown", - "id": "", - "metadata": {}, - "source": [ - "In this example, we want to create a pipeline from multiple JSON files stored in an AWS S3 bucket called singlestoredb and a folder called **employeedata**. This bucket is located in **ap-south-1**." - ] - }, - { - "attachments": {}, "cell_type": "markdown", - "id": "", + "id": "50093846-9ea3-441d-89f0-fbe0576f78bf", "metadata": {}, "source": [ - "Each file has the following shape with nested arrays:\n", - "\n", - "```json\n", - "{\n", - " \"userId\": \"88-052-8576\",\n", - " \"jobTitleName\": \"Social Worker\",\n", - " \"firstName\": \"Mavis\",\n", - " \"lastName\": \"Hilldrop\",\n", - " \"dataofjoining\": \"20/09/2020\",\n", - " \"contactinfo\": {\n", - " \"city\": \"Dallas\",\n", - " \"phone\": \"972-454-9822\",\n", - " \"emailAddress\": \"mhilldrop0@google.ca\",\n", - " \"state\": \"TX\",\n", - " \"zipcode\": \"75241\"\n", - " },\n", - " \"Children\": [\n", - " \"Evaleen\",\n", - " \"Coletta\",\n", - " \"Leonelle\"\n", - " ],\n", - " \"salary\": 203000\n", - "}\n", - "```" + "This notebook helps you navigate through different scenarios data ingestion of JSON files from an AWS S3 location:\n", + "* Ingest JSON files in AWS S3 using wildcards with pre-defined schema\n", + "* Ingest JSON files in AWS S3 using wildcards into a JSON column" ] }, { - "attachments": {}, "cell_type": "markdown", - "id": "", + "id": "b2ed410a-87b8-452a-b906-431fb0e949b3", "metadata": {}, "source": [ - "

Demo Flow

" + "## Create a Pipeline from JSON files in AWS S3 using wildcards" ] }, { - "attachments": {}, "cell_type": "markdown", - "id": "", + "id": "9996b479-586d-4af3-b0ee-b61eead39ebc", "metadata": {}, "source": [ - "" + "In this example, we want to create a pipeline from two JSON files called **actors1.json** and **actors2.json** stored in an AWS S3 bucket called singlestoredb and a folder called **actors**. This bucket is located in **us-east-1**." ] }, { - "attachments": {}, "cell_type": "markdown", - "id": "", + "id": "9a4caf68-0610-41a6-bfd1-59612b8e959a", "metadata": {}, "source": [ - "## How to use this notebook" + "Each file has the following shape with nested objects and arrays:\n", + "```json\n", + "{\n", + " \"Actors\": [\n", + " {\n", + " \"name\": \"Tom Cruise\",\n", + " \"age\": 56,\n", + " \"Born At\": \"Syracuse, NY\",\n", + " \"Birthdate\": \"July 3, 1962\",\n", + " \"photo\": \"https://jsonformatter.org/img/tom-cruise.jpg\",\n", + " \"wife\": null,\n", + " \"weight\": 67.5,\n", + " \"hasChildren\": true,\n", + " \"hasGreyHair\": false,\n", + " \"children\": [\n", + " \"Suri\",\n", + " \"Isabella Jane\",\n", + " \"Connor\"\n", + " ]\n", + " },\n", + " {\n", + " \"name\": \"Robert Downey Jr.\",\n", + " \"age\": 53,\n", + " \"Born At\": \"New York City, NY\",\n", + " \"Birthdate\": \"April 4, 1965\",\n", + " \"photo\": \"https://jsonformatter.org/img/Robert-Downey-Jr.jpg\",\n", + " \"wife\": \"Susan Downey\",\n", + " \"weight\": 77.1,\n", + " \"hasChildren\": true,\n", + " \"hasGreyHair\": false,\n", + " \"children\": [\n", + " \"Indio Falconer\",\n", + " \"Avri Roel\",\n", + " \"Exton Elias\"\n", + " ]\n", + " }\n", + " ]\n", + "}\n", + "```" ] }, { - "attachments": {}, "cell_type": "markdown", - "id": "", + "id": "98a8e14f-808e-43ff-b670-b6656091b81a", "metadata": {}, "source": [ - "" + "### Create a Table" ] }, { - "attachments": {}, "cell_type": "markdown", - "id": "", + "id": "a70e168d-de32-4988-90c4-651089ac25a0", "metadata": {}, "source": [ - "## Create a database (You can skip this Step if you are using Free Starter Tier)\n", - "\n", - "We need to create a database to work with in the following examples." + "We first create a table called **actors** in the database **demo_database**" ] }, { "cell_type": "code", "execution_count": 1, - "id": "", + "id": "b703aab8-7449-43db-af04-9d65520239a5", "metadata": {}, "outputs": [], "source": [ - "shared_tier_check = %sql show variables like 'is_shared_tier'\n", - "if not shared_tier_check or shared_tier_check[0][1] == 'OFF':\n", - " %sql DROP DATABASE IF EXISTS HRData;\n", - " %sql CREATE DATABASE HRData;" + "%%sql\n", + "CREATE DATABASE IF NOT EXISTS demo_database;" ] }, { - "attachments": {}, "cell_type": "markdown", - "id": "", + "id": "6dfc5b0b-9308-46c9-8cc8-be08fb07c1b6", "metadata": {}, "source": [ "
\n", " \n", "
\n", "

Action Required

\n", - "

If you have a Free Starter Workspace deployed already, select the database from drop-down menu at the top of this notebook. It updates the connection_url to connect to that database.

\n", + "

Make sure to select the demo_database database from the drop-down menu at the top of this notebook. It updates the connection_url to connect to that database.

\n", "
\n", "
" ] @@ -149,322 +146,301 @@ { "cell_type": "code", "execution_count": 2, - "id": "", + "id": "b09528cf-0beb-4fe0-9e60-6edefb72f8b1", "metadata": {}, "outputs": [], "source": [ "%%sql\n", - "\n", - "#creating table for sample data\n", - "\n", - "CREATE TABLE IF NOT EXISTS employeeData (\n", - " userId text CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,\n", - " jobTitleName text CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,\n", - " firstName text CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,\n", - " lastName text CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,\n", - " dataofjoining text CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,\n", - " contactinfo JSON COLLATE utf8_bin NOT NULL,\n", - " salary int NOT NULL,\n", - " Children JSON COLLATE utf8_bin NOT NULL\n", - " );" + "CREATE TABLE IF NOT EXISTS demo_database.actors (\n", + " name text CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,\n", + " age int NOT NULL,\n", + " born_at text CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,\n", + " Birthdate text CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,\n", + " photo text CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,\n", + " wife text CHARACTER SET utf8 COLLATE utf8_general_ci,\n", + " weight float NOT NULL,\n", + " haschildren boolean,\n", + " hasGreyHair boolean,\n", + " children JSON COLLATE utf8_bin NOT NULL,\n", + " SHARD KEY ()\n", + ");" + ] + }, + { + "cell_type": "markdown", + "id": "e4c15a63-eb17-432d-b0b5-d7485bcf028d", + "metadata": {}, + "source": [ + "### Create a pipeline" ] }, { - "attachments": {}, "cell_type": "markdown", - "id": "", + "id": "5e09146a-74cb-4e0d-bd0a-3502c2d15a00", "metadata": {}, "source": [ - "### Create Pipeline To Insert JSON Data into Table" + "We then create a pipeline called **actors** in the database **demo_database**. Since those files are small, batch_interval is not as important and the maximum partitions per batch is only 1. For faster performance, we recommend increasing the maximum partitions per batch.\n", + "Note, that since the bucket is publcly accessible, you do not need to provide access key and secret." ] }, { "cell_type": "code", "execution_count": 3, - "id": "", + "id": "92df7943-e68d-4509-b7f5-4a93697f6578", "metadata": {}, "outputs": [], "source": [ "%%sql\n", - "\n", - "#creating pipeline for sample data\n", - "\n", - "CREATE PIPELINE IF NOT EXISTS employeeData AS\n", - "LOAD DATA S3 'singlestoreloaddata/employeedata/*.json'\n", - "CONFIG '{ \\\"region\\\": \\\"ap-south-1\\\" }'\n", + "CREATE PIPELINE if not exists demo_database.actors\n", + " AS LOAD DATA S3 'studiotutorials/sample_dataset/json_files/wildcard_demo/*.json'\n", + " CONFIG '{ \\\"region\\\": \\\"us-east-1\\\" }'\n", " /*\n", " CREDENTIALS '{\"aws_access_key_id\": \"\",\n", " \"aws_secret_access_key\": \"\"}'\n", " */\n", - "INTO TABLE employeeData\n", - "FORMAT JSON\n", - "(\n", - " userId <- userId,\n", - " jobTitleName <- jobTitleName,\n", - " firstName <- firstName,\n", - " lastName <- lastName,\n", - " dataofjoining <- dataofjoining,\n", - " contactinfo <- contactinfo,\n", - " salary <- salary,\n", - " Children <- Children\n", - ");\n", - "\n", - "START PIPELINE employeeData;" - ] - }, - { - "attachments": {}, + " BATCH_INTERVAL 2500\n", + " MAX_PARTITIONS_PER_BATCH 1\n", + " DISABLE OUT_OF_ORDER OPTIMIZATION\n", + " DISABLE OFFSETS METADATA GC\n", + " SKIP DUPLICATE KEY ERRORS\n", + " INTO TABLE `actors`\n", + " FORMAT JSON\n", + " (\n", + " actors.name <- name,\n", + " actors.age <- age,\n", + " actors.born_at <- `Born At`,\n", + " actors.Birthdate <- Birthdate,\n", + " actors.photo <- photo,\n", + " actors.wife <- wife,\n", + " actors.weight <- weight,\n", + " actors.haschildren <- hasChildren,\n", + " actors.hasGreyHair <- hasGreyHair,\n", + " actors.children <- children\n", + " );" + ] + }, + { "cell_type": "markdown", - "id": "", + "id": "5410c1b9-573f-4326-ba4c-b7af71e069ad", "metadata": {}, "source": [ - "### Check if Data is Loaded" + "### Start and monitor the pipeline" ] }, { "cell_type": "code", "execution_count": 4, - "id": "", + "id": "eeddd12e-e28c-4000-859b-6d1291c4a137", "metadata": {}, "outputs": [], "source": [ "%%sql\n", - "SELECT * from employeeData limit 5;" + "START PIPELINE demo_database.actors;" ] }, { - "attachments": {}, "cell_type": "markdown", - "id": "", + "id": "a555997d-38dc-4b69-821b-390e52bb4d00", "metadata": {}, "source": [ - "### Sample Queries" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "", - "metadata": {}, - "source": [ - "#### Select Top 2 Employees with highest salary risiding in State 'MS'" + "If there is no error or warning, you should see no error message." ] }, { "cell_type": "code", "execution_count": 5, - "id": "", + "id": "f48de155-af85-4c40-ad56-955573a434f8", "metadata": {}, "outputs": [], "source": [ "%%sql\n", - "select * from employeeData where contactinfo::$state = 'MS' order by salary desc limit 2" + "SELECT * FROM information_schema.pipelines_errors\n", + " WHERE pipeline_name = 'actors' ;" ] }, { - "attachments": {}, "cell_type": "markdown", - "id": "", + "id": "c18ac453-63de-424a-b9bf-ae6846817ea6", "metadata": {}, "source": [ - "#### Select Top 5 Cities with highest Average salary" + "### Query the table" ] }, { "cell_type": "code", "execution_count": 6, - "id": "", + "id": "09a739cb-4925-4699-ab61-71016a04bfb6", "metadata": {}, "outputs": [], "source": [ "%%sql\n", - "select contactinfo::$city as City,AVG(salary) as 'Avg Salary' from employeeData\n", - " group by contactinfo::$city order by AVG(salary) desc limit 5" + "SELECT * FROM demo_database.actors;" ] }, { - "attachments": {}, "cell_type": "markdown", - "id": "", + "id": "c4815572-10d8-4c31-a246-05ad6e7e6e99", "metadata": {}, "source": [ - "#### Number of employees with Children grouped by No of children" + "### Cleanup ressources" ] }, { "cell_type": "code", "execution_count": 7, - "id": "", + "id": "6a6dfc1d-c758-4287-a797-6cc3e4fff934", "metadata": {}, "outputs": [], "source": [ "%%sql\n", - "SELECT\n", - " JSON_LENGTH(Children) as No_of_Kids,\n", - " COUNT(*) AS employees_with_children\n", - "FROM employeeData\n", - " group by JSON_LENGTH(Children);" + "DROP PIPELINE IF EXISTS demo_database.actors;\n", + "DROP TABLE IF EXISTS demo_database.actors;" ] }, { - "attachments": {}, "cell_type": "markdown", - "id": "", + "id": "09fbffac-9a0a-45fd-ad07-ede4e11b3691", "metadata": {}, "source": [ - "#### Average salary of employees who have children" + "## Ingest JSON files in AWS S3 using wildcards into a JSON column" ] }, { - "cell_type": "code", - "execution_count": 8, - "id": "", + "cell_type": "markdown", + "id": "d3e8ff65-1b2d-47c5-8754-28fa4c254edd", "metadata": {}, - "outputs": [], "source": [ - "%%sql\n", - "SELECT\n", - " AVG(salary) AS average_salary_with_children\n", - "FROM employeeData\n", - "WHERE JSON_LENGTH(Children) > 0;" + "As the schema of your files might change, you might want to keep flexibility in ingesting the data into one JSON column that we name **json_data**. the table we create is named **actors_json**." ] }, { - "attachments": {}, "cell_type": "markdown", - "id": "", + "id": "d761f324-0d28-4713-a866-3f96673d8317", "metadata": {}, "source": [ - "#### Select the total and average salary by State" + "### Create Table" ] }, { "cell_type": "code", - "execution_count": 9, - "id": "", + "execution_count": 8, + "id": "bcb14814-7b79-4df2-ab47-7def7ae03ce3", "metadata": {}, "outputs": [], "source": [ "%%sql\n", - "SELECT\n", - " contactinfo::$state AS State,\n", - " COUNT(*) AS 'No of Employees',\n", - " SUM(salary) AS 'Total Salary',\n", - " AVG(salary) AS 'Average Salary'\n", - "FROM employeeData\n", - "GROUP BY contactinfo::$state limit 5;" + "CREATE TABLE IF NOT EXISTS demo_database.actors_json (\n", + " json_data JSON NOT NULL,\n", + " SHARD KEY ()\n", + ");" ] }, { - "attachments": {}, "cell_type": "markdown", - "id": "", + "id": "429fce4b-c529-4acf-af7e-5d802f79eda6", "metadata": {}, "source": [ - "#### Top 5 job title with highest number of employees" + "### Create a pipeline" ] }, { "cell_type": "code", - "execution_count": 10, - "id": "", + "execution_count": 9, + "id": "a1d60130-095e-45da-b55d-b427a0af3d26", "metadata": {}, "outputs": [], "source": [ "%%sql\n", - "SELECT\n", - " jobTitleName,\n", - " COUNT(*) AS num_employees\n", - "FROM employeeData\n", - "GROUP BY jobTitleName order by num_employees desc limit 5;" + "CREATE PIPELINE IF NOT EXISTS demo_database.actors_json\n", + " AS LOAD DATA S3 'studiotutorials/sample_dataset/json_files/wildcard_demo/*.json'\n", + " CONFIG '{ \\\"region\\\": \\\"us-east-1\\\" }'\n", + " /*\n", + " CREDENTIALS '{\"aws_access_key_id\": \"\",\n", + " \"aws_secret_access_key\": \"\"}'\n", + " */\n", + " BATCH_INTERVAL 2500\n", + " MAX_PARTITIONS_PER_BATCH 1\n", + " DISABLE OUT_OF_ORDER OPTIMIZATION\n", + " DISABLE OFFSETS METADATA GC\n", + " SKIP DUPLICATE KEY ERRORS\n", + " INTO TABLE `actors_json`\n", + " FORMAT JSON\n", + " (json_data <- %);" ] }, { - "attachments": {}, "cell_type": "markdown", - "id": "", + "id": "bd296bf5-db20-4028-a1d7-b5c9da0a6cb2", "metadata": {}, "source": [ - "#### Select the highest and lowest salary" + "### Start and monitor pipeline" ] }, { "cell_type": "code", - "execution_count": 11, - "id": "", + "execution_count": 10, + "id": "b374598a-f9cb-43c4-a2a4-ebcd298108c4", "metadata": {}, "outputs": [], "source": [ "%%sql\n", - "SELECT\n", - " MAX(salary) AS highest_salary,\n", - " MIN(salary) AS lowest_salary\n", - "FROM employeeData;" + "START PIPELINE demo_database.actors_json;" ] }, { - "attachments": {}, - "cell_type": "markdown", - "id": "", + "cell_type": "code", + "execution_count": 11, + "id": "ca06781b-61fa-4fea-97de-cd0dbacd86e8", "metadata": {}, + "outputs": [], "source": [ - "## Conclusion\n", - "\n", - "\n", - "We have shown how to connect to S3 using `Pipelines` and insert JSON data into SinglestoreDB. These techniques should enable you to\n", - "integrate and query your JSON data with SingleStoreDB." + "%%sql\n", + "# Monitor and see if there is any error or warning\n", + "SELECT * FROM information_schema.pipelines_errors\n", + " WHERE pipeline_name = 'actors_json' ;" ] }, { - "attachments": {}, "cell_type": "markdown", - "id": "", + "id": "7419ccdd-0f85-414e-bd05-fbe8d9656305", "metadata": {}, "source": [ - "## Clean up\n", - "\n", - "Remove the '#' to uncomment and execute the queries below to clean up the pipeline and table created." + "### Query the table" ] }, { "cell_type": "code", "execution_count": 12, - "id": "", + "id": "e34c5b49-0e97-4b07-9026-38bb6c370f73", "metadata": {}, "outputs": [], "source": [ "%%sql\n", - "#STOP PIPELINE employeeData;\n", - "\n", - "#DROP PIPELINE employeeData;" + "SELECT * FROM demo_database.actors_json" ] }, { - "attachments": {}, "cell_type": "markdown", - "id": "", + "id": "c4c155e5-a4a5-4b01-a8a7-e7e626e5fac8", "metadata": {}, "source": [ - "Drop data" + "### Cleanup ressources" ] }, { "cell_type": "code", "execution_count": 13, - "id": "", + "id": "6f0bd356-8a11-4cd9-b774-569d8f5e2520", "metadata": {}, "outputs": [], "source": [ - "#shared_tier_check = %sql show variables like 'is_shared_tier'\n", - "#if not shared_tier_check or shared_tier_check[0][1] == 'OFF':\n", - "# %sql DROP DATABASE IF EXISTS HRData;\n", - "#else:\n", - "# %sql DROP TABLE employeeData;" + "%%sql\n", + "DROP DATABASE IF EXISTS demo_database;" ] }, { "cell_type": "markdown", - "id": "", + "id": "c572193e-7f5b-4637-af5d-2f33f5ba5d86", "metadata": {}, "source": [ "
\n", @@ -494,7 +470,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.11.6" + "version": "3.11.4" } }, "nbformat": 4,