[RFC] OpenSearch and Apache Spark Integration #4

penghuo · 2022-11-29T17:54:43Z

Introduction

We received a feature request for query execution on object stores in OpenSearch.

[FEATURE] Materialized views (aka virtual indexes) on object stores sql#1080

We have investigated the possibility to build a new solution for OpenSearch uses and leverage object store as storage. Which includes

We found the challenges are

OpenSearch aggregation framework is the simplified MPP frameworks and does not support shuffle stage.
OpenSearch query framework missing key feature support, E.g. JOIN, Subquery.

We found these work have been solved by general purpose data preprocessing system, E.g. Presto, Spark, Trino. And build such a platform require years to mature.

Idea

The initial idea is

Using SQL as interface.
Leverage spark as query/compute execution engine.

High level diagram:

User Experience

User configure SPARK cluster as computation resource, E.g. https://SPARK:7707.
User submit SQL to OpenSearch cluster use _plugins/_sql REST API.
1. SQL engine parse and analysis the SQL query.
2. SQL engine decide whether route the query to SPARK cluster or run query locally.
In phase-1, we provide interface to let user create derived dataset from data on object store and store in OpenSearch. Then query will be optimized based derived dataset automatically during query time.
In phase-2, we provide opt-in optimization choice for user. The derived dataset will be create automatically based on query pattern.

Epic

[Feature] OpenSearch and Apache Spark Integration #3

anirudha · 2022-11-29T18:33:35Z

This enables spark as a compute connector to opensearch data. correct ?
can we set this up as a remote compute connection similar to a data source ?

query from spark cluster to index is SQL ?

penghuo · 2022-11-29T18:44:14Z

This enables spark as a compute connector to opensearch data. correct ?

Yes. But not limit to OpenSearch data.

can we set this up as a remote compute connection similar to a data source ?

It is one option, but i feel we should make it more generic. ML could also leverage Spark as computation engine.

query from spark cluster to index is SQL ?

OpenSearch SQL engine will submit the job to Spark cluster. E.g. the job leverage OpenSearch to store materialization view and accelerate the query.
Potentially, we could also leverage Spark Job to query/load cold lucene segment on object store, and provide alternative solution for ultrawarm query path.

YANG-DB · 2022-12-08T22:36:28Z

Will opensearch SQL engine be responsible for analyzing the query and dispatching all the queries to the MPP engine ?
or it will have the ability to do parts of the query itself (for opensearh indices) and other parts delegate to spark ? will this require adding rules to Catalyst ?

dai-chen · 2022-12-14T23:26:35Z

Just some thoughts for discussion and PoC later: we need to verify and confirm the role of Spark RDD (with/without Spark SQL) in OpenSearch:

Provide capability for querying object store only by Spark RDD: immediate requirement and our own query engine is still needed for querying OpenSearch index and planning execution jobs for Spark
Provide capability for querying OpenSearch index as well by Spark SQL + RDD: higher complexity and needs to either get Spark SQL/RDD work with OpenSearch DSL, or OpenSearch index, or Lucene index directly and thus:
2a. Provide faster execution path or remove limitation in OpenSearch aggregation/join
2b. Replace OpenSearch DSL query

dai-chen · 2022-12-19T20:04:49Z

As discussed, Spark SQL and RDD is only for purpose # 1 above. Leveraging it for querying OpenSearch index is a totally different story and not our current goal. So in this case, the question for introducing Spark SQL is: whether we need it for Spark RDD job optimizing and planning to query object store.

Implementation options:

Introduce Spark SQL as library
Introduce partial, such as only Catalyst optimizer
Copy source code to make required changes
Reuse our own engine to plan RDD job

Research items:

Metastore: how/where to manage table metadata for Spark SQL
Fault tolerance: get WAL and intermediate data store work for streaming
Thread pool: check if any blocking operation in Spark SQL
Data source integration: pass credential from Data Source introduced in 2.4 to file system reader
Plugin setting: for example response size limit, need to make it work as well

anirudha · 2023-03-03T17:39:22Z

os-sql-viz-final.mov

ps48 · 2023-03-03T18:58:57Z

OS-SQL-SPARK.mp4

ryn9 · 2023-03-09T13:43:09Z

Amazing stuff!

How will you support filtering (eg. timestamp ranges and/or keywords) in relation to S3 path schema.

For example - if using fluentbit's S3 output with s3_key_format /$TAG[2]/$TAG[0]/%Y/%m/%d/%H_%M_%S/$UUID.gz how will we map a keyword to pull objects with only the tags in a supplied filter and time range desired?

ref: https://docs.fluentbit.io/manual/pipeline/outputs/s3/

dai-chen · 2023-03-13T23:24:37Z

Amazing stuff!

How will you support filtering (eg. timestamp ranges and/or keywords) in relation to S3 path schema.

For example - if using fluentbit's S3 output with s3_key_format /$TAG[2]/$TAG[0]/%Y/%m/%d/%H_%M_%S/$UUID.gz how will we map a keyword to pull objects with only the tags in a supplied filter and time range desired?

ref: https://docs.fluentbit.io/manual/pipeline/outputs/s3/

@ryn9 Similar as optimization in other query engine, we can leverage partition pruning and data skipping on your data (path or content). Please see general example for data skipping in opensearch-project/sql#1379 (comment). We may look into FluentBit later. Thanks!

anirudha · 2023-03-14T08:12:42Z

Decorators will be available in FluentBit/ Data-Pepper/ otel-exporter

dai-chen · 2023-03-23T17:51:06Z

Phase 0 demo: https://github.com/opensearch-project/sql/discussions/1465

muralikpbhat · 2023-03-24T18:11:44Z

Great initiative. Really like the price performance trade off that this solution will bring in. Few questions below:

Can we think about and call out what are the downsides of doing query planing in spark? Will it restrict some of the existing features of open search? What are those ? Few pointers:

a. What are the types of queries that doesn’t work with sql today ?
b. How will DLS/FLS work ?
c. How document level alerting/percolator work?

How are we thinking about life cycle management of materialised views ? We need an ability to delete old MVs. Assuming maximus table and skipping indices don’t need that as they will not be very huge.
Are we using data streams for MV so that we don’t need explicit index rotation ?
Can we think of on-demand materialised views instead of keeping it up to date (cost reduction)
In case of MV, can the query span across MV and raw data ? (Case where one data file is projected completely and the other is not)
Similarly, can the query span across fields in MV and raw data for the same document? (Not for fields in skipping index, but for fields in MVs covered index)

dai-chen · 2023-03-29T16:45:55Z

@muralikpbhat Thanks for all the comment! Please find my answer inline as below.

1.Can we think about and call out what are the downsides of doing query planing in spark? Will it restrict some of the existing features of open search? What are those?

In our demo, we use Spark SQL mostly for building skipping index and MV into OpenSearch index. Finally all query and dashboard works with the index as before.

As you asked below, I assume we're talking about Spark SQL query with OS index involved, if so there are limitations:

a. What are the types of queries that doesn’t work with sql today ?

OpenSearch functions including full text search and aggregation: this maybe solved by either improving OS-Hadoop or introducing our OS SQL plugin into Spark.

How will DLS/FLS work ?

I think we need separate AuthN/Z for raw data on S3. If you're talking about OS index, the query sent to OS is still DSL which may work. We need deep dive.

How document level alerting/percolator work?

I think all OS feature can work with MV. But for raw data, I'm not sure. Need to understand the use case and workflow.

2.How are we thinking about life cycle management of materialised views ? We need an ability to delete old MVs. Assuming maximus table and skipping indices don’t need that as they will not be very huge.

Yes, we're considering MV as second level and on-demand acceleration strategy. We will provide standard SQL API for higher level application to use, such as SHOW/DROP MV.

3.Are we using data streams for MV so that we don’t need explicit index rotation ?

As shown in demo above, the sink (destination) of streaming job behind MV is regular OpenSearch index. I think we can make it any OpenSearch object as long as OpenSearch-Hadoop connector can support it.

4.Can we think of on-demand materialised views instead of keeping it up to date (cost reduction)

Yes, that's what we're doing in the demo. We ignore the strong consistency between MV and source intentionally.

5.In case of MV, can the query span across MV and raw data ? (Case where one data file is projected completely and the other is not)

Yes, because MV itself is a table too. User can use it in any query with raw data. We didn't do this in the demo because currently OS-Hadoop doesn't extend Spark Catalog so efforts required to register MV or any OS index to Spark catalog.

Meanwhile, I'm not sure what specific use case or query you're referring to. Actually we also consider and may need this in future for Hybrid Scan capability. Hybrid scan will union the MV data and latest raw data. This will be helpful for customer who want strong consistency.

6.Similarly, can the query span across fields in MV and raw data for the same document? (Not for fields in skipping index, but for fields in MVs covered index)

Not pretty sure what the query looks like. I think it's possible as long as there is primary key field in MV correlated to row in raw data.

sathishbaskar · 2023-05-22T15:39:00Z

Would joins involve pulling data to RDDs?

penghuo · 2023-07-03T19:02:13Z

Would joins involve pulling data to RDDs?

could you eleberate more? do you mean join OpenSearch Index and S3?

sathishbaskar · 2023-07-03T20:12:13Z

could you eleberate more? do you mean join OpenSearch Index and S3?

An example would help explain this better. Consider following datasets,

users [20 billion docs, ~ 2 TB]
user_id, user_name, user_location

pages [1 trillion docs, ~ 90 TB]
page_id, website_id

page_views [10 trillion docs, over 1 PB]
hour_timestamp
user_id
page_id

If I have to prepare a report every day, to summarize page view pattern in the 7 days - top 100 pages and top 100 locations, with following result schemas,

day, hour, page_id, website_id, views
day, hour, user_location, views

SELECT
  DATE(pv.hour_timestamp) AS day, HOUR(pv.hour_timestamp) AS hour, pv.page_id, p.website_id,  COUNT(*) AS views
FROM
  page_views pv JOIN pages p ON pv.page_id = p.page_id
WHERE
  pv.hour_timestamp >= date_sub(current_date(), interval 7 days)
GROUP BY
  day, hour, pv.page_id, p.website_id
ORDER BY views DESC LIMIT 100

SELECT
  DATE(pv.hour_timestamp) AS day, HOUR(pv.hour_timestamp) AS hour, u.user_location, COUNT(*) AS views
FROM
  page_views pv JOIN users u ON pv.user_id = u.user_id
WHERE
  pv.hour_timestamp >= date_sub(current_date(), interval 7 days)
GROUP BY
  day, hour, u.user_location
ORDER BY views DESC
LIMIT 100

Assuming users & pages are completely available in Opensearch storage in a reasonably large cluster, and page_views is a materialized view, with most data in S3, I'd like to understand how we plan to make the joins work. Would Spark data frames be loaded with data fetched from Opensearch index and Opensearch materialized views, and then processed within Spark runtime? And do we intend to push down some of the compute to Opensearch, as we could avoid good amount of network transfers?

penghuo added the enhancement New feature or request label Nov 29, 2022

dai-chen assigned penghuo and dai-chen Dec 14, 2022

dai-chen mentioned this issue Jan 24, 2023

[FEATURE] Object Storage (S3) Data Ingestion through Streaming Query opensearch-project/sql#948

Open

43 tasks

penghuo mentioned this issue Jul 11, 2023

[Feature] OpenSearch and Apache Spark Integration #3

Closed

anirudha changed the title ~~[FEATURE] OpenSearch SQL on Spark~~ [FEATURE] OpenSearch and Spark Integration Jan 31, 2023

anirudha changed the title ~~[FEATURE] OpenSearch and Spark Integration~~ [FEATURE] OpenSearch and Apace Spark Integration Jan 31, 2023

anirudha changed the title ~~[FEATURE] OpenSearch and Apace Spark Integration~~ [FEATURE] OpenSearch and Apache Spark Integration Jan 31, 2023

dai-chen mentioned this issue Feb 23, 2023

Create index on external table opensearch-project/sql#1379

Closed

anirudha mentioned this issue Mar 3, 2023

SQL based event analytics. opensearch-project/sql#1395

Open

dai-chen mentioned this issue Mar 7, 2023

Create materialized view on external table opensearch-project/sql#1407

Closed

brijos mentioned this issue Mar 11, 2023

[FEATURE] Materialized views (aka virtual indexes) on object stores opensearch-project/sql#1080

Open

brijos mentioned this issue Mar 15, 2023

[DOC] Spark Connector opensearch-project/documentation-website#3454

Closed

4 tasks

YANG-DB mentioned this issue Mar 24, 2023

[FEATURE]Add SQL on top of S3 / Spark as an Integration opensearch-project/observability#1480

Open

dai-chen changed the title ~~[FEATURE] OpenSearch and Apache Spark Integration~~ [RFC] OpenSearch and Apache Spark Integration Apr 3, 2023

shanilpa mentioned this issue Jul 3, 2023

UX proposal for an improved data source picker (v1) opensearch-project/OpenSearch-Dashboards#4482

Closed

penghuo mentioned this issue Jul 11, 2023

OpenSearch on Spark (without an OpenSearch cluster) - has this been contemplated? opensearch-project/OpenSearch#8566

Open

dai-chen transferred this issue from opensearch-project/sql Jul 11, 2023

dai-chen removed the enhancement New feature or request label Aug 30, 2023

ps48 mentioned this issue Mar 5, 2024

Glue DataCatalog Cache (GDC) Design Document opensearch-project/dashboards-observability#1484

Open

shanilpa mentioned this issue Mar 7, 2024

[Proposal] - Advanced Data Selector opensearch-project/OpenSearch-Dashboards#6068

Open

salyh mentioned this issue Sep 11, 2024

[DOC] Misleading and unclear documentation for the Spark Connector in the SQL/PPL docs opensearch-project/documentation-website#8212

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] OpenSearch and Apache Spark Integration #4

[RFC] OpenSearch and Apache Spark Integration #4

penghuo commented Nov 29, 2022 •

edited

Loading

anirudha commented Nov 29, 2022

penghuo commented Nov 29, 2022 •

edited

Loading

YANG-DB commented Dec 8, 2022

dai-chen commented Dec 14, 2022

dai-chen commented Dec 19, 2022 •

edited

Loading

anirudha commented Mar 3, 2023

ps48 commented Mar 3, 2023 •

edited

Loading

ryn9 commented Mar 9, 2023

dai-chen commented Mar 13, 2023

anirudha commented Mar 14, 2023

dai-chen commented Mar 23, 2023

muralikpbhat commented Mar 24, 2023

dai-chen commented Mar 29, 2023

sathishbaskar commented May 22, 2023

penghuo commented Jul 3, 2023

sathishbaskar commented Jul 3, 2023

[RFC] OpenSearch and Apache Spark Integration #4

[RFC] OpenSearch and Apache Spark Integration #4

Comments

penghuo commented Nov 29, 2022 • edited Loading

Introduction

Idea

User Experience

Epic

anirudha commented Nov 29, 2022

penghuo commented Nov 29, 2022 • edited Loading

YANG-DB commented Dec 8, 2022

dai-chen commented Dec 14, 2022

dai-chen commented Dec 19, 2022 • edited Loading

anirudha commented Mar 3, 2023

ps48 commented Mar 3, 2023 • edited Loading

ryn9 commented Mar 9, 2023

dai-chen commented Mar 13, 2023

anirudha commented Mar 14, 2023

dai-chen commented Mar 23, 2023

muralikpbhat commented Mar 24, 2023

dai-chen commented Mar 29, 2023

sathishbaskar commented May 22, 2023

penghuo commented Jul 3, 2023

sathishbaskar commented Jul 3, 2023

penghuo commented Nov 29, 2022 •

edited

Loading

penghuo commented Nov 29, 2022 •

edited

Loading

dai-chen commented Dec 19, 2022 •

edited

Loading

ps48 commented Mar 3, 2023 •

edited

Loading