-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] OpenSearch and Apache Spark Integration #4
Comments
This enables spark as a compute connector to opensearch data. correct ?
|
Yes. But not limit to OpenSearch data.
It is one option, but i feel we should make it more generic. ML could also leverage Spark as computation engine.
|
Will opensearch SQL engine be responsible for analyzing the query and dispatching all the queries to the MPP engine ? |
Just some thoughts for discussion and PoC later: we need to verify and confirm the role of Spark RDD (with/without Spark SQL) in OpenSearch:
|
As discussed, Spark SQL and RDD is only for purpose # 1 above. Leveraging it for querying OpenSearch index is a totally different story and not our current goal. So in this case, the question for introducing Spark SQL is: whether we need it for Spark RDD job optimizing and planning to query object store. Implementation options:
Research items:
|
os-sql-viz-final.mov |
OS-SQL-SPARK.mp4 |
Amazing stuff! How will you support filtering (eg. timestamp ranges and/or keywords) in relation to S3 path schema. For example - if using fluentbit's S3 output with s3_key_format /$TAG[2]/$TAG[0]/%Y/%m/%d/%H_%M_%S/$UUID.gz how will we map a keyword to pull objects with only the tags in a supplied filter and time range desired? |
@ryn9 Similar as optimization in other query engine, we can leverage partition pruning and data skipping on your data (path or content). Please see general example for data skipping in opensearch-project/sql#1379 (comment). We may look into FluentBit later. Thanks! |
Decorators will be available in FluentBit/ Data-Pepper/ otel-exporter |
Great initiative. Really like the price performance trade off that this solution will bring in. Few questions below:
a. What are the types of queries that doesn’t work with sql today ?
|
@muralikpbhat Thanks for all the comment! Please find my answer inline as below.
In our demo, we use Spark SQL mostly for building skipping index and MV into OpenSearch index. Finally all query and dashboard works with the index as before. As you asked below, I assume we're talking about Spark SQL query with OS index involved, if so there are limitations:
OpenSearch functions including full text search and aggregation: this maybe solved by either improving OS-Hadoop or introducing our OS SQL plugin into Spark.
I think we need separate AuthN/Z for raw data on S3. If you're talking about OS index, the query sent to OS is still DSL which may work. We need deep dive.
I think all OS feature can work with MV. But for raw data, I'm not sure. Need to understand the use case and workflow.
Yes, we're considering MV as second level and on-demand acceleration strategy. We will provide standard SQL API for higher level application to use, such as SHOW/DROP MV.
As shown in demo above, the sink (destination) of streaming job behind MV is regular OpenSearch index. I think we can make it any OpenSearch object as long as OpenSearch-Hadoop connector can support it.
Yes, that's what we're doing in the demo. We ignore the strong consistency between MV and source intentionally.
Yes, because MV itself is a table too. User can use it in any query with raw data. We didn't do this in the demo because currently OS-Hadoop doesn't extend Spark Catalog so efforts required to register MV or any OS index to Spark catalog. Meanwhile, I'm not sure what specific use case or query you're referring to. Actually we also consider and may need this in future for Hybrid Scan capability. Hybrid scan will union the MV data and latest raw data. This will be helpful for customer who want strong consistency.
Not pretty sure what the query looks like. I think it's possible as long as there is primary key field in MV correlated to row in raw data. |
Would joins involve pulling data to RDDs? |
could you eleberate more? do you mean join OpenSearch Index and S3? |
An example would help explain this better. Consider following datasets,
If I have to prepare a report every day, to summarize page view pattern in the 7 days - top 100 pages and top 100 locations, with following result schemas,
Assuming |
Introduction
We received a feature request for query execution on object stores in OpenSearch.
We have investigated the possibility to build a new solution for OpenSearch uses and leverage object store as storage. Which includes
We found the challenges are
We found these work have been solved by general purpose data preprocessing system, E.g. Presto, Spark, Trino. And build such a platform require years to mature.
Idea
The initial idea is
High level diagram:
User Experience
Epic
The text was updated successfully, but these errors were encountered: