-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] - OpenSearch Table #14524
Comments
Thanks for raising this @penghuo, looking forward to discussion on this. Removed untriaged label. |
Some high level comments after a discussion with @penghuo:
|
Last year, @noCharger and I built a little prototype that avoided storing I wonder if we could do something similar here, where a query against an OpenSearch index retrieves matching doc IDs, sorted or scored as appropriate, then you use those doc IDs to fetch content from Parquet (or DynamoDB or Cassandra or whatever). |
Not sure how Spark Dataset's API thinks about partitions, but a possible option may be to use a point-in-time (PIT) with slices. You could query multiple slices concurrently. Not sure how much that piece helps, though. Edit: Oh -- I see this is mentioned in opensearch-project/opensearch-spark#430 (assuming the partitioning there is the same slicing). |
@amberzsy has done some really neat experiments using Protobuf serialization between client and coordinator. I would imagine that would help here too. (See opensearch-project/opensearch-clients#69.) Given that Spark mostly relies on streaming, would Spark Dataset be able to benefit from a client/server API that supports streaming, like gRPC? Would it help if OpenSearch could stream larger result sets from each request? |
Is your feature request related to a problem? Please describe
1. Current status
Currently, users can use the Spark Dataset API to directly read and write OpenSearch indices. The OpenSearch Spark extension internally leverages the Dataset API to access OpenSearch indices. However, we observe several problems and requirements:
Describe the solution you'd like
2. Vision of the future
Our goal is to enable users to utilize OpenSearch indices within popular query engines such as Spark. Spark users should be able to directly use OpenSearch clusters as catalogs and access OpenSearch indices as tables. We aim to enable Spark users to leverage OpenSearch's rich query and aggregation capabilities to efficiently query OpenSearch. Given OpenSearch's rich data type support, we plan to extend Spark's data type system and functions to incorporate more features from OpenSearch.
We also intend to formally define the OpenSearch Table specification, covering schema and data types, partitioning, and table metadata. Users should be able to define OpenSearch tables in the AWS Glue catalog and use Lake Formation to define ACLs on OpenSearch tables.
To improve performance, we will invest in more efficient data storage formats and data transmission protocols for OpenSearch. We are considering Apache Parquet as a storage format instead of
_source
, (similar proposal #13668) and Apache Arrow for zero-copy data transmission.To achieve cost savings, we aim to enable users to query OpenSearch cold indices and snapshots. This will allow them to eagerly move data from hot to cold storage without losing OpenSearch's key features.
In summary, the end to end user experience is
2.1. Directly access
2.1.1. Configure Spark
By default, OpenSearch domain is catalog.
2.1.2. Query index as Table
User could directly access opensearch index without create table.
2.2. Create Table
2.2.1. Configure Spark and Create Table (Spark)
dev.default.tabl00001.metadata
to store metadata.tbl00001
to store data.2.2.2. Writes (Spark)
2.2.3. Query (Spark)
Related component
Search:Query Capabilities
Describe alternatives you've considered
n/a
Additional context
3. Next Steps
We will incorporate the feedback from this RFC into a more detailed proposal and high-level design that integrates the storage-related efforts in OpenSearch. We will create meta-issues to delve deeper into the components involved and continue with the detailed design.
4. How Can You Help?
Any general comments about the overall direction are welcome. Some specific questions:
The text was updated successfully, but these errors were encountered: