Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Streaming Indexing] Introduce bulk Protofobuf API streaming flavour #15447

Open
reta opened this issue Aug 27, 2024 · 1 comment
Open

[Streaming Indexing] Introduce bulk Protofobuf API streaming flavour #15447

reta opened this issue Aug 27, 2024 · 1 comment
Assignees
Labels
enhancement Enhancement or improvement to existing feature or request Indexing Indexing, Bulk Indexing and anything related to indexing v2.18.0 Issues and PRs related to version 2.18.0 v3.0.0 Issues and PRs related to version 3.0.0

Comments

@reta
Copy link
Collaborator

reta commented Aug 27, 2024

Is your feature request related to a problem? Please describe

Is your feature request related to a problem? Please describe.
The bulk HTTP API does not support streaming (neither HTTP/2 nor chunked transfer)

Describe the solution you'd like
Introduce bulk Protobuf API streaming flavour (see please #9070 (comment)) based on new experimental transport (#9067)

Describe alternatives you've considered
N/A

Additional context
See please #9067

Introduce efficient (binary?) format for streaming ingestion

Alternative option (to #9070) is to introduce new efficient (binary?) format for streaming ingestion (for example, based on Protocol Buffers).

Protocol Buffers are great for handling individual messages within a large data set. Usually, large data sets are a collection of small pieces, where each small piece is structured data. - https://protobuf.dev/programming-guides/techniques/

The example message schema may look like this:

syntax = "proto3";
import "google/protobuf/any.proto";

message Index {
  optional string index = 1;
  optional string _id = 2;
  optional bool require_alias = 3;
  map<string,  google.protobuf.Any> fields = 4;
}

message Create {
  optional string index = 1;
  optional string _id = 2;
  optional bool require_alias = 3;  
  map<string,  google.protobuf.Any> fields = 4;
}

message Delete {
  optional string index = 1;
  string _id = 2;
  optional bool require_alias = 3;      
}

message Update {
  optional string index = 1;
  string _id = 2;
  optional bool require_alias = 3;    
  optional google.protobuf.Any doc = 4;
}

message Action {
  oneof action {
      Index index = 1;
      Create create = 2;
      Delete delete= 3;
      Update update = 4;
  }
}

The schema actively relies on google.protobuf.Any to pass freestyle JSON-like structures around (for example, documents or scripts):

The Any message type lets you use messages as embedded types without having their .proto definition. An Any contains an arbitrary serialized message as bytes, along with a URL that acts as a globally unique identifier for and resolves to that message’s type. - https://protobuf.dev/programming-guides/proto3/#any

Risks to consider:

  • dealing with very large messages (basically, documents)

Related component

Indexing

Describe alternatives you've considered

Stay on HTTP APIs only (#9070)

Additional context

See please #9067

@reta reta added the enhancement Enhancement or improvement to existing feature or request label Aug 27, 2024
@reta reta self-assigned this Aug 27, 2024
@github-actions github-actions bot added Indexing Indexing, Bulk Indexing and anything related to indexing untriaged labels Aug 27, 2024
@reta reta removed the untriaged label Aug 27, 2024
@reta reta changed the title [Streaming Indexing] Introduce bulk Protofobuf API streaming flavor [Streaming Indexing] Introduce bulk Protofobuf API streaming flavour Aug 27, 2024
@reta reta added v2.18.0 Issues and PRs related to version 2.18.0 v3.0.0 Issues and PRs related to version 3.0.0 labels Aug 27, 2024
@msfroh
Copy link
Collaborator

msfroh commented Aug 27, 2024

The schema actively relies on google.protobuf.Any to pass freestyle JSON-like structures around (for example, documents or scripts):

I've seen two other options used to pass around documents when using Protobuf in search use-cases:

  1. Fields are a list of key-value pairs.
    1. Keys are strings (since they're the field names).
    2. The values may either be strings (which get parsed to other primitive types based on mapping) or may be a union type to support passing numbers as numbers (which is still trickier than JSON, since you potentially need to support multiple number types). The union type could support lists, or you just represent lists as k-v pairs with a repeated key.
    3. You can support object nesting by either allowing a field value to be a Document (where Document is the type with the fields), or you could have a separate k-v list for nested documents. (To be fair, I think I've only seen the separate list when nested objects were added later.)
  2. A document is an opaque (to Protobuf) byte array, which would probably just be a JSON string encoded as UTF-8.
    1. Doing "JSON over Protobuf" probably loses a lot of the advantage of Protobuf, but it's very easy.

I'm still not sure what solution I'd like to see, but wanted to document those options.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Indexing Indexing, Bulk Indexing and anything related to indexing v2.18.0 Issues and PRs related to version 2.18.0 v3.0.0 Issues and PRs related to version 3.0.0
Projects
None yet
Development

No branches or pull requests

2 participants