Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Add Batch Ingestion Endpoint for OpenLineage Events #2918

Open
algorithmy1 opened this issue Oct 9, 2024 · 1 comment
Open
Milestone

Comments

@algorithmy1
Copy link
Contributor

algorithmy1 commented Oct 9, 2024

Currently, the Marquez API for OpenLineage events (/api/v1/lineage) accepts one event per request, as seen in OpenLineageResource.java#L67. While this is suitable for real-time ingestion, it becomes inefficient when we need to ingest multiple events simultaneously.

Use Case:

  • Database Migration or Restoration: When changing the database or restoring from backups, we may need to re-ingest a large number of events to rebuild the lineage graph.
  • Bulk Event Replay: In scenarios like system recovery or batch processing, ingesting events one by one is not practical.
  • Performance Optimization: Reducing the number of HTTP requests can significantly improve ingestion performance.

Proposal:

  • New Endpoint: Introduce a batch ingestion endpoint (e.g., /api/v1/lineage/batch) that accepts an array of OpenLineage events.
  • Batch Processing: Update the OpenLineageResource class to handle a list of events in a single request.
  • Response Format: Provide a response that indicates the success or failure of each event within the batch.

(Or even update the current one /api/v1/lineage to accept both options)

Benefits:

  • Efficiency: Streamlines the ingestion process for multiple events.
  • Scalability: Enhances Marquez's ability to handle large-scale data operations.
  • User Convenience: Simplifies workflows that require bulk event ingestion.
@wslulciuc wslulciuc added this to the 0.51.0 milestone Oct 23, 2024
@wslulciuc
Copy link
Member

wslulciuc commented Oct 23, 2024

Thanks for the suggestion, @algorithmy1! We couldn't agree more on the benefits you outlined. The good news is that we've been prototyping such an endpoint for OpenLineage batch events, see v2.LineageResource.collectBatchOf(BatchOfEvents). The endpoint will be available in Marquez 0.51.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

2 participants