Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support start_time or range options for the first scan of scheduled s3 scan #4929

Merged

Conversation

graytaylor0
Copy link
Member

@graytaylor0 graytaylor0 commented Sep 9, 2024

Description

This change adds support for start_time and range in scheduled s3 scan to filter on the first scan of the buckets based on the time. Previously scheduled scan would always process all objects on the first scan.

Tested with a pipeline to confirm that only objects within a given start_time and range are processed

Check List

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@@ -124,7 +125,7 @@ private List<PartitionIdentifier> listFilteredS3ObjectsForBucket(final List<Stri
.filter(keyTimestampPair -> !keyTimestampPair.left().endsWith("/"))
.filter(keyTimestampPair -> excludeKeyPaths.stream()
.noneMatch(excludeItem -> keyTimestampPair.left().endsWith(excludeItem)))
.filter(keyTimestampPair -> isKeyMatchedBetweenTimeRange(keyTimestampPair.right(), startDateTime, endDateTime))
.filter(keyTimestampPair -> isKeyMatchedBetweenTimeRange(keyTimestampPair.right(), startDateTime, endDateTime, isFirstScan))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if S3 scan api itself can take this filter? If yes, that should help filter out these records at the S3 end itself.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah would be nice to do this server side. ListObjectsV2 does not support it though (https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjectsV2.html)

@graytaylor0 graytaylor0 merged commit 76d9640 into opensearch-project:main Sep 10, 2024
39 of 47 checks passed
@graytaylor0 graytaylor0 deleted the StartTimeRangeScheduledScan branch September 10, 2024 18:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants