SQS S3 backoff delays #3726

rhys-evans · 2023-11-29T20:19:45Z

rhys-evans
Nov 29, 2023

Hi

Is there any option to be able to set the max backoff delay ? Or any option to "quarantine" the source sqs message and move on ?

Essentially we have a single dataprepper cluster reading a single SQS queue, which is fed by multiple buckets (we don't manage the bucket). From time to time we get a bucket being misconfigured, in respects to our access to it. Now if we get the SQS message but a 403 on accessing the bucket, we start to see delays in data ingestion as dataprepper backs off on reading from the SQS queue (which i think is due to this #2574) for up to 5 mins. Obviously we then get a backlog of messages until such time as we get the source issue resolved.

I am happy to be told our design is wrong and we should have a queue per bucket, DLQ's (which we have) etc etc..

But my question would then be we would not want to run "multiple" dataprepper clusters, and as such would the back off only apply to the problem source SQS queue ? IE we would need multiple "input" pipelines sinking to a central "output" pipeline

So how are other handling these types of failures ?

Any help is appreciate

Thanks

Omarimcblack · 2023-11-30T18:24:40Z

Omarimcblack
Nov 30, 2023

considering that when disable_bucket_ownership_validation is set to true we may be doing so as we are not the bucket owners and the responsibility of fixing the issue may be on an an external entity

0 replies

dlvenable · 2023-11-30T22:54:09Z

dlvenable
Nov 30, 2023
Maintainer

@rhys-evans , Presently the maximum back-off delay is hard-coded. This could be made into a configuration. Would you like to create a GitHub issue requesting this?

I think you also raise an interesting point that the back-off should vary between the SQS queue and S3 bucket. This would require that Data Prepper pull the message from SQS, check the bucket name, and then apply a backoff for that bucket. This could introduce another problem however. Once we pull from SQS the visibility timeout starts. This could expire and then another node would take the same message. Using the visibility duplication protection (#2485) could help with this. But without it, you could go back to a bad state.

4 replies

Omarimcblack Nov 30, 2023

@dlvenable one of the reasons that we are even being faced with the back off delay is due to S3 objects no longer being present.

Error processing from S3: null

How can we mitigate against this, where bucket owners have deleted objects before we have had the chance to process them.

rhys-evans Dec 1, 2023
Author

Could we see bucket x failing y times, we release the message to the sqs queue, then based on the DLQ set on the SQS queue, it should eventually make it to the DLQ (sqs queue) for review ?

Or could we have a parameter that "blocklists" specific buckets (essentially /dev/null) ?

rhys-evans Dec 1, 2023
Author

The other option is to take different actions on the type of error returned from S3,

IE a 403 may require retries but object not found will never succeed ?

dlvenable Dec 1, 2023
Maintainer

The other option is to take different actions on the type of error returned from S3,

I think this is a good approach. But, it will require a code change to support it.

dlvenable · 2023-11-30T22:55:56Z

dlvenable
Nov 30, 2023
Maintainer

@Omarimcblack , The disable_bucket_ownership_validation flag is there to prevent a check on who owns the S3 bucket. This could lead to bucket snipping issues and we do not recommend setting it to true. Feel free to read this documentation for some information on it: https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/sources/s3/#cross-account-s3-access

0 replies

rhys-evans · 2023-12-01T10:14:37Z

rhys-evans
Dec 1, 2023
Author

So regarding the bucket snipping issue, this can be mitigated to some extent on the IAM role used for the ingestor, only allowing it to connect to buckets within the relevant org/org's, also if the SQS queue only allows publishing from specific accounts, we should never see those messages (messages being the s3 event notifications)? 🤔

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SQS S3 backoff delays #3726

{{title}}

Replies: 4 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

SQS S3 backoff delays #3726

rhys-evans Nov 29, 2023

Replies: 4 comments · 4 replies

Omarimcblack Nov 30, 2023

dlvenable Nov 30, 2023 Maintainer

Omarimcblack Nov 30, 2023

rhys-evans Dec 1, 2023 Author

rhys-evans Dec 1, 2023 Author

dlvenable Dec 1, 2023 Maintainer

dlvenable Nov 30, 2023 Maintainer

rhys-evans Dec 1, 2023 Author

rhys-evans
Nov 29, 2023

Replies: 4 comments 4 replies

Omarimcblack
Nov 30, 2023

dlvenable
Nov 30, 2023
Maintainer

rhys-evans Dec 1, 2023
Author

rhys-evans Dec 1, 2023
Author

dlvenable Dec 1, 2023
Maintainer

dlvenable
Nov 30, 2023
Maintainer

rhys-evans
Dec 1, 2023
Author