SQS S3 backoff delays #3726
Replies: 4 comments 4 replies
-
considering that when |
Beta Was this translation helpful? Give feedback.
-
@rhys-evans , Presently the maximum back-off delay is hard-coded. This could be made into a configuration. Would you like to create a GitHub issue requesting this? I think you also raise an interesting point that the back-off should vary between the SQS queue and S3 bucket. This would require that Data Prepper pull the message from SQS, check the bucket name, and then apply a backoff for that bucket. This could introduce another problem however. Once we pull from SQS the visibility timeout starts. This could expire and then another node would take the same message. Using the visibility duplication protection (#2485) could help with this. But without it, you could go back to a bad state. |
Beta Was this translation helpful? Give feedback.
-
@Omarimcblack , The |
Beta Was this translation helpful? Give feedback.
-
So regarding the bucket snipping issue, this can be mitigated to some extent on the IAM role used for the ingestor, only allowing it to connect to buckets within the relevant org/org's, also if the SQS queue only allows publishing from specific accounts, we should never see those messages (messages being the s3 event notifications)? 🤔 |
Beta Was this translation helpful? Give feedback.
-
Hi
Is there any option to be able to set the max backoff delay ? Or any option to "quarantine" the source sqs message and move on ?
Essentially we have a single dataprepper cluster reading a single SQS queue, which is fed by multiple buckets (we don't manage the bucket). From time to time we get a bucket being misconfigured, in respects to our access to it. Now if we get the SQS message but a 403 on accessing the bucket, we start to see delays in data ingestion as dataprepper backs off on reading from the SQS queue (which i think is due to this #2574) for up to 5 mins. Obviously we then get a backlog of messages until such time as we get the source issue resolved.
I am happy to be told our design is wrong and we should have a queue per bucket, DLQ's (which we have) etc etc..
But my question would then be we would not want to run "multiple" dataprepper clusters, and as such would the back off only apply to the problem source SQS queue ? IE we would need multiple "input" pipelines sinking to a central "output" pipeline
So how are other handling these types of failures ?
Any help is appreciate
Thanks
Beta Was this translation helpful? Give feedback.
All reactions