Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increased lambda duration during march 8th incident #131

Open
santiagoaguiar opened this issue Mar 8, 2023 · 2 comments
Open

Increased lambda duration during march 8th incident #131

santiagoaguiar opened this issue Mar 8, 2023 · 2 comments
Assignees
Labels
enhancement New feature or request

Comments

@santiagoaguiar
Copy link

santiagoaguiar commented Mar 8, 2023

During the 2023 march 8th incident, our lambdas duration average execution increased very significantly (about x2-x3), causing an increase in concurrency and additional load across the board. These lambdas are executed thousands of times per minute, and can take ~90ms on average to complete. We ended up disabling the lambda extension to restore to our normal duration.

Looking at https://github.com/DataDog/datadog-lambda-extension#overhead it seems we shouldn't have seen this. My interpretation is that it would be expected to have 1 invocation every minute to have a larger than normal duration as it flushes the buffered metrics/spans, but most lambda invocations should have kept working at same speed.

Wanted to check on that interpretation and see if there was any other reasons that could have caused such an increase in average duration, and if there was anything that could be done to prevent those in the future in the light of today's incident.

This is our current configuration for the extension:

  enableDDTracing: false
  # logs are forwarded from CW to DD
  enableDDLogs: false
  subscribeToAccessLogs: false
  # as tracing is disable, do not add DD context to logs
  injectLogContext: false

Thank you, and #hugops as I bet this one was a hard one!

@tianchu tianchu self-assigned this Mar 14, 2023
@tianchu
Copy link
Contributor

tianchu commented Mar 14, 2023

@santiagoaguiar Thanks for reporting what you experienced during the incident! That's is extremely valuable for us, as we are still actively investigating the exact impact to our serverless customers during the incident. Do you mind following up in another week or so, I believe we will have some concrete to share by then.

@tianchu
Copy link
Contributor

tianchu commented Mar 30, 2023

@santiagoaguiar We were able to identify a few places in the Datadog Agent where the existing retry and buffer logic were not optimized for serverless. We are looking into potential improvements in Q2.

@tianchu tianchu added the enhancement New feature or request label Apr 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants