Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve kafka consumer dropped traces errors in Datadog #736

Open
4 of 6 tasks
robrap opened this issue Jul 26, 2024 · 12 comments
Open
4 of 6 tasks

Resolve kafka consumer dropped traces errors in Datadog #736

robrap opened this issue Jul 26, 2024 · 12 comments
Assignees

Comments

@robrap
Copy link
Contributor

robrap commented Jul 26, 2024

I noticed that the error in logs for
"failed to send, dropping 1 traces to intake at unix:///var/run/datadog/apm.socket/v0.5/traces after 3 retries" seems to be hitting our kafka consumers. It may be hitting some other workers, but not sure if we just have inconsistent naming.

I'm wondering if this has anything to do with the long-running infinite loop on the consumers, and if we need to clean up the trace, like we clean up the db connection, etc.?

A/C:

  • Confirm with DD Support that this is actually important to fix and doesn't just represent an expected level of failed trace sends
    • They should also be able to give us some debugging pointers
  • Either document as wontfix, or fix it
    • Confirm fix (on or after Sept 13th)
    • Communicate fix for non-edxapp workers.
    • Confirm if edxapp workers get fixed in Nov 2024 (SRE believes this to be fixed)
    • Maybe update thread once edxapp workers have been fixed.
@robrap robrap changed the title Resolve consumer worker dropped traces errors in Datadog Resolve kafka consumer dropped traces errors in Datadog Jul 26, 2024
@robrap robrap self-assigned this Aug 12, 2024
@robrap
Copy link
Contributor Author

robrap commented Aug 12, 2024

@robrap
Copy link
Contributor Author

robrap commented Aug 13, 2024

After thinking about how to respond to the DD ticket, and looking at logs more closely, I decided to open an SRE ticket for further investigation: https://2u-internal.atlassian.net/browse/GSRE-1988.

@robrap
Copy link
Contributor Author

robrap commented Aug 14, 2024

I tried looking in AWS a bit, but this really needs SRE support for now. Marking as blocked and I'll check in on the GSRE ticket in a few weeks.

@robrap
Copy link
Contributor Author

robrap commented Aug 26, 2024

[update] Blocked on the GSRE ticket which was picked up on Aug 22, but no comments were added yet.

@robrap
Copy link
Contributor Author

robrap commented Sep 6, 2024

I will confirm that this has actually gone away.

@robrap
Copy link
Contributor Author

robrap commented Sep 6, 2024

  • The fix was deployed on Sept 5th @ 10:00am ET, and the GSRE ticket was closed.
  • I'd like to leave this blocked for at least a week (Sept 13th), and then confirm that the original search confirms that the issue has gone away. So far so good, but it has only been a day.

Todo:

  • Confirm fix.
  • Communicate, in case anyone was aware.

@robrap
Copy link
Contributor Author

robrap commented Sep 25, 2024

@robrap
Copy link
Contributor Author

robrap commented Sep 25, 2024

Slack announce of initial fix.

@robrap
Copy link
Contributor Author

robrap commented Oct 2, 2024

Note: Original GSRE comment on Sept 26 did not have a response yet. Posted a reminder today on Oct 2.

@robrap
Copy link
Contributor Author

robrap commented Oct 4, 2024

SRE added a fix for edxapp workers on the morning of Oct 4. There were 2 week gaps between issues, so I'll leave this in blocked and wait until Nov to confirm.

@robrap
Copy link
Contributor Author

robrap commented Oct 10, 2024

Another spike on Oct-10, but this was for the k8s servers.

  • Nov is still a good time to confirm the edxapp workers fix.
  • But, we need to check in on the GSRE ticket to see if there will be a new change to review.

@robrap
Copy link
Contributor Author

robrap commented Oct 11, 2024

The Oct-10 spike was because the Datadog agent was restarted for other purposes (SRE working on log parsing issues).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Blocked
Development

No branches or pull requests

1 participant