Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SDK] Avoid missing conditional variable update and simplify atomic bool #2553

Merged
merged 15 commits into from
May 8, 2024

Conversation

arekay
Copy link
Contributor

@arekay arekay commented Feb 23, 2024

Addresses two issues -

  1. Fix the use of a conditional variable where a wait on the variable might not be in flight when a notify is called. This is fixed by ensuring that an associated lock is aquired before calling the notify.
  2. Instead of relying on a lock an a boolean, replace the use with a single atomic boolean.

Fixes # (issue)

Changes

  • Adds a scoped lock around notify call to conditional var in periodic exporter
  • Replaces a spinlock+bool with an atomic bool in metrics reader

For significant contributions please make sure you have completed the following items:

  • CHANGELOG.md updated for non-trivial changes
  • Unit tests have been added
  • Changes in public API reviewed

Addresses two issues -
1. Fix the use of a conditional variable where a wait on the variable might not be in flight when a notify is called. This is fixed by ensuring that an associated lock is aquired before calling the notify.
2. Instead of relying on a lock an a boolean, replace the use wit a single atomic boolean.
@arekay arekay requested a review from a team February 23, 2024 23:00
Copy link

linux-foundation-easycla bot commented Feb 23, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

@marcalff marcalff added the pr:waiting-on-cla Waiting on CLA label Feb 23, 2024
@marcalff
Copy link
Member

Thanks for the PR.

Not a full review, but this looks good based on a first read.

Please sign the EasyCLA to pass the CI checks.

@marcalff marcalff removed the pr:waiting-on-cla Waiting on CLA label Mar 5, 2024
@arekay arekay requested a review from owent March 5, 2024 12:12
@ThomsonTan
Copy link
Contributor

Is it possible to add a test case to reproduce the issue and verify the fix?

@owent
Copy link
Member

owent commented Mar 29, 2024

Is it possible to add a test case to reproduce the issue and verify the fix?

Good point. But this problem is hard to reproduce. I have no idea about how to add test case by now. Maybe someone else can help?

Copy link
Member

@marcalff marcalff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the fix.

@marcalff
Copy link
Member

marcalff commented Apr 2, 2024

Is it possible to add a test case to reproduce the issue and verify the fix?

Good point. But this problem is hard to reproduce. I have no idea about how to add test case by now. Maybe someone else can help?

It is very unlikely a test case can be written for this.

From other projects using mutex and condition variables, testing requires some serious code injection in debug, to control exactly how threads executes, to block a thread at a given point. opentelemetry-cpp does not have the tooling for this.

@marcalff
Copy link
Member

marcalff commented Apr 5, 2024

LGTM and approved, waiting for @owent to confirm since he had comments earlier.

@owent
Copy link
Member

owent commented Apr 6, 2024

LGTM and approved, waiting for @owent to confirm since he had comments earlier.

Sorry, I'm on holiday these days.

#2584 may also fixes the problems this PR try to slove, and it also solve other deadlock problems in traces and metrics. And it has conflicts with this PR. Do you think we can just use #2584 and drop this one?

@arekay
Copy link
Contributor Author

arekay commented Apr 9, 2024

@owent If #2584 is coming in soon, we can wait. Otherwise, we'd appreciate it if you can merge this and then follow up with #2584 since we have been using #2553 as a patch and it has fixed the frequent crashes we were seeing. This will give you more time to test #2584 as well.

@owent
Copy link
Member

owent commented Apr 10, 2024

@owent If #2584 is coming in soon, we can wait. Otherwise, we'd appreciate it if you can merge this and then follow up with #2584 since we have been using #2553 as a patch and it has fixed the frequent crashes we were seeing. This will give you more time to test #2584 as well.

I have the last question about the deadlock problem above. Other codes looks good to me.

@arekay
Copy link
Contributor Author

arekay commented Apr 18, 2024

@owent If #2584 is coming in soon, we can wait. Otherwise, we'd appreciate it if you can merge this and then follow up with #2584 since we have been using #2553 as a patch and it has fixed the frequent crashes we were seeing. This will give you more time to test #2584 as well.

I have the last question about the deadlock problem above. Other codes looks good to me.

I have not had time to look into the deadlock problem, but anecdotally, we haven't had any deadlocks and eliminated crashes due to OTEL in our deployments. I believe this would be beneficial to all, even if short-lived and updated by #2584 .

@arekay
Copy link
Contributor Author

arekay commented Apr 30, 2024

I have not had time to look into the deadlock problem, but anecdotally, we haven't had any deadlocks and eliminated crashes due to OTEL in our deployments. I believe this would be beneficial to all, even if short-lived and updated by #2584 .

@owent Just checking if there is any update on this. Thanks.

@owent
Copy link
Member

owent commented Apr 30, 2024

I have not had time to look into the deadlock problem, but anecdotally, we haven't had any deadlocks and eliminated crashes due to OTEL in our deployments. I believe this would be beneficial to all, even if short-lived and updated by #2584 .

@owent Just checking if there is any update on this. Thanks.

@marcalff could you please check the lock problem again when you have time, or could #2584 be merged now?

@marcalff marcalff changed the title Avoid missing conditional variable update and simplify atomic bool [SDK] Avoid missing conditional variable update and simplify atomic bool May 8, 2024
Copy link

codecov bot commented May 8, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 87.57%. Comparing base (497eaf4) to head (808c3c4).
Report is 58 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2553      +/-   ##
==========================================
+ Coverage   87.12%   87.57%   +0.46%     
==========================================
  Files         200      188      -12     
  Lines        6109     5848     -261     
==========================================
- Hits         5322     5121     -201     
+ Misses        787      727      -60     
Files Coverage Δ
.../include/opentelemetry/sdk/metrics/metric_reader.h 50.00% <ø> (ø)
...metrics/export/periodic_exporting_metric_reader.cc 79.79% <100.00%> (+0.44%) ⬆️
sdk/src/metrics/metric_reader.cc 77.42% <100.00%> (-1.36%) ⬇️

... and 55 files with indirect coverage changes

@marcalff
Copy link
Member

marcalff commented May 8, 2024

@owent If #2584 is coming in soon, we can wait. Otherwise, we'd appreciate it if you can merge this and then follow up with #2584 since we have been using #2553 as a patch and it has fixed the frequent crashes we were seeing. This will give you more time to test #2584 as well.

I have the last question about the deadlock problem above. Other codes looks good to me.

I looked again in details about the last remaining concern from @owent about a possible deadlock.

After code analysis, I don't think the deadlock is possible, as the code uses a well known pattern of locking a mutex before signalling a condition variable, and the condition wait_for() internally releases the mutex when the wait starts, to re acquire the lock once the wait ends.

Now merging.

@arekay

Thanks for the fix, and also for your patience, sorry this review took such a long time.

@marcalff marcalff merged commit 4f32bc6 into open-telemetry:main May 8, 2024
49 checks passed
@arekay
Copy link
Contributor Author

arekay commented May 8, 2024

Thanks for the fix, and also for your patience, sorry this review took such a long time.

@marcalff @owent Thanks for all the feedback and diligence - and of course, thanks for all the work you folks put in to maintaining something all of us benefit tremendously from!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants