Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YUNIKORN-2910] Fix data corruption due to insufficient shim context locking #924

Closed
wants to merge 1 commit into from

Conversation

craigcondit
Copy link
Contributor

What is this PR for?

Restore context locking that was removed as part of YUNIKORN-2629. The locks are necessary to prevent logical data corruption due to concurrent processing of both pod and node events.

What type of PR is it?

  • - Bug Fix
  • - Improvement
  • - Feature
  • - Documentation
  • - Hot Fix
  • - Refactoring

Todos

  • - Task

What is the Jira issue?

https://issues.apache.org/jira/browse/YUNIKORN-2910

How should this be tested?

Ensure existing tests still pass and new deadlocks are not introduced.

Screenshots (if appropriate)

Questions:

  • - The licenses files need update.
  • - There is breaking changes for older versions.
  • - It needs documentation.

@craigcondit craigcondit self-assigned this Oct 9, 2024
Copy link

codecov bot commented Oct 9, 2024

Codecov Report

Attention: Patch coverage is 88.13559% with 7 lines in your changes missing coverage. Please review.

Project coverage is 68.69%. Comparing base (ac4f141) to head (b7c83c0).
Report is 3 commits behind head on master.

Files with missing lines Patch % Lines
pkg/cache/context.go 88.13% 5 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #924      +/-   ##
==========================================
+ Coverage   68.21%   68.69%   +0.47%     
==========================================
  Files          70       70              
  Lines        7621     7752     +131     
==========================================
+ Hits         5199     5325     +126     
- Misses       2213     2217       +4     
- Partials      209      210       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…locking

Restore context locking that was removed as part of YUNIKORN-2629. The
locks are necessary to prevent logical data corruption due to concurrent
processing of both pod and node events.
Copy link
Contributor

@zhuqi-lucas zhuqi-lucas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a good way to mock or reproduce those data race/corruption issues?

It will be very helpful if we can do it easily, for both e2e testing or unit test.

Copy link
Contributor

@pbacsko pbacsko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@pbacsko
Copy link
Contributor

pbacsko commented Oct 10, 2024

Do we have a good way to mock or reproduce those data race/corruption issues?

It will be very helpful if we can do it easily, for both e2e testing or unit test.

I don't think so. It's very difficult to repro things like that. Sometimes it's totally workload dependent.

@pbacsko pbacsko closed this Oct 10, 2024
@pbacsko pbacsko reopened this Oct 10, 2024
@zhuqi-lucas
Copy link
Contributor

Do we have a good way to mock or reproduce those data race/corruption issues?
It will be very helpful if we can do it easily, for both e2e testing or unit test.

I don't think so. It's very difficult to repro things like that. Sometimes it's totally workload dependent.

Ok, i see.

craigcondit added a commit that referenced this pull request Oct 10, 2024
…locking (#924)

Restore context locking that was removed as part of YUNIKORN-2629. The
locks are necessary to prevent logical data corruption due to concurrent
processing of both pod and node events.

Closes: #924
@shravan-achar
Copy link

I am working on making issues more reproducible. Will share updates more widely.

@craigcondit craigcondit deleted the YUNIKORN-2910 branch October 10, 2024 20:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants