Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test injection_queue_depth_multi_thread is flaky #6847

Closed
Darksonn opened this issue Sep 14, 2024 · 12 comments · Fixed by #6916
Closed

Test injection_queue_depth_multi_thread is flaky #6847

Darksonn opened this issue Sep 14, 2024 · 12 comments · Fixed by #6916
Labels
A-tokio Area: The main tokio crate E-help-wanted Call for participation: Help is requested to fix this issue. E-medium Call for participation: Experience needed to fix: Medium / intermediate M-metrics Module: tokio/runtime/metrics

Comments

@Darksonn
Copy link
Contributor

This test has been observed to fail in CI:

#[test]
fn injection_queue_depth_multi_thread() {
let rt = threaded();
let metrics = rt.metrics();
let barrier1 = Arc::new(Barrier::new(3));
let barrier2 = Arc::new(Barrier::new(3));
// Spawn a task per runtime worker to block it.
for _ in 0..2 {
let barrier1 = barrier1.clone();
let barrier2 = barrier2.clone();
rt.spawn(async move {
barrier1.wait();
barrier2.wait();
});
}
barrier1.wait();
for i in 0..10 {
assert_eq!(i, metrics.injection_queue_depth());
rt.spawn(async {});
}
barrier2.wait();
}

To close this issue, figure out why it is failing and fix it.

@Darksonn Darksonn added E-help-wanted Call for participation: Help is requested to fix this issue. A-tokio Area: The main tokio crate E-medium Call for participation: Experience needed to fix: Medium / intermediate M-metrics Module: tokio/runtime/metrics labels Sep 14, 2024
@Darksonn Darksonn changed the title injection_queue_depth_multi_thread Test injection_queue_depth_multi_thread is flaky Sep 14, 2024
@jofas
Copy link
Contributor

jofas commented Sep 15, 2024

I presume it's these two recent jobs that timed out

that have been affected by the flakiness of injection_queue_depth_multi_thread? I'd assume so because if this assert

assert_eq!(i, metrics.injection_queue_depth());

fails, the main thread panics and we never get to synchronise on barrier2

barrier2.wait();

here, causing the test to run forever (or until CI times out).

@Darksonn
Copy link
Contributor Author

Yes. Good point with the assert. That explains why it times out instead of failing normally.

@jofas
Copy link
Contributor

jofas commented Sep 16, 2024

I wonder if a first step1 would be to convert the assert into an if-statement where we'd panic only after calling barrier2.wait(). Then we'd get better diagnostics but more importantly wouldn't time out CI any more.

Footnotes

  1. In case this isn't a quick fix

@Darksonn
Copy link
Contributor Author

Darksonn commented Sep 23, 2024

Making it fail instead of hanging would be nice. I submitted a PR for that. (But the flakiness is not fixed by this.)

@jofas
Copy link
Contributor

jofas commented Sep 24, 2024

Interestingly enough, after I pulled your changes, the test still stalls instead of failing. It seems like I was wrong about the assert! causing a panic which makes us never reach the second barrier.

At least I built this highly professional bash script—which I'm going to share here in case someone else wants to use it—that finally allowed me to trigger the flakiness on my local x86_64 Linux system:

run_test() {
  cargo test \
    --all-features \
    --test rt_metrics \
    injection_queue_depth_multi_thread \
    -- --nocapture
}

i=0
while run_test; do
  let i++
  echo -e "$i: \e[32m☑\e[0m"
done

Now that I can litter my fork with debug print statements, I'll investigate some more 🙂

@jofas
Copy link
Contributor

jofas commented Sep 24, 2024

My preliminary findings after adding debug prints before and after the line1 are that we deadlock on barrier1.wait() here:

barrier1.wait();

Footnotes

  1. Best way to debug control flow

@jofas
Copy link
Contributor

jofas commented Sep 24, 2024

Uhm, isn't the flakiness just a mundane case of accidentally blocking the runtime before it can schedule the second task?

@Darksonn
Copy link
Contributor Author

The runtime isn't supposed to get blocked by this. If there is an idle worker thread and a runnable task, the worker thread must pick up a runnable task. The only exception is the LIFO slot which is not relevant here.

@jofas
Copy link
Contributor

jofas commented Sep 24, 2024

Could there be some weirdness around parked threads? The deadlocked test parks the second worker before it gets the second task in its work queue and does not unpark it again on my machine.

@carllerche
Copy link
Member

Looking at the source, my guess is one worker gets both tasks off the injection queue in one batch and doesn’t notify a peer to steal.

@jofas would you be able to try to isolate this case as a loom test?

@jofas
Copy link
Contributor

jofas commented Sep 24, 2024

I'd love to give it a try for sure. This is quite exciting. @Darksonn may I ask you questions in case I get stuck on this?

@Darksonn
Copy link
Contributor Author

Yes, feel free to send questions my way. A loom test would be a good start.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-tokio Area: The main tokio crate E-help-wanted Call for participation: Help is requested to fix this issue. E-medium Call for participation: Experience needed to fix: Medium / intermediate M-metrics Module: tokio/runtime/metrics
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants