-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Local pipeline randomly stalls #441
Comments
Hi, have you confirmed there's nothing in top/htop spinning on CPU? What's the last entry in pipeline.log? |
The only thing still "spinning" are a series of forked MBM.py scripts which appears to be doing nothing. i.e.:
Last few entries in the pipeline.log are:
With the command in this case being:
Checking on the whole run up to that point, it looks like stage 2 didn't complete if I do a count of the completion messages. This seems to have caused some other stages like 3 and beyond not to run at all, while others presumably not dependent on stage 3 have run. ALSO: pipeline state
|
Hi, If you're running locally, why is your config "queue-type sge". You need to run MBM.py with "--local" |
I was under the impression that the "local" option in the config set the run mode to local. In this particular case to evaluate I am just using the sample configuration. Should I be using a different queue type in this case? Many thanks for your assistance so far. |
All these settings:
Refer to a cluster config and should be removed for your application. This:
May be a bug triggered by |
Looking at the code, There is also another check before that against I'll see if removing Thanks again. |
Note that the 180 is seconds, not minutes.
We rarely use this configuration so it's certainly possible some other issue is at play.
…________________________________
From: James Urquhart <[email protected]>
Sent: October 28, 2019 2:50 PM
To: Mouse-Imaging-Centre/pydpiper <[email protected]>
Cc: Subscribed <[email protected]>
Subject: Re: [Mouse-Imaging-Centre/pydpiper] Local pipeline randomly stalls (#441)
Looking at the code, is_time_to_drain uses time-to-accept-jobs. Seems to be checked here during the main loop: https://github.com/Mouse-Imaging-Centre/pydpiper/blob/master/pydpiper/execution/pipeline_executor.py#L542<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_Mouse-2DImaging-2DCentre_pydpiper_blob_master_pydpiper_execution_pipeline-5Fexecutor.py-23L542&d=DwMCaQ&c=Sj806OTFwmuG2UO1EEDr-2uZRzm2EPz39TfVBG2Km-o&r=WbPKw40NU3g_RTKn7pWL3cSAdk6QRKr3kMreWPZzNcg&m=jOAcj6rDkHBQK74MZa-W_cr1gnBfCn7gSzZyDQ6I3J4&s=0YWS8kG4qvE2iJGvGqPbjMIL5uFry9OJ1yQjhvGKvZI&e=> - in which case, it will always return True which causes the loop to continue running (assuming I followed through right). But this seems to be done before grabbing the next command, so as long as time-to-accept-jobs has been exceeded the executor will never exit.
There is also another check before that against max-idle-time which actually kills the executor if it has been exceeded.
I'll see if removing time-to-accept-jobs and the like from my config improves the situation, though I am still a little suspicious there may be some other variable at play here as sometimes the stall occurs well before 180 minutes has passed.
Thanks again.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_Mouse-2DImaging-2DCentre_pydpiper_issues_441-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DAABICNBSKY4GZLO3W742G33QQ4YBDA5CNFSM4JFXDADKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECN7ZKY-23issuecomment-2D547093675&d=DwMCaQ&c=Sj806OTFwmuG2UO1EEDr-2uZRzm2EPz39TfVBG2Km-o&r=WbPKw40NU3g_RTKn7pWL3cSAdk6QRKr3kMreWPZzNcg&m=jOAcj6rDkHBQK74MZa-W_cr1gnBfCn7gSzZyDQ6I3J4&s=GB7tfltbdxpOcPnSFOIs0cgB4cs9vtfZGHX1-RR3GuE&e=>, or unsubscribe<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AABICNDD6YIJQHP4M5ETNUDQQ4YBDANCNFSM4JFXDADA&d=DwMCaQ&c=Sj806OTFwmuG2UO1EEDr-2uZRzm2EPz39TfVBG2Km-o&r=WbPKw40NU3g_RTKn7pWL3cSAdk6QRKr3kMreWPZzNcg&m=jOAcj6rDkHBQK74MZa-W_cr1gnBfCn7gSzZyDQ6I3J4&s=hB4dQg8g23s2RkH2l845_iVj0oQUnHjGjebyEVgosp0&e=>.
________________________________
This e-mail may contain confidential, personal and/or health information(information which may be subject to legal restrictions on use, retention and/or disclosure) for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this e-mail in error, please contact the sender and delete all copies.
|
Hi @jamesu, I've had this problem when I've been launching MBM.py myself (for example the pipeline misses some early mincblur stages). For local runs, I just have a config file containing only [string] and specify --mem and --num-executors in the MBM.py command line call. As Pydpiper automatically saves which stage you are at, I've just been relaunching the command and that will fix the problem (I'm sorry that its tedious). Kind Regards, |
Ah, good to hear someone else is having this issue. Indeed, you can keep restarting the pipeline but it seems a little silly to me we're having to resort to working around the problem for a tool which should be doing the job for us. For reference, on my end |
How long are your filenames? I think that might have an effect on whether stages get missed or not but I haven't tested it out properly (I stumbled upon this while trying to correct an incorrect filename been written). |
@Dorky-Lever no longer than the filenames used in the test brain files in this case |
Hi James, See if specifying more RAM (if possible) and reducing the number of executors will cause it to stall less. I've noticed that reduces some stalling. So when I'm doing a local run with MBM.py: I have the h_vmem specified to 60gb in my shell and --mem specified to 300gb (I'm running the code on a subnode with 1000gb of memory. #!/bin/bash MBM.py --pipeline-name Hen_shrt --mem 300 --num-executors 40 |
I've been trying to run
MBM.py
as part of thetest_MBM_and_MAGeT.py
test script and also separately with some other mouse brain images I have.What seems to happen is randomly the pipeline will completely stall. Often I will see a set of MBM.py scripts still running, but doing nothing... even past the default 1 minute idle time. I then need to restart the pipeline to progress further. Sometimes this even happens right at the start of a run (after ~8 stages). The problem seems to get worse the more executors I use.
Is there any easy way to debug this, or are there better default config options I can try to mitigate the problem?
Usually I run with some variant of this (with csv files from
test-data.tar.gz
):MBM.py --pipeline-name=MBM_latest_test --num-executors=3 --verbose --init-model=/execute/test-data/Pydpiper-init-model-basket-may-2014/basket_mouse_brain.mnc --config-file=/execute/test-data/sample.cfg --lsq6-large-rotations --no-run-maget --maget-no-mask --lsq12-protocol=/execute/test-data/default_linear_MAGeT_prot.csv --files <mnc file list>
But I've also noticed this issue with full runs (with MaGET). Also my configuration is just the sample local test config, i.e.
The text was updated successfully, but these errors were encountered: