Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Local pipeline randomly stalls #441

Open
jamesu opened this issue Oct 28, 2019 · 12 comments
Open

Local pipeline randomly stalls #441

jamesu opened this issue Oct 28, 2019 · 12 comments
Assignees
Milestone

Comments

@jamesu
Copy link

jamesu commented Oct 28, 2019

I've been trying to run MBM.py as part of the test_MBM_and_MAGeT.py test script and also separately with some other mouse brain images I have.

What seems to happen is randomly the pipeline will completely stall. Often I will see a set of MBM.py scripts still running, but doing nothing... even past the default 1 minute idle time. I then need to restart the pipeline to progress further. Sometimes this even happens right at the start of a run (after ~8 stages). The problem seems to get worse the more executors I use.

Is there any easy way to debug this, or are there better default config options I can try to mitigate the problem?

Usually I run with some variant of this (with csv files from test-data.tar.gz):

MBM.py --pipeline-name=MBM_latest_test --num-executors=3 --verbose --init-model=/execute/test-data/Pydpiper-init-model-basket-may-2014/basket_mouse_brain.mnc --config-file=/execute/test-data/sample.cfg --lsq6-large-rotations --no-run-maget --maget-no-mask --lsq12-protocol=/execute/test-data/default_linear_MAGeT_prot.csv --files <mnc file list>

But I've also noticed this issue with full runs (with MaGET). Also my configuration is just the sample local test config, i.e.

[sample]
queue-type sge
queue-name all.q
proc 1
mem  10
time-to-accept-jobs 180
local
@gdevenyi
Copy link

Hi, have you confirmed there's nothing in top/htop spinning on CPU?

What's the last entry in pipeline.log?

@jamesu
Copy link
Author

jamesu commented Oct 28, 2019

The only thing still "spinning" are a series of forked MBM.py scripts which appears to be doing nothing. i.e.:

 3027 docker    20   0 2646476  68436   4884 S   0.0  0.4   0:00.00 MBM.py                                                                                                                                                                                                                              
 3501 docker    20   0  931948 102324  34028 S   0.0  0.6   0:03.91 MBM.py                                                                                                                                                                                                                              
 3524 docker    20   0 2947928  84476  14504 S   0.0  0.5   0:56.97 MBM.py                                                                                                                                                                                                                              
 3525 docker    20   0  931948  73036   4740 S   0.0  0.5   0:02.15 MBM.py                                                                                                                                                                                                                              
 3531 docker    20   0 1267860  72824   4248 S   0.0  0.5   0:00.00 MBM.py                                                                                                                                                                                                                              
 3532 docker    20   0 1267860  72824   4248 S   0.0  0.5   0:00.00 MBM.py                                                                                                                                                                                                                              
 3537 docker    20   0 1562820  73640   4960 S   0.0  0.5   0:00.05 MBM.py                                                                                                                                                                                                                              
 3538 docker    20   0 1562820  73640   4960 S   0.0  0.5   0:00.03 MBM.py                                                                                                                                                                                                                              
 3539 docker    20   0 1562820  73644   4960 S   0.0  0.5   0:00.05 MBM.py                                                                                                                                                                                                                              
 3541 docker    20   0 1571016  73648   4960 S   0.0  0.5   0:00.01 MBM.py                                                                                                                                                                                                                              
 3549 docker    20   0 2152692  73396   4560 S   0.0  0.5   0:00.00 MBM.py                                                                                                                                                                                                                              
 3550 docker    20   0 2152692  73812   4960 S   0.0  0.5   0:00.25 MBM.py  

Last few entries in the pipeline.log are:

[2019-10-28 14:36:10.526,pydpiper.execution.pipeline,INFO] Running stage 550 (on PYRO:[email protected]:43115). Memory requested: 1.75
[2019-10-28 14:36:10.526,pydpiper.execution.pipeline,INFO] mincresample -clobber -2 -nearest_neighbour -invert -keep_real_range -transform mbm_new_images_processed/DW_Recon.8_dti/masking/transforms/DW_Recon.8_dti_N_I_lsq6_mt_to_Dorr_2008_average_nlin_5.xfm -like mbm_new_images_processed/DW_Recon.8_dti/resampled/DW_Recon.8_dti_N_I_lsq6.mnc /execute/test-data/ex-vivo-atlases/Dorr_2008_labels.mnc mbm_new_images_atlases/Dorr_2008_labels/tmp/DW_Recon.8_dti_N_I_lsq6_mt_to_Dorr_2008_average_nlin_5-resampled_labels.mnc
[2019-10-28 14:36:17.474,pydpiper.execution.pipeline,INFO] Stage 548 finished, return was: 0 (on PYRO:[email protected]:43115)
FIN: Finished Stage 548
FIN: Starting Stage 442
[2019-10-28 14:36:17.481,pydpiper.execution.pipeline,INFO] Running stage 442 (on PYRO:[email protected]:43115). Memory requested: 1.75
[2019-10-28 14:36:17.481,pydpiper.execution.pipeline,INFO] mincresample -clobber -2 -nearest_neighbour -invert -keep_real_range -labels -transform mbm_new_images_processed/DW_Recon.8_dti/masking/transforms/DW_Recon.8_dti_N_I_lsq6_mt_to_Dorr_2008_average_nlin_5.xfm -like mbm_new_images_processed/DW_Recon.8_dti/resampled/DW_Recon.8_dti_N_I_lsq6.mnc /execute/test-data/ex-vivo-atlases/Dorr_2008_mask.mnc mbm_new_images_atlases/Dorr_2008_mask/tmp/DW_Recon.8_dti_N_I_lsq6_mt_to_Dorr_2008_average_nlin_5-resampled.mnc
[2019-10-28 14:36:57.423,pydpiper.execution.pipeline,INFO] Stage 550 finished, return was: 0 (on PYRO:[email protected]:43115)
FIN: Finished Stage 550
[2019-10-28 14:37:04.065,pydpiper.execution.pipeline,INFO] Stage 442 finished, return was: 0 (on PYRO:[email protected]:43115)
FIN: Finished Stage 442
[2019-10-28 14:38:04.074,pydpiper.execution.pipeline,WARNING] Exceeded allowed idle time ... bye!
[2019-10-28 14:38:04.075,pydpiper.execution.pipeline,INFO] Main loop finished
[2019-10-28 14:38:04.075,pydpiper.execution.pipeline,INFO] Unsetting the registered-with-the-server flag for executor: PYRO:[email protected]:43115
[2019-10-28 14:38:04.076,pydpiper.execution.pipeline,INFO] Client un-registered (Cheers!): PYRO:[email protected]:43115
[2019-10-28 14:38:04.076,pydpiper.execution.pipeline,INFO] Done calling unregisterClient
[2019-10-28 14:38:04.083,pydpiper.execution.pipeline,INFO] Executor shutting down.

With the command in this case being:

/usr/local/bin/MBM.py --pipeline-name mbm_new_images --config-file=/execute/test-data/sample.cfg --init-model=/execute/test-data/Pydpiper-init-model-basket-may-2014/basket_mouse_brain.mnc --registration-method=minctracc --maget-registration-method=minctracc --num-executors=5 --lsq6-large-rotations --lsq6-large-rotations-parameters=mousebrain --maget-atlas-library=/execute/test-data/ex-vivo-atlases --lsq12-protocol=/execute/test-data/default_linear_MAGeT_prot.csv --maget-nlin-protocol=/execute/test-data/default_nlin_MAGeT_minctracc_prot.csv --nlin-protocol=/execute/test-data/default_nlin_MAGeT_minctracc_prot.csv --maget-masking-nlin-protocol=/execute/test-data/default_nlin_MAGeT_minctracc_prot.csv --proc 2 --files /execute/test-data/dti-images/DW_Recon.15_dti.mnc /execute/test-data/dti-images/DW_Recon.5_dti.mnc /execute/test-data/dti-images/DW_Recon.6_dti.mnc /execute/test-data/dti-images/DW_Recon.7_dti.mnc /execute/test-data/dti-images/DW_Recon.8_dti.mnc /execute/test-data/dti-images/DW_Recon.9_dti.mnc

Checking on the whole run up to that point, it looks like stage 2 didn't complete if I do a count of the completion messages. This seems to have caused some other stages like 3 and beyond not to run at all, while others presumably not dependent on stage 3 have run.

ALSO: pipeline state

STALLED PIPELINE STATE:

Total number of stages in the pipeline:  1761
Number of stages already processed:      344 

Number of active clients:                2
Number of clients waiting in the queue:  0 

Currently running stages (3): 
0	mincblur -clobber -no_apodize -fwhm 0.28 /execute/test-data/dti-images/DW_Recon.15_dti.mnc mbm_new_images_processed/DW_Recon.15_dti/tmp/DW_Recon.15_dti_fwhm0.28 -gradient

3	mincblur -clobber -no_apodize -fwhm 0.28 /execute/test-data/dti-images/DW_Recon.5_dti.mnc mbm_new_images_processed/DW_Recon.5_dti/tmp/DW_Recon.5_dti_fwhm0.28 -gradient

110	mincblur -clobber -no_apodize -fwhm 0.25 /execute/test-data/ex-vivo-atlases/Dorr_2008_on_VVW_Sert_v1_average.mnc mbm_new_images_atlases/Dorr_2008_on_VVW_Sert_v1_average/tmp/Dorr_2008_on_VVW_Sert_v1_average_fwhm0.25 -gradient


Number of runnable stages:                0 


Number of failed stages:                  0
Number of failed/lost/dead executors:     0 


Memory requirement of runnable stages: []
Memory available in registered clients: [3.5, 3.5] 

@gdevenyi
Copy link

Hi,

If you're running locally, why is your config "queue-type sge". You need to run MBM.py with "--local"

@jamesu
Copy link
Author

jamesu commented Oct 28, 2019

I was under the impression that the "local" option in the config set the run mode to local. In this particular case to evaluate I am just using the sample configuration. Should I be using a different queue type in this case?

Many thanks for your assistance so far.

@gdevenyi
Copy link

All these settings:

queue-type sge
queue-name all.q
proc 1
mem  10
time-to-accept-jobs 180

Refer to a cluster config and should be removed for your application.

This:

[2019-10-28 14:38:04.074,pydpiper.execution.pipeline,WARNING] Exceeded allowed idle time ... bye!

May be a bug triggered by time-to-accept-jobs 180 which shouldn't apply in the local case, I leave that for the developers to determine.

@jamesu
Copy link
Author

jamesu commented Oct 28, 2019

Looking at the code, is_time_to_drain uses time-to-accept-jobs. Seems to be checked here during the main loop: https://github.com/Mouse-Imaging-Centre/pydpiper/blob/master/pydpiper/execution/pipeline_executor.py#L542 - in which case, it will always return True which causes the loop to continue running (assuming I followed through right). But this seems to be done before grabbing the next command, so as long as time-to-accept-jobs has been exceeded the executor will never exit.

There is also another check before that against max-idle-time which actually kills the executor if it has been exceeded.

I'll see if removing time-to-accept-jobs and the like from my config improves the situation, though I am still a little suspicious there may be some other variable at play here as sometimes the stall occurs well before 180 minutes has passed.

Thanks again.

@bcdarwin
Copy link
Member

bcdarwin commented Oct 28, 2019 via email

@nzxwang nzxwang added this to the v2.0.14 milestone Oct 31, 2019
@dorkylever
Copy link

Hi @jamesu,

I've had this problem when I've been launching MBM.py myself (for example the pipeline misses some early mincblur stages). For local runs, I just have a config file containing only [string] and specify --mem and --num-executors in the MBM.py command line call.

As Pydpiper automatically saves which stage you are at, I've just been relaunching the command and that will fix the problem (I'm sorry that its tedious).

Kind Regards,
Kyle Drover

@jamesu
Copy link
Author

jamesu commented Nov 14, 2019

Ah, good to hear someone else is having this issue. Indeed, you can keep restarting the pipeline but it seems a little silly to me we're having to resort to working around the problem for a tool which should be doing the job for us.

For reference, on my endtime-to-accept-jobs didn't seem to fix the problem, I can only assume there must be some edge-case issue being triggered here. I'll see if I can get any more info on the state of the processes on a future run.

@dorkylever
Copy link

How long are your filenames?

I think that might have an effect on whether stages get missed or not but I haven't tested it out properly (I stumbled upon this while trying to correct an incorrect filename been written).

@jamesu
Copy link
Author

jamesu commented Nov 20, 2019

@Dorky-Lever no longer than the filenames used in the test brain files in this case

@dorkylever
Copy link

Hi James,

See if specifying more RAM (if possible) and reducing the number of executors will cause it to stall less. I've noticed that reduces some stalling.

So when I'm doing a local run with MBM.py:

I have the h_vmem specified to 60gb in my shell and --mem specified to 300gb (I'm running the code on a subnode with 1000gb of memory.

#!/bin/bash
#$ -N "test_kat_crop"
#$ -cwd
#$ -V
#$ -S /bin/bash
#$ -q "hugemem.q"
#$ -l h_vmem=60g,virtual_free=59.9g
#$ -pe threads 40
#$ -m e
#$ -M "[email protected]"

MBM.py --pipeline-name Hen_shrt --mem 300 --num-executors 40
--init-model /home/kyle_arkell/Henrik/TCP-1700003F12Rik/initial-model/E15.5_mouse_model_final_image.mnc
--registration-method minctracc --registration-strategy pairwise_and_build_model
--nlin-protocol /home/kyle_arkell/embryo_minctracc_protocol.csv
--no-run-maget --maget-no-mask --no-common-space-registration --verbose --config-file /home/kyle_arkell/arkell.cfg
--lsq12-protocol /home/kyle_arkell/Henrik/TCP-1700003F12Rik/MBM_lsq12_embryo_protocol.csv
--output-dir /home/kyle_arkell/Hen_files/ --files /home/kyle_arkell/Henrik/TCP-1700003F12Rik/m_*mnc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants