LOFAR_dd.py pipeline hangs at 'predict' step without error message #32

mknapp55 · 2023-11-27T18:35:34Z

I'm running the LOFAR_dd.py pipeline on an Ubuntu 22 system, 128 G RAM, 12 cores. I'm using Singularity 20220805 for the environment. The pipeline runs fine until it gets to the "Predict full model" step. It works through the first few parts of that step, but then hangs silently at "Adding model data column...". I see that entry in the wsclean log and then nothing else happens. This has recurred at least 3 times and it gets stuck in exactly the same place. No error message. When I check resource utilization, the computer is at idle levels, so it isn't a case of this step just taking a really long time.

Let me know if seeing the logs either from the pipeline or wsclean would be helpful.

revoltek · 2023-11-28T13:31:33Z

it shouldn't take that long, is disk space ok? you can try running the command by hand to have a bit more control

mknapp55 · 2023-11-30T01:06:41Z

I'm not out of disk space - I have ~7 TB available. What appears to be happening is the wsclean command on line 295 of LOFAR_dd.py executes until it needs to write the MODEL_DATA column and then it hangs. All steps prior to this one seem to execute without issue. I have attached the wsclean log. I checked the MSs in mss-avg/ and they do not have a MODEL_DATA column. I also ran the wsclean command outside the pipeline framework. It errored out much sooner than the wsclean command run within the pipeline (command and error below). Perhaps there's a write lock that needs to be cleared?

Wsclean command:
wsclean -predict -name ddcal/init/wideM-1 -j 12 -channels-out 12 -reorder -parallel-reordering 4 mss-avg/TC00.MS mss-avg/TC01.MS mss-avg/TC02.MS mss-avg/TC03.MS

Error:

Reading ddcal/init/wideM-1-0000-model.fits...
An exception occurred, writing back will be skipped. Cleaning up...
An exception occurred, writing back will be skipped. Cleaning up...
An exception occurred, writing back will be skipped. Cleaning up...
An exception occurred, writing back will be skipped. Cleaning up...
+ + + + + + + + + + + + + + + + + + +
+ An exception occured:
+ >>> Error opening meta file for ms mss-avg/TC00.MS, dataDescId 0
+ + + + + + + + + + + + + + + + + + +

I've also noticed a pronounced slowdown in basic text-based terminal commands while the pipeline is hanging. This goes away when I reboot the machine. I do not see much CPU or memory usage, so I don't understand why the hanging pipeline would cause this problem...but I've observed the same effect multiple times.

Edited to add: I tried the dd pipeline on a different dataset and saw the same hanging behavior and the same overall system lag while the pipeline was hanging. I also ran the wsclean command outside of the pipeline and it completed for this data set.

wscleanPRE-c0.log

mknapp55 · 2023-11-30T18:31:26Z

Ok, please disregard the wsclean error above - I had one of the MSs improperly writelocked. When I fixed that problem, the wsclean command executed to completion with no errors. So, something different happens when the wsclean command is executed in the pipeline framework than when it is executed on its own from the command line. Is it possible this has to do with a version difference between the wsclean used in the Singularity container vs. the one I have installed on my OS?

What's the best way to force the pipeline to skip over that step? I can run the pipeline to that step, terminate it, run the wsclean command from the command line, and then restart the pipeline on the next step? Of course it would be great to find the root cause - I'm just looking for a workaround in case finding root cause proves difficult.

mknapp55 · 2024-02-07T14:08:41Z

I've now run the dd pipeline on a different field with a different calibrator and I get the same hanging behavior at the same place. Would it be possible to run the command that hangs manually and then restart the pipeline past that point? Are there other workaround options?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LOFAR_dd.py pipeline hangs at 'predict' step without error message #32

LOFAR_dd.py pipeline hangs at 'predict' step without error message #32

mknapp55 commented Nov 27, 2023

revoltek commented Nov 28, 2023

mknapp55 commented Nov 30, 2023 •

edited

Loading

mknapp55 commented Nov 30, 2023

mknapp55 commented Feb 7, 2024

LOFAR_dd.py pipeline hangs at 'predict' step without error message #32

LOFAR_dd.py pipeline hangs at 'predict' step without error message #32

Comments

mknapp55 commented Nov 27, 2023

revoltek commented Nov 28, 2023

mknapp55 commented Nov 30, 2023 • edited Loading

mknapp55 commented Nov 30, 2023

mknapp55 commented Feb 7, 2024

mknapp55 commented Nov 30, 2023 •

edited

Loading