Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LOFAR_dd.py pipeline hangs at 'predict' step without error message #32

Open
mknapp55 opened this issue Nov 27, 2023 · 4 comments
Open

Comments

@mknapp55
Copy link

I'm running the LOFAR_dd.py pipeline on an Ubuntu 22 system, 128 G RAM, 12 cores. I'm using Singularity 20220805 for the environment. The pipeline runs fine until it gets to the "Predict full model" step. It works through the first few parts of that step, but then hangs silently at "Adding model data column...". I see that entry in the wsclean log and then nothing else happens. This has recurred at least 3 times and it gets stuck in exactly the same place. No error message. When I check resource utilization, the computer is at idle levels, so it isn't a case of this step just taking a really long time.

Let me know if seeing the logs either from the pipeline or wsclean would be helpful.

@revoltek
Copy link
Owner

it shouldn't take that long, is disk space ok? you can try running the command by hand to have a bit more control

@mknapp55
Copy link
Author

mknapp55 commented Nov 30, 2023

I'm not out of disk space - I have ~7 TB available. What appears to be happening is the wsclean command on line 295 of LOFAR_dd.py executes until it needs to write the MODEL_DATA column and then it hangs. All steps prior to this one seem to execute without issue. I have attached the wsclean log. I checked the MSs in mss-avg/ and they do not have a MODEL_DATA column. I also ran the wsclean command outside the pipeline framework. It errored out much sooner than the wsclean command run within the pipeline (command and error below). Perhaps there's a write lock that needs to be cleared?

Wsclean command:
wsclean -predict -name ddcal/init/wideM-1 -j 12 -channels-out 12 -reorder -parallel-reordering 4 mss-avg/TC00.MS mss-avg/TC01.MS mss-avg/TC02.MS mss-avg/TC03.MS

Error:

Reading ddcal/init/wideM-1-0000-model.fits...
An exception occurred, writing back will be skipped. Cleaning up...
An exception occurred, writing back will be skipped. Cleaning up...
An exception occurred, writing back will be skipped. Cleaning up...
An exception occurred, writing back will be skipped. Cleaning up...
+ + + + + + + + + + + + + + + + + + +
+ An exception occured:
+ >>> Error opening meta file for ms mss-avg/TC00.MS, dataDescId 0
+ + + + + + + + + + + + + + + + + + +

I've also noticed a pronounced slowdown in basic text-based terminal commands while the pipeline is hanging. This goes away when I reboot the machine. I do not see much CPU or memory usage, so I don't understand why the hanging pipeline would cause this problem...but I've observed the same effect multiple times.

Edited to add: I tried the dd pipeline on a different dataset and saw the same hanging behavior and the same overall system lag while the pipeline was hanging. I also ran the wsclean command outside of the pipeline and it completed for this data set.

wscleanPRE-c0.log

@mknapp55
Copy link
Author

Ok, please disregard the wsclean error above - I had one of the MSs improperly writelocked. When I fixed that problem, the wsclean command executed to completion with no errors. So, something different happens when the wsclean command is executed in the pipeline framework than when it is executed on its own from the command line. Is it possible this has to do with a version difference between the wsclean used in the Singularity container vs. the one I have installed on my OS?

What's the best way to force the pipeline to skip over that step? I can run the pipeline to that step, terminate it, run the wsclean command from the command line, and then restart the pipeline on the next step? Of course it would be great to find the root cause - I'm just looking for a workaround in case finding root cause proves difficult.

@mknapp55
Copy link
Author

mknapp55 commented Feb 7, 2024

I've now run the dd pipeline on a different field with a different calibrator and I get the same hanging behavior at the same place. Would it be possible to run the command that hangs manually and then restart the pipeline past that point? Are there other workaround options?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants