-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LOFAR_dd.py pipeline hangs at 'predict' step without error message #32
Comments
it shouldn't take that long, is disk space ok? you can try running the command by hand to have a bit more control |
I'm not out of disk space - I have ~7 TB available. What appears to be happening is the wsclean command on line 295 of LOFAR_dd.py executes until it needs to write the MODEL_DATA column and then it hangs. All steps prior to this one seem to execute without issue. I have attached the wsclean log. I checked the MSs in mss-avg/ and they do not have a MODEL_DATA column. I also ran the wsclean command outside the pipeline framework. It errored out much sooner than the wsclean command run within the pipeline (command and error below). Perhaps there's a write lock that needs to be cleared? Wsclean command: Error:
I've also noticed a pronounced slowdown in basic text-based terminal commands while the pipeline is hanging. This goes away when I reboot the machine. I do not see much CPU or memory usage, so I don't understand why the hanging pipeline would cause this problem...but I've observed the same effect multiple times. Edited to add: I tried the dd pipeline on a different dataset and saw the same hanging behavior and the same overall system lag while the pipeline was hanging. I also ran the wsclean command outside of the pipeline and it completed for this data set. |
Ok, please disregard the wsclean error above - I had one of the MSs improperly writelocked. When I fixed that problem, the wsclean command executed to completion with no errors. So, something different happens when the wsclean command is executed in the pipeline framework than when it is executed on its own from the command line. Is it possible this has to do with a version difference between the wsclean used in the Singularity container vs. the one I have installed on my OS? What's the best way to force the pipeline to skip over that step? I can run the pipeline to that step, terminate it, run the wsclean command from the command line, and then restart the pipeline on the next step? Of course it would be great to find the root cause - I'm just looking for a workaround in case finding root cause proves difficult. |
I've now run the dd pipeline on a different field with a different calibrator and I get the same hanging behavior at the same place. Would it be possible to run the command that hangs manually and then restart the pipeline past that point? Are there other workaround options? |
I'm running the LOFAR_dd.py pipeline on an Ubuntu 22 system, 128 G RAM, 12 cores. I'm using Singularity 20220805 for the environment. The pipeline runs fine until it gets to the "Predict full model" step. It works through the first few parts of that step, but then hangs silently at "Adding model data column...". I see that entry in the wsclean log and then nothing else happens. This has recurred at least 3 times and it gets stuck in exactly the same place. No error message. When I check resource utilization, the computer is at idle levels, so it isn't a case of this step just taking a really long time.
Let me know if seeing the logs either from the pipeline or wsclean would be helpful.
The text was updated successfully, but these errors were encountered: