-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate use of Google Batch instead of Life Sciences API as a Cromwell backend #40
Comments
In order to perform a test run, using an updated version of Cromwell and Google Batch as the backend, I believe the following changes will be required: Change from Google Life Sciences API to Batch
... and delete two related entries that should not be needed
|
When attempting this for the first time, this error was encountered:
Enabling the API was seemingly as simple as visiting that URL and hitting the "ENABLE API" button. I believe this could be done automatically for the project in question by modifying scripts/enable_api.sh, which is called by manual-workflows/resources.sh. To replace: with In the short term, since we will be experimenting with this backend while continuing to use the LifeSciencesAPI, we will want to add rather than replace APIs allowed. |
Next errors encountered:
It seems that additional Google Cloud IAM permissions may be required. Initial guess is that the service account we are using here would need something like:
We should already have that last one. Service accounts are currently configured in scripts/create_resources.sh which is called by manual-workflows/resources.sh. At present we have defined two service accounts according to IAM: [email protected] (described as "Cromwell backend compute") with roles:
[email protected] (described as "Cromwell Task Compute VM") with roles:
In the short term, since we will be experimenting with this backend while continuing to use the LifeSciencesAPI, we will want to add rather than replace permissions. I think this can be done simply by updating the two scripts mentioned above and rerunning the In summary I think we could start by trying to add the following currently missing role(s) in scripts/create_resources.sh:
|
Adding the
This sounds related to this:
So I will next add the following to scripts/create_resources.sh:
|
Still no success after this last change. I do see Cromwell report that task code is being created and jobs are being launched. The basic structure of the run files including localization and run scripts are being created in the Google Bucket by Cromwell. But nothing is coming back from the VMs. And I see VMs being started and running in the console. Cromwell seems to be requesting specific resources for each task and I see things like this in the Cromwell logging:
But if I log onto one of these instances through the console, while I see different amount of Memory and CPU, I see no evidence such a storage disk has been attached. Nothing seems to be happening. I suspect Cromwell tries something, times out and fails the task. I'm still getting these events as well:
I have not seen any other informative logging in the Cromwell log. One thing I don't full understand is that we are setting this in our cromwell.config file:
However, all the IAM permissions we have been conferring are to: We could try adding these permissions to that service account user as well... |
This last change seems to have helped and now data input data is being written to a separate volume and mount point on a machine that I logged into: Summary of the addition of permissions so far for SERVER_ACCOUNT:
Summary of the addition of permissions so far for COMPUTE_ACCOUNT:
|
In my latest test, tasks are now actually running as expected and a few have actually succeeded. But there seems to a problem related to use of preemptible machines. Our current strategy is to specify something like this in the
And also in the
When using this with the Google Life Sciences API, this gets in interpreted as: Try at most 1 attempt on a preemptible (much cheaper) instance. If that get preempted, try again on a non-premptible instance. If that fails, try again 2 more times, again only on non-premptible instance. This was all working as expected on the old backend. So far in my limited testing of GCP Batch, I have a few observations:
The reason I think this is that, when I query the Cromwell log like this:
I get something like this, In other words, the provisioning model is always reported as
From lurking on github issues forums for Cromwell and NextFlow, it sounds like support for a mixed model of preemptible/non-preemptible instances in failure handling, is perhaps half-baked? Perhaps with changes to the GCP Batch API itself still being contemplated? This would be unfortunate from a cost perspective, but also from a testing perspective. Is there even a convenient way to do a test run with no use of preemptible instances? Every task in the WDL currently sets the above parameters. Does the setting in workflow_options.json override those task specific settings? Or is that just a default if no specific retry behavior is specified in a task. If that is the case, we might need to modify the WDLs to move forward here. Next testing ideas: Note the according to the Crowell docs:
If that doesn't work (which seems likely), then we must change every relevant WDL on the VM:
Note that when running these sed commands on a Mac I had to tweak them slightly:
|
Note this change log: https://github.com/broadinstitute/cromwell/blob/develop/CHANGELOG.md#gcp-batch The latest release of Cromwell (v88), not actually available at this time? Appears to describe some updates relevant to pre-empting.
Not sure if that fixes the problem we have, or just makes the error messages more clear. It seems that maybe the
The last commit to that branch was Jul 14, 2023. The If one wanted to experiment with building a .jar from the current |
Using the method above, to change all WDLs (only locally on the VM for experimentation purposes) to use no preemptible instances, resulted in the following apparent change in the Crowell log output. Now |
Note: It is possible that the issue with handling preemption has been fixed and will be incorporated into the next release of Cromwell. Various reported issues mention GCP batch and preemption such as: broadinstitute/cromwell#7407 with associate PRs that have been merged in the |
The current test got very far, but failed on the germline vep task. I believe I saw this failure with the stable pipeline as well (I believe I had to increase disk space for this step). In this failure, I don't see any relevant error messages and it seems like no second attempt was made... Perhaps running out of disk resulted in an unusual failure mode. But, this could also indicate additional issues with the reattempt logic for GCP batch not working as expected. Will increase disk space for VEP for this test and attempt a restart to see if the call caching is working. I think this is not calculating size needs aggressively enough anyway (we have seen this failure a few times now):
|
Call caching seems to be working at first glance. Though, some steps appear to be redoing work that was already done?:
Note this is far from the first time I have observed Cromwell to redo work that was completed in a previous run. We have never dug too deep to understand these cases. In other words, this may not have anything to do with GCP Batch. |
The diff API endpoint is good for finding why call caching didn't match the second time: https://cromwell.readthedocs.io/en/stable/api/RESTAPI/#explain-hashing-differences-for-2-calls If the inputs are somehow busting the cache could be a simple change to get it to work. |
My first end-to-end test worked (with one restart) and I was able to pull the results down. The results superficially seem to look good but of course more formal testing and comparisons must be done. Other than having to turn off use of preemptible nodes, the other thing that did not work was the
This gave the following Python error:
My first guess would be that something related to updating the Cromwell version we are using has changed the structure of the outputs being parsed by this code? At first glance the .json metadata files appear to contain the relevant information. |
To facilitate further testing of GCP Batch and Cromwell 87, I have created a pre-release for cloud-workflows and analysis-wdls. If you checkout the cloud-workflows pre-release it will automatically take into account all the changes described above, including automatically cloning the analysis-wdls pre-release on the Cromwell head node. In other words, once you clone that pre-release, you can follow the usual procedure for performing a test run. analysis-wdls (v1.2.2-gcpbatch) I am currently using this version for a test with the hcc1395 full test data. |
This test completed smoothly without any need to restart or any other issues. Conclusion for now. We can start using GCP Batch if we need/want to with the following two known caveats:
|
I looked into the estimate billing issue. According to the output, the error is occurring with the line "assert is_run_task(task)". The function is_run_task is checking if the "jes" key is in the associated json data. This key is present in the v1.1.4 json data, but is not present in the v1.2.1 json data. This function is then returning "False", which is causing the assertion error. This "jes" key contains the following entries: |
The current backend we are using with Cromwell on GCP is deprecated and will be turned off July 8, 2025. Google now recommends migrating to Google Batch:
https://cloud.google.com/life-sciences/docs/getting-support
Newer versions of Cromwell now support GCP batch as a backend. Cromwell documentation on using Batch, including an example cromwell.conf file can be found here:
https://cromwell.readthedocs.io/en/develop/tutorials/Batch101/
https://github.com/broadinstitute/cromwell/blob/develop/cromwell.example.backends/GCPBATCH.conf
https://cromwell.readthedocs.io/en/develop/backends/GCPBatch/
https://github.com/broadinstitute/cromwell/blob/develop/CHANGELOG.md#gcp-batch
The version of Cromwell and the way the cromwell.conf file is created to specify the backend used are determined by these helper scripts and config files in this repo (these are used in our tutorial on how to run the pipeline on GCP):
In very basic terms.
resources.sh
sets up the Google cloud environment (buckets, network, etc.) and creates two cromwell configuration related files:cromwell.conf
andworkflow_options.json
with some user specific parameters populated. These will be copied to specified locations on the Cromwell VM started on Google Cloud.start.sh
launches a VM and specifies thatserver_startup.py
be run as part of the start up process. During this the process, the specified version of Cromwell is installed and launched (usingsystemctl start cromwell
).manual-workflows/cromwell.service defines some parameters for how Cromwell Server is started, including the location of Cromwell jar and cromwell.conf files.
The text was updated successfully, but these errors were encountered: