Some helper scripts for interacting with the Cromwell server.
pull_outputs.py will query a Cromwell server for the outputs
associated with a workflow, and download them. For details on usage,
see python3 scripts/pull_outputs.py --help
.
Requirements to run this script:
- able to reach the Cromwell server's endpoints
- authenticated by Google
- authorized to read files from specified GCS bucket
cloudize-workflow.py will accept a workflow, its inputs, and a GCS
bucket, and prepare a new inputs file to run that workflow on the
cloud. The script assumes the workflow definition is cloud-ready, the
file parameters in the input file are all available, and you have
access to upload to a GCS bucket. For details on usage, see python3 scripts/cloudize-workflow.py --help
Requirements to run this script:
- access to read all file paths specified in workflow inputs
- authenticated by Google
- authorized to write files to specified GCS bucket
This script is the least user-ready script but it's still available as-needed. It's essentially just composing the steps "zip the workflow dependencies" and "POST to the server with curl". Recommendation is not to use this script directly until a more user-friendly version is made, but modify or extract from it what is needed to zip workflow dependencies for your case.
It's current iteration is expected to be used as follows
Given the location of a workflow definition, perform a zip on a pre-defined location for WDL workflows (ANALYSIS_WDLS), then curl with those inputs, the newly generated zip, and a pre-defined WORKFLOW_OPTIONS file.
This script generates a JSON file which estimates the cost of a
workflow run. Takes both a root workflow_id to estimate, and a path to
a directory with a pile of JSON files named
<workflow_id>.json
.
Example call here
python3 estimate_billing.py $WORKFLOW_ID /local/path/to/metadata > costs.json
The above call expects you'd have local copies of the metadata JSON
files at /path/to/dir/of/metadata
. If you don't have that, and it's
in a GCS bucket, you can use the GCS path. It'll be slower, but still
work.
python3 estimate_billing.py $WORKFLOW_ID gs://bucket/path/to/metadata > costs.json
Alternatively, just download them locally first.
gsutil cp -r gs://bucket/path/to/metadata /local/path/to/metadata
These are generated with persist_artifacts.py
which saves them both locally and in GCS. estimate_billing.py
works
for either local or GCS paths, but if you want to run multiple times
you should probably get local copies first and run them on that.
Estimation is done by crawling the metadata.json files of the root workflow, and any of its subworkflows, to find the VM characteristics for each call. Cost of a task is roughly
(cost_vm_cpu + cost_vm_ram + cost_disks) * duration
Outputs to stdout in JSON format. You'll want to call with > costs.json
or similar. Each workflow has its total cost, start/end
times, duration, and its calls (tasks + workflows) costs. Each task
has its total cost, start/end times, duration, machine type, disks,
and prices.
See the Google docs for pricing information
- VM: https://cloud.google.com/compute/vm-instance-pricing
- disks: https://cloud.google.com/compute/disks-image-pricing#disk
This is both a top-level and a helper script for estimate_billing.py.
Used at the top level, it'll take in an input JSON file containing workflow costs, generated by estimate_billing.py, and spit a CSV-formatted version of the task costs within that workflow to stdout.
python3 costs_json_to_csv.py costs.json > costs.csv
This functionality is also wrapped into estimate_billing.py under the
--csv
flag.
python3 estimate_billing.py $WORKFLOW_ID /local/path/to/metadata --csv > costs.csv
I'd still run these separately just to have both, but if you're only after the CSV this may be more convenient.
Takes the output of costs_json_to_csv.py and collapses tasks that have been split into shards, giving one cost for the entire task. It outputs a csv labeled costs_report_final.csv.
Use as follows-
python3 /opt/scripts/cost_script.py costs.tsv
Some scripts use the logging
library for status updates. The scripts
that do this can have their logging level changed by setting
environment variable LOGLEVEL
to one of DEBUG
, INFO
, WARNING
,
ERROR
, depending on your need. Default value is INFO
.