Skip to content

Latest commit

 

History

History
127 lines (85 loc) · 4.46 KB

README.md

File metadata and controls

127 lines (85 loc) · 4.46 KB

Some helper scripts for interacting with the Cromwell server.

pull_outputs.py

pull_outputs.py will query a Cromwell server for the outputs associated with a workflow, and download them. For details on usage, see python3 scripts/pull_outputs.py --help.

Requirements to run this script:

  • able to reach the Cromwell server's endpoints
  • authenticated by Google
  • authorized to read files from specified GCS bucket

cloudize-workflow.py

cloudize-workflow.py will accept a workflow, its inputs, and a GCS bucket, and prepare a new inputs file to run that workflow on the cloud. The script assumes the workflow definition is cloud-ready, the file parameters in the input file are all available, and you have access to upload to a GCS bucket. For details on usage, see python3 scripts/cloudize-workflow.py --help

Requirements to run this script:

  • access to read all file paths specified in workflow inputs
  • authenticated by Google
  • authorized to write files to specified GCS bucket

submit_workflow.sh

This script is the least user-ready script but it's still available as-needed. It's essentially just composing the steps "zip the workflow dependencies" and "POST to the server with curl". Recommendation is not to use this script directly until a more user-friendly version is made, but modify or extract from it what is needed to zip workflow dependencies for your case.

It's current iteration is expected to be used as follows

Given the location of a workflow definition, perform a zip on a pre-defined location for WDL workflows (ANALYSIS_WDLS), then curl with those inputs, the newly generated zip, and a pre-defined WORKFLOW_OPTIONS file.

estimate_billing.py

This script generates a JSON file which estimates the cost of a workflow run. Takes both a root workflow_id to estimate, and a path to a directory with a pile of JSON files named <workflow_id>.json.

Example call here

python3 estimate_billing.py $WORKFLOW_ID /local/path/to/metadata > costs.json

The above call expects you'd have local copies of the metadata JSON files at /path/to/dir/of/metadata. If you don't have that, and it's in a GCS bucket, you can use the GCS path. It'll be slower, but still work.

python3 estimate_billing.py $WORKFLOW_ID gs://bucket/path/to/metadata > costs.json

Alternatively, just download them locally first.

gsutil cp -r gs://bucket/path/to/metadata /local/path/to/metadata

These are generated with persist_artifacts.py which saves them both locally and in GCS. estimate_billing.py works for either local or GCS paths, but if you want to run multiple times you should probably get local copies first and run them on that.

Estimation is done by crawling the metadata.json files of the root workflow, and any of its subworkflows, to find the VM characteristics for each call. Cost of a task is roughly

(cost_vm_cpu + cost_vm_ram + cost_disks) * duration

Outputs to stdout in JSON format. You'll want to call with > costs.json or similar. Each workflow has its total cost, start/end times, duration, and its calls (tasks + workflows) costs. Each task has its total cost, start/end times, duration, machine type, disks, and prices.

See the Google docs for pricing information

costs_json_to_csv.py

This is both a top-level and a helper script for estimate_billing.py.

Used at the top level, it'll take in an input JSON file containing workflow costs, generated by estimate_billing.py, and spit a CSV-formatted version of the task costs within that workflow to stdout.

python3 costs_json_to_csv.py costs.json > costs.csv

This functionality is also wrapped into estimate_billing.py under the --csv flag.

python3 estimate_billing.py $WORKFLOW_ID /local/path/to/metadata --csv > costs.csv

I'd still run these separately just to have both, but if you're only after the CSV this may be more convenient.

cost_script.py

Takes the output of costs_json_to_csv.py and collapses tasks that have been split into shards, giving one cost for the entire task. It outputs a csv labeled costs_report_final.csv.

Use as follows-

python3 /opt/scripts/cost_script.py costs.tsv

Troubleshooting scripts

Some scripts use the logging library for status updates. The scripts that do this can have their logging level changed by setting environment variable LOGLEVEL to one of DEBUG, INFO, WARNING, ERROR, depending on your need. Default value is INFO.