Bunnies is a python API to write scalable and reproducible scientific workflows/pipelines. It shares many ideas with other data-driven pipeline frameworks such as Snakemake, Nextflow, and Luigi, but strives to achieves a far higher level of reproducibility. It is in early stages of development, but it has been so far used to run bioinformatics pipelines on AWS, successfully.
Bunnies captures a snapshot of all the information involved directly or indirectly in the creation of a datafile. This is necessary for reproducibility. It records the software versions (in the form of container images), scripts (git commits), data inputs (digests), and application specific parameters (json config) involved at all stages of the pipeline.
Most existing frameworks are content with that above definition of reproducibility. The main objective of a "reproducible" pipeline is to allow multiple users of this pipeline to produce the same result (be it a file, a report, or a verdict).
We have found in many cases that this is insufficient. Bunnies aims to provide a solution to many typical shortcomings of reproducible frameworks:
-
Detecting changes Pipelines that involve many core-years of computation, can rarely be run from start to end in one go -- There is an iterative aspect to their development. It is sometimes necessary, to have to change parameters (or software) along the way. A good reproducible framework should detect a change, and should be able to precisely determine whether any existing previous result is to be affected by the change (and re-generated). Bunnies gives you that choice. Bunnies also allows multiple versions of the same data result to co-exist.
-
Data provenance At some point, the data generated by a pipeline inevitably leaves the framework. The data is either moved to a different service/storage, shared with other teams, or backed up. It then becomes important to be able to determine, by anyone working on a data file, exactly how it was produced. There shouldn't be any guess work involved. This allows others, downstream, to use the data with confidence, and in our opinion is the only way to achieve scientific results with fidelity.
-
Reusing existing results safely If a pipeline allows reproducible results, it must be possible for two people, working on the same pipeline, to share their partial results. Bunnies not only makes results reproducible, but will also let users take advantage of pre-existing results that satisfy exactly their own parameters. Objects generated by bunnies have a predictable unique id, and allows taking advantage of storage caches aggressively, and safely. There is no need to re-compute an object which has been previously computed with the exact same set of parameters. Also, two different users can run pipelines with overlapping steps, without stepping on each other's toes or data causing corruption.
-
Predicting costs A reproducible pipeline should be able to provide you with a hint about costs. Bunnies aims to record time, resource usage, and costs associated with delivering all data objects. It can provide cost and time forecasts associated with obtaining a new result. It can also attribute a dollar value to each data previously generated. Users can use this information and make an informed decision about whether to store an object long-term, or regenerate it on demand.
Bunnies allow users to express data-driven pipelines as a graph that you assemble using plain Python3 objects. There is no new language to learn (to use Bunnies) if you already know Python. Conceptually, at a high level, you build a graph of dependencies using transformation objects, and make a call to compile a pipeline that will generate one or more objects of your choice in that graph. It works similarly to a Makefile (or SCons), except that Bunnies doesn't work at the granularity of files -- but objects. Each transformation can generate one or more files.
For each file generated by a transformation step, Bunnies captures fully the inputs, the software, and the software parameters which generated the file. This is recorded in a structured document manifest for each node of the graph.
One requirement for creating new Bunnies transformations, is that it should be possible to determine a set of parameters that when presented again to the transformation, will generate an equivalent output. The defalt list of parameters include the docker image id, the inputs (and their parameters, recursively), and parameters/flags for the transformation.
Running the transformation after choosing new parameters or inputs will trigger the transformation to run again. Running with the same parameters, however, will reuse a previous result for that transformation, if one is available.
For each output generated, Bunnies records provenance:
- the collection of software which was used to generate the file. this takes the form of git commits, and Docker image ids.
- the full list of parameters configuring the transformation.
- the input files which were fed to the transformation.
-
create virtualenv:
virtualenv -p python3 --prompt="(bunnies) " .venv
Note: if you don't have virtualenv, you can install it first with
pip install virtualenv
, or use the builtin module in python3, i.e.python3 -m venv .venv
Note: If you're on an older distribution with a different default python3 and/or you don't have root access to install packages, you can bootstrap its installation as a regular user with conda
~/miniconda3/bin/conda env create -n python36 python=3.6 source ~/miniconda3/bin/activate python36 # inside the conda environment, you have python3.6 (python36) $ pip install --upgrade pip (python36) $ pip install virtualenv # create a python3.6 virtual env (python36) $ virtualenv --prompt="(bunnies) " -p python3.6 .venv # from this point on you no longer need the conda environment. # a copy of the python3.6 runtime was added to the .venv virtualenv # folder source ~/miniconda3/bin/deactivate
-
activate env
source .venv/bin/activate
-
install python dependencies (includes awscli tools)
# optional, but recommended before you install deps: pip install --upgrade pip # platform dependencies pip install -r requirements.txt
-
Configure your AWS credentials. This is detailed elsewhere, but here's one way:
mkdir -p ~/.aws
Add a section in
~/.aws/config
.:[profile reprod] region=us-west-2 output=json
Update your credentials file
~/.aws/credentials
(section header syntax differs from config file):[reprod] aws_access_key_id=AKIAIOSFODNN7EXAMPLE aws_secret_access_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
It's a good idea to
chmod go-rw ~/.aws/credentials
too.Note: The region you pick here should be one where FARGATE is supported.
While you're working on reprod
, you may then wish to export the
AWS_PROFILE environment variable to pick the desired account. If this
is not custmized, the aws cli tools will use the default account.
export AWS_PROFILE=reprod
To get started, a few platform resources need to be created and configured in your AWS account.
- IAM roles and permissions for S3, EC2, ECS.
- S3 buckets
- EC2 VPCs
- API Gateway pipeline management endpoints.
More resources will be generated when the pipeline definitions are converted into AWS concepts:
- Lambdas
- ECS Tasks
- S3 Buckets to store temporary data
These resources are created using the scripts provided in
./scripts/
. FIXME provide more detailed description.
-
./scripts/setup-lambda.sh
creates roles with permissions for platform-created lambdas. Creates./lambda-settings.json
. -
./scripts/setup-network.sh
creates network configuration usable by tasks. outputs created ids in./network-settings.json
. -
./scripts/setup-tasks.sh
creates task configuration based on available tasks. Currently using mostly hardcoded values sufficient to drive the example. The created entities are saved incluster-settings.json
-
You will need to create
./storage-settings.json
with the name of a bucket you intend to use as temporary storage. Example contents:{ "storage": { "tmp_bucket": "reprod-temp-bucket", "build_bucket": "reprod-build-bucket" } }
-
./scripts/setup-key-pair.sh
creates the keypair that will be associated with the new instances. This will be the key to use to ssh into the created VMs or containers. Outputs./key-pair-settings.json
andkey-pair.pem
private key. -
python -m bunnies.environment setup
will create amazon roles and permissions necessary for scheduling instances and submit jobs in the context of a compute environment.