Simple tutorial for simple ad hoc analysis within Docker on a Google VM

Preamble

The following tutorial explains how to do a project setup that will configure Google credentials and security settings for someone working at WASHU. It will only work for a user on the WASHU internet network (or VPN). However, it is designed to work from your personal computer and not rely on compute1/storage1 access. The example analysis in this tutorial will download some protein sequences using gget and use pVACbind to perform neoantigen predictions on these sequences.

Prerequisites

google-cloud-sdk (for things like gcloud and gsutil)
git
A google user account associated with a WASHU billing account (which will use your WUSTL Key credentials)

Google Projects, Users and Billing

If you are using Google resources for the first time you will need to have a Google account created.

Each lab generally has a Google Account that is connected to WASHU billing. Eric Suiter (WUIT) is the usual contact for getting this set up. Once a Google billing project is created an linked to a funding source (via a standing Purchase Order), users can be created.

Each user authorized in the Google Console for this project will be able to log into and use Google Cloud resources. The Google Project selected and user authentication will determing billing. WUIT can add new users. A lab manager or PI may also be granted permissions needed to create new users. Each user will log into the Google Console or authenticate the command line interface using their WUSTL Key email and password (and multi-factor authentication).

Bills will be issued monthly to the lab PI via "Burwood" a Google Cloud reseller with whom WASHU has a Business Agreement. The amounts of these bills should agree with what is seen in the Google Billing Console. Your lab's department or division purchasing administrator will generally help you manage approving these bills monthly and recharging the purchase order as needed.

Step-by-step instructions

Set some Google Cloud and other environment variables

The following environment variables are used mostly for convenience. Some should be customized to produce intuitive labeling for your own analysis, others should be left aloneas indicated below:

# project name that must match your pre-configured Google Billing Project name
export GCS_PROJECT=griffith-lab

# variable that you should leave as state here
export GCS_SERVICE_ACCOUNT=cromwell-server@$GCS_PROJECT.iam.gserviceaccount.com
export GCS_NETWORK=cloud-workflows
export GCS_SUBNET=cloud-workflows-default

# you might change to another valid Google compute zone, depending on your current location
export GCS_ZONE=us-central1-c

# variables that you can customize
export WORKING_BASE=~/gcp_adhoc
export GCS_BUCKET_NAME=griffith-lab-malachi-adhoc
export GCS_BUCKET_PATH=gs://griffith-lab-malachi-adhoc
export GCS_INSTANCE_NAME=malachi-adhoc

Note that the project name used above can not be anything. It must match a project created in the Google Console by WUIT and linked to a funding source.

Local setup

First create a working directory on your local system

The following directory on the local system will contain: (a) a git repository for this tutorial, (b) git repository for tools that help provision and manage our workflow runs on the cloud, (c) example data that we will download for this tutorial.

mkdir $WORKING_BASE

Clone git repositories that have the workflows (pipelines) and scripts to help run them

The following repositories contain: this tutorial (immuno_gcp_wdl) and tools for running these on the cloud (cloud-workflows). Note that the command below will clone the main branch of each repo.

cd $WORKING_BASE
mkdir git
cd git
git clone git@github.com:griffithlab/immuno_gcp_wdl_compute1.git
git clone git@github.com:griffithlab/cloud-workflows.git

Login to GCP and set the desired project

From the command line, you will need to authenticate your cloud access (using your WASHU google cloud account credentials). This generally only needs to be done once, though there is no harm to re-authenticating. The login command below will generate a custom URL to enter in your browser. Once you do this, you will be prompted to log into your Google account. If you have multiple Google accounts (e.g. one for your institution/company and a personal one) be sure to use the correct one. Once you have logged in you will be presented with a long code. Enter this at the prompt generated by the login command below.

Finally, set your desired Google Project. This Project should correspond to a Google Billing account in the Google console.

If you are using Google Cloud for the first time, both billing and a project should be set up before proceeding. Configuring billing alerts is also probably wise at this point.

gcloud auth login
gcloud config set project $GCS_PROJECT
gcloud config list

Set up cloud account and bucket

Run the following command and make note of the "Service Account" returned (e.g. "cromwell-server@griffith-lab.iam.gserviceaccount.com").

This inititialization step does several things: creates a few service accounts (cromwell-server and cromwell-compute), updates some user IAM permissions for these accounts, creates a network (cloud-workflows) and sub-network (cloud-workflows-default), and creates a Google bucket.

Note that the IP ranges or "CIDRs" specified below specify all the IP addresses that a user from WASHU might be coming from. A firewall rule is created based on these two ranges to limit access to only those users on the WASHU network. Details of this firewall rule will appear in the Google Cloud console under: VPC network -> Firewall -> cloud-workflows-allow-ssh.

Note that if any of these things are already initialized already (due to a previous run), the script will report errors, but if the message describes that the resource already exists, this is fine.

cd $WORKING_BASE/git/cloud-workflows/manual-workflows/
bash resources.sh init-project --project $GCS_PROJECT --bucket $GCS_BUCKET_NAME --ip-range "128.252.0.0/16,65.254.96.0/19"

Note that a lot of the above is needed to run automated WDL workflows with cromwell. We don't need that here but we are taking advantage of the same setup functionality to make sure we have a Google bucket to store data and the instance we setup will only be accessible from WASHU's network.

Start a Google VM and configure resources for some ad hoc analysis

You will probably want to customize a few of the settings below based on how much compute resources you need. For example, e2-standard-8 will create a VM with 8 CPUs and 32 GB memory. To customize these settings you can add things like the following to your gcloud compute command below:

--machine-type=e2-standard-8. Use an instance with 8 CPUs and 32GB memory (default would be 8 GB for an e2-standard-2 instance).
--boot-disk-size=250GB. Increase boot disk to 250 GB (default would be 10 GB).
--boot-disk-type=pd-ssd. Use SSD disk (default would HDD).

For more options on configuration of the VM refer to: gcloud compute instances create --help. For more information on instance types and costs refer to the vm-instance-pricing guide.

Note on the Operating System. In the following example we will use this Google Image as a base: ubuntu-2204-jammy-v20220712a, ubuntu-os-cloud, ubuntu-2204-lts. To see a full list of available public images you can use: gcloud compute images list.

gcloud compute instances create $GCS_INSTANCE_NAME --project $GCS_PROJECT \
       --service-account=$GCS_SERVICE_ACCOUNT --scopes=cloud-platform \
       --image-family ubuntu-2204-lts --image-project ubuntu-os-cloud \
       --zone $GCS_ZONE --network=$GCS_NETWORK --subnet=$GCS_SUBNET \
       --boot-disk-size=250GB --boot-disk-type=pd-ssd --machine-type=e2-standard-8

Note that you may see a warning about a "need to resize the root partition manually if the operating system does not support automatic resizing". This is a not a concern. The operating system we are using here should do this automatic resizing for you.

To view the current list of instances associated with your Google Account: gcloud compute instances list.

Log into the VM and check status

In this step we will confirm that we can log into our Google VM with gcloud compute ssh and make sure it is ready for use.

After logging in, use journalctl to see if the instance start up has completed.

gcloud compute ssh $GCS_INSTANCE_NAME --zone $GCS_ZONE

#confirm start up scripts have completed. use <ctrl> <c> to exit
journalctl -u google-startup-scripts -f

#check for expected disk space
df -h 

#do some basic security updates (hit enter if prompted with any questions)
sudo apt update
sudo apt full-upgrade -y
sudo apt autoremove -y
sudo reboot

#wait a few seconds to allow reboot to complete and then login again
gcloud compute ssh $GCS_INSTANCE_NAME --zone $GCS_ZONE

Install Docker engine on the Google VM

If you know your analysis will not require running tasks within Docker, you can skip this step. However, the following just take a few minutes and will allow your user to run tools that are available as Docker images.

# set up repository
sudo apt-get update
sudo apt-get install -y ca-certificates curl gnupg lsb-release
sudo mkdir -p /etc/apt/keyrings

curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg

echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

# install docker engine
sudo apt-get update
sudo apt-get install -y docker-ce docker-ce-cli containerd.io docker-compose-plugin

#add user
sudo usermod -a -G docker $USER
sudo reboot

The reboot will will kick you out of the instance. Log in again and test the docker install

gcloud compute ssh $GCS_INSTANCE_NAME --zone $GCS_ZONE

# test install
docker run hello-world

Use `gget` to obtain test sequences

Install gget and update path

sudo apt install python3-pip -y
pip install --upgrade gget
echo "PATH=\$PATH:$HOME/.local/bin" >> ~/.bashrc 
source ~/.bashrc
gget

Download a test protein sequence for CTAG1B (cancer/testis antigen 1B aka CTAG; ESO1; CT6.1; CTAG1; LAGE-2; LAGE2B; NY-ESO-1). The MANE select transcript for this gene is ENST00000328435.3. The protein ID for this transcript is: ENSP00000332602.2.

mkdir -p $HOME/analysis/protein_seqs
cd $HOME/analysis/protein_seqs
gget seq --translate ENST00000328435 --out CTAG1B_aa.fa

Use `pVACbind` to perform neoantigen analysis on the protein sequences obtained

Start an interactive docker session using the latest slim (no BLAST DB) version of the pvactools docker image

export WORKING_DIR=$HOME/analysis
docker pull griffithlab/pvactools:latest-slim
docker run -it -v $HOME/:$HOME/ --env WORKING_DIR griffithlab/pvactools:latest-slim /bin/bash
cd $WORKING_DIR
pvacbind run --help

mkdir -p $WORKING_DIR/pvacbind/
pvacbind run $WORKING_DIR/protein_seqs/CTAG1B_aa.fa sample1 HLA-A*02:01,HLA-A*01:06,HLA-B*08:02 all_class_i $WORKING_DIR/pvacbind/ -e1 9 --n-threads 2 --iedb-install-directory /opt/iedb/

#leave the interactive docker session
exit

Save final results from the analysis in the Google Bucket you created above

The tool gsutil is used to perform all operations on Google storage. For example to save our results and then confirm they are in the bucket, do the following:

# use the same bucket path that was setup near the beginning of this tutorial
export GCS_BUCKET_PATH=gs://griffith-lab-malachi-adhoc
cd $HOME
gsutil ls $GCS_BUCKET_PATH
gsutil cp -r analysis $GCS_BUCKET_PATH
gsutil ls $GCS_BUCKET_PATH/*

# leave the Google VM 
exit

Once the analysis is done and results retrieved, destroy the Google VM on GCP to avoid wasting resources

gcloud compute instances delete $GCS_INSTANCE_NAME

Note that any data you transferred to your Google Bucket is still there (and will have storage costs). You can use gsutil to transfer those results back to your local compute environment and delete the bucket when ready. You can also clean up storage using the Google Web Console.

For example:

cd $WORKING_BASE
gsutil cp -r $GCS_BUCKET_PATH/analysis .
find .

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AdHoc.md

AdHoc.md

Simple tutorial for simple ad hoc analysis within Docker on a Google VM

Preamble

Prerequisites

Google Projects, Users and Billing

Step-by-step instructions

Set some Google Cloud and other environment variables

Local setup

First create a working directory on your local system

Clone git repositories that have the workflows (pipelines) and scripts to help run them

Login to GCP and set the desired project

Set up cloud account and bucket

Start a Google VM and configure resources for some ad hoc analysis

Log into the VM and check status

Install Docker engine on the Google VM

Use `gget` to obtain test sequences

Use `pVACbind` to perform neoantigen analysis on the protein sequences obtained

Save final results from the analysis in the Google Bucket you created above

Once the analysis is done and results retrieved, destroy the Google VM on GCP to avoid wasting resources

Files

AdHoc.md

Latest commit

History

AdHoc.md

File metadata and controls

Simple tutorial for simple ad hoc analysis within Docker on a Google VM

Preamble

Prerequisites

Google Projects, Users and Billing

Step-by-step instructions

Set some Google Cloud and other environment variables

Local setup

First create a working directory on your local system

Clone git repositories that have the workflows (pipelines) and scripts to help run them

Login to GCP and set the desired project

Set up cloud account and bucket

Start a Google VM and configure resources for some ad hoc analysis

Log into the VM and check status

Install Docker engine on the Google VM

Use gget to obtain test sequences

Use pVACbind to perform neoantigen analysis on the protein sequences obtained

Save final results from the analysis in the Google Bucket you created above

Once the analysis is done and results retrieved, destroy the Google VM on GCP to avoid wasting resources

Use `gget` to obtain test sequences

Use `pVACbind` to perform neoantigen analysis on the protein sequences obtained