BRAVO Data Pipeline

Processing data to power the BRowse All Variants Online (BRAVO) API

Build, download, or install dependencies.
1. Compile custom tools
2. Install external tools
3. Download external data
Collect data to be processed into convenient location.
Modify nextflow configs to match paths on your system or cluster.
Run nextflow workflows

Input Data

Naming: The pipeline depends on the names of the input cram files having the sample ID as the first part of the filename. Specifically, the expectation that the ID preceeds the first . such that a call to getSimpleName() yields the ID.

Sequence Data

Source cram files. Original sequences from which the variant calls were made.

Variant calls

Source bcf files. Generated running the topmed variant calling pipeline

Data Preparation Tools

Compile Custom Tools

In the tools/ directory you will find tools/scripts to prepare your data for importing into Mongo database and using in BRAVO browser.

cd tools/cpp_tools
cget install .

This build executables in tools/cpp_tools/cget/bin

External Tools

BamUtil, VEP, and Loftee tools required are described in dependencies.md

External Data

Gencode, Ensembl, dbSNP, and HUGO data required are described in basis_data.md

Nextflow Scripts

In the workflows/ directory are three Nextflow configs and scripts used to prepare the runtime data for the BRAVO API.

Details about the steps of the pipeline are detailed in data_prep_steps.md.

The three nextflow pipelines are:

Prepare VCF Teddy
Sequences
Coverage

Downstream data for BRAVO API

The make_vignette_dir.sh script consolidates the results from the nextflow scripts into a data directory organized for the BRAVO API. It is designed for small data sets, and should be run after the three data pipelines complete.

There are two data sets that Bravo API needs to run:

Runtime Data are flat files on disk read at runtime.
Basis Data files processed and loaded into mongo db.

Downstream data subdirectory notes

data/
├── cache
├── coverage
│   ├── bin_1
│   ├── bin_25e-2
│   ├── bin_50e-2
│   ├── bin_75e-2
│   └── full
├── crams
│   ├── sequences
│   ├── variant_map.tsv.gz
│   └── variant_map.tsv.gz.tbi
└── reference
    ├── hs38DH.fa
    └── hs38DH.fa.fai

reference/ holds the refercence fasta files for the genome
API's SEQUENCE_DIR config val is asking for directory that contains the 'sequences' directory.
- sequences dirname is hardcoded
- variant_map.tsv.gz file name is hardcoded.
- variant_map.tsv.gz.tbi file name is hardcoded.
Under sequence/, directory structure and filenames are perscribed.
- All two hex character directories 00 to ff should exist as subdirectories.
- cram files must have the filename in the exact form of sample_id.cram
- The sub dir a cram belongs in is the first two characters of the md5 hexdigest of the sample_id.
  - E.g. foobar123.cram would be in directory "ae"
```
hashlib.md5("foobar123".encode()).hexdigest()[:2]
```
  - This dir structure is produced by the nextflow pipeline
coverage directory contents are taken from result/ dir of coverage workflow
variant_map.tsv.gz is an output of RandomHetHom3

Name		Name	Last commit message	Last commit date
Latest commit History 1,450 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
hacks		hacks
legacy_deploy		legacy_deploy
legacy_mgmt		legacy_mgmt
tools		tools
workflows		workflows
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
basis_data.md		basis_data.md
contributing.md		contributing.md
data_prep_steps.md		data_prep_steps.md
dependencies.md		dependencies.md
organize_bravo_results.sh		organize_bravo_results.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BRAVO Data Pipeline

Input Data

Sequence Data

Variant calls

Data Preparation Tools

Compile Custom Tools

External Tools

External Data

Nextflow Scripts

Downstream data for BRAVO API

Downstream data subdirectory notes

About

Releases

Packages

Contributors 12

Languages

License

statgen/bravo_data_prep

Folders and files

Latest commit

History

Repository files navigation

BRAVO Data Pipeline

Input Data

Sequence Data

Variant calls

Data Preparation Tools

Compile Custom Tools

External Tools

External Data

Nextflow Scripts

Downstream data for BRAVO API

Downstream data subdirectory notes

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 12

Languages

Packages