Skip to content

Fork from aws-samples repo: code referenced in the AWS blog post "Build a Genomics data lake on AWS".

License

Notifications You must be signed in to change notification settings

c-BIG/aws-genomics-datalake

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Build a Genomics Data Lake on AWS

This repo contains the code referenced in the AWS blog post "Build a Genomics data lake on AWS".

ETL - contains the transformation scripts and the cloudformation template to spin up the EMR cluster

EMRGenomics.py - Lambda function that is triggered by the cloudFormation template to create EMR cluster to process VCFs.

EventEMRGenomics.py - Event trigger Lambda function

emr_config.json - JSON file with EMR configuration for this example. This file can be edited to change EMR configuration parameters.

vcfToParquetTransform.py - pySpark script that performs the VCF to parquet transformation using the Hail API. This can be customized to perform any specific transformation steps required.

genomics_datalake_emr.template - Cloudformation template that can be deployed in your account for the solution.

1000Genomes.ipynb - Python notebook with sample queries

For instructions on how to create the Glue data catalog tables for 1000 Genomes on the Registry of Open Data, please check the DataLakeAsCode repo at https://github.com/aws-samples/data-lake-as-code/tree/roda#readme. The repo also has CloudFormation templates for ClinVar and gnomAD.

About

Fork from aws-samples repo: code referenced in the AWS blog post "Build a Genomics data lake on AWS".

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 76.8%
  • Python 23.2%