-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #4 from ChristyPeterson/main
Update README.md and add CHANGELOG.md
- Loading branch information
Showing
2 changed files
with
91 additions
and
122 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
# phac-nml/genomic_address_service: Changelog | ||
|
||
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/) | ||
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). | ||
|
||
## v1.0dev - [date] | ||
|
||
Initial release of phac-nml/genomic_address_service | ||
|
||
### `Added` | ||
|
||
### `Fixed` | ||
|
||
Changed README format to standard DAAD README, added useage arguments. | ||
|
||
### `Dependencies` | ||
|
||
### `Deprecated` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3,66 +3,25 @@ | |
[![Conda](https://img.shields.io/conda/dn/bioconda/profile_dists?color=green)](https://anaconda.org/bioconda/genomic_address_service) | ||
[![License: Apache-2.0](https://img.shields.io/github/license/phac-nml/genomic_address_service)](https://www.apache.org/licenses/LICENSE-2.0) | ||
|
||
<img src="https://github.com/phac-nml/genomic_address_service/blob/main/logo.png?raw=true" width = "150" height="189"> | ||
|
||
## Genomic Address Service | ||
![alt text](https://github.com/phac-nml/genomic_address_service/blob/main/logo.png?raw=true) | ||
|
||
## Contents | ||
|
||
- [Introduction](#introduction) | ||
- [Installation](#installation) | ||
- [Usage](#usage) | ||
- [Quick Start](#quick-start) | ||
- [FAQ](#faq) | ||
- [Citation](#citation) | ||
- [Legal](#legal) | ||
- [Contact](#contact) | ||
|
||
## Introduction | ||
|
||
Surveillance and outbreak analysis of pathogens has been operationalized by multiple public health laboratories around | ||
the world using gene-by-gene, SNP and k-mer approaches to produce estimates of genetic distance between sets of samples. | ||
Standard phylogentic or heirarchal clustering approaches group samples based on distances but need to be recalculated | ||
everytime a new sample is added. The lack of repeatable genetic units or nomenclature between runs of clustering makes | ||
communication between different groups difficult and the slow due to the poor scaling of these approaches to larger | ||
sample sizes. Enterobase implements [HeirCC](https://github.com/zheminzhou/pHierCC) as an algorithm approach to assign | ||
new cgMLST profiles into existing multi-level cluster nomenclature from a minimum spanning tree, using single linkage in real time, | ||
along with some tools for evaluating the clusters when run in de novo mode. [SnapperDB](https://github.com/ukhsa-collaboration/snapperdb) | ||
used by the UKHSA utilizes single linkage clustering based on SNPs to produce a multi-level cluster nomenclature for outbreak and surveillance activities. | ||
[ReportTree](https://github.com/insapathogenomics/ReporTreer) will perform de novo clustering based on [SciPy](https://scipy.org/) | ||
linkage methods from a sequence alignment, VCF, allele profile, or distance matrix and so provides significant flexibility in inputs | ||
compared to the other two methods. Like Enterobase, ReportTree also provides tools for evaluating regions of cluster stability and | ||
can maintain cluster nomenclature between runs with the caveate that Report tree has the potential to split and merge previous cluster assignments. | ||
This poses a significant issue in terms of Public Health Reporting and it is desirable to maintain existing cluster designations with | ||
occasional updates to the cluster assignments which must then be reported to all clients and partners to make the changes in | ||
their databases as well. HeirCC and be run as a standalone piece of software and applied to new cgMLST schemes and organisms readily, whereas | ||
SnapperDB is a relatively complex pipeline where establishing databases for new organisms is rather complex. HeirCC could | ||
be incorporated into other bioinformatics workflows relatively easily however, SnapperDB cannot. | ||
|
||
<br><br> | ||
While HeirCC provides customization and flexibility in terms of thresholds and schemes, it is limited to single-linkage | ||
clustering which is known to have issues with stability and be prone to producing "scraggly" clusters due to chaining. | ||
ReportTree provides customization and reporting functionality which can be of use to public health labs, however the | ||
potential for changing cluster assignments poses issues for distributed Public Health surveillance and outbreak activities | ||
due to the potential for becoming out of synch with each other. Within our public health partners, there is a need | ||
for a clustering service which can perform de novo clustering based on average, complete, and single linkage which can | ||
then be partitioned into clusters based on multiple thresholds. Additionally, there is a need to assign new samples into | ||
an existing clustering to provide stable nomenclature for communication between different partners and stakeholders. Any | ||
changes to the clustering assignments needs to be done purposefully with changes clearly communicated to partners and stakeholders. | ||
<br><br> | ||
To address needs of users within our team we have designed an integrated solution for calculating distance matricies and querying of genetically | ||
similar samples within a defined threshold to support outbreak and surveillance activities. We provide the flexibility to have standard text | ||
based outputs as well as parquet. It is implemented in pure python and currently is only available in a single threaded version but later | ||
refinements may include the support for multiprocessing. To facilitate integration of the tool into larger workflows it will also be implemented | ||
as a nextflow workflow. | ||
|
||
As datasets grow, using text based formats such as CSV, TSV represent significant amounts of runtime in terms | ||
of reading, parsing and writing. New formats such as [parquet](https://parquet.apache.org/) support compression and are optimized for | ||
efficiency of storage and retreiveal of data. | ||
|
||
|
||
|
||
## Installation | ||
# Genomic Address Service | ||
# Introduction | ||
|
||
Surveillance and outbreak analysis of pathogens has been operationalized by multiple public health laboratories around the world using gene-by-gene, SNP and k-mer approaches to produce estimates of genetic distance between sets of samples. Standard phylogentic or heirarchal clustering approaches group samples based on distances but need to be recalculated each time a new sample is added. The lack of repeatable genetic units or nomenclature between runs of clustering makes communication between different groups difficult, especially with the poor scaling of these approaches to larger sample sizes. | ||
|
||
A number of different software/pipelines have been published to address the issues mentioned above. Firstly, Enterobase implements [HeirCC](https://github.com/zheminzhou/pHierCC) as an algorithm approach to assign new cgMLST profiles into existing multi-level cluster nomenclature from a minimum spanning tree, using single linkage in real time, along with some tools for evaluating the clusters when run in de novo mode. [SnapperDB](https://github.com/ukhsa-collaboration/snapperdb) used by the UKHSA utilizes single linkage clustering based on SNPs to produce a multi-level cluster nomenclature for outbreak and surveillance activities. [ReportTree](https://github.com/insapathogenomics/ReporTreer) will perform de novo clustering based on [SciPy](https://scipy.org/) linkage methods from a sequence alignment, VCF, allele profile, or distance matrix and so provides significant flexibility in inputs compared to the other two methods. Like Enterobase, ReportTree also provides tools for evaluating regions of cluster stability and can maintain cluster nomenclature between runs with the caveat that ReportTree can split and merge previous cluster assignments, a significant issue for Public Health where it is desirable to maintain existing cluster designations. HeirCC can be run as a standalone piece of software and applied to new cgMLST schemes and organisms readily as well as incorporated into other bioinformatic pipelines. Conversely, SnapperDB is a relatively complex pipeline where establishing databases for new organisms is rather complex resulting in an inability to incorporate this tool into other workflows. While HeirCC provides customization and flexibility in terms of thresholds and schemes, it is limited to single-linkage clustering which is known to have issues with stability and be prone to producing "scraggly" clusters due to chaining. ReportTree provides customization and reporting functionality which can be of use to public health labs, however the potential for changing cluster assignments poses issues for distributed Public Health surveillance and outbreak activities due to the potential for becoming out of synch with each other. | ||
|
||
Within our public health partners, there is a need for a clustering service which can perform de novo clustering based on average, complete, and single linkage that can then be partitioned into clusters based on multiple thresholds. Additionally, there is a need to assign new samples into an existing clustering to provide stable nomenclature for communication between different partners and stakeholders. To address needs of users within our team we have designed an integrated solution for calculating distance matrices and querying genetically similar samples, within a defined threshold, to support outbreak and surveillance activities. We provide the flexibility to have standard text based outputs and have included [parquet](https://parquet.apache.org/) format for highthrough put needs. It is implemented in pure python and currently is only available in a single threaded version but later refinements may include the support for multiprocessing. To facilitate integration of the tool into larger workflows it will also be implemented as a nextflow workflow. | ||
|
||
## Citation | ||
|
||
Robertson, James, Wells, Matthew, Schonfeld, Justin, Reimer, Aleisha. Genomic Address Service: Convenient package for de novo clustering and sample assignment to existing clusters. 2023. [https://github.com/phac-nml/genomic_address_service](https://github.com/phac-nml/genomic_address_service) | ||
## Contact | ||
|
||
For any questions, issues or comments please make a Github issue or reach out to [**James Robertson**]([email protected]). | ||
|
||
# Install | ||
|
||
Install the latest released version from conda: | ||
|
||
|
@@ -75,68 +34,66 @@ Install using pip: | |
Install the latest master branch version directly from Github: | ||
|
||
pip install git+https://github.com/phac-nml/genomic_address_service.git | ||
### Compatibility | ||
List out Dependencies and/or packages as appropriate | ||
|
||
|
||
|
||
# Getting Started | ||
## Usage | ||
If you run ``gas``, you should see the following usage statement: | ||
|
||
Usage: gas <command> [options] <required arguments> | ||
|
||
To get minimal usage for a command use: | ||
gas command | ||
`gas command args` | ||
|
||
To get full help for a command use one of: | ||
gas command -h | ||
cladeomatic command --help | ||
### Commands | ||
There are three commands that gas uses: | ||
|
||
1. **mcluster** - de novo nested multi-level clustering | ||
2. **call** - call genomic address based on existing clusterings | ||
4. **test** - test functionality on a small dataset | ||
|
||
Available commands: | ||
### Args | ||
There are a number of arguments that are specific for each command. They can be found directly by adding `--help` after each command. The following are common arguments: | ||
|
||
mcluster De novo nested multi-level clustering | ||
call Call genomic address based on existing clusterings | ||
test Test functionality on a small dataset | ||
- `-o`, `--outdir` - output directory to put cluster results | ||
- `-m`, `--method` - cluster method [single, complete, average (default)] | ||
- `-t`, `--thresholds` - thresholds delimited by ',' columns will be treated in sequential order | ||
- `-V`, `--version` - print installed tool version | ||
- `-f`, `--force` - overwrite existing out directory | ||
|
||
Supported distance matrix formats | ||
===== | ||
**Square** | ||
#### mcluster specific args | ||
- `-i`, `--matrix` - TSV formatted distance matrix or parquet | ||
- `-d`, `--delimiter` - delimiter desired for nomenclature code [default="."] | ||
|
||
| id | S1 | S2 | S3 | S4 | S5 | S6 | | ||
| ----------- | ----------- |----------- | ----------- | ----------- |----------- | ----------- | | ||
| S1 | 0 | 0 | 3 | 3 | 9 | 9 | | ||
| S2 | 0 | 0 | 3 | 3 | 9 | 9 | | ||
| S3 | 3 | 3 | 0 | 0 | 9 | 9 | | ||
| S4 | 3 | 3 | 0 | 0 | 9 | 9 | | ||
| S5 | 9 | 9 | 9 | 9 | 0 | 0 | | ||
| S6 | 9 | 9 | 9 | 9 | 0 | 0 | | ||
#### call specific args | ||
- `-d`, `--dists` - a 3 column file [query_id, ref_id, dist] in TSV or parquet format | ||
- `-r`, `--rclusters` - existing cluster file in TSV or parquet format | ||
- `-j`, `--thresh_map` - Json file of [colname:threshold] | ||
- `-u`, `--outfmt` - output format for assignments [text (default), parquet] | ||
- `-l`, `--delimiter` - delimiter desired for nomenclature code [default="."] | ||
|
||
- Distance matrix units can be of float, or integer type with the constrain that the diagnonal must be 0 and the first | ||
line must be a header with all of the samples | ||
|
||
Quick start | ||
===== | ||
**De novo Multi-level Clustering** | ||
## Configuration and Settings: | ||
Thresholds must be configured when using GAS. These threshold must be determined manually through testing and establishment of practical criteria for each pathogen of interest. | ||
|
||
Mcluster minimally accepts as input a distance matrix, output directory, and a set of thresholds delimeted by a comma. | ||
For instance, in PulseNet Canada they have determined the use of '10,5,0' to be the threshold of choice for their pathogen surveillance program. [Publication on going] | ||
|
||
<br /> | ||
## Data Input/formats | ||
|
||
gas mcluster --matrix distance.matrix.text --outdir results --thresholds 10,5,0 | ||
### Square distance matrix | ||
|
||
This will produce a cluster file with the following header [id, nomenclature, level_1, level_2, level_3] | ||
| id | S1 | S2 | S3 | S4 | S5 | S6 | | ||
| --- | --- | --- | --- | --- | --- | --- | | ||
| S1 | 0 | 0 | 3 | 3 | 9 | 9 | | ||
| S2 | 0 | 0 | 3 | 3 | 9 | 9 | | ||
| S3 | 3 | 3 | 0 | 0 | 9 | 9 | | ||
| S4 | 3 | 3 | 0 | 0 | 9 | 9 | | ||
| S5 | 9 | 9 | 9 | 9 | 0 | 0 | | ||
| S6 | 9 | 9 | 9 | 9 | 0 | 0 | | ||
|
||
<br /> | ||
- Distance matrix units can be of float, or integer type with the constrain that the diagonal must be 0 and the first line must be a header with all of the samples | ||
|
||
| id | nomenclature | level_1 | level_2 | level_3 | | ||
| ----------- | ----------- | ----------- | ----------- | ----------- | | ||
| S1 | 1.1.1 | 1 | 1 | 1 | | ||
| S2 | 1.1.1 | 1 | 1 | 1 | | ||
| S3 | 1.1.2 | 1 | 1 | 2 | | ||
| S4 | 1.1.2 | 1 | 1 | 2 | | ||
| S5 | 1.2.3 | 1 | 2 | 3 | | ||
| S6 | 1.2.3 | 1 | 2 | 3 | | ||
|
||
**Outputs:** | ||
### Apache parquet | ||
- More information on the open-source column-oriented data format can be found [here](https://parquet.apache.org/). | ||
## Output/Results | ||
|
||
``` | ||
{Output folder name} | ||
|
@@ -147,24 +104,20 @@ This will produce a cluster file with the following header [id, nomenclature, le | |
└── run.json - Contains logging information for the run including parameters, newick tree, and threshold mapping info | ||
``` | ||
|
||
**Cluster assignment** | ||
|
||
Coming soon | ||
|
||
|
||
## Benchmarks | ||
|
||
Coming soon | ||
|
||
## FAQ | ||
# Troubleshooting and FAQs: | ||
|
||
Coming soon | ||
1. Mcluster fails due to missing scipy, with the following error: | ||
``` | ||
import scipy | ||
ModuleNotFoundError: No module named 'scipy' | ||
``` | ||
- This dependency is currently missing in the pip install. Use the following command to install scipy separately: `pip install scipy` | ||
|
||
## Citation | ||
# Benchmarking | ||
|
||
Robertson, James, Wells, Matthew, Schonfeld, Justin, Reimer, Aleisha. Genomic Address Service: Convenient package for de novo clustering and sample assignment to existing clusters. 2023. https://github.com/phac-nml/genomic_address_service | ||
Coming soon. | ||
|
||
## Legal | ||
# Legal and Compliance Information: | ||
|
||
Copyright Government of Canada 2023 | ||
|
||
|
@@ -181,7 +134,5 @@ under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR | |
CONDITIONS OF ANY KIND, either express or implied. See the License for the | ||
specific language governing permissions and limitations under the License. | ||
|
||
|
||
## Contact | ||
|
||
**James Robertson**: [email protected] | ||
# Updates and Release Notes: | ||
Please see the `CHANGELOG.md` |