Skip to content

Commit

Permalink
Merge pull request #4 from ChristyPeterson/main
Browse files Browse the repository at this point in the history
Update README.md and add CHANGELOG.md
  • Loading branch information
ChristyPeterson authored Mar 20, 2024
2 parents 9a3ee05 + e1ca0b3 commit ffca4ed
Show file tree
Hide file tree
Showing 2 changed files with 91 additions and 122 deletions.
18 changes: 18 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# phac-nml/genomic_address_service: Changelog

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## v1.0dev - [date]

Initial release of phac-nml/genomic_address_service

### `Added`

### `Fixed`

Changed README format to standard DAAD README, added useage arguments.

### `Dependencies`

### `Deprecated`
195 changes: 73 additions & 122 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,66 +3,25 @@
[![Conda](https://img.shields.io/conda/dn/bioconda/profile_dists?color=green)](https://anaconda.org/bioconda/genomic_address_service)
[![License: Apache-2.0](https://img.shields.io/github/license/phac-nml/genomic_address_service)](https://www.apache.org/licenses/LICENSE-2.0)

<img src="https://github.com/phac-nml/genomic_address_service/blob/main/logo.png?raw=true" width = "150" height="189">

## Genomic Address Service
![alt text](https://github.com/phac-nml/genomic_address_service/blob/main/logo.png?raw=true)

## Contents

- [Introduction](#introduction)
- [Installation](#installation)
- [Usage](#usage)
- [Quick Start](#quick-start)
- [FAQ](#faq)
- [Citation](#citation)
- [Legal](#legal)
- [Contact](#contact)

## Introduction

Surveillance and outbreak analysis of pathogens has been operationalized by multiple public health laboratories around
the world using gene-by-gene, SNP and k-mer approaches to produce estimates of genetic distance between sets of samples.
Standard phylogentic or heirarchal clustering approaches group samples based on distances but need to be recalculated
everytime a new sample is added. The lack of repeatable genetic units or nomenclature between runs of clustering makes
communication between different groups difficult and the slow due to the poor scaling of these approaches to larger
sample sizes. Enterobase implements [HeirCC](https://github.com/zheminzhou/pHierCC) as an algorithm approach to assign
new cgMLST profiles into existing multi-level cluster nomenclature from a minimum spanning tree, using single linkage in real time,
along with some tools for evaluating the clusters when run in de novo mode. [SnapperDB](https://github.com/ukhsa-collaboration/snapperdb)
used by the UKHSA utilizes single linkage clustering based on SNPs to produce a multi-level cluster nomenclature for outbreak and surveillance activities.
[ReportTree](https://github.com/insapathogenomics/ReporTreer) will perform de novo clustering based on [SciPy](https://scipy.org/)
linkage methods from a sequence alignment, VCF, allele profile, or distance matrix and so provides significant flexibility in inputs
compared to the other two methods. Like Enterobase, ReportTree also provides tools for evaluating regions of cluster stability and
can maintain cluster nomenclature between runs with the caveate that Report tree has the potential to split and merge previous cluster assignments.
This poses a significant issue in terms of Public Health Reporting and it is desirable to maintain existing cluster designations with
occasional updates to the cluster assignments which must then be reported to all clients and partners to make the changes in
their databases as well. HeirCC and be run as a standalone piece of software and applied to new cgMLST schemes and organisms readily, whereas
SnapperDB is a relatively complex pipeline where establishing databases for new organisms is rather complex. HeirCC could
be incorporated into other bioinformatics workflows relatively easily however, SnapperDB cannot.

<br><br>
While HeirCC provides customization and flexibility in terms of thresholds and schemes, it is limited to single-linkage
clustering which is known to have issues with stability and be prone to producing "scraggly" clusters due to chaining.
ReportTree provides customization and reporting functionality which can be of use to public health labs, however the
potential for changing cluster assignments poses issues for distributed Public Health surveillance and outbreak activities
due to the potential for becoming out of synch with each other. Within our public health partners, there is a need
for a clustering service which can perform de novo clustering based on average, complete, and single linkage which can
then be partitioned into clusters based on multiple thresholds. Additionally, there is a need to assign new samples into
an existing clustering to provide stable nomenclature for communication between different partners and stakeholders. Any
changes to the clustering assignments needs to be done purposefully with changes clearly communicated to partners and stakeholders.
<br><br>
To address needs of users within our team we have designed an integrated solution for calculating distance matricies and querying of genetically
similar samples within a defined threshold to support outbreak and surveillance activities. We provide the flexibility to have standard text
based outputs as well as parquet. It is implemented in pure python and currently is only available in a single threaded version but later
refinements may include the support for multiprocessing. To facilitate integration of the tool into larger workflows it will also be implemented
as a nextflow workflow.

As datasets grow, using text based formats such as CSV, TSV represent significant amounts of runtime in terms
of reading, parsing and writing. New formats such as [parquet](https://parquet.apache.org/) support compression and are optimized for
efficiency of storage and retreiveal of data.



## Installation
# Genomic Address Service
# Introduction

Surveillance and outbreak analysis of pathogens has been operationalized by multiple public health laboratories around the world using gene-by-gene, SNP and k-mer approaches to produce estimates of genetic distance between sets of samples. Standard phylogentic or heirarchal clustering approaches group samples based on distances but need to be recalculated each time a new sample is added. The lack of repeatable genetic units or nomenclature between runs of clustering makes communication between different groups difficult, especially with the poor scaling of these approaches to larger sample sizes.

A number of different software/pipelines have been published to address the issues mentioned above. Firstly, Enterobase implements [HeirCC](https://github.com/zheminzhou/pHierCC) as an algorithm approach to assign new cgMLST profiles into existing multi-level cluster nomenclature from a minimum spanning tree, using single linkage in real time, along with some tools for evaluating the clusters when run in de novo mode. [SnapperDB](https://github.com/ukhsa-collaboration/snapperdb) used by the UKHSA utilizes single linkage clustering based on SNPs to produce a multi-level cluster nomenclature for outbreak and surveillance activities. [ReportTree](https://github.com/insapathogenomics/ReporTreer) will perform de novo clustering based on [SciPy](https://scipy.org/) linkage methods from a sequence alignment, VCF, allele profile, or distance matrix and so provides significant flexibility in inputs compared to the other two methods. Like Enterobase, ReportTree also provides tools for evaluating regions of cluster stability and can maintain cluster nomenclature between runs with the caveat that ReportTree can split and merge previous cluster assignments, a significant issue for Public Health where it is desirable to maintain existing cluster designations. HeirCC can be run as a standalone piece of software and applied to new cgMLST schemes and organisms readily as well as incorporated into other bioinformatic pipelines. Conversely, SnapperDB is a relatively complex pipeline where establishing databases for new organisms is rather complex resulting in an inability to incorporate this tool into other workflows. While HeirCC provides customization and flexibility in terms of thresholds and schemes, it is limited to single-linkage clustering which is known to have issues with stability and be prone to producing "scraggly" clusters due to chaining. ReportTree provides customization and reporting functionality which can be of use to public health labs, however the potential for changing cluster assignments poses issues for distributed Public Health surveillance and outbreak activities due to the potential for becoming out of synch with each other.

Within our public health partners, there is a need for a clustering service which can perform de novo clustering based on average, complete, and single linkage that can then be partitioned into clusters based on multiple thresholds. Additionally, there is a need to assign new samples into an existing clustering to provide stable nomenclature for communication between different partners and stakeholders. To address needs of users within our team we have designed an integrated solution for calculating distance matrices and querying genetically similar samples, within a defined threshold, to support outbreak and surveillance activities. We provide the flexibility to have standard text based outputs and have included [parquet](https://parquet.apache.org/) format for highthrough put needs. It is implemented in pure python and currently is only available in a single threaded version but later refinements may include the support for multiprocessing. To facilitate integration of the tool into larger workflows it will also be implemented as a nextflow workflow.

## Citation

Robertson, James, Wells, Matthew, Schonfeld, Justin, Reimer, Aleisha. Genomic Address Service: Convenient package for de novo clustering and sample assignment to existing clusters. 2023. [https://github.com/phac-nml/genomic_address_service](https://github.com/phac-nml/genomic_address_service)
## Contact

For any questions, issues or comments please make a Github issue or reach out to [**James Robertson**]([email protected]).

# Install

Install the latest released version from conda:

Expand All @@ -75,68 +34,66 @@ Install using pip:
Install the latest master branch version directly from Github:

pip install git+https://github.com/phac-nml/genomic_address_service.git
### Compatibility
List out Dependencies and/or packages as appropriate



# Getting Started
## Usage
If you run ``gas``, you should see the following usage statement:

Usage: gas <command> [options] <required arguments>

To get minimal usage for a command use:
gas command
`gas command args`

To get full help for a command use one of:
gas command -h
cladeomatic command --help
### Commands
There are three commands that gas uses:

1. **mcluster** - de novo nested multi-level clustering
2. **call** - call genomic address based on existing clusterings
4. **test** - test functionality on a small dataset

Available commands:
### Args
There are a number of arguments that are specific for each command. They can be found directly by adding `--help` after each command. The following are common arguments:

mcluster De novo nested multi-level clustering
call Call genomic address based on existing clusterings
test Test functionality on a small dataset
- `-o`, `--outdir` - output directory to put cluster results
- `-m`, `--method` - cluster method [single, complete, average (default)]
- `-t`, `--thresholds` - thresholds delimited by ',' columns will be treated in sequential order
- `-V`, `--version` - print installed tool version
- `-f`, `--force` - overwrite existing out directory

Supported distance matrix formats
=====
**Square**
#### mcluster specific args
- `-i`, `--matrix` - TSV formatted distance matrix or parquet
- `-d`, `--delimiter` - delimiter desired for nomenclature code [default="."]

| id | S1 | S2 | S3 | S4 | S5 | S6 |
| ----------- | ----------- |----------- | ----------- | ----------- |----------- | ----------- |
| S1 | 0 | 0 | 3 | 3 | 9 | 9 |
| S2 | 0 | 0 | 3 | 3 | 9 | 9 |
| S3 | 3 | 3 | 0 | 0 | 9 | 9 |
| S4 | 3 | 3 | 0 | 0 | 9 | 9 |
| S5 | 9 | 9 | 9 | 9 | 0 | 0 |
| S6 | 9 | 9 | 9 | 9 | 0 | 0 |
#### call specific args
- `-d`, `--dists` - a 3 column file [query_id, ref_id, dist] in TSV or parquet format
- `-r`, `--rclusters` - existing cluster file in TSV or parquet format
- `-j`, `--thresh_map` - Json file of [colname:threshold]
- `-u`, `--outfmt` - output format for assignments [text (default), parquet]
- `-l`, `--delimiter` - delimiter desired for nomenclature code [default="."]

- Distance matrix units can be of float, or integer type with the constrain that the diagnonal must be 0 and the first
line must be a header with all of the samples

Quick start
=====
**De novo Multi-level Clustering**
## Configuration and Settings:
Thresholds must be configured when using GAS. These threshold must be determined manually through testing and establishment of practical criteria for each pathogen of interest.

Mcluster minimally accepts as input a distance matrix, output directory, and a set of thresholds delimeted by a comma.
For instance, in PulseNet Canada they have determined the use of '10,5,0' to be the threshold of choice for their pathogen surveillance program. [Publication on going]

<br />
## Data Input/formats

gas mcluster --matrix distance.matrix.text --outdir results --thresholds 10,5,0
### Square distance matrix

This will produce a cluster file with the following header [id, nomenclature, level_1, level_2, level_3]
| id | S1 | S2 | S3 | S4 | S5 | S6 |
| --- | --- | --- | --- | --- | --- | --- |
| S1 | 0 | 0 | 3 | 3 | 9 | 9 |
| S2 | 0 | 0 | 3 | 3 | 9 | 9 |
| S3 | 3 | 3 | 0 | 0 | 9 | 9 |
| S4 | 3 | 3 | 0 | 0 | 9 | 9 |
| S5 | 9 | 9 | 9 | 9 | 0 | 0 |
| S6 | 9 | 9 | 9 | 9 | 0 | 0 |

<br />
- Distance matrix units can be of float, or integer type with the constrain that the diagonal must be 0 and the first line must be a header with all of the samples

| id | nomenclature | level_1 | level_2 | level_3 |
| ----------- | ----------- | ----------- | ----------- | ----------- |
| S1 | 1.1.1 | 1 | 1 | 1 |
| S2 | 1.1.1 | 1 | 1 | 1 |
| S3 | 1.1.2 | 1 | 1 | 2 |
| S4 | 1.1.2 | 1 | 1 | 2 |
| S5 | 1.2.3 | 1 | 2 | 3 |
| S6 | 1.2.3 | 1 | 2 | 3 |

**Outputs:**
### Apache parquet
- More information on the open-source column-oriented data format can be found [here](https://parquet.apache.org/).
## Output/Results

```
{Output folder name}
Expand All @@ -147,24 +104,20 @@ This will produce a cluster file with the following header [id, nomenclature, le
└── run.json - Contains logging information for the run including parameters, newick tree, and threshold mapping info
```

**Cluster assignment**

Coming soon


## Benchmarks

Coming soon

## FAQ
# Troubleshooting and FAQs:

Coming soon
1. Mcluster fails due to missing scipy, with the following error:
```
import scipy
ModuleNotFoundError: No module named 'scipy'
```
- This dependency is currently missing in the pip install. Use the following command to install scipy separately: `pip install scipy`

## Citation
# Benchmarking

Robertson, James, Wells, Matthew, Schonfeld, Justin, Reimer, Aleisha. Genomic Address Service: Convenient package for de novo clustering and sample assignment to existing clusters. 2023. https://github.com/phac-nml/genomic_address_service
Coming soon.

## Legal
# Legal and Compliance Information:

Copyright Government of Canada 2023

Expand All @@ -181,7 +134,5 @@ under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR
CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.


## Contact

**James Robertson**: [email protected]
# Updates and Release Notes:
Please see the `CHANGELOG.md`

0 comments on commit ffca4ed

Please sign in to comment.