Update documentation

phac-nml · Oct 3, 2024 · 2cd6dae · 2cd6dae
1 parent b17c87b
commit 2cd6dae
Show file tree

Hide file tree

Showing 3 changed files with 44 additions and 1 deletion.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -3,6 +3,15 @@
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
+## Development
+
+### Changed
+
+- Added the ability to include a `sample_name` column in the input samplesheet.csv. Allows for compatibility with IRIDA-Next input configuration.
+  - `sample_name` special characters will be replaced with `"_"`
+  - If no `sample_name` is supplied in the column `sample` will be used
+  - To avoid repeat values for `sample_name` all `sample_name` values will be suffixed with the unique `sample` value from the input file
+
 ## [0.2.3] - 2024/09/25
 
 ### `Changed`

diff --git a/README.md b/README.md
@@ -20,6 +20,16 @@ The structure of this file is defined in [assets/schema_input.json](assets/schem
 
 Details on the columns can be found in the [Full samplesheet](docs/usage.md#full-samplesheet) documentation.
 
+## IRIDA-Next Optional Input Configuration
+
+`gasnomenclature` accepts the [IRIDA-Next](https://github.com/phac-nml/irida-next) format for samplesheets which can contain an additional column: `sample_name`
+
+`sample_name`: An **optional** column, that overrides `sample` for outputs (filenames and sample names) and reference assembly identification.
+
+`sample_name`, allows more flexibility in naming output files or sample identification. Unlike `sample`, `sample_name` is not required to contain unique values. `Nextflow` requires unique sample names, and therefore in the instance of repeat `sample_names`, `sample` will be suffixed to any `sample_name`. Non-alphanumeric characters (excluding `_`,`-`,`.`) will be replaced with `"_"`.
+
+An [example samplesheet](tests/data/samplesheets/samplesheet-sample_name.csv) has been provided with the pipeline.
+
 # Parameters
 
 The main parameters are `--input` as defined above and `--output` for specifying the output results directory. You may wish to provide `-profile singularity` to specify the use of singularity containers and `-r [branch]` to specify which GitHub branch you would like to run.

diff --git a/docs/usage.md b/docs/usage.md
@@ -12,7 +12,7 @@ You will need to create a samplesheet with information about the samples you wou
 --input '[path to samplesheet file]'
 ```
 
-### Full samplesheet
+### Full Standard Samplesheet
 
 The input samplesheet must contain three columns: `sample`, `mlst_alleles`, `address`. The sample names within a samplesheet should be unique. All other columns will be ignored.
 
@@ -33,6 +33,28 @@ sampleF,sampleF.mlst.json,
 
 An [example samplesheet](../assets/samplesheet.csv) has been provided with the pipeline.
 
+### IRIDA-Next Optional Samplesheet Configuration
+
+`gasnomenclature` accepts the [IRIDA-Next](https://github.com/phac-nml/irida-next) format for samplesheets which contain the following columns: `sample`, `sample_name`, `mlst_alleles`, `address`. The sample IDs within a samplesheet should be unique.
+
+A final samplesheet file consisting of mlst_alleles and addresses may look something like the one below:
+
+```csv title="samplesheet.csv"
+sample,sample_name,mlst_alleles,address
+sampleA,S1,sampleA.mlst.json.gz,1.1.1
+sampleQ,S2,sampleQ.mlst.json.gz,2.2.2
+sampleF,,sampleF.mlst.json,
+```
+
+| Column         | Description                                                                                                                                                                                                                                                                                 |
+| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `sample`       | Custom sample name. Samples should be unique within a samplesheet.                                                                                                                                                                                                                          |
+| `sample_name`  | Sample name used in outputs (filenames and sample names)                                                                                                                                                                                                                                    |
+| `mlst_alleles` | Full path to an MLST JSON file describing the loci/alleles for the sample against some MLST scheme. A way to generate this file is via [locidex]. File can optionally be gzipped and must have the extension ".mlst.json", ".mlst.subtyping.json" (or with an additional ".gz" if gzipped). |
+| `address`      | Hierarchal clustering address. If left empty for a sample, the pipeline will assign a cluster address.                                                                                                                                                                                      |
+
+An [example samplesheet](tests/data/samplesheets/samplesheet-sample_name.csv) has been provided with the pipeline.
+
 ## Running the pipeline
 
 The typical command for running the pipeline is as follows:
@@ -185,3 +207,5 @@ We recommend adding the following line to your environment to limit this (typica
 ```bash
 NXF_OPTS='-Xms1g -Xmx4g'
 ```
+
+[locidex]: https://github.com/phac-nml/locidex