Merge pull request #24 from phac-nml/add_sample_name

Enhanced pipeline logic to support user-defined `sample_name` input
phac-nml · Sep 25, 2024 · 513b58b · 513b58b
2 parents 2b8da30 + f6e6b0f
commit 513b58b
Show file tree

Hide file tree

Showing 18 changed files with 505 additions and 115 deletions.
diff --git a/.github/workflows/linting.yml b/.github/workflows/linting.yml
@@ -14,13 +14,12 @@ jobs:
   pre-commit:
     runs-on: ubuntu-latest
     steps:
-      - uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11 # v4
+      - uses: actions/checkout@0ad4b8fadaa221de15dcec353f45205ec38ea70b # v4
 
-      - name: Set up Python 3.11
-        uses: actions/setup-python@0a5c61591373683505ea898e09a3ea4f39ef2b9c # v5
+      - name: Set up Python 3.12
+        uses: actions/setup-python@82c7e631bb3cdc910f68e0081d67478d79c6982d # v5
         with:
-          python-version: 3.11
-          cache: "pip"
+          python-version: "3.12"
 
       - name: Install pre-commit
         run: pip install pre-commit
@@ -32,14 +31,14 @@ jobs:
     runs-on: ubuntu-latest
     steps:
       - name: Check out pipeline code
-        uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11 # v4
+        uses: actions/checkout@0ad4b8fadaa221de15dcec353f45205ec38ea70b # v4
 
       - name: Install Nextflow
-        uses: nf-core/setup-nextflow@v1
+        uses: nf-core/setup-nextflow@v2
 
-      - uses: actions/setup-python@0a5c61591373683505ea898e09a3ea4f39ef2b9c # v5
+      - uses: actions/setup-python@82c7e631bb3cdc910f68e0081d67478d79c6982d # v5
         with:
-          python-version: "3.11"
+          python-version: "3.12"
           architecture: "x64"
 
       - name: Install dependencies
@@ -60,7 +59,7 @@ jobs:
 
       - name: Upload linting log file artifact
         if: ${{ always() }}
-        uses: actions/upload-artifact@5d5d22a31266ced268874388b861e4b58bb5c2f3 # v4
+        uses: actions/upload-artifact@65462800fd760344b1a7b4382951275a0abb4808 # v4
         with:
           name: linting-logs
           path: |

diff --git a/.github/workflows/linting_comment.yml b/.github/workflows/linting_comment.yml
@@ -11,7 +11,7 @@ jobs:
     runs-on: ubuntu-latest
     steps:
       - name: Download lint results
-        uses: dawidd6/action-download-artifact@f6b0bace624032e30a85a8fd9c1a7f8f611f5737 # v3
+        uses: dawidd6/action-download-artifact@09f2f74827fd3a8607589e5ad7f9398816f540fe # v3
         with:
           workflow: linting.yml
           workflow_conclusion: completed

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -3,6 +3,15 @@
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
+## Development
+
+### `Changed`
+
+- Added the ability to include a `sample_name` column in the input samplesheet.csv. Allows for compatibility with IRIDA-Next input configuration [PR24](https://github.com/phac-nml/speciesabundance/pull/24)
+  - `sample_name` special characters will be replaced with `"_"`
+  - If no `sample_name` is supplied in the column sample will be used
+  - To avoid repeat values for `sample_name` all `sample_name` values will be suffixed with the unique `sample` value from the input file
+
 ## 2.1.1 - 2024/05/02
 
 ### `Changed`
@@ -36,3 +45,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ### `Dependencies`
 
 ### `Deprecated`
+
+[2.0.0]: https://github.com/phac-nml/speciesabundance/releases/tag/2.0.0
+[2.1.0]: https://github.com/phac-nml/speciesabundance/releases/tag/2.1.0
+[2.1.1]: https://github.com/phac-nml/speciesabundance/releases/tag/2.1.1
diff --git a/README.md b/README.md
@@ -14,8 +14,26 @@ The input to the pipeline is a standard sample sheet (passed as `--input samples
 | ------- | --------------- | --------------- |
 | SampleA | file_1.fastq.gz | file_2.fastq.gz |
 
+An [example samplesheet](../assets/samplesheet_minimal.csv) has been provided with the pipeline.
+
 The structure of this file is defined in [assets/schema_input.json](assets/schema_input.json). Validation of the sample sheet is performed by [nf-validation](https://nextflow-io.github.io/nf-validation/).
 
+## IRIDA-Next Optional Input Configuration
+
+`speciesabundance` accepts the [IRIDA-Next](https://github.com/phac-nml/irida-next) format for samplesheets which can contain an additional column: `sample_name`
+
+`sample_name`: An **optional** column, that overrides `sample` for outputs (filenames and sample names) and reference assembly identification.
+
+`sample_name`, allows more flexibility in naming output files or sample identification. Unlike `sample`, `sample_name` is not required to contain unique values. `Nextflow` requires unique sample names, and therefore in the instance of repeat `sample_names`, `sample` will be suffixed to any `sample_name`. Non-alphanumeric characters (excluding `_`,`-`,`.`) will be replaced with `"_"`.
+
+The sample sheet, when including the optional `sample_name` column, should look like:
+
+| sample  | sample_name | fastq_1         | fastq_2         |
+| ------- | ----------- | --------------- | --------------- |
+| SampleA | A1          | file_1.fastq.gz | file_2.fastq.gz |
+
+An [example samplesheet](../tests/data/samplename_samplesheet.csv) has been provided with the pipeline, which includes the `sample_name` column.
+
 # Parameters
 
 ## Mandatory

diff --git a/assets/samplesheet.csv b/assets/samplesheet.csv
@@ -1,4 +1,4 @@
-sample,fastq_1,fastq_2
-SAMPLE1,https://github.com/nf-core/test-datasets/raw/mag/test_data/test_minigut_R1.fastq.gz,https://github.com/nf-core/test-datasets/raw/mag/test_data/test_minigut_R2.fastq.gz
-SAMPLE2,https://github.com/nf-core/test-datasets/raw/mag/test_data/test_minigut_sample2_R1.fastq.gz,https://github.com/nf-core/test-datasets/raw/mag/test_data/test_minigut_sample2_R2.fastq.gz
-SAMPLE3,https://github.com/nf-core/test-datasets/raw/mag/test_data/test_minigut_R1.fastq.gz,
+sample,sample_name,fastq_1,fastq_2
+SAMPLE1,A1,https://github.com/nf-core/test-datasets/raw/mag/test_data/test_minigut_R1.fastq.gz,https://github.com/nf-core/test-datasets/raw/mag/test_data/test_minigut_R2.fastq.gz
+SAMPLE2,B2,https://github.com/nf-core/test-datasets/raw/mag/test_data/test_minigut_sample2_R1.fastq.gz,https://github.com/nf-core/test-datasets/raw/mag/test_data/test_minigut_sample2_R2.fastq.gz
+SAMPLE3,C3,https://github.com/nf-core/test-datasets/raw/mag/test_data/test_minigut_R1.fastq.gz,
diff --git a/assets/samplesheet_minimal.csv b/assets/samplesheet_minimal.csv
@@ -0,0 +1,4 @@
+sample,fastq_1,fastq_2
+SAMPLE1,https://github.com/nf-core/test-datasets/raw/mag/test_data/test_minigut_R1.fastq.gz,https://github.com/nf-core/test-datasets/raw/mag/test_data/test_minigut_R2.fastq.gz
+SAMPLE2,https://github.com/nf-core/test-datasets/raw/mag/test_data/test_minigut_sample2_R1.fastq.gz,https://github.com/nf-core/test-datasets/raw/mag/test_data/test_minigut_sample2_R2.fastq.gz
+SAMPLE3,https://github.com/nf-core/test-datasets/raw/mag/test_data/test_minigut_R1.fastq.gz,
diff --git a/assets/schema_input.json b/assets/schema_input.json
@@ -10,10 +10,15 @@
             "sample": {
                 "type": "string",
                 "pattern": "^\\S+$",
-                "meta": ["id"],
+                "meta": ["irida_id"],
                 "unique": true,
                 "errorMessage": "Sample name must be provided and cannot contain spaces"
             },
+            "sample_name": {
+                "type": "string",
+                "meta": ["id"],
+                "errorMessage": "Optional. Used to override sample when used in tools like IRIDA-Next."
+            },
             "fastq_1": {
                 "type": "string",
                 "pattern": "^\\S+\\.f(ast)?q(\\.gz)?$",

diff --git a/conf/iridanext.config b/conf/iridanext.config
@@ -4,6 +4,7 @@ iridanext {
         path = "${params.outdir}/iridanext.output.json.gz"
         overwrite = true
         files {
+            idkey = "irida_id"
             global = [
                 "**/failure/failures_report.csv"
             ]

diff --git a/conf/test.config b/conf/test.config
@@ -20,6 +20,6 @@ params {
     max_time   = '1.h'
 
     // Input data
-    input  = 'https://raw.githubusercontent.com/phac-nml/speciesabundance/dev/assets/samplesheet.csv'
+    input  = "${projectDir}/assets/samplesheet.csv"
     database = "${projectDir}/tests/data/minidb"
 }
diff --git a/docs/usage.md b/docs/usage.md
@@ -15,24 +15,40 @@ You will need to create a samplesheet with information about the samples you wou
 ### Full samplesheet
 
 The input samplesheet must contain three columns: `sample`, `fastq_1`, `fastq_2`. The sampleIDs within a samplesheet should be unqiue. All other columns will be ignored.
+This pipleine does not support the processing of long-read sequencing data (Nanopore or PacBio).
 
 A final samplesheet file consisting of both single- and paired-end Illumina short read data may look something like the one below.
-This pipleine does not support the processing of long-read sequencing data (Nanopore or PacBio).
 
-```csv title="samplesheet.csv"
+```csv title="samplesheet_minimal.csv"
 sample,fastq_1,fastq_2
 SAMPLE1,sample1_R1.fastq.gz,sample1_R2.fastq.gz
 SAMPLE2,sample2_R1.fastq.gz,sample2_R2.fastq.gz
-SAMPLE3,sample1_R1.fastq.gz,
+SAMPLE3,sample3_R1.fastq.gz,
+```
+
+A [example samplesheet](../assets/samplesheet_minimal.csv) has been provided with the pipeline.
+
+### IRIDA-Next Optional Samplesheet Configuration
+
+`speciesabundance` accepts the [IRIDA-Next](https://github.com/phac-nml/irida-next) format for samplesheets which contain the following columns: `sample`, `sample_name`, `fastq_1`, and `fastq_2`. The sample IDs within a samplesheet should be unique.
+
+A final samplesheet file consisting of both single- and paired-end data may look something like the one below.
+
+```csv title'"samplesheet.csv"
+sample,sample_name,fastq_1,fastq_2
+SAMPLE1,A1,sample1_R1.fastq.gz,sample1_R2.fastq.gz
+SAMPLE2,B2,sample2_R1.fastq.gz,sample2_R2.fastq.gz
+SAMPLE3,C3,sample3_R1.fastq.gz,
 ```
 
-| Column    | Description                                                                                                                |
-| --------- | -------------------------------------------------------------------------------------------------------------------------- |
-| `sample`  | Custom sample name. Samples should be unique within a samplesheet.                                                         |
-| `fastq_1` | Full path to FastQ file for Illumina short reads 1. File has to be gzipped and have the extension ".fastq.gz" or ".fq.gz". |
-| `fastq_2` | Full path to FastQ file for Illumina short reads 2. File has to be gzipped and have the extension ".fastq.gz" or ".fq.gz". |
+| Column        | Description                                                                                                                |
+| ------------- | -------------------------------------------------------------------------------------------------------------------------- |
+| `sample`      | Custom sample name. Samples should be unique within a samplesheet.                                                         |
+| `sample_name` | Sample name used in outputs (filenames and sample names)                                                                   |
+| `fastq_1`     | Full path to FastQ file for Illumina short reads 1. File has to be gzipped and have the extension ".fastq.gz" or ".fq.gz". |
+| `fastq_2`     | Full path to FastQ file for Illumina short reads 2. File has to be gzipped and have the extension ".fastq.gz" or ".fq.gz". |
 
-An [example samplesheet](../assets/samplesheet.csv) has been provided with the pipeline.
+An [example samplesheet](../tests/data/samplename_samplesheet.csv) has been provided with the pipeline, which includes the `sample_name` column.
 
 ## Running the pipeline
 

diff --git a/modules/local/topN/main.nf b/modules/local/topN/main.nf
@@ -38,7 +38,7 @@ process TOP_N {
     ${abundances} \\
     ${args} \\
     -n ${top_n} \\
-    -s ${meta.id} \\
+    -s ${meta.irida_id} \\
     > ${meta.id}_${taxonomic_level}_top_${top_n}.csv
 
     cat <<-END_VERSIONS > versions.yml

diff --git a/tests/data/error_samplesheet.csv b/tests/data/error_samplesheet.csv
@@ -1,5 +1,5 @@
-sample,fastq_1,fastq_2
-SAMPLE1,https://github.com/nf-core/test-datasets/raw/mag/test_data/test_minigut_R1.fastq.gz,https://github.com/nf-core/test-datasets/raw/mag/test_data/test_minigut_sample2_R2.fastq.gz
-SAMPLE2,https://raw.githubusercontent.com/phac-nml/speciesabundance/dev/tests/data/fastq/test-kraken_R1_001.fastq.gz,https://raw.githubusercontent.com/phac-nml/speciesabundance/dev/tests/data/fastq/test-kraken_R2_001.fastq.gz
-SAMPLE3,https://raw.githubusercontent.com/phac-nml/speciesabundance/dev/tests/data/fastq/test-bracken_R1_001.fastq.gz,https://raw.githubusercontent.com/phac-nml/speciesabundance/dev/tests/data/fastq/test-bracken_R2_001.fastq.gz
-SAMPLE4,https://github.com/nf-core/test-datasets/raw/mag/test_data/test_minigut_R1.fastq.gz,https://github.com/nf-core/test-datasets/raw/mag/test_data/test_minigut_R2.fastq.gz
+sample,sample_name,fastq_1,fastq_2
+SAMPLE1,A1,https://github.com/nf-core/test-datasets/raw/mag/test_data/test_minigut_R1.fastq.gz,https://github.com/nf-core/test-datasets/raw/mag/test_data/test_minigut_sample2_R2.fastq.gz
+SAMPLE2,B2,https://raw.githubusercontent.com/phac-nml/speciesabundance/dev/tests/data/fastq/test-kraken_R1_001.fastq.gz,https://raw.githubusercontent.com/phac-nml/speciesabundance/dev/tests/data/fastq/test-kraken_R2_001.fastq.gz
+SAMPLE3,C3,https://raw.githubusercontent.com/phac-nml/speciesabundance/dev/tests/data/fastq/test-bracken_R1_001.fastq.gz,https://raw.githubusercontent.com/phac-nml/speciesabundance/dev/tests/data/fastq/test-bracken_R2_001.fastq.gz
+SAMPLE4,D4,https://github.com/nf-core/test-datasets/raw/mag/test_data/test_minigut_R1.fastq.gz,https://github.com/nf-core/test-datasets/raw/mag/test_data/test_minigut_R2.fastq.gz
diff --git a/tests/data/fail_samplesheet.csv b/tests/data/fail_samplesheet.csv
@@ -1,2 +1,2 @@
-sample,fastq_1,fastq_2
-SAMPLE1,https://github.com/nf-core/test-datasets/raw/mag/test_data/test_minigut_R1.fastq.gz,https://github.com/nf-core/test-datasets/raw/mag/test_data/test_minigut_sample2_R2.fastq.gz
+sample,sample_name,fastq_1,fastq_2
+SAMPLE1,A1,https://github.com/nf-core/test-datasets/raw/mag/test_data/test_minigut_R1.fastq.gz,https://github.com/nf-core/test-datasets/raw/mag/test_data/test_minigut_sample2_R2.fastq.gz