fastforward to v1.3.7

dr-joe-wirth · Sep 16, 2024 · 322f246 · 322f246
2 parents e391763 + f3d0f90
commit 322f246
Show file tree

Hide file tree

Showing 14 changed files with 727 additions and 196 deletions.
diff --git a/.github/workflows/docker-build-push.yml b/.github/workflows/docker-build-push.yml
@@ -0,0 +1,41 @@
+name: Build and Push Docker Image
+
+on:
+  push:
+    tags:
+      - 'v*.*.*'
+
+env:
+  DOCKER_USERNAME: jwirth
+  DOCKER_IMAGE_NAME: primerforge
+
+jobs:
+  docker-build-push:
+    name: docker build and push
+    runs-on: ubuntu-latest
+
+    steps:
+      - name: Checkout code
+        uses: actions/checkout@v3
+
+      - name: Set up Docker Buildx
+        uses: docker/setup-buildx-action@v2
+
+      - name: Log in to Docker Hub
+        uses: docker/login-action@v2
+        with:
+          username: ${{ env.DOCKER_USERNAME }}
+          password: ${{ secrets.DOCKER_PASSWORD }}
+
+      - name: Extract version from tag
+        run: echo "VERSION=${GITHUB_REF#refs/tags/v}" >> $GITHUB_ENV
+
+      - name: Build and push Docker image
+        uses: docker/build-push-action@v5
+        with:
+          context: ./docker
+          file: ./docker/Dockerfile
+          push: true
+          tags: |
+            ${{ env.DOCKER_USERNAME }}/${{ env.DOCKER_IMAGE_NAME }}:v${{ env.VERSION }}
+            ${{ env.DOCKER_USERNAME }}/${{ env.DOCKER_IMAGE_NAME }}:latest
diff --git a/README.md b/README.md
@@ -90,6 +90,21 @@ optional arguments:
     --debug              run in debug mode (default: False)
 ```
 
+## Results
+The results produced by `primerForge` list one primer pair per line, and these pairs are sorted based on the following criteria (in order of importance):
+
+  * largest difference from the ingroup PCR product range for outgroup PCR products
+  * lowest number of outgroup PCR products
+  * lowest G+C difference between the pair
+  * lowest Tm difference between the pair
+  * lowest Tm for heterodimers
+  * lowest Tm (sum) for homodimers
+  * lowest Tm (sum) for hairpins
+  * lowest variance in ingroup PCR product sizes
+  * largest median ingroup PCR product size
+
+See [the example](#examining-the-results) for details about the file format.
+
 ## Workflow
 
 ```mermaid
@@ -208,8 +223,8 @@ flowchart TB
     ispcr["filter pairs using isPcr"]
     final(["final primer pairs"])
     dump5[/"dump to file"/]
-    write[/"write pairs to file"/]
-
+    write[/"sort pairs and
+            write to file"/]
 
     noout --> ispcr
     ispcr --> final
@@ -223,7 +238,7 @@ This example assumes you have already installed `primerForge` [as described abov
 ### Motivation
 In this example, we will use `primerForge` to find pairs of primers between 18bp and 24bp that can be used to differentiate three strains of _Mycoplasma mycoides_ subspecies mycoides (the "ingroup") from two strains of _Mycoplasma mycoides_ subspecies capri (the "outgroup"). The primer pairs identified by `primerForge` are predicted to produce a single PCR product between 500bp and 2000bp in the ingroup. These same primer pairs are predicted to produce PCR products <300bp, >3000bp, or no PCR products at all in the outgroup.
 
-### preparing the workspace
+### Preparing the workspace
 In order to get started, create a directory called `mycoplasma_test` and move into it:
 
 ```bash
@@ -252,7 +267,7 @@ We will use the following flags to specify specific parameters for this example:
 
 You can get a list of all available flags using the command `primerForge --help`.
 
-Run `primerForge` using the following command (requires at least 3 Gb of RAM):
+Run `primerForge` using the following command (requires at least 2 Gb of RAM):
 
 ```bash
 primerForge --ingroup "./i*gbff" --outgroup "./o*gbff" --pcr_prod 500,2000 --bad_sizes 300,3000 --primer_len 18,24
@@ -261,43 +276,43 @@ primerForge --ingroup "./i*gbff" --outgroup "./o*gbff" --pcr_prod 500,2000 --bad
 After running the command, you should see something like this printed to the screen:
 
 ```text
+dumping Parameters to '_pickles/parameters.p' ... done 00:00:00.49
 identifying kmers suitable for use as primers in all 3 ingroup genome sequences
-    getting shared ingroup kmers that appear once in each genome ... done 00:01:19.32
-    dumping shared kmers to '_pickles/sharedKmers.p' ... done 00:00:07.34
-    evaluating kmers ... done 00:03:20.11
+    getting shared ingroup kmers that appear once in each genome ... done 00:01:36.15
+    dumping shared kmers to '_pickles/sharedKmers.p' ... done 00:00:11.22
+    evaluating 2430140 kmers ... done 00:01:51.73
     identified 30413 candidate kmers
-done 00:04:49.18
-    dumping candidate kmers to '_pickles/candidates.p' ... done 00:00:01.03
-identifying pairs of primers found in all ingroup sequences ... done 00:00:11.60
+    dumping candidate kmers to '_pickles/candidates.p' ... done 00:00:01.55
+done 00:03:42.94
+identifying pairs of primers found in all ingroup sequences ... done 00:00:11.62
     identified 16050 primer pairs shared in all ingroup sequences
-    dumping unfiltered pairs to '_pickles/pairs.p' ... done 00:00:00.54
+    dumping unfiltered pairs to '_pickles/pairs.p' ... done 00:00:00.56
 removing primer pairs present in the outgroup sequences
-    getting outgroup PCR products ... done 00:00:01.01
-    filtering primer pairs ... done 00:00:00.53
-    processing outgroup results ... done 00:00:00.52
+    getting outgroup PCR products ... done 00:00:01.03
+    filtering primer pairs ... done 00:00:00.54
+    processing outgroup results ... done 00:00:00.54
     removed 5905 pairs present in the outgroup (10145 pairs remaining)
-    dumping filtered pairs to '_pickles/pairs_noOutgroup.p' ... done 00:00:00.53
-validating primer pairs with isPcr ... done 00:00:01.97
-    removed 1665 pairs not validated by isPcr (8480 pairs remaining)
-    dumping validated pairs to '_pickles/pairs_noOutgroup_validated.p' ... done 00:00:00.53
-writing primer pairs to 'results.tsv' ... done 00:00:00.52
+    dumping filtered pairs to '_pickles/pairs_noOutgroup.p' ... done 00:00:00.54
+validating primer pairs with isPcr ... done 00:00:03.44
+    removed 3659 pairs not validated by isPcr (6486 pairs remaining)
+    dumping validated pairs to '_pickles/pairs_noOutgroup_validated.p' ... done 00:00:00.54
+writing 6486 primer pairs to 'results.tsv' ... done 00:00:11.62
 
-total runtime: 00:05:08.49
+total runtime: 00:04:14.50
 ```
 
-As we can see, `primerForge` found 30,413 kmers that were suitable for use as a primer in all three ingroup genomes. It then went on to identify 16,050 primer pairs that would produce PCR products between 500bp and 2000bp in the ingroup genomes. Next, it found that of those 16,050 pairs, 5,905 of them formed PCR products between 300bp and 3000bp in one or more of the outgroup genomes. Finally, it used `isPcr` to validate the remaining 10,145 primer pairs resulting in 8,480 primer pairs being written to file.
+As we can see, `primerForge` found 30,413 kmers that were suitable for use as a primer in all three ingroup genomes. It then went on to identify 16,050 primer pairs that would produce PCR products between 500bp and 2000bp in the ingroup genomes. Next, it found that of those 16,050 pairs, 5,905 of them formed PCR products between 300bp and 3000bp in one or more of the outgroup genomes. Finally, it used `isPcr` to validate the remaining 10,145 primer pairs resulting in 6,486 primer pairs being written to file.
 
+### Examining the results
 `primerForge` generated `results.tsv`, the file that contains the sequences and details for each primer pair, and `primerForge.log`. Here are a few lines from `results.tsv`:
 
 |fwd_seq|fwd_Tm|fwd_GC|rev_seq|rev_Tm|rev_GC|i1.gbff_contig|i1.gbff_length|i2.gbff_contig|i2.gbff_length|i3.gbff_contig|i3.gbff_length|o1.gbff_contig|o1.gbff_length|o2.gbff_contig|o2.gbff_length|
 |:------|:-----|:-----|:------|:-----|:-----|:------------:|:------------:|:------------:|:------------:|:------------:|:------------:|:------------:|:------------:|:------------:|:------------:|
-|TATGCAACTAATCCCGAGTATCAC|56.1|41.7|TGTAAGTGGCGTTGTATCCC|55.5|50|NZ_LAUX01000130.1|521|NZ_LAUV01000074.1|521|NZ_LAUY01000118.1|521|NA|0|NA|0|
-|TGTTCCTTCACACTCAATAACAGC|57.3|41.7|AGAAGGAACAGTCGCTGAAG|55.4|50|NZ_LAUX01000081.1|1645|NZ_LAUV01000036.1|1645|NZ_LAUY01000073.1|1645|NZ_LS483503.1|3091|NZ_CP065583.1|3103|
-|AAGGAGAGTATCGCTTAGTTGATG|56|41.7|CAACAGCAGATGGTTTAGAAAGTG|56.4|41.7|NZ_LAUX01000079.1|788|NZ_LAUV01000035.1|788|NZ_LAUY01000072.1|788|NA|0|NA|0|
-|AAGGAGAGTATCGCTTAGTTGATG|56|41.7|ACTCCAATTGCTCTTCCTGAAG|56.2|45.5|NZ_LAUX01000079.1|1053|NZ_LAUV01000035.1|1053|NZ_LAUY01000072.1|1053|NA|0|NA|0|
-|TGAAATCACCAGCTATTGCATCAG|57.5|41.7|ACATTGCAACTCCTGAGATTTG|55.1|40.9|NZ_LAUX01000081.1|1998|NZ_LAUV01000036.1|1998|NZ_LAUY01000073.1|1998|NA|0|NA|0|
+|AGAAGCAAAGGATATGGGAACAAC|57.1|41.7|AAATCAACACCCTCAATAAGCTCC|57.1|41.7|NZ_LAUX01000078.1|804|NZ_LAUV01000035.1|804|NZ_LAUY01000092.1|804|NA|0|NA|0|
+|TCCATCTAATGAAGATCAACCAGG|55.9|41.7|CCCTAATTGTGATGAGTTGACAAC|55.9|41.7|NZ_LAUX01000010.1|710|NZ_LAUV01000042.1|713|NZ_LAUY01000078.1|710|NA|0|NA|0|
+|CATCAGCTGTTGTAAATAACCCAC|56.2|41.7|GTGGAGCTATGAAACCAATATCAG|55.3|41.7|NZ_LAUX01000117.1|1694|NZ_LAUV01000064.1|1694|NZ_LAUY01000018.1|1694|NZ_LS483503.1|3212|NA|0|
 
-The first six columns show the forward and reverse sequences (5' --> 3') as well as their melting temperatures and their G+C %. Next, for each genome it lists the contig and the PCR product size that is predicted be produced by this pair. For example, the first pair of primers are predicted to produce PCR products of 521bp the ingroup genomes, and the binding sites for this primer pair in the files `i1.gbff`, `i2.gbff`, and `i3.gbff` can be found in contigs `NZ_LAUX01000130.1`, `NZ_LAUV01000074.1`, and `NZ_LAUY01000118.1`, respectively. This same pair is not predicted to produce any PCR products in either outgroup genome. Similarly, the next pair is predicted to produce a PCR product size of 1,645bp in all three ingroup genomes and PCR products sizes of >3,000bp in both the outgroup genomes.
+The first six columns show the forward and reverse sequences (5' --> 3') as well as their melting temperatures and their G+C content (mol %). Next, for each genome it lists the contig and the PCR product size that is predicted be produced by this pair. For example, the first pair of primers are predicted to produce PCR products of 804bp the ingroup genomes, and the binding sites for this primer pair in the files `i1.gbff`, `i2.gbff`, and `i3.gbff` can be found in contigs `NZ_LAUX01000078.1`, `NZ_LAUV01000035.1`, and `NZ_LAUY01000092.1`, respectively. This same pair is not predicted to produce any PCR products in either outgroup genome. Similarly, the third pair is predicted to produce a PCR product size of 1,694bp in all three ingroup genomes, no products `o2.gbff`, and a PCR product >3,000bp in `o1.gbff`.
 
 If a primer pair is predicted to produce multiple products in an outgroup genome, then the contig column and the size column will list contigs and sizes in a comma-separated list linked by position. For example, if a primer pair was expected to produce a product of 1,990bp in `contig_1` and 2,024bp in `contig_2` in the genome file `example.gbff`, then the columns for this genome would look like this:
 

diff --git a/bin/Parameters.py b/bin/Parameters.py
@@ -12,7 +12,6 @@ class Parameters():
     _MIN_LEN = 10
     __ALLOWED_FORMATS = ('genbank', 'fasta')
     __ALL_CONTIGS_FNA = ".all_contigs.fna"
-    __QUERY_FN = ".ispcr_query.tsv"
     __PICKLE_DIR = "_pickles"
     _PARAMS = 0
     _SHARED = 1
@@ -69,7 +68,6 @@ def __init__(self, author:str, version:str, initializeLog:bool=True) -> Paramete
         self.pickles:dict[int,str]
         self.keepIntermediateFiles:bool
         self.allContigsFna:str
-        self.queryFn:str
 
         # save author and version as private attributes
         self.__author:str = author
@@ -93,9 +91,6 @@ def __init__(self, author:str, version:str, initializeLog:bool=True) -> Paramete
 
             # get the all contigs fasta filename
             self.allContigsFna = os.path.join(os.path.dirname(self.resultsFn), Parameters.__ALL_CONTIGS_FNA)
-
-            # get the ispcr query filename
-            self.queryFn = os.path.join(os.path.dirname(self.resultsFn), Parameters.__QUERY_FN)
 
     def __eq__(self, other:Parameters) -> bool:
         """equality overload
@@ -182,11 +177,9 @@ def __checkOutputFile(fn:str) -> None:
             if proceed == YN[1]:
                 raise FileExistsError(f"{WARN_MSG_A}{fn}{WARN_MSG_B}")
 
-        # make sure the file can be written
+        # make sure we can write the file to the output directory
         try:
-            fh = open(fn, 'w')
-            fh.close()
-            os.remove(fn)
+            os.access(os.path.dirname(fn), os.W_OK)
         except:
             raise ValueError(f"{ERR_MSG}{fn}")
 

diff --git a/bin/Primer.py b/bin/Primer.py
@@ -33,6 +33,10 @@ def __init__(self, seq:Seq, contig:str, start:int, length:int, strand:str) -> Pr
         self.strand:str = strand
         self.Tm:float = None
         self.gcPer:float = None
+        self.hairpinTm:float = None
+        self.rcHairpin:float = None
+        self.homodimerTm:float = None
+        self.rcHomodimer:float = None
         self.__length:int = length
 
         # flip start and end if on the minus strand
@@ -111,6 +115,12 @@ def reverseComplement(self) -> Primer:
             new = Primer(self.seq.reverse_complement(), self.contig, self.start, len(self), Primer.MINUS)
         else:
             new = Primer(self.seq.reverse_complement(), self.contig, self.end, len(self), Primer.PLUS)
+
+        # flip the hairpin/dimer Tms
+        new.hairpinTm = self.rcHairpin
+        new.rcHairpin = self.hairpinTm
+        new.homodimerTm = self.rcHomodimer
+        new.rcHomodimer = self.homodimerTm
 
         # make the new object
         return new

diff --git a/bin/Product.py b/bin/Product.py
@@ -0,0 +1,32 @@
+from __future__ import annotations
+from typing import Union
+
+class Product:
+    """class for storing PCR product data for a pair of primers
+    """
+    def __init__(self, contig:str, size:Union[int,str], fbin:int, rbin:int, dimerTm:float) -> Product:
+        """constructor
+
+        Args:
+            contig (str): the contig where the product resides
+            size (Union[int,str]): the contig size (will be str if multiple sizes)
+            fbin (int): the bin the forward primer belongs to
+            rbin (int): the bin the reverse primer belongs to
+            dimerTm (float): the heterodimer Tm for the primer pair
+
+        Returns:
+            Product: a Product object
+        """
+        # set attributes
+        self.contig:str = contig
+        self.size:Union[int,str] = size
+        self.dimerTm:float = dimerTm
+        self.fbin:int = fbin
+        self.rbin:int = rbin
+
+    # overloads
+    def __str__(self) -> str:
+        return f'{self.contig}: {self.size} bp'
+
+    def __repr__(self) -> str:
+        return str(self)
diff --git a/bin/getCandidateKmers.py b/bin/getCandidateKmers.py
@@ -316,18 +316,26 @@ def noLongRepeats(primer:Primer) -> bool:
     def noHairpins(primer:Primer) -> bool:
         """verifies that the primer does not form hairpins
         """
+        # save the hairpin Tms
+        primer.hairpinTm = primer3.calc_hairpin_tm(str(primer))
+        primer.rcHairpin = primer3.calc_hairpin_tm(str(primer.reverseComplement()))
+
         # hairpin tm should be less than (minTm - 5°); need to check both strands
-        fwdOk = primer3.calc_hairpin_tm(str(primer)) < (minTm - FIVE_DEGREES)
-        revOk = primer3.calc_hairpin_tm(str(primer.reverseComplement())) < (minTm - FIVE_DEGREES)
+        fwdOk = primer.hairpinTm < (minTm - FIVE_DEGREES)
+        revOk = primer.rcHairpin < (minTm - FIVE_DEGREES)
 
         return fwdOk and revOk
 
     def noHomodimers(primer:Primer) -> bool:
         """verifies that the primer does not form homodimers
         """
+        # save the homodimer Tms
+        primer.homodimerTm = primer3.calc_homodimer_tm(str(primer))
+        primer.rcHomodimer = primer3.calc_homodimer_tm(str(primer.reverseComplement()))
+
         # homodimer tm should be less than (minTm - 5°); need to check both strands
-        fwdOk = primer3.calc_homodimer_tm(str(primer)) < (minTm - FIVE_DEGREES)
-        revOk = primer3.calc_homodimer_tm(str(primer.reverseComplement())) < (minTm - FIVE_DEGREES)
+        fwdOk = primer.homodimerTm < (minTm - FIVE_DEGREES)
+        revOk = primer.rcHomodimer < (minTm - FIVE_DEGREES)
 
         return fwdOk and revOk
 
@@ -403,18 +411,34 @@ def __buildOutput(kmers:dict[str,dict[str,tuple[str,int,str]]], candidates:list[
         # identify which sequence is present in the genome
         try:
             entry = kmers[cand.seq]
+            hairpinTm = cand.hairpinTm
+            rcHairpin = cand.rcHairpin
+            homodimerTm = cand.homodimerTm
+            rcHomodimer = cand.rcHomodimer
+
         except:
             entry = kmers[cand.seq.reverse_complement()]
+            hairpinTm = cand.rcHairpin
+            rcHairpin = cand.hairpinTm
+            homodimerTm = cand.rcHomodimer
+            rcHomodimer = cand.homodimerTm
 
         # for each genome
         for name in entry.keys():
             # extract the data from the entry
             contig, start, strand = entry[name]
-
-            # save a Primer object in this contig's list
+
+            # create the new primer for this genome
+            primer = Primer(cand.seq, contig, start, len(cand.seq), strand)
+            primer.hairpinTm = hairpinTm
+            primer.rcHairpin = rcHairpin
+            primer.homodimerTm = homodimerTm
+            primer.rcHomodimer = rcHomodimer
+
+            # save the Primer in this contig's list
             out[name] = out.get(name, dict())
             out[name][contig] = out[name].get(contig, list())
-            out[name][contig].append(Primer(cand.seq, contig, start, len(cand.seq), strand))
+            out[name][contig].append(primer)
 
     return out