Skip to content

Commit

Permalink
fastforward to v1.3.7
Browse files Browse the repository at this point in the history
  • Loading branch information
dr-joe-wirth committed Sep 16, 2024
2 parents e391763 + f3d0f90 commit 322f246
Show file tree
Hide file tree
Showing 14 changed files with 727 additions and 196 deletions.
41 changes: 41 additions & 0 deletions .github/workflows/docker-build-push.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
name: Build and Push Docker Image

on:
push:
tags:
- 'v*.*.*'

env:
DOCKER_USERNAME: jwirth
DOCKER_IMAGE_NAME: primerforge

jobs:
docker-build-push:
name: docker build and push
runs-on: ubuntu-latest

steps:
- name: Checkout code
uses: actions/checkout@v3

- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v2

- name: Log in to Docker Hub
uses: docker/login-action@v2
with:
username: ${{ env.DOCKER_USERNAME }}
password: ${{ secrets.DOCKER_PASSWORD }}

- name: Extract version from tag
run: echo "VERSION=${GITHUB_REF#refs/tags/v}" >> $GITHUB_ENV

- name: Build and push Docker image
uses: docker/build-push-action@v5
with:
context: ./docker
file: ./docker/Dockerfile
push: true
tags: |
${{ env.DOCKER_USERNAME }}/${{ env.DOCKER_IMAGE_NAME }}:v${{ env.VERSION }}
${{ env.DOCKER_USERNAME }}/${{ env.DOCKER_IMAGE_NAME }}:latest
69 changes: 42 additions & 27 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -90,6 +90,21 @@ optional arguments:
--debug run in debug mode (default: False)
```

## Results
The results produced by `primerForge` list one primer pair per line, and these pairs are sorted based on the following criteria (in order of importance):

* largest difference from the ingroup PCR product range for outgroup PCR products
* lowest number of outgroup PCR products
* lowest G+C difference between the pair
* lowest Tm difference between the pair
* lowest Tm for heterodimers
* lowest Tm (sum) for homodimers
* lowest Tm (sum) for hairpins
* lowest variance in ingroup PCR product sizes
* largest median ingroup PCR product size

See [the example](#examining-the-results) for details about the file format.

## Workflow

```mermaid
Expand Down Expand Up @@ -208,8 +223,8 @@ flowchart TB
ispcr["filter pairs using isPcr"]
final(["final primer pairs"])
dump5[/"dump to file"/]
write[/"write pairs to file"/]
write[/"sort pairs and
write to file"/]
noout --> ispcr
ispcr --> final
Expand All @@ -223,7 +238,7 @@ This example assumes you have already installed `primerForge` [as described abov
### Motivation
In this example, we will use `primerForge` to find pairs of primers between 18bp and 24bp that can be used to differentiate three strains of _Mycoplasma mycoides_ subspecies mycoides (the "ingroup") from two strains of _Mycoplasma mycoides_ subspecies capri (the "outgroup"). The primer pairs identified by `primerForge` are predicted to produce a single PCR product between 500bp and 2000bp in the ingroup. These same primer pairs are predicted to produce PCR products <300bp, >3000bp, or no PCR products at all in the outgroup.

### preparing the workspace
### Preparing the workspace
In order to get started, create a directory called `mycoplasma_test` and move into it:

```bash
Expand Down Expand Up @@ -252,7 +267,7 @@ We will use the following flags to specify specific parameters for this example:

You can get a list of all available flags using the command `primerForge --help`.

Run `primerForge` using the following command (requires at least 3 Gb of RAM):
Run `primerForge` using the following command (requires at least 2 Gb of RAM):

```bash
primerForge --ingroup "./i*gbff" --outgroup "./o*gbff" --pcr_prod 500,2000 --bad_sizes 300,3000 --primer_len 18,24
Expand All @@ -261,43 +276,43 @@ primerForge --ingroup "./i*gbff" --outgroup "./o*gbff" --pcr_prod 500,2000 --bad
After running the command, you should see something like this printed to the screen:

```text
dumping Parameters to '_pickles/parameters.p' ... done 00:00:00.49
identifying kmers suitable for use as primers in all 3 ingroup genome sequences
getting shared ingroup kmers that appear once in each genome ... done 00:01:19.32
dumping shared kmers to '_pickles/sharedKmers.p' ... done 00:00:07.34
evaluating kmers ... done 00:03:20.11
getting shared ingroup kmers that appear once in each genome ... done 00:01:36.15
dumping shared kmers to '_pickles/sharedKmers.p' ... done 00:00:11.22
evaluating 2430140 kmers ... done 00:01:51.73
identified 30413 candidate kmers
done 00:04:49.18
dumping candidate kmers to '_pickles/candidates.p' ... done 00:00:01.03
identifying pairs of primers found in all ingroup sequences ... done 00:00:11.60
dumping candidate kmers to '_pickles/candidates.p' ... done 00:00:01.55
done 00:03:42.94
identifying pairs of primers found in all ingroup sequences ... done 00:00:11.62
identified 16050 primer pairs shared in all ingroup sequences
dumping unfiltered pairs to '_pickles/pairs.p' ... done 00:00:00.54
dumping unfiltered pairs to '_pickles/pairs.p' ... done 00:00:00.56
removing primer pairs present in the outgroup sequences
getting outgroup PCR products ... done 00:00:01.01
filtering primer pairs ... done 00:00:00.53
processing outgroup results ... done 00:00:00.52
getting outgroup PCR products ... done 00:00:01.03
filtering primer pairs ... done 00:00:00.54
processing outgroup results ... done 00:00:00.54
removed 5905 pairs present in the outgroup (10145 pairs remaining)
dumping filtered pairs to '_pickles/pairs_noOutgroup.p' ... done 00:00:00.53
validating primer pairs with isPcr ... done 00:00:01.97
removed 1665 pairs not validated by isPcr (8480 pairs remaining)
dumping validated pairs to '_pickles/pairs_noOutgroup_validated.p' ... done 00:00:00.53
writing primer pairs to 'results.tsv' ... done 00:00:00.52
dumping filtered pairs to '_pickles/pairs_noOutgroup.p' ... done 00:00:00.54
validating primer pairs with isPcr ... done 00:00:03.44
removed 3659 pairs not validated by isPcr (6486 pairs remaining)
dumping validated pairs to '_pickles/pairs_noOutgroup_validated.p' ... done 00:00:00.54
writing 6486 primer pairs to 'results.tsv' ... done 00:00:11.62
total runtime: 00:05:08.49
total runtime: 00:04:14.50
```

As we can see, `primerForge` found 30,413 kmers that were suitable for use as a primer in all three ingroup genomes. It then went on to identify 16,050 primer pairs that would produce PCR products between 500bp and 2000bp in the ingroup genomes. Next, it found that of those 16,050 pairs, 5,905 of them formed PCR products between 300bp and 3000bp in one or more of the outgroup genomes. Finally, it used `isPcr` to validate the remaining 10,145 primer pairs resulting in 8,480 primer pairs being written to file.
As we can see, `primerForge` found 30,413 kmers that were suitable for use as a primer in all three ingroup genomes. It then went on to identify 16,050 primer pairs that would produce PCR products between 500bp and 2000bp in the ingroup genomes. Next, it found that of those 16,050 pairs, 5,905 of them formed PCR products between 300bp and 3000bp in one or more of the outgroup genomes. Finally, it used `isPcr` to validate the remaining 10,145 primer pairs resulting in 6,486 primer pairs being written to file.

### Examining the results
`primerForge` generated `results.tsv`, the file that contains the sequences and details for each primer pair, and `primerForge.log`. Here are a few lines from `results.tsv`:

|fwd_seq|fwd_Tm|fwd_GC|rev_seq|rev_Tm|rev_GC|i1.gbff_contig|i1.gbff_length|i2.gbff_contig|i2.gbff_length|i3.gbff_contig|i3.gbff_length|o1.gbff_contig|o1.gbff_length|o2.gbff_contig|o2.gbff_length|
|:------|:-----|:-----|:------|:-----|:-----|:------------:|:------------:|:------------:|:------------:|:------------:|:------------:|:------------:|:------------:|:------------:|:------------:|
|TATGCAACTAATCCCGAGTATCAC|56.1|41.7|TGTAAGTGGCGTTGTATCCC|55.5|50|NZ_LAUX01000130.1|521|NZ_LAUV01000074.1|521|NZ_LAUY01000118.1|521|NA|0|NA|0|
|TGTTCCTTCACACTCAATAACAGC|57.3|41.7|AGAAGGAACAGTCGCTGAAG|55.4|50|NZ_LAUX01000081.1|1645|NZ_LAUV01000036.1|1645|NZ_LAUY01000073.1|1645|NZ_LS483503.1|3091|NZ_CP065583.1|3103|
|AAGGAGAGTATCGCTTAGTTGATG|56|41.7|CAACAGCAGATGGTTTAGAAAGTG|56.4|41.7|NZ_LAUX01000079.1|788|NZ_LAUV01000035.1|788|NZ_LAUY01000072.1|788|NA|0|NA|0|
|AAGGAGAGTATCGCTTAGTTGATG|56|41.7|ACTCCAATTGCTCTTCCTGAAG|56.2|45.5|NZ_LAUX01000079.1|1053|NZ_LAUV01000035.1|1053|NZ_LAUY01000072.1|1053|NA|0|NA|0|
|TGAAATCACCAGCTATTGCATCAG|57.5|41.7|ACATTGCAACTCCTGAGATTTG|55.1|40.9|NZ_LAUX01000081.1|1998|NZ_LAUV01000036.1|1998|NZ_LAUY01000073.1|1998|NA|0|NA|0|
|AGAAGCAAAGGATATGGGAACAAC|57.1|41.7|AAATCAACACCCTCAATAAGCTCC|57.1|41.7|NZ_LAUX01000078.1|804|NZ_LAUV01000035.1|804|NZ_LAUY01000092.1|804|NA|0|NA|0|
|TCCATCTAATGAAGATCAACCAGG|55.9|41.7|CCCTAATTGTGATGAGTTGACAAC|55.9|41.7|NZ_LAUX01000010.1|710|NZ_LAUV01000042.1|713|NZ_LAUY01000078.1|710|NA|0|NA|0|
|CATCAGCTGTTGTAAATAACCCAC|56.2|41.7|GTGGAGCTATGAAACCAATATCAG|55.3|41.7|NZ_LAUX01000117.1|1694|NZ_LAUV01000064.1|1694|NZ_LAUY01000018.1|1694|NZ_LS483503.1|3212|NA|0|

The first six columns show the forward and reverse sequences (5' --> 3') as well as their melting temperatures and their G+C %. Next, for each genome it lists the contig and the PCR product size that is predicted be produced by this pair. For example, the first pair of primers are predicted to produce PCR products of 521bp the ingroup genomes, and the binding sites for this primer pair in the files `i1.gbff`, `i2.gbff`, and `i3.gbff` can be found in contigs `NZ_LAUX01000130.1`, `NZ_LAUV01000074.1`, and `NZ_LAUY01000118.1`, respectively. This same pair is not predicted to produce any PCR products in either outgroup genome. Similarly, the next pair is predicted to produce a PCR product size of 1,645bp in all three ingroup genomes and PCR products sizes of >3,000bp in both the outgroup genomes.
The first six columns show the forward and reverse sequences (5' --> 3') as well as their melting temperatures and their G+C content (mol %). Next, for each genome it lists the contig and the PCR product size that is predicted be produced by this pair. For example, the first pair of primers are predicted to produce PCR products of 804bp the ingroup genomes, and the binding sites for this primer pair in the files `i1.gbff`, `i2.gbff`, and `i3.gbff` can be found in contigs `NZ_LAUX01000078.1`, `NZ_LAUV01000035.1`, and `NZ_LAUY01000092.1`, respectively. This same pair is not predicted to produce any PCR products in either outgroup genome. Similarly, the third pair is predicted to produce a PCR product size of 1,694bp in all three ingroup genomes, no products `o2.gbff`, and a PCR product >3,000bp in `o1.gbff`.

If a primer pair is predicted to produce multiple products in an outgroup genome, then the contig column and the size column will list contigs and sizes in a comma-separated list linked by position. For example, if a primer pair was expected to produce a product of 1,990bp in `contig_1` and 2,024bp in `contig_2` in the genome file `example.gbff`, then the columns for this genome would look like this:

Expand Down
11 changes: 2 additions & 9 deletions bin/Parameters.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,6 @@ class Parameters():
_MIN_LEN = 10
__ALLOWED_FORMATS = ('genbank', 'fasta')
__ALL_CONTIGS_FNA = ".all_contigs.fna"
__QUERY_FN = ".ispcr_query.tsv"
__PICKLE_DIR = "_pickles"
_PARAMS = 0
_SHARED = 1
Expand Down Expand Up @@ -69,7 +68,6 @@ def __init__(self, author:str, version:str, initializeLog:bool=True) -> Paramete
self.pickles:dict[int,str]
self.keepIntermediateFiles:bool
self.allContigsFna:str
self.queryFn:str

# save author and version as private attributes
self.__author:str = author
Expand All @@ -93,9 +91,6 @@ def __init__(self, author:str, version:str, initializeLog:bool=True) -> Paramete

# get the all contigs fasta filename
self.allContigsFna = os.path.join(os.path.dirname(self.resultsFn), Parameters.__ALL_CONTIGS_FNA)

# get the ispcr query filename
self.queryFn = os.path.join(os.path.dirname(self.resultsFn), Parameters.__QUERY_FN)

def __eq__(self, other:Parameters) -> bool:
"""equality overload
Expand Down Expand Up @@ -182,11 +177,9 @@ def __checkOutputFile(fn:str) -> None:
if proceed == YN[1]:
raise FileExistsError(f"{WARN_MSG_A}{fn}{WARN_MSG_B}")

# make sure the file can be written
# make sure we can write the file to the output directory
try:
fh = open(fn, 'w')
fh.close()
os.remove(fn)
os.access(os.path.dirname(fn), os.W_OK)
except:
raise ValueError(f"{ERR_MSG}{fn}")

Expand Down
10 changes: 10 additions & 0 deletions bin/Primer.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,10 @@ def __init__(self, seq:Seq, contig:str, start:int, length:int, strand:str) -> Pr
self.strand:str = strand
self.Tm:float = None
self.gcPer:float = None
self.hairpinTm:float = None
self.rcHairpin:float = None
self.homodimerTm:float = None
self.rcHomodimer:float = None
self.__length:int = length

# flip start and end if on the minus strand
Expand Down Expand Up @@ -111,6 +115,12 @@ def reverseComplement(self) -> Primer:
new = Primer(self.seq.reverse_complement(), self.contig, self.start, len(self), Primer.MINUS)
else:
new = Primer(self.seq.reverse_complement(), self.contig, self.end, len(self), Primer.PLUS)

# flip the hairpin/dimer Tms
new.hairpinTm = self.rcHairpin
new.rcHairpin = self.hairpinTm
new.homodimerTm = self.rcHomodimer
new.rcHomodimer = self.homodimerTm

# make the new object
return new
Expand Down
32 changes: 32 additions & 0 deletions bin/Product.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
from __future__ import annotations
from typing import Union

class Product:
"""class for storing PCR product data for a pair of primers
"""
def __init__(self, contig:str, size:Union[int,str], fbin:int, rbin:int, dimerTm:float) -> Product:
"""constructor
Args:
contig (str): the contig where the product resides
size (Union[int,str]): the contig size (will be str if multiple sizes)
fbin (int): the bin the forward primer belongs to
rbin (int): the bin the reverse primer belongs to
dimerTm (float): the heterodimer Tm for the primer pair
Returns:
Product: a Product object
"""
# set attributes
self.contig:str = contig
self.size:Union[int,str] = size
self.dimerTm:float = dimerTm
self.fbin:int = fbin
self.rbin:int = rbin

# overloads
def __str__(self) -> str:
return f'{self.contig}: {self.size} bp'

def __repr__(self) -> str:
return str(self)
38 changes: 31 additions & 7 deletions bin/getCandidateKmers.py
Original file line number Diff line number Diff line change
Expand Up @@ -316,18 +316,26 @@ def noLongRepeats(primer:Primer) -> bool:
def noHairpins(primer:Primer) -> bool:
"""verifies that the primer does not form hairpins
"""
# save the hairpin Tms
primer.hairpinTm = primer3.calc_hairpin_tm(str(primer))
primer.rcHairpin = primer3.calc_hairpin_tm(str(primer.reverseComplement()))

# hairpin tm should be less than (minTm - 5°); need to check both strands
fwdOk = primer3.calc_hairpin_tm(str(primer)) < (minTm - FIVE_DEGREES)
revOk = primer3.calc_hairpin_tm(str(primer.reverseComplement())) < (minTm - FIVE_DEGREES)
fwdOk = primer.hairpinTm < (minTm - FIVE_DEGREES)
revOk = primer.rcHairpin < (minTm - FIVE_DEGREES)

return fwdOk and revOk

def noHomodimers(primer:Primer) -> bool:
"""verifies that the primer does not form homodimers
"""
# save the homodimer Tms
primer.homodimerTm = primer3.calc_homodimer_tm(str(primer))
primer.rcHomodimer = primer3.calc_homodimer_tm(str(primer.reverseComplement()))

# homodimer tm should be less than (minTm - 5°); need to check both strands
fwdOk = primer3.calc_homodimer_tm(str(primer)) < (minTm - FIVE_DEGREES)
revOk = primer3.calc_homodimer_tm(str(primer.reverseComplement())) < (minTm - FIVE_DEGREES)
fwdOk = primer.homodimerTm < (minTm - FIVE_DEGREES)
revOk = primer.rcHomodimer < (minTm - FIVE_DEGREES)

return fwdOk and revOk

Expand Down Expand Up @@ -403,18 +411,34 @@ def __buildOutput(kmers:dict[str,dict[str,tuple[str,int,str]]], candidates:list[
# identify which sequence is present in the genome
try:
entry = kmers[cand.seq]
hairpinTm = cand.hairpinTm
rcHairpin = cand.rcHairpin
homodimerTm = cand.homodimerTm
rcHomodimer = cand.rcHomodimer

except:
entry = kmers[cand.seq.reverse_complement()]
hairpinTm = cand.rcHairpin
rcHairpin = cand.hairpinTm
homodimerTm = cand.rcHomodimer
rcHomodimer = cand.homodimerTm

# for each genome
for name in entry.keys():
# extract the data from the entry
contig, start, strand = entry[name]

# save a Primer object in this contig's list

# create the new primer for this genome
primer = Primer(cand.seq, contig, start, len(cand.seq), strand)
primer.hairpinTm = hairpinTm
primer.rcHairpin = rcHairpin
primer.homodimerTm = homodimerTm
primer.rcHomodimer = rcHomodimer

# save the Primer in this contig's list
out[name] = out.get(name, dict())
out[name][contig] = out[name].get(contig, list())
out[name][contig].append(Primer(cand.seq, contig, start, len(cand.seq), strand))
out[name][contig].append(primer)

return out

Expand Down
Loading

0 comments on commit 322f246

Please sign in to comment.