Skip to content

Releases: pairwise-alignment/pa-bench

A*PA2 evals

25 Mar 12:07
Compare
Choose a tag to compare

This release is a placeholder release for downloads related to the A*PA2 paper.

results.zip contains the results used by evals/astarpa2/evals.ipynb. See evals/astarpa2/README.md for more info.

A*PA evals

04 Aug 14:47
Compare
Choose a tag to compare

This release is a placeholder release for downloads related to the A*PA paper.

results.zip contains the results used by evals/astarpa/evals.ipynb. See evals/astarpa/README.md for more info.

Datasets

16 Jan 02:25
Compare
Choose a tag to compare

This is just a place to put datasets used for benchmarking. Datasets are provided in the .seq format, containing sequence pairs to be aligned like so:

>CTGGGGTTACAGGCATGCACCAGCACGCC...
<CTGGGGTTACAGGCATGCACCAGCACGCC...

ont-500k.zip: ONT reads length >500kbp @ 6.1% divergence

Contains 50 .seq files (seq01.seq .. seq50.seq) each containing a single alignment. This dataset contains only read errors.

This dataset was created by downloading some reads (this download, 300GB total) used for v1.1 of CHM13, and aligning them back to the reference.
See Snakefile for details.

ont-500k-genvar.zip: ONT reads length >500kbp @ 7.2% divergence, including genetic variation

Contains 48 .seq files each containing a single alignment. This dataset includes genetic variation and large gaps.

This dataset is reused directly from BiWFA and is also available in the BiWF repository. We provide it here for completeness, with the only change that seq[1-9].seq have been renamed to seq[01-09].seq. It was generated by the BiWFA authors by taking ONT MinION reads from Bowden et. al (2019), filtering them for length at least 500kbp, and aligning them to the CHM13 v1.1 assembly.

ont-10k.zip ont-50k.zip: ONT reads of length <10k and <50k @ 12% divergence

These contain 50 .seq files with 100 resp. 200 sequence pairs each. Pairs are sorted by edit distance, with the closest pairs in 00.seq.

These datasets were reused from BiWFA and only modified to split them into multiple files.

ont-1k.zip: ONT reads of length <1k @ 10% divergence

Contains 50 .seq files with ~250 sequence pairs each, sorted by increasing edit distance.

This dataset is reused from WFA.

sars-cov-2.zip: 10000 pairs of length 30k @ 1.5% divergence

This dataset was generated by downloading a 500MB of SARS-CoV-2 genomes. We stripped all non-ACTG characters, sampled 10000 random pairs, sorted them by edit distance, and split them into 50 files each containing 200 pairs.
Average divergence is 1.5%.

illumina.seq.tar.bz: illumina reads of length 100 @ 0.3% divergence

Contains a single .seq file with 100'000 pairs. Reused from WFA.