12 merge sv candidates #14

jonperdomo · 2023-11-21T15:17:09Z

Merge similar SV candidates.

jonperdomo · 2023-11-21T15:23:14Z

Tested CNV calling using Truvari with ONT long reads for alignment-based calls (see #11) and Illumina short reads from SNP-based CNV predictions. SNPs were called using Deepvariant v1.5.0 WGS model. Data is Illumina WGS 2x150bp 300X per individual from GIAB HG002.

Recall is unchanged, but precision is slightly affected for both deletions and duplications. SV merging should improve this.

Duplications:

"TP-base": 53,
"TP-call": 53,
"FP": 31462,
"FN": 4,
"precision": 0.0016817388545137236,
"recall": 0.9298245614035088,
"f1": 0.0033574052958317492,
Deletions:

"TP-base": 82,
"TP-call": 82,
"FP": 14435,
"FN": 0,
"precision": 0.005648549975890336,
"recall": 1.0,
"f1": 0.011233646140146587,

jonperdomo · 2023-11-21T19:34:32Z

After implementing clipped base + read depth-based merging:

Duplications:

"TP-base": 40,
"TP-call": 40,
"FP": 2355,
"FN": 17,
"precision": 0.016701461377870562,
"recall": 0.7017543859649122,
"f1": 0.03262642740619902,
Deletions:

"TP-base": 78,
"TP-call": 78,
"FP": 1023,
"FN": 4,
"precision": 0.07084468664850137,
"recall": 0.9512195121951219,
"f1": 0.1318681318681319,

jonperdomo · 2023-11-28T21:06:38Z

Results with different DBScan parameters. Sample minimum is set to 2, and I test with epsilon values from 1-99 (step size=1) and from 100-1000 (step size=10):

jonperdomo · 2023-12-01T19:26:15Z

After running benchmarking only in high-confidence intervals:

[NORMAL] Generating agglomerative clustering results
Input file path: agglo_tests/Agglo_Dth_100to1000.out
SV Type: DEL
Maximum Recall: 1.0
Maximum Precision at Maximum Recall: 0.010328756770374103
FP Count at Maximum Recall: 7857
FN Count at Maximum Recall: 0
Number of DELs in Callset: 8016
Number of DELs in Benchmark: 82
SV Type: DUP
Maximum Recall: 0.9298245614035088
Maximum Precision at Maximum Recall: 0.0037679510877292764
FP Count at Maximum Recall: 14013
FN Count at Maximum Recall: 4
Number of DUPs in Callset: 14069
Number of DUPs in Benchmark: 57

[HIGH-CONF] Generating agglomerative clustering results
Input file path: agglo_tests/Agglo_Dth_100to1000_HighConf.out
SV Type: DEL
Maximum Recall: 1.0
Maximum Precision at Maximum Recall: 0.01566265060240964
FP Count at Maximum Recall: 4085
FN Count at Maximum Recall: 0
Number of DELs in Callset: 4150
Number of DELs in Benchmark: 65
SV Type: DUP
Maximum Recall: 0.9736842105263158
Maximum Precision at Maximum Recall: 0.00378632828489562
FP Count at Maximum Recall: 9735
FN Count at Maximum Recall: 1
Number of DUPs in Callset: 9772
Number of DUPs in Benchmark: 38

[NORMAL] Generating DBSCAN clustering results
Input file path: dbscan_tests/E1toE100toE1000_M2.out
SV Type: DEL
Maximum Recall: 0.9878048780487805
Maximum Precision at Maximum Recall: 0.14111498257839722
FP Count at Maximum Recall: 493
FN Count at Maximum Recall: 1
Number of DELs in Callset: 580
Number of DELs in Benchmark: 82
SV Type: DUP
Maximum Recall: 0.8771929824561403
Maximum Precision at Maximum Recall: 0.030432136335970784
FP Count at Maximum Recall: 1593
FN Count at Maximum Recall: 7
Number of DUPs in Callset: 1643
Number of DUPs in Benchmark: 57

[HIGH-CONF] Generating DBSCAN clustering results
Input file path: dbscan_tests/E1toE100toE1000_M2_HighConf.out
SV Type: DEL
Maximum Recall: 0.9846153846153847
Maximum Precision at Maximum Recall: 0.48484848484848486
FP Count at Maximum Recall: 68
FN Count at Maximum Recall: 1
Number of DELs in Callset: 132
Number of DELs in Benchmark: 65
SV Type: DUP
Maximum Recall: 0.9210526315789473
Maximum Precision at Maximum Recall: 0.10144927536231885
FP Count at Maximum Recall: 310
FN Count at Maximum Recall: 3
Number of DUPs in Callset: 345
Number of DUPs in Benchmark: 38

jonperdomo · 2023-12-01T19:49:54Z

Precision is significantly improved for DELs using DBSCAN, could also test increasing the distance threshold for agglo. clustering:

jonperdomo · 2023-12-19T16:09:27Z

I have added some initial work on improving SV classification from split reads by removing some scenarios where duplications or deletions would not occur, including the endpoints of the overlap (deletion only), the endpoints of the alignments in an overlap (duplication only), gap endpoints (duplication or deletion), and gap alignment endpoints (duplication only).
After this initial work on duplication classification prior to merging:

Duplications:

"TP-base": 37,
"TP-call": 37,
"FP": 17669,
"FN": 1,
"precision": 0.0020896871117135436,
"recall": 0.9736842105263158,
"f1": 0.004170423805229938,
Deletions:

"TP-base": 65,
"TP-call": 65,
"FP": 7932,
"FN": 0,
"precision": 0.008128048018006753,
"recall": 1.0,
"f1": 0.01612503100967502,

FP count is significantly reduced, see #11 (comment)

jonperdomo · 2024-03-12T17:12:49Z

Completed whole-genome sequencing after threading bug fixes and with CNV data disabled for improved performance. Here are some metrics with 250G memory and 40 threads, cores:

Cores per node: 40
CPU Utilized: 2-10:18:32
CPU Efficiency: 36.92% of 6-13:56:00 core-walltime
Job Wall-clock time: 03:56:54
Memory Utilized: 31.34 GB
Memory Efficiency: 12.53% of 250.00 GB

Although wall clock time is ~4 hours, CPU time is much higher at ~6.5 days, thus it would take a long time without high multi-threading.

jonperdomo · 2024-03-25T15:34:45Z

Benchmark results using Truvari v3.5.0 with defaults, refdist=1000
This is after SV merging, and with no post-filtering step.

Data: ONT Kit14 R10.4.1 at 60-fold coverage
Benchmark: GIAB SV v0.6 in high-confidence regions (9705 SVs)
- 4261 deletions, 3673 insertions, 1771 duplications (only 6 > 50kb)

SV Caller	F1	Precision	Recall	TP	FP	FN	SV Caller SV Count
contextSV	0.66	0.51	0.965	9306	9093*	340	18399
Sniffles2	0.954	0.944	0.963	9290	547	356	9837
cuteSV	0.946	0.918	0.976	9419	839	227	11081
PBSV	0.942	0.928	0.956	9217	711	429	10582

jonperdomo added 4 commits November 17, 2023 12:10

Remove comments

ff5f650

Work on sv merger

358ec3d

Work on clipped base support

bd09fcf

Add long and short read alignment arguments

912cfdc

jonperdomo linked an issue Nov 21, 2023 that may be closed by this pull request

Merge SV candidates #12

Closed

Work on SV merging

39c7fec

jonperdomo added 3 commits November 22, 2023 11:48

Fix end positions

ad3de6d

Update coordinates code

5f4d850

Add DBScan testing plots

4285e73

jonperdomo added 4 commits November 28, 2023 16:07

Update DBScan tests

ea5b859

Test with agglomerative clustering

ea51aa1

Print maximum recall and corresponding precision

4fcd09e

Add SV counts

10e16ce

jonperdomo added 9 commits December 1, 2023 14:50

Print optimal parameter value

0434c8e

Add CNV whole genome predictions

f2e9bd1

Add whole genome sv calling

d66adbc

Add whole genome pfb support

9ed4688

Remove script file

fa221e0

Update gnomad filepaths

c554556

Add threading by chromosome

31e5086

Add mutexes

2c324a9

Work on duplication classification

e02ea60

jonperdomo added 2 commits December 22, 2023 14:37

Simplify split read evidence and tested with CNV data

664a965

Index AF VCF and add CNV types

45d77e3

jonperdomo added 13 commits February 23, 2024 14:57

Add SV surrounding SNP predictions

0f7534d

Update SNP window bins to non-overlapping

6cc4827

Add SV distribution plots, update genotypes

0a63ae7

Add option to extend CNV regions

8a92faf

Fix command argument typo

a2540e5

Update threading

e315133

Add plot argument and update threading

20549ca

Update test due to default PFB update

17234d9

Update split read detection

cbfd549

Fix pointer and parallelism issues with hmm

e3e6077

Region chunk bug fix

0f4f4dc

Efficiency update

ec76c17

Add option to disable extra cnv steps

3e831aa

jonperdomo added 3 commits March 18, 2024 16:40

CNV prediction updates and debugging

0bd76d6

sv merger efficiency update

9498d4d

Update merger

253a4a5

jonperdomo added 10 commits March 25, 2024 14:38

remove debugging code

79ea15c

Update readme and remove testing code

e651f1d

Add window size parameter

260f8d3

Update CNV plot style

12024d8

Fix vcf conventions

4fc7051

Update vcf

4d58004

Update README.md

ef16966

Update README

cab2926

Update readme and help

b47a72d

Add conda files

d02ac78

jonperdomo marked this pull request as ready for review April 3, 2024 21:32

jonperdomo merged commit 9b02f6e into main Apr 3, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

12 merge sv candidates #14

12 merge sv candidates #14

jonperdomo commented Nov 21, 2023

jonperdomo commented Nov 21, 2023 •

edited

Loading

jonperdomo commented Nov 21, 2023 •

edited

Loading

jonperdomo commented Nov 28, 2023

jonperdomo commented Dec 1, 2023

jonperdomo commented Dec 1, 2023

jonperdomo commented Dec 19, 2023 •

edited

Loading

jonperdomo commented Mar 12, 2024

jonperdomo commented Mar 25, 2024 •

edited

Loading

12 merge sv candidates #14

12 merge sv candidates #14

Conversation

jonperdomo commented Nov 21, 2023

jonperdomo commented Nov 21, 2023 • edited Loading

jonperdomo commented Nov 21, 2023 • edited Loading

jonperdomo commented Nov 28, 2023

jonperdomo commented Dec 1, 2023

jonperdomo commented Dec 1, 2023

jonperdomo commented Dec 19, 2023 • edited Loading

jonperdomo commented Mar 12, 2024

jonperdomo commented Mar 25, 2024 • edited Loading

jonperdomo commented Nov 21, 2023 •

edited

Loading

jonperdomo commented Nov 21, 2023 •

edited

Loading

jonperdomo commented Dec 19, 2023 •

edited

Loading

jonperdomo commented Mar 25, 2024 •

edited

Loading