Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

12 merge sv candidates #14

Merged
merged 92 commits into from
Apr 3, 2024
Merged

12 merge sv candidates #14

merged 92 commits into from
Apr 3, 2024

Conversation

jonperdomo
Copy link
Collaborator

Merge similar SV candidates.

@jonperdomo jonperdomo linked an issue Nov 21, 2023 that may be closed by this pull request
@jonperdomo
Copy link
Collaborator Author

jonperdomo commented Nov 21, 2023

Tested CNV calling using Truvari with ONT long reads for alignment-based calls (see #11) and Illumina short reads from SNP-based CNV predictions. SNPs were called using Deepvariant v1.5.0 WGS model. Data is Illumina WGS 2x150bp 300X per individual from GIAB HG002.

Recall is unchanged, but precision is slightly affected for both deletions and duplications. SV merging should improve this.

  • Duplications:

    "TP-base": 53,
    "TP-call": 53,
    "FP": 31462,
    "FN": 4,
    "precision": 0.0016817388545137236,
    "recall": 0.9298245614035088,
    "f1": 0.0033574052958317492,

  • Deletions:

    "TP-base": 82,
    "TP-call": 82,
    "FP": 14435,
    "FN": 0,
    "precision": 0.005648549975890336,
    "recall": 1.0,
    "f1": 0.011233646140146587,

@jonperdomo
Copy link
Collaborator Author

jonperdomo commented Nov 21, 2023

After implementing clipped base + read depth-based merging:

  • Duplications:

    "TP-base": 40,
    "TP-call": 40,
    "FP": 2355,
    "FN": 17,
    "precision": 0.016701461377870562,
    "recall": 0.7017543859649122,
    "f1": 0.03262642740619902,

  • Deletions:

    "TP-base": 78,
    "TP-call": 78,
    "FP": 1023,
    "FN": 4,
    "precision": 0.07084468664850137,
    "recall": 0.9512195121951219,
    "f1": 0.1318681318681319,

@jonperdomo
Copy link
Collaborator Author

Results with different DBScan parameters. Sample minimum is set to 2, and I test with epsilon values from 1-99 (step size=1) and from 100-1000 (step size=10):

Precision_Recall_DEL
F1_DEL
Precision_Recall_DUP
F1_DUP

@jonperdomo
Copy link
Collaborator Author

After running benchmarking only in high-confidence intervals:

[NORMAL] Generating agglomerative clustering results
Input file path: agglo_tests/Agglo_Dth_100to1000.out
SV Type: DEL
Maximum Recall: 1.0
Maximum Precision at Maximum Recall: 0.010328756770374103
FP Count at Maximum Recall: 7857
FN Count at Maximum Recall: 0
Number of DELs in Callset: 8016
Number of DELs in Benchmark: 82
SV Type: DUP
Maximum Recall: 0.9298245614035088
Maximum Precision at Maximum Recall: 0.0037679510877292764
FP Count at Maximum Recall: 14013
FN Count at Maximum Recall: 4
Number of DUPs in Callset: 14069
Number of DUPs in Benchmark: 57

[HIGH-CONF] Generating agglomerative clustering results
Input file path: agglo_tests/Agglo_Dth_100to1000_HighConf.out
SV Type: DEL
Maximum Recall: 1.0
Maximum Precision at Maximum Recall: 0.01566265060240964
FP Count at Maximum Recall: 4085
FN Count at Maximum Recall: 0
Number of DELs in Callset: 4150
Number of DELs in Benchmark: 65
SV Type: DUP
Maximum Recall: 0.9736842105263158
Maximum Precision at Maximum Recall: 0.00378632828489562
FP Count at Maximum Recall: 9735
FN Count at Maximum Recall: 1
Number of DUPs in Callset: 9772
Number of DUPs in Benchmark: 38

[NORMAL] Generating DBSCAN clustering results
Input file path: dbscan_tests/E1toE100toE1000_M2.out
SV Type: DEL
Maximum Recall: 0.9878048780487805
Maximum Precision at Maximum Recall: 0.14111498257839722
FP Count at Maximum Recall: 493
FN Count at Maximum Recall: 1
Number of DELs in Callset: 580
Number of DELs in Benchmark: 82
SV Type: DUP
Maximum Recall: 0.8771929824561403
Maximum Precision at Maximum Recall: 0.030432136335970784
FP Count at Maximum Recall: 1593
FN Count at Maximum Recall: 7
Number of DUPs in Callset: 1643
Number of DUPs in Benchmark: 57

[HIGH-CONF] Generating DBSCAN clustering results
Input file path: dbscan_tests/E1toE100toE1000_M2_HighConf.out
SV Type: DEL
Maximum Recall: 0.9846153846153847
Maximum Precision at Maximum Recall: 0.48484848484848486
FP Count at Maximum Recall: 68
FN Count at Maximum Recall: 1
Number of DELs in Callset: 132
Number of DELs in Benchmark: 65
SV Type: DUP
Maximum Recall: 0.9210526315789473
Maximum Precision at Maximum Recall: 0.10144927536231885
FP Count at Maximum Recall: 310
FN Count at Maximum Recall: 3
Number of DUPs in Callset: 345
Number of DUPs in Benchmark: 38

@jonperdomo
Copy link
Collaborator Author

Precision is significantly improved for DELs using DBSCAN, could also test increasing the distance threshold for agglo. clustering:
Precision_Recall_DEL
Precision_Recall_DEL

@jonperdomo
Copy link
Collaborator Author

jonperdomo commented Dec 19, 2023

I have added some initial work on improving SV classification from split reads by removing some scenarios where duplications or deletions would not occur, including the endpoints of the overlap (deletion only), the endpoints of the alignments in an overlap (duplication only), gap endpoints (duplication or deletion), and gap alignment endpoints (duplication only).
After this initial work on duplication classification prior to merging:

  • Duplications:

    "TP-base": 37,
    "TP-call": 37,
    "FP": 17669,
    "FN": 1,
    "precision": 0.0020896871117135436,
    "recall": 0.9736842105263158,
    "f1": 0.004170423805229938,

  • Deletions:

    "TP-base": 65,
    "TP-call": 65,
    "FP": 7932,
    "FN": 0,
    "precision": 0.008128048018006753,
    "recall": 1.0,
    "f1": 0.01612503100967502,

FP count is significantly reduced, see #11 (comment)

@jonperdomo
Copy link
Collaborator Author

Completed whole-genome sequencing after threading bug fixes and with CNV data disabled for improved performance. Here are some metrics with 250G memory and 40 threads, cores:

Cores per node: 40
CPU Utilized: 2-10:18:32
CPU Efficiency: 36.92% of 6-13:56:00 core-walltime
Job Wall-clock time: 03:56:54
Memory Utilized: 31.34 GB
Memory Efficiency: 12.53% of 250.00 GB

Although wall clock time is ~4 hours, CPU time is much higher at ~6.5 days, thus it would take a long time without high multi-threading.

@jonperdomo
Copy link
Collaborator Author

jonperdomo commented Mar 25, 2024

Benchmark results using Truvari v3.5.0 with defaults, refdist=1000
This is after SV merging, and with no post-filtering step.

  • Data: ONT Kit14 R10.4.1 at 60-fold coverage
  • Benchmark: GIAB SV v0.6 in high-confidence regions (9705 SVs)
    • 4261 deletions, 3673 insertions, 1771 duplications (only 6 > 50kb)
SV Caller F1 Precision Recall TP FP FN SV Caller SV Count
contextSV 0.66 0.51 0.965 9306 9093* 340 18399
Sniffles2 0.954 0.944 0.963 9290 547 356 9837
cuteSV 0.946 0.918 0.976 9419 839 227 11081
PBSV 0.942 0.928 0.956 9217 711 429 10582

@jonperdomo jonperdomo marked this pull request as ready for review April 3, 2024 21:32
@jonperdomo jonperdomo merged commit 9b02f6e into main Apr 3, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Merge SV candidates
1 participant