-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tumor only variant calling #81
Comments
Yes, generally you can use this workflow to call variants on any type of sample setup, including the case where you only have a tumor sample. You can even specify a normal sample along-side it (for which you don't have data) and thus leverage a tumor sample purity value that you usually get for tumor sequencing data. You just have to create a calling scenario that describes your sample setup, have a look at this example: And for general infos on what you need to do in calling (and descriptions of all the possible entries in this scenario YAML file), have a look here: For any variant type, you'll need candidate calls. Currently, this pipeline supports generating those candidates with Freebayes (SNVs, MNVs, some short indels) and Delly (insertions and deletions), but varlociraptor itself also supports larger structural variants specified via their breakpoints. And for getting started with actually deploying the workflow, have a look at the deployment instructions for this workflow (they are also linked to in the main I hope this gives some orientation around the varlociraptor docs and the capabilities of the caller and the workflow. Don't hesitate to ask further questions via the issues here or in the varlociraptor repo. And feel free to close the issue, if this answers the current question for you. |
Thank you for your detailed answer! Is it always better to generate candidates with multiple callers (eg. Mutect + Vardict + Strelka + Octopus) and let Varlociraptor detect the true variants, or are there cases in which using only one caller yields better results? I do not care about runtimes. I care about Precision and Recall. |
For the candidate callers, all you should care about is sensitivity / recall, not precision, as varlociraptor will take care of that and you can easily filter results in a meaningful way using the false discovery rate or Bayesian odds: Thus, generally using more callers to generate candidates should improve the recall of your results, as every caller will yield different results. Unless one caller yields a clear subset of candidates of another caller, it makes sense to include both callers in candidates calling (if you REALLY don't care about runtimes;). Also, it makes sense to have a look at the possible parameters of candidate callers and turn off any automatic (heuristic) filtering of variants they might be doing. That said, I would say that for SNVs, using a tool like Freebayes without any filtering is good enough to detect any potential variants as sites with some non-reference nucleotide present. But for MNVs, insertions, deletions and structural variants, the callsets will differ a lot more among callers. So it might make sense to investigate callsets a bot among the respective callers, if you want to increase recall for those variant types. If you do have a look into this, we're interested in any feedback and suggestions you might come up with! Which candidate caller combination is best is still something to be investigated in detail, see e.g. this issue in the varlociraptor repo: |
I will try to investigate a few things about the Tumor-only case in which I need to detect both Germline and Somatic variants:
My Tumor-only samples are WES so I will not investigate SV callers like Delly. |
These are the preliminary results using MAF 0.01 (1%) and no filters: HaplotypeCaller: 75,727 variants detected Ensemble calls (each variant in at least 1 caller): 447,550 variants detected Ensemble calls (each variant in at least 1 caller) using PASS filter: ~220,000 variants detected As you can see Vardict detected most of the variants. I am really worried about keeping those variant that do not have a PASS filter but as you said we want RECALL and Varlociraptor will take care of the rest. Are you confident that I do not have to do a more strict analysis and filtering and Varlociraptor will handle it? |
Nice, thanks a lot for investigating and documenting this here! For freebayes, are you using the parameters as they are specified in this workflow? dna-seq-varlociraptor/workflow/rules/candidate_calling.smk Lines 14 to 17 in 52a139f
Namely, this turns off genotyping (speeding up what freebayes does) and basically calls everything with at least one alternative read. For further increasing sensitivity (for low-frequency somatic variants), and as you eventually don't care about runtimes, you should decrease the As for Vardict, those seem to be a lot of unique variants. But if you don't care about runtimes, varlociraptor should remove all junk calls reliably. I.e. I would not worry about false positives (precision), if you use the false discovery rate for filtering. So definitely no harder filtering necessary by varlociraptor, the false discovery rate should do the job rigorously. The only trouble you might have with tumor-only calling, might actually be that the lack of data from the normal sample that you specify for the purity, might actually reduce the eventual varlociraptor recall quite a bit. Especially if allele frequencies become ambiguous (for example if the purity is 0.5 and several alternative allele frequencies have multiple interpretations; for example an alternative allele frequency of 0.25 can mean a clonal somatic heterozygous variant, but also a germline heterozygous variant lost somatically). But you can only genuinely address that problem (with any caller) by actually sequencing a normal sample, which is what I would always try to do (or get collaborators to do). But I understand this is not an option in your setup (and have had similar situations myself)... |
I am running this code for freebayes:
Sorry for the long list of questions. I am just trying to get the most of this analysis. Thank you so much for addressing my conserns! |
Yes, the statistical model will be able to deal with that as long as the coverage is large enough. Of course, probabilities will get weaker and weaker the lower the fraction becomes if the coverage is not sufficient.
Actually for FDR control on tumor-only samples I would always provide multiple callsets. The first should control FDR over the somatic event only. The second over the germline event, and the third over the union of both events (the control-fdr command can take multiple events at the same time; it will then calculate the sum of their posteriors before filtering by FDR). This way, you also capture the unclear cases (which will be many). One could also consider just controling the third event and leave the distinguishing to manual inspection. Often this makes more sense (e.g. in a clinical setting).
You will need a proper estimate for the contamination, which one can retrieve from pathologists, or by looking at certain variants from which you know that they have to be somatic and clonal (e.g. from your tumor entity). TP53 can perhaps be such a candidate.
Bias estimation usually only needs to be excluded in case of amplicon sequencing (where sometimes all reads are e.g. from the same strand or only a single amplicon overlaps a locus). |
What should I expect from the FDR controled union of both events? What differencies may I observe between the somatic only and germline only? Does FDR control filtering Tag variants to events in the vcf file (in the union case)? eg. variant X SOMATIC |
The FDR control will simply output a filtered VCF (actually BCF) with only the variants that are selected (i.e. the FILTER column is not used). When you take the union of events, you will get both germline and somatic. But the records keep the posterior probabilities, so that you can see the uncertainty between the two cases. If you filter only for somatic or only for germline via the FDR control, cases where Varlociraptor cannot decide between the two will be lost, only the certain cases will remain. And naturally, since you don't have a normal sample in your case, there will be lots of uncertain cases. Hence the union is important to look at. |
Btw. a quick hint as your scenario sounds like you are in a clinical context: use local FDR control (--local) instead of global. This way, you also don't need the odds filtering. Global FDR control is more for explorative cases: it will fill up your callset with more uncertain cases up to the desired FDR. |
Please correct me if I say something wrong:
This means that --local and FDR control on Somatic only will give me a filtered bcf with the desired fdr control from cases that are at least favorable to be somatic (those that you call certain)? FDR control will then be applied only to those cases? And --local and fdr control on Union will treat ALL cases (even the not certain ones) and then apply fdr control on those? |
Yes |
One more question! The Union will contain uncertain cases (Somatic or Germline) that Varlociraptor is certain that are not sequencing errors etc? |
The short answer is But maybe this quick example helps for a good intuition of how the FDR control works: The posterior probabilities that varlociraptor provides for all the events you find in the output BCF add up to Thus, to generally call variants on a tumor-only sample, controlling for the false discovery rate of both jointly makes the most sense. And yes, this basically gives you all the variants that varlociraptor is sure are real (vs. some type of artifact). This should work with:
And then in some cases, the data will be clearer towards |
Your answer is exceptional! Can you please also give me an explanation of how --local behaves? |
The general FDR approach basically sorts The local FDR simply checks whether a site has a One key difference is, that with the general FDR approach, you can theoretically have I hope this gives a good intuition of when to use which. I'd opt for regular / global FDR when going for new discoveries that you then go and analyze further somehow. And especially if you have lots of certain variants you're looking at, and don't expect any such big jumps that fill up your callset, so usually what you'd expect from whole genome or whole exome callsets. If you're looking to identify certain variants from smaller callsets (capture kits, e.g.), I would definitely go for a local FDR. Similarly, if avoiding this "filling up" with uncertain variants is important, e.g. for diagnosis where such uncertain variants could be problematic (you can of course see this by simply looking at the |
Another exceptional answer! Thank you so much! |
Using FDR 5% GLOBAL on SOMATIC event got me around 109.000 variants on exome Tumor-only sample. So it seems that the GLOBAL 5% FDR gets me variants up to 12% LOCAL FDR to fill up the list of variants. 109.000 seem a lot of Somatic variants for an exome. Mutect2 with filtering gives me around 20.000 variants. So 55.380 variants from LOCAL seem to be reasonable. Do you have any prior expectations about the average number of somatic mutations in exome sequencing data? Number of total observations before any filtering: 3,134,662. Can you tell me your thoughts? I found 9,977 variants that exist in PON provided by GATK and FDR 5% LOCAL Somatic callset. Thank you for your time! |
Hi everyone,
Can I use this workflow for tumor only variant calling to detect both Somatic and Germline variants? Or should I first run my data with Mutect in tumor only mode and then use varlociraptor?
In general, do you have any recommendations for Tumor only variant calling?
Thank you for your time,
Konstantinos Kyriakidis
The text was updated successfully, but these errors were encountered: