-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Understanding sav benefits. #11
Comments
SAV will show the most improvement for datasets with large sample sizes that contain either WGS genotypes (GT-only) or imputed genotypes stored as DS or HDS. Non-sparse data can see improvements when enabling PBWT for those fields.
We are working on a manuscript that will provide a more exhaustive comparison with BCF and other file formats, but I can give you a sneak peak with 1000 genomes data. While SAV does well at compressing 1000g, the improvements compared to BCF are much greater when you scale to hundreds of thousands of samples.
|
@jonathonl Thank you! That's very helpful.
|
I'm trying to catch up on what's been going on in the world of alternat vcf representations and I'm trying to understand what the benefits of savvy are vs bcf. I've run into a few questions.
It seems like the big difference is the addition of a sparse vector type. The random vcf files I've tried savving haven't seen any appreciable size improvement from running
sav import
on them though so I was wondering if you had some examples of files that benefited from using savvy. I suspect I'm either using files that don't particularly benefit from the sparsity reduction, or I've misconfigured my import.I don't understand how PBWT is used by sav files and what benefit that gives. Does it only apply to genotype fields? I tried looking in the code, but I couldn't find where it actually computes PBWT. It seems like it's just tagging fields as being PBWT sorted? Is this passing through something processed upstream and just acting as a marker for it? How is this intended to be used? I'm not really a C++ programmer so I may have just missed something obvious.
From what I can tell sav doesn't directly address the problem of encoding gvcf files efficiently. (Although they could probably benefit from the sparse vector type when encoding sparse PLs.). Is that outside of the mandate of the sav format?
Thank you. Let me know if there's a better forum for asking general non-code questions about savvy.
The text was updated successfully, but these errors were encountered: