-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Imputing population heterogeneous datasets #44
Comments
Hi Alba,
Minimac4 imputes each individual haplotype independently, so the imputation
result of one sample will not be affected by other samples. The two
approaches you mentioned should give you exactly the same results except
for R-squared itself.
The R-squared output by minimac4 is an estimate of the imputation accuracy
and is calculated based on the imputation dosages of all input samples.
R-squared = var(HDS)/(p(1-p)), where p=mean(HDS), HDS is the vector of the
haplotype dosages of input samples at the marker. The only difference
between the two approaches you mentioned is how they calculate the
R-square: the first approach calculates the R-square over the vector of the
haplotype dosages of all samples , and the second approach is equivalent to
splitting the vector into pieces according to the ancestry of the samples
and calculating the R-squared for each piece.
Therefore, the actual imputation accuracy will be the same no matter which
approach you take, but the R-squared can be different.
Best,
Ketian
…On Fri, Sep 17, 2021 at 7:05 AM albaicans ***@***.***> wrote:
Hello,
we are trying to optimize our pipeline of phasing and imputation using
Eagle and Minimac4 and the 1000 Genomes reference panel and we would like
to know your suggestions regarding the best strategy for imputing
heterogeneous datasets.
Our dataset contains individuals with different ancestries in different
proportions: most of the individuals have a European ancestry but we also
have a smaller group of admixed European-African and even smaller groups of
East Asian and African, as well as several individuals with admixed
American ancestry. We are interested in using the imputed data in an
association analysis including all ancestries.
Our first approach was to phase and impute all samples together but we
realized that the imputation accuracy (based on R squared distributions and
alternate allele dosages) was not as good as with homogenous datasets. We
did some tests imputing different ancestries separately (still using the
whole reference panel) and we got better results for the populations with
big sample size (European and admixed European-African) but for the
populations with small sample size this is not clear. The overall accuracy
of the variants that pass the quality filter (R squared > 0.3) is higher if
we impute them alone compared to when imputing them together with all the
other samples, but we lose about half of the variants, probably because of
low MAF that translates to low R squared.
Based on your experience and your knowledge of the imputation algorithm
and the calculation of the accuracy, what’s the best approach when
phasing/imputing heterogeneous datasets? It looks like we are getting
better results when imputing different populations separately but we are
not sure how much a small sample size (let’s say 15 individuals) can affect
the imputation result and the accuracy estimation.
Thank you in advance!
Best,
Alba
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#44>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AD6UVLIBCBF3VADATWO5TUTUCMOHRANCNFSM5EGYQEDA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Hi Ketian, |
Thank you for your clarification. Eagle may augment the reference panel with inferred target haplotypes. I believe this feature is triggered by default when the number of target samples is larger than half of the reference sample size, so if your sample size >= 1252 when phasing with 1000G, the results could be affected by samples phased together, which may decrease the accuracy for non-European samples (given that your samples are dominantly European). You could turn this feature off by setting --pbwtIters 1 when running eagle2. However, I am not the right person to ask about phasing. You may need to consult and confirm these details with the author of Eagle2.
Best,
Ketian
|
Thanks a lot, this was very useful. Best, Alba |
Hi again, |
I am wondering can you share a bash script/demo script to show how eagle2 + minimac4 for phasing and imputation? Thanks. |
Hello,
we are trying to optimize our pipeline of phasing and imputation using Eagle and Minimac4 and the 1000 Genomes reference panel and we would like to know your suggestions regarding the best strategy for imputing heterogeneous datasets.
Our dataset contains individuals with different ancestries in different proportions: most of the individuals have a European ancestry but we also have a smaller group of admixed European-African and even smaller groups of East Asian and African, as well as several individuals with admixed American ancestry. We are interested in using the imputed data in an association analysis including all ancestries.
Our first approach was to phase and impute all samples together but we realized that the imputation accuracy (based on R squared distributions and alternate allele dosages) was not as good as with homogenous datasets. We did some tests imputing different ancestries separately (still using the whole reference panel) and we got better results for the populations with big sample size (European and admixed European-African) but for the populations with small sample size this is not clear. The overall accuracy of the variants that pass the quality filter (R squared > 0.3) is higher if we impute them alone compared to when imputing them together with all the other samples, but we lose about half of the variants, probably because of low MAF that translates to low R squared.
Based on your experience and your knowledge of the imputation algorithm and the calculation of the accuracy, what’s the best approach when phasing/imputing heterogeneous datasets? It looks like we are getting better results when imputing different populations separately but we are not sure how much a small sample size (let’s say 15 individuals) can affect the imputation result and the accuracy estimation.
Thank you in advance!
Best,
Alba
The text was updated successfully, but these errors were encountered: