Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Imputing population heterogeneous datasets #44

Open
albaicans opened this issue Sep 17, 2021 · 6 comments
Open

Imputing population heterogeneous datasets #44

albaicans opened this issue Sep 17, 2021 · 6 comments

Comments

@albaicans
Copy link

Hello,

we are trying to optimize our pipeline of phasing and imputation using Eagle and Minimac4 and the 1000 Genomes reference panel and we would like to know your suggestions regarding the best strategy for imputing heterogeneous datasets.

Our dataset contains individuals with different ancestries in different proportions: most of the individuals have a European ancestry but we also have a smaller group of admixed European-African and even smaller groups of East Asian and African, as well as several individuals with admixed American ancestry. We are interested in using the imputed data in an association analysis including all ancestries.

Our first approach was to phase and impute all samples together but we realized that the imputation accuracy (based on R squared distributions and alternate allele dosages) was not as good as with homogenous datasets. We did some tests imputing different ancestries separately (still using the whole reference panel) and we got better results for the populations with big sample size (European and admixed European-African) but for the populations with small sample size this is not clear. The overall accuracy of the variants that pass the quality filter (R squared > 0.3) is higher if we impute them alone compared to when imputing them together with all the other samples, but we lose about half of the variants, probably because of low MAF that translates to low R squared.

Based on your experience and your knowledge of the imputation algorithm and the calculation of the accuracy, what’s the best approach when phasing/imputing heterogeneous datasets? It looks like we are getting better results when imputing different populations separately but we are not sure how much a small sample size (let’s say 15 individuals) can affect the imputation result and the accuracy estimation.

Thank you in advance!

Best,

Alba

@yukt
Copy link
Contributor

yukt commented Sep 17, 2021 via email

@albaicans
Copy link
Author

Hi Ketian,
thank you very much for your reply. I also thought the difference would be only on the estimated R-squared, so I compared the dosages of the samples of interest using the two strategies (imputed alone or together with the other samples). I calculated the distance to the closest integer as a measure of uncertainty or accuracy, calculated the median distance for each variant and then plotted the distribution of the medians. I got better accuracy (more variants with a median distance close to 0) when the samples had been imputed alone. Even though I realize there could be slight differences between the results of different imputation runs, the imputation of homogenous populations was always more accurate in this sense.
I'm thinking that the difference could also come from the phasing step. We used Eagle with the 1000 Genomes reference panel to do phasing before imputation, and we tested the whole pipeline of phasing-imputation with the same sets of samples. Do you know if the phasing result can be affected by the samples phased together and this translates into different imputation results? Sorry, you might not be the right person to ask this.
Thank you!
Best,
Alba

@yukt
Copy link
Contributor

yukt commented Sep 17, 2021 via email

@albaicans
Copy link
Author

Thanks a lot, this was very useful.

Best,

Alba

@albaicans
Copy link
Author

Hi again,
I just wanted to inform you about the follow-up on this issue in case someone reads it in the future. As stated by Eagle2 author, indeed the phasing algorithm will produce different results depending on which samples are phased together, but the general rule of thumb is that phasing samples together tends to be no worse (and usually better) than separating samples by ancestry.
After finding a mistake in my code, I reran the tests comparing phasing all samples together with phasing them by ancestry cluster and got similar results, with no significant increase in imputation accuracy when phasing separately. Consequently, we ended up phasing all samples together.
Sorry for the confusion!
Alba

@Shicheng-Guo
Copy link

I am wondering can you share a bash script/demo script to show how eagle2 + minimac4 for phasing and imputation? Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants