Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

methylation calling in non-CpG context (CHG and CHH) #6

Open
yusmiatiliau opened this issue Nov 15, 2022 · 8 comments
Open

methylation calling in non-CpG context (CHG and CHH) #6

yusmiatiliau opened this issue Nov 15, 2022 · 8 comments

Comments

@yusmiatiliau
Copy link

Hello there,
I re-post an issue I posted in DeepMod regarding future plan to include detection of other methylation motives in plant. Hope it can be one of the new development included in DeepMod2. Thanks a lot

@kaichop
Copy link

kaichop commented Nov 29, 2022

This is in principle very possible, as long as there is a good training data with paired gold standard on methylation within CHG and CHH motif. If you are aware of such plant specific datasets, we can train a model for that for detection of non-CpG context. Thank you.

@yusmiatiliau
Copy link
Author

Hi @kaichop,

Thank you very much for your response. I need to talk to my supervisor, but we should have data from nanopore sequencing and whole genome bisulfite sequencing, would those be enough?
Also, our data are all quite recent, so from R10 flowcell, would that be compatible with DeepMod2?
Thanks again

@kaichop
Copy link

kaichop commented Nov 29, 2022

DeepMod2 can train on R10 flowcell but the model will be different from those on R9.4. We have tested it on HG002 (on R10) and it works well.

@umahsn
Copy link
Collaborator

umahsn commented Nov 29, 2022

Just to add some extra information, yes whole genome bisulfite sequencing and Nanopore sequencing should be sufficient. We have uploaded two CpG models trained from high coverage Guppy basecalled R10.4 and R9.4.1 reads from ONT open datasets release using one BS-seq replicate, and we have achieved very high performance.

@yusmiatiliau
Copy link
Author

Thanks both.
I'll get back to you guys regarding the datasets for training.

@yusmiatiliau
Copy link
Author

Hi @kaichop and @umahsn,

Sorry for the delay, we apparently don't have any matching ONT and WGBS dataset from the same sample yet, but are looking forward to generate them. May I know, for the model training, is there any specification on the dataset (e.g.coverage, etc) that you would need specifically.

Thanks again,
Cen

@umahsn
Copy link
Collaborator

umahsn commented Mar 7, 2023

Hi,

For CpG, we were able to achieve very high performance (~94% F1) with ~30X NA12878 native ONT dataset using consensus of two WGBS replicates for ground truth, and we achieved slightly better performance (~95% F1) with ~90X HG002 native ONT dataset using a single WGBS sample.

On the other hand, we also achieved very good performance (~90-93% F1) when we trained using low coverage synthetically methylated and unmethylated controls of HG001 from Simpson. I believe both controls are less than 5X coverage.

I think the important thing in both cases is having sufficient total number of reads with high degree of confidence regarding their methylation. This can come from 1) high coverage at a few sites that have high confidence labels, or 2) low coverage at several sites that have extremely high confidence labels. In case of 1), for native ONT datasets, we trained models using ground truth labels from WGBS with a very strict criteria, i.e. minimum coverage in WGBS of at least 5, and all replicates had to have 100% methylation to be considered methylated or all replicates had to have 0% methylation to be considered unmethylated. Even with this strict criteria, there were ~850k-1M CpG sites for model training, and paired with at least 30X coverage, thats a lot of training data. Whereas for 2), using synthetic positive and negative control, even though coverage was low, we used all ~50million CpG sites for training since we had great confidence in each site in positive and negative control being methylated and unmethylated, respectively.

In short, high coverage ONT will only help for training if you can assign methylated or unmethylated labels to the reads with high confidence. Which is why it is very important to place more emphasis on generating proper ground truth labels for the motifs you are interested in, whether via WGBS or synthetically. Please let me know if you have more questions.

@yusmiatiliau
Copy link
Author

Thanks @umahsn and apologise for the delay in responding.
I will get in touch again once we have the WGBS data in hand

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants