BioInformaticsMapQMachineLearningMethods

Sample generated code for training and generating models for thesis.

Full write up can be found in the uploaded thesis.

Abstract:

Nowadays, one can now get their genomes sequenced for just over 1000 USD and ascertain their genomic individualities; ranging from ancestry to genetic diseases they may be prone to [1]. The process can be simpliﬁed into four basic steps. First, a donor gives a blood sample to a medical professional to get it sequenced. Second, the DNA is extracted from the blood and goes through a machine (i.e. sequencer) which breaks it up for processing and reads those pieces, referred to as sequences or reads. Third, these reads have to now be aligned to re-create a digital version of the users genome. Lastly, the genome is sifted through to look for any patterns in the DNA which can be associated with medical markers, for example variation in the BRCA set of genes which are good indicators of breast cancer in women [2]. The ﬁrst step can be performed by simply going to any local clinic. The second step has become automated with various sequencing machines and techniques. The last step is just a matter of passing the information obtained through an ever growing library to ﬁnd patterns. The third step should just be a simple matter of aligning the reads given in step 2, but it’s a bit more complicated than that. DNA is created from only four base pairs (i.e. A, C, G and T). With only 4 letters, a lot of redundancies begin to appear when trying to ﬁnd the exact position of a read. With the most commonly used Illumina length currently being only 150 base pairs long, and a genome of 3 billion base pairs to generate, diﬃculties in mapping begin to arise [3,4]. Currently all available methods try to piece together the presented pieces against a reference DNA [5]. Humans have about 1 in 1000 diﬀerent DNA letters, permutations, between one another, so after the tools ﬁnd a few positions where a read could be, it’s just a matter of taking the one deemed to be the most accurate [6]. Each tool uses its own method for determining the location, and then generates a score of how accurate it believes its prediction is. Being as accurate as possible is of paramount importance [19]. If a read is placed in the wrong location, the net result can be accidentally telling someone they have some hereditary disease in step 4, or worse yet, missing a preventable one. Current tools give inaccurate scores, as in, if one sums sum all the supposed ’90% chance of being right’ reads, they won’t obtain anywhere near 90% of them being correctly placed and 10% misplaced. This occurs on high ends where a tool will claim a score of 60, which is theoretically 99.9999%, but then proceed to only obtain 99.99%. This nulls the reason of having a wide range of scores. Worst of all, it results in thousands of misplaced reads even after one ﬁlters against the lower scores. When dealing with human lives, and an ever growing population of over 7 billion [7], this should be reduced to as close to 0 as possible. With no access to the mapping process itself, post mapping results can be evaluated and ﬁltered in order to obtain better predictions than the mappers themselves. Machine learning, which is the use of statistical techniques in computer science to teach a model to learn and infer results was perfect for such an application [8]. By learning which reads the mapper has diﬃculty with, and which it underestimates, a model can be generated which predicts better than the mapper itself. After attempting many diﬀerent machine learning methods, and running thousands of tests, I propose a tool with better scoring and accuracy, with results as high as 100% in some instances, which was not present in any other tool. The method used also demonstrates ﬂexibility in that it can be trained to be used for diﬀerent genomes, tools, or read lengths.

Conclusion:

In this thesis, a range of machine learning approaches were explored in their ability to improve MapQ scores beyond those of currently available methods. Logistic Regression, Random Forests, XGBoost, Support Vector Machines and Neural Networks were all evaluated, and the best results were used to procure the best feasible models. Previous methods of assigning scores to mapped reads (e.g. Bowtie2, BWA-MEM or SNAP) resulted in relatively inaccurate scores, with relatively high false positives both in the desired 20 through 60 MapQ score range, as well as in their best case. The methods had a tendency of overestimating the MapQ quality of reads, all at diﬀerent inconsistent extents. This leads to inconsistencies in overall score predictions, as well as leading to more false positives, which in turn results in many more errors in the genome that is being generated. This damages the subsequent and ﬁnal step of the genome re-construction pipeline , where the generated DNA is sent to be analysed. Ultimately, a neural network was developed to predict MapQ scores for reads of 150bp – without loss of generality – based on predictors obtained from the Gem3 mapper. This model had better accuracy in the best case – 20 through 60 region – and the overall. It also has more accurate MapQ scores, and the highest F1 score, thus outperforming previous methods in all desired criteria. In some tested samples, an accuracy of 100 percent was obtained with over 60 percent of a set of tested samples – unseen in any other tools. Given more and better predictors, the tool can theoretically be further improved to more consistently hit a 100 percent accuracy, with a wider net of true positives. Similarly, XGBoost showed immense promise with the highest MapQ score accuracy, but with more false positives. This can mostly be mitigated by combining the scores with the logistic regression results that Gem3 calculates. XGBoost is faster, and models are much more easily trained than neural networks, providing users with a fast and easy way to improve their map scores. Although the generated models outperform current methods for the test case of 150bp, better models can still be trained. To start with, these would require more, and better, predictors to be generated, so that the models can ﬁnd better patterns to discern between. Next, having the Gem3 mapper run longer so that it can output more mapping locations, as BWA does, and sending those to be chosen by the model could result in even better accuracy. Finally, the parameters used were optimised given the time available as well as the 150bp read goal. More parameter testing for possible small improvements, as well as generating models for each used base pair size, would allow for a powerful tool that is also ﬂexible. As a result of this work, all the code and tools to train models is publicly available. It can be used to generate models which adaptively calibrate MapQ scores to any genome of any species.Many sequencing-data analysis pipelines based on non-model organism can see a potential improvement in having custom MapQ models tailored for them. This will allow for improved accuracy and speciﬁcity ﬂexible around the experiment goals, read lengths, and samples. The model can also learn from, and become speciﬁc to, reads of any speciﬁc length; providing an increase in accuracy for a wide range of tools and mapping methods used today. Furthermore, the computational cost of generating MapQ scores using these models is negligible and does not add any signiﬁcant processing time when combined with any pre-existing analysis pipeline. I’ve thoroughly enjoyed my work on this project and hope I have contributed, however small, to the ﬁeld. I anticipate seeing more machine learning being introduced to the ﬁelds of bioinformatics and medicine, and I wish to continue to expand my knowledge and expertise so that I can be a pioneer as these ﬁelds continue to grow.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
1.TenBPLeewayComp.py		1.TenBPLeewayComp.py
1204.py		1204.py
1204v12.py		1204v12.py
1204v14.py		1204v14.py
1204v7.py		1204v7.py
1209.py		1209.py
2.goThroughAndFind.py		2.goThroughAndFind.py
3.removeUnwantedPredictors.py		3.removeUnwantedPredictors.py
4.mergedOf01AndPredictors.py		4.mergedOf01AndPredictors.py
5.convertToScore.py		5.convertToScore.py
6.SigmoidTool.py		6.SigmoidTool.py
7.goThroughAndAdd.py		7.goThroughAndAdd.py
8.makeOver60Into60.py		8.makeOver60Into60.py
9.comparisonTool.py		9.comparisonTool.py
BashScriptsUsedForDataPrep		BashScriptsUsedForDataPrep
LR100MThetasFor15Parameters		LR100MThetasFor15Parameters
LRwRThetas15Parameters		LRwRThetas15Parameters
MScBio__Adaptive_calibration_of_MAPQ_scores_using_machine_learning_techniques.pdf		MScBio__Adaptive_calibration_of_MAPQ_scores_using_machine_learning_techniques.pdf
MergeModelXGBoostAndLRwReg.py		MergeModelXGBoostAndLRwReg.py
NN1204Model.h5		NN1204Model.h5
NN1209Model.h5		NN1209Model.h5
README.md		README.md
SampleNeuralNetworkDeelLearningModel.py		SampleNeuralNetworkDeelLearningModel.py
XGBoost.R		XGBoost.R
XGLRMergeSuggestionA.py		XGLRMergeSuggestionA.py
XGLRMergeSuggestionB.py		XGLRMergeSuggestionB.py
costFunction.m		costFunction.m
costFunctionReg.m		costFunctionReg.m
finalPresentation2Roush2018_7_18P10_0_0.pdf		finalPresentation2Roush2018_7_18P10_0_0.pdf
logisticRegressionTool.m		logisticRegressionTool.m
logisticRegressionToolWithRegularization.m		logisticRegressionToolWithRegularization.m
randomForest.R		randomForest.R
sigmoid.m		sigmoid.m
supportVectors.R		supportVectors.R
xgboostmd4eta1nround10010M150bpReadsSeed2.model		xgboostmd4eta1nround10010M150bpReadsSeed2.model

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BioInformaticsMapQMachineLearningMethods

About

Releases

Packages

Languages

roushrsh/AdaptiveCalibrationOfMAPQScoresUsingMachineLearningTechniquesCode

Folders and files

Latest commit

History

Repository files navigation

BioInformaticsMapQMachineLearningMethods

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages