Sample generated code for training and generating models for thesis.
Full write up can be found in the uploaded thesis.
Abstract:
Nowadays, one can now get their genomes sequenced for just over 1000 USD and ascertain their genomic individualities; ranging from ancestry to genetic diseases they may be prone to [1]. The process can be simplified into four basic steps. First, a donor gives a blood sample to a medical professional to get it sequenced. Second, the DNA is extracted from the blood and goes through a machine (i.e. sequencer) which breaks it up for processing and reads those pieces, referred to as sequences or reads. Third, these reads have to now be aligned to re-create a digital version of the users genome. Lastly, the genome is sifted through to look for any patterns in the DNA which can be associated with medical markers, for example variation in the BRCA set of genes which are good indicators of breast cancer in women [2]. The first step can be performed by simply going to any local clinic. The second step has become automated with various sequencing machines and techniques. The last step is just a matter of passing the information obtained through an ever growing library to find patterns. The third step should just be a simple matter of aligning the reads given in step 2, but it’s a bit more complicated than that. DNA is created from only four base pairs (i.e. A, C, G and T). With only 4 letters, a lot of redundancies begin to appear when trying to find the exact position of a read. With the most commonly used Illumina length currently being only 150 base pairs long, and a genome of 3 billion base pairs to generate, difficulties in mapping begin to arise [3,4]. Currently all available methods try to piece together the presented pieces against a reference DNA [5]. Humans have about 1 in 1000 different DNA letters, permutations, between one another, so after the tools find a few positions where a read could be, it’s just a matter of taking the one deemed to be the most accurate [6]. Each tool uses its own method for determining the location, and then generates a score of how accurate it believes its prediction is. Being as accurate as possible is of paramount importance [19]. If a read is placed in the wrong location, the net result can be accidentally telling someone they have some hereditary disease in step 4, or worse yet, missing a preventable one. Current tools give inaccurate scores, as in, if one sums sum all the supposed ’90% chance of being right’ reads, they won’t obtain anywhere near 90% of them being correctly placed and 10% misplaced. This occurs on high ends where a tool will claim a score of 60, which is theoretically 99.9999%, but then proceed to only obtain 99.99%. This nulls the reason of having a wide range of scores. Worst of all, it results in thousands of misplaced reads even after one filters against the lower scores. When dealing with human lives, and an ever growing population of over 7 billion [7], this should be reduced to as close to 0 as possible. With no access to the mapping process itself, post mapping results can be evaluated and filtered in order to obtain better predictions than the mappers themselves. Machine learning, which is the use of statistical techniques in computer science to teach a model to learn and infer results was perfect for such an application [8]. By learning which reads the mapper has difficulty with, and which it underestimates, a model can be generated which predicts better than the mapper itself. After attempting many different machine learning methods, and running thousands of tests, I propose a tool with better scoring and accuracy, with results as high as 100% in some instances, which was not present in any other tool. The method used also demonstrates flexibility in that it can be trained to be used for different genomes, tools, or read lengths.
Conclusion:
In this thesis, a range of machine learning approaches were explored in their ability to improve MapQ scores beyond those of currently available methods. Logistic Regression, Random Forests, XGBoost, Support Vector Machines and Neural Networks were all evaluated, and the best results were used to procure the best feasible models. Previous methods of assigning scores to mapped reads (e.g. Bowtie2, BWA-MEM or SNAP) resulted in relatively inaccurate scores, with relatively high false positives both in the desired 20 through 60 MapQ score range, as well as in their best case. The methods had a tendency of overestimating the MapQ quality of reads, all at different inconsistent extents. This leads to inconsistencies in overall score predictions, as well as leading to more false positives, which in turn results in many more errors in the genome that is being generated. This damages the subsequent and final step of the genome re-construction pipeline , where the generated DNA is sent to be analysed. Ultimately, a neural network was developed to predict MapQ scores for reads of 150bp – without loss of generality – based on predictors obtained from the Gem3 mapper. This model had better accuracy in the best case – 20 through 60 region – and the overall. It also has more accurate MapQ scores, and the highest F1 score, thus outperforming previous methods in all desired criteria. In some tested samples, an accuracy of 100 percent was obtained with over 60 percent of a set of tested samples – unseen in any other tools. Given more and better predictors, the tool can theoretically be further improved to more consistently hit a 100 percent accuracy, with a wider net of true positives. Similarly, XGBoost showed immense promise with the highest MapQ score accuracy, but with more false positives. This can mostly be mitigated by combining the scores with the logistic regression results that Gem3 calculates. XGBoost is faster, and models are much more easily trained than neural networks, providing users with a fast and easy way to improve their map scores. Although the generated models outperform current methods for the test case of 150bp, better models can still be trained. To start with, these would require more, and better, predictors to be generated, so that the models can find better patterns to discern between. Next, having the Gem3 mapper run longer so that it can output more mapping locations, as BWA does, and sending those to be chosen by the model could result in even better accuracy. Finally, the parameters used were optimised given the time available as well as the 150bp read goal. More parameter testing for possible small improvements, as well as generating models for each used base pair size, would allow for a powerful tool that is also flexible. As a result of this work, all the code and tools to train models is publicly available. It can be used to generate models which adaptively calibrate MapQ scores to any genome of any species.Many sequencing-data analysis pipelines based on non-model organism can see a potential improvement in having custom MapQ models tailored for them. This will allow for improved accuracy and specificity flexible around the experiment goals, read lengths, and samples. The model can also learn from, and become specific to, reads of any specific length; providing an increase in accuracy for a wide range of tools and mapping methods used today. Furthermore, the computational cost of generating MapQ scores using these models is negligible and does not add any significant processing time when combined with any pre-existing analysis pipeline. I’ve thoroughly enjoyed my work on this project and hope I have contributed, however small, to the field. I anticipate seeing more machine learning being introduced to the fields of bioinformatics and medicine, and I wish to continue to expand my knowledge and expertise so that I can be a pioneer as these fields continue to grow.