The function frequentWords() creates the algorithm for finding the most frequent k-mers in a DNA string. The complexity of this algorith is O(n^2*k), where n is the length of the text.
The function fasterFrequentWords() performs the same job but increases performance.
The function patternToNumber() and patternToNumber2() solves the same problem, where they transform a k-mer pattern into an integer, which is the index of the pattern in the frequency array. The frequency array orders all 4^k k-mers lexicographically. The difference between the two functions is patternToNumber2() uses recursion.
Similar to frequentWords(), frequentWordsWithMismatch() finds the most frequent k-mer in a DNA string but with mismatches. The input is a DNA string, and integers k and d (assumign k<=12 and d<=3). The output is most frequent k-mers with up to d mismatches in the string.
Similar to frequentWords() and frequentWordsWithMismatch(), frequentWordsWithMismatch() also finds the most frequent k-mer in a DNA string, but with mismatches and reverse complements. The input is a DNA string Text, and integers k and d. The output is all k-mers maximazing the sum of Count(Text,Pattern) with d mismatches + Count(Text,Pattern-rc) with d mismatches over all possible k-mers.
This program counts the frequency of a given pattern in a genome.
The function reverseComplementString() finds the reverse complement of a DNA string.
The function patternMatch() takes the two strings, pattern and genome, and returns a collection of space-separted integers specifying all starting positions where pattern appears as a substring of genome.
The function clumpFinding() outpus all distinct k-mers forming (L,t)-clumps in a genome, where L is the length of ori in the genome. A k-mer pattern forms an (L,t)-clump inside a genome if there is an interval of genome of length L in which this k-mer appears at least t times. The input is a string genome, and integers k, L, and t. A clump is also sometimes called a cluster.
The function minimumSkew() takes a DNA string and outputs all integer(s) i minimizing "Skew(genome) sub i" among all values of i for the entire input genome. The hypothesis is the integer(s) that minimizes the skew diagram is the origin of DNA replication. For more details on the hypothesis and the theory, see the comments in minimum_skew.py
The program can also plot the skew diagram by uncommenting the line plt.show()
.
The function hammingDistace() finds the number of mismatches between two strings (aka the Hamming distance), such that position i in k-mers p1...pk and q1...qk is a mismatch if pi doesn't equal to qi, where p and q are two strings.
The function approximatePatternMatch() finds where a pattern appears as a substring of text with at most d mismatches. The function takes pattern, text, and an integer d, and outputs the total count and all starting positions where pattern appears as a substring of text with at most d mismatches.
The functino neighbors() generates the set of all k-mers whose Hamming distance from pattern doesn not exceed d.