entro.py is a module that calculates basic informational parameters for any input string. To use, simply download the file "entro.py" into
your working directory and load it like any other package. Make sure to define your input sequence as a string sequence = "etc"
or load a plaintext file
import entro
sequence = "etc"
with open('etc.txt', 'r') as file:
sequence = file.read().replace('\n', '')
The theory underpinning the program can be found in "THEORY.md." It describes how and why to calculate the parameters that we can. Familiarity with Boltzmann entropy is recommended, but not required.
def max(sequence):
return math.log2(len(set(Counter(sequence).elements())))
entro.max(sequence)
calculates the maximum entropy of an alphabet. Time complexity of
sequence is a string variable.
def fi(sequence):
fi = []
for x in set(Counter(sequence).elements()):
fi.append(Counter(sequence)[x] / len(sequence))
return fi
entro.fi(sequence)
calculates
sequence is a string variable.
def h(sequence):
fi = []
h1 = []
for x in set(Counter(sequence).elements()):
fi.append(Counter(sequence)[x] / len(sequence))
for x in fi:
h1.append(x*math.log2(x))
return -(sum(h1))
entro.h(sequence)
calculates the
sequence is a string variable.
def d1(sequence):
# Singlet Frequencies
fi = []
for x in set(Counter(sequence).elements()):
fi.append(Counter(sequence)[x] / len(sequence))
#Shannon Entropy
h1 = []
for x in fi:
h1.append(x*math.log2(x))
H1 = -(sum(h1))
H1max = math.log2(len(n))
return H1max - H1
entro.d1(sequence)
calculates the
sequence is a string variable.
def fij(sequence):
nn = []
fij = []
for (x, y), count in Counter(zip(sequence, sequence[1:])).items():
nn.append(x + y)
fij.append(count / (len(sequence)-1))
return fij
entro.fij(sequence)
returns a 1D array of
sequence is a string variable.
def h2i(sequence):
# Singlet Frequencies
fi = []
for x in set(Counter(sequence).elements()):
fi.append(Counter(sequence)[x] / len(sequence))
# Doublet Frequencies
nn = []
fij = []
for (x, y), count in Counter(zip(sequence, sequence[1:])).items():
nn.append(x + y)
fij.append(count / (len(sequence)-1))
#Divergence from Independence
h2ind = []
for x in fi:
for y in fi:
h2ind.append((x*y)*math.log2(x*y))
H2ind = -(sum(h2ind))
h2 = []
for x in fij:
h2.append(x*math.log2(x))
H2 = -(sum(h2))
return H2ind - H2
entro.h2i(sequence)
returns the
sequence is a string variable.
def id(sequence):
# State of Equiprobaility
H1max = math.log2(len(set(Counter(sequence).elements())))
# Singlet Frequencies
fi = []
for x in set(Counter(sequence).elements()):
fi.append(Counter(sequence)[x] / len(sequence))
# Shannon Entropy
h1 = []
for x in fi:
h1.append(x*math.log2(x))
H1 = -(sum(h1))
# Divergence from Equiprobability
D1 = H1max - H1
# Doublet Frequencies
nn = []
fij = []
for (x, y), count in Counter(zip(sequence, sequence[1:])).items():
nn.append(x + y)
fij.append(count / (len(sequence)-1))
# Divergence from Independence
h2ind = []
for x in fi:
for y in fi:
h2ind.append((x*y)*math.log2(x*y))
H2ind = -(sum(h2ind))
h2 = []
for x in fij:
h2.append(x*math.log2(x))
H2 = -(sum(h2))
D2 = H2ind - H2
return D1 + D2
entro.id(sequence)
returns the
sequence is a string variable.
def hm(sequence):
# Singlet Frequencies
fi = []
for x in set(Counter(sequence).elements()):
fi.append(Counter(sequence)[x] / len(sequence))
# Shannon Entropy
h1 = []
for x in fi:
h1.append(x*math.log2(x))
H1 = -(sum(h1))
# Doublet Frequencies
nn = []
fij = []
for (x, y), count in Counter(zip(sequence, sequence[1:])).items():
nn.append(x + y)
fij.append(count / (len(sequence)-1))
# Doublet Entropy
h2 = []
for x in fij:
h2.append(x*math.log2(x))
H2 = -(sum(h2))
return H2 - H1
entro.hm(sequence)
returns the Markov entropy for a target string. Also calculates:
sequence is a string variable.
TL;DR: The functions for calculating doublet frequencies and divergence from independence are both
It is also recommended to assign the output of each function to a variable in order to save computation time, as each function calculates all previous parameters necessary to calculate the value of interest.
In the original literature (published 1972) Gatlin references the nearest-neighbor experiments and uses those results to calculate the doublet frequencies. This will not be the method used here as it is a bit antiquated. It should be noted that the first viable method of genome sequencing, Sanger sequencing, would not be invented until 1977. Yet still, viable full-genome sequncing wouldn't become commonplace until the turn of the century, much less affordable genome sequencing ten years after.
The point of my program, entro.py is to bring her analysis into the 21st century by adapting the methods to work with full genome sequences (which may be obtained independently or from the NIH repositries). Briefly, I will describe the method used in this code to calculate the doublet frequencies. R.A. Elton (1975) describes a method not based on the nearest-neighbor experiments:
The sequence can be represented by
$(x_{l}, ... , x_{n+1})$ , using the convention that each value$x_i$ is 1, 2, 3 or 4 according as the ith base in the sequence is U(T), C, A or G. The transition count is then a 4x4 matrix of frequencies {$f_{ij}$ }, where$f_{ij}$ is the number of times that a base i in the sequence is followed by a base j.
Since we are using python, we do not need to use the cited convention that "each value Counter()
function to identify the number of times a doublet appears in a given sequence.
Just as one would find the frequency of a singlet by counting its occurence in a given sequence, then dividing it by the
total length of the sequence, we may find the doublet frequencies in a similar fashion. In order to determine the frequencies
of each doublet, the occurence must be divided by the "doublet space," i.e. the total amount of possible doublets. Consider
the following sequence genome = "AGCTTTTCA"
. It can be seen that len(genome) = 9
but that the amount of doublets is 8.
This follows for even sequences too, genome = "AGCTTTTC"
which has len(genome) = 8
with a doublet space of 7. So it
can be seen that the doublet space is always len(genome)-1
. This can be proved more rigorously by noting that each
single base in the sequence may serve as the "root" of a doublet pair. Moving left to right through the sequence, no matter the
length, there will always be a doublet pair that is comprised of the last base and a "null" value outside the sequence, so to speak.
The base at the end of the sequence will always be left out. This is accounted for by
Therefore, we may construct the sample space for doublets,
n = ["A", "T", "C", "G"]
nn = []
fij = []
for x in n:
for y in n:
nn.append(x + y)
fij.append(genome.count(x + y) / (len(genome)-1))
where n = ["A", "T", "C", "G"]
. nn = []
. At the same time, the frequencies, for (x, y), count in Counter(zip(sequence, sequence[1:])).items()
and dividing it by the aforementioned doublet space, len(genome)-1
. This is
all appended to a list of doublet frequencies, fij = []
. The nested for loops in effect are the same method described by
R.A. Elton, but instead of keeping it in a 4x4 matrix, the results are appended into a 1-dimensional list.
As such, is important to note that the time complexity of these two particular functions (doublet frequencies & divergence from
independence) are .Counter()
function is
Furthermore, each function calculates all values from scratch. To take one of the largest functions as an example, in order to calculate