New Pollen Type for DNA Bases #62
Replies: 4 comments 1 reply
-
This definitely seems worthwhile! As always when we talk about data types, it's possible to decompose the issue into two different perspectives:
With that in mind, when we make a DSL, we should definitely be exposing a more constrained type like I also think it's worth asking why odgi doesn't already use a smaller sequence representation, at least for its in-memory data structure: The reason is probably because, on a CPU, accessing values smaller than a byte is really inconvenient. You end up having to load whole bytes and slice-and-dice them to extract out the unaligned chunks of 3 bits. Even if you save a lot of memory footprint, this extra slicing-and-dicing might outweigh that advantage (or it might not! It's just unclear). In hardware, however, that's absolutely not true: accessing 3 bits is exactly as (in)convenient as accessing 8 bits. And furthermore, every bit counts: a narrower datapath could save a lot of resources throughout a design. So it seems critical to avoid waste in this way. |
Beta Was this translation helpful? Give feedback.
-
Thank you for clarifying the different issues at play with respect to the abstract model vs. the physical implementation, this makes a lot of sense! I suppose that since Pollen generates a lot of data with abnormal bit widths, at some point we may want to consider if this reduces program performance on CPUs and address these (hypothetical) performance issues. Does calyx by any chance already have a way to deal with this memory/speed tradeoff? |
Beta Was this translation helpful? Give feedback.
-
It doesn't really do much about this at all: all bit widths are equally (in)efficient. That is, because the target is hardware, not software, it has no special affinity for 8-bit-aligned sizes. One place we could consider changing this is in the interpreter… but I think the only place we should worry about this is if we generate CPU code (not Calyx code) from the DSL as well. |
Beta Was this translation helpful? Give feedback.
-
Seems there is DNA4, DNA5, RNA4, RNA5, and these support different characters. For example, DNA4 has |
Beta Was this translation helpful? Give feedback.
-
Currently, we represent sequences - strands of DNA base pairs - as a String (equivalently, a List of characters). However, since there are only 5 possible base pairs in odgi (ATCGN), we can represent each base using only 3 bits (as opposed to the 7/8 required to represent characters). Additionally, representing a sequence as a String allows users to insert arbitrary values into a sequence, potentially introducing subtle, difficult-to-detect bugs or corrupting the data. Instead of representing a DNA base as a character, we could represent it using a unique type. Then we can enforce the invariant that the only base pairs are ATCGN, improve our memory footprint, and possible speed up data transfer rates in certain computations (?).
Considering what this might look like, I did some back-of-the-napkin calculations and found that the
chr8.pan.gfa
graph stores 259525394 base pairs, which takes up 216 MB in memory if we represent characters using 7 bits. This might not sound like a lot, but I understand that the typical FPGA can only store a few megabytes at a time, so reducing the amount of memory used to 92 MB by using 3 bits to represent each base pair could reduce the time needed to computecrush
for the entire graph by a factor of x2.3 (in a perfect world where the complexity ofcrush
scales directly with the size of a sequence).If we introduced a special
Base
type, the only difference to our model is that node sequences would no longer be strings:crush
would becomeWe could even take this a step further and give
List<Base>
a special name, e.g.,Strand
. The way I conceptual this, Strand and List are are not interchangeable (i.e., not subtypes of each other). Because Base is effectively a subtype ofChar
, I imagine Strand as behaving like String (except the values of its characters are constrained). For example, we would still useseq.push(c)
instead ofseq.append(c)
, using the syntax for Strings rather than Lists. In this case, the node type would be as follows:crush
:Thoughts on the introduction of these new types? Do we prefer Base but not Strand, or Strand but different from how I described it? Neither?
Beta Was this translation helpful? Give feedback.
All reactions