-
Notifications
You must be signed in to change notification settings - Fork 5
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[doc] huge additions and fixes to the documentation
- Loading branch information
Showing
17 changed files
with
449 additions
and
60 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,50 @@ | ||
# Introduction {#record_based_intro} | ||
|
||
Most files in bioinformatics are comprised of *records*, i.e. multiple, indepedent entries that each consist of one or | ||
more *fields*. | ||
For example, a FastA file contains one or more sequence records that each contain an ID field and sequence field. | ||
|
||
[TOC] | ||
|
||
``` | ||
>myseq1 | ||
ACGT | ||
>myseq2 | ||
GAGGA | ||
>myseq3 | ||
ACTA | ||
``` | ||
|
||
<center> | ||
↓↓↓↓↓↓↓ | ||
</center> | ||
|
||
|
||
| ID field | sequence field | | ||
|:----------:|:--------------:| | ||
| "myseq1" | "ACGT" | | ||
| "myseq2" | "GAGGA" | | ||
| "myseq3" | "ACTA" | | ||
|
||
Each line in this table is conceptionally "a record", and each file is modeled as a series of these records. | ||
The process of "reading a file", is transforming the on-disk representation displayed above into the "abstraction" shown below. | ||
The process of "writing a file" is the reverse. | ||
|
||
Details on how records are defined is available here: \ref record_faq | ||
|
||
## Readers | ||
|
||
So called *readers* are responsible for detecting the format and decoding a file into a series of records: | ||
|
||
\snippet test/snippet/seq_io/seq_io_reader.cpp simple_usage_file | ||
|
||
The reader is an *input range* which is C++ terminology for "something that you can iterate over (once)". | ||
The last bit is important, it implies that once you reach the end, the reader will be "empty". To iterate over it again, you need to recreate it. | ||
|
||
<!-- Details on how readers are defined is available here: \ref reader_writer_faq --> | ||
|
||
## Writers | ||
|
||
TODO |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,156 @@ | ||
# Record FAQ {#record_faq} | ||
|
||
Records in B.I.O. are of implemented as a specialisation of the bio::record template.¹ | ||
This behaves very similar to a std::tuple with the difference that a bio::field identifier is associated with every | ||
element and a corresponding member function is provided, so you can easily access the elements without knowing the order. | ||
|
||
<small>¹ With the exception of bio::plain_io which uses bio::plain_io::record.</small> | ||
|
||
[TOC] | ||
|
||
\note This page contains details on how records are defined. It is meant to provide a better understanding of the design and performance implications. We recommend starting with the snippets shown in the API (e.g. bio::seq_io::reader, bio::var_io::reader, …) and only return to this page if you have questions or want to fine-tune things. | ||
|
||
## What is the full type of my record? {#record_type} | ||
|
||
Most records you interact with are produced by readers. | ||
|
||
\snippet test/snippet/seq_io/seq_io_reader.cpp simple_usage_file | ||
|
||
In this example, `rec` is the record and with each iteration of the loop, a new record is generated from the file. The exact type of the record depends on the reader. In the above example, it is: | ||
|
||
\snippet test/snippet/seq_io/seq_io_reader.cpp simple_usage_file_type | ||
|
||
That is quite long and difficulat to remember (even though definitions of X* and Y* are omitted here), | ||
so we write `auto &` instead. | ||
But it is important to know which fields are contained in the record (in this case ID, SEQ and QUAL). | ||
The documentation for the reader will tell you this, e.g. bio::seq_io::reader. | ||
|
||
## How can I access the fields? | ||
|
||
The easiest way to access a field, is by calling the respective member function: | ||
|
||
\snippet test/snippet/seq_io/seq_io_reader.cpp simple_usage_file | ||
|
||
Here, `.id()` (bio::record#id()) and `.seq()` (bio::record#seq()) are used to access the fields. Note, that the | ||
documentation has entries for all field-accessor member functions, but it depends on the specific specialisation | ||
(used by the reader) whether that function is available. | ||
So, on the record defined by bio::seq_io::reader above, the members `.id()`, `.seq()`, `.qual()` are available, but | ||
the member `.pos()` would not be. | ||
|
||
When the number of fields in the record is low and you know the order, you can also use | ||
[structured bindings](https://en.cppreference.com/w/cpp/language/structured_binding) | ||
to decompose the record into its fields: | ||
|
||
\snippet test/snippet/seq_io/seq_io_reader.cpp decomposed | ||
|
||
Note that the order of the fields is fixed (in this case it is defined by bio::seq_io::default_field_ids). | ||
It is independent of the names you give to the bindings, so this syntax is error-prone when used with large records | ||
(e.g. those defined by bio::var_io::reader). | ||
|
||
In generic contexts, you can also access fields via `get<0>(rec)` (returns the 0-th field in the record) or | ||
`get<bio::field::id>(rec)` (the same as calling `rec.id()`); but most users will never need this. | ||
|
||
|
||
## Does my record own the data? (Shallow vs deep records) {#shallow_vs_deep} | ||
|
||
As shown above, every field has an identifier (e.g. bio::field::id) and a type (e.g. std::string_view). | ||
You may have wondered, why std::string_view is used as a type and what these `transform_view`s are. | ||
These imply that the record is a *shallow* data structure, i.e. the fields *appear* like strings or vectors, but they | ||
are implemented more like references or pointers. | ||
See the SeqAn3 documentation for an in-depth [Tutorial on Ranges and Views](http://docs.seqan.de/seqan/3-master-user/tutorial_ranges.html). | ||
|
||
Shallow records imply fewer memory allocations and/or copy operations during reading. This results in a **better | ||
performance** but also in some important limitations: | ||
|
||
* Shallow records cannot be modified (as easily²). | ||
* Shallow records cannot be "stored"; they depend on internal caches and buffers of the reader and become invalid | ||
as soon as the next record is read from the file. | ||
|
||
|
||
If you need to change a record in-place and/or "store" the record for longer than one iteration of the reader, you need to use *deep records* instead. | ||
You can tell the reader that you want deep records by providing the respective options: | ||
|
||
\snippet test/snippet/seq_io/seq_io_reader.cpp options2 | ||
|
||
This snippet behaves similar to the previous one, except that the type of `rec` is now the following: | ||
|
||
\snippet test/snippet/seq_io/seq_io_reader.cpp options2_type | ||
|
||
This allows you to call std::vector's `.push_back()` member function (which is not possible in the default case). | ||
Creating this kind of record is likely a bit slower than the shallow record. | ||
|
||
**Summary** | ||
|
||
* The records generated by readers are *shallow* by default. | ||
* This setting has the best performance; but it is less flexible than a *deep* record. | ||
* Readers can be configured to produce *deep* records via the options. | ||
|
||
<small>² Some modifying operations are possible on views, too, but this depends on the exact types.</small> | ||
|
||
## How can I change the field types? | ||
|
||
In the previous section, we showed how to change the field types from being shallow to deep. | ||
For some readers, more options are available, e.g. bio::seq_io::reader assumes nucleotide data for the SEQ field by default, but you might want to read protein data instead. | ||
|
||
\snippet test/snippet/seq_io/seq_io_reader.cpp options | ||
|
||
The snippet above illustrates how the alphabet can be changed (and how to provide another option at the same time). | ||
|
||
Instead of using these pre-defined `field_types`, you can also define them completely manually. You can decide to even read only a subset of the fields by changing the `.field_ids` member: | ||
|
||
\snippet test/snippet/seq_io/seq_io_reader_options.cpp example_advanced2 | ||
|
||
This code makes FASTA the only legal format and creates records with only the sequence field asa std::string. | ||
|
||
But you can also use this mechanism to make some fields shallow and other fields deep. It also allows | ||
to choose different container types. | ||
See the API documentation of the respective `reader_options` for more advanced use-cases and the | ||
exact restrictions on allowed types. | ||
|
||
## How can I create record variables? | ||
|
||
There are various easy ways to create a bio::record that do not involve manually providing the template arguments: | ||
|
||
1. Deduce from the reader. | ||
2. Use an alias. | ||
3. Use bio::make_record or bio::tie_record. | ||
|
||
### Deduce from the reader {#record_type_from_reader} | ||
|
||
When iterating over a reader, it is easy to use `auto &` to deduce the record type, but sometimes you need | ||
the record type outside of the for-loop or in a separate context. | ||
|
||
This snippet demonstrates how to read an interleaved FastQ file and process the read pairs together (at every second iteration of the loop): | ||
|
||
\snippet test/snippet/detail/reader_base.cpp read_pair_processing | ||
|
||
To to this, you need to use deep records, because shallow records become invalid after the loop iteration. | ||
Note how it is possible to "ask" the reader for the type of its record to create the local variable. | ||
|
||
### Record type aliases {#record_aliases} | ||
|
||
When writing a file without reading a file previously, you can use one of the predefined aliases: | ||
|
||
* bio::var_io::default_record | ||
|
||
This longer example illustrates using an alias: | ||
|
||
\snippet test/snippet/var_io/var_io_writer.cpp creation | ||
\snippet test/snippet/var_io/var_io_writer.cpp simple_usage_file | ||
|
||
Here bio::var_io::default_record is the type that a bio::var_io::reader would generate if it is defined without any options, **except that the alias is deep by default.** | ||
This is based on the assumption that aliases are typically used to define local variables whose values you want to change. | ||
|
||
### Making and tying records {#record_make_tie} | ||
|
||
There are convenience functions for making and tying records, similar to std::make_tuple and std::tie: | ||
\snippet test/snippet/record.cpp make_and_tie_record | ||
|
||
The type of rec1 is: | ||
\snippet test/snippet/record.cpp make_and_tie_record_type_rec1 | ||
|
||
The type of rec2 is: | ||
\snippet test/snippet/record.cpp make_and_tie_record_type_rec2 | ||
|
||
When creating a record from existing variables, you can use bio::tie_record to avoid needless copies. | ||
Instead of manually entering the identifiers as a bio::vtag, you can use bio::seq_io::default_field_ids (or the respective defaults of another reader/writer). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.