Skip to content

Commit

Permalink
Merge pull request #35 from h-2/docdoc
Browse files Browse the repository at this point in the history
[doc] lots of documentation
  • Loading branch information
h-2 committed Mar 10, 2022
2 parents e39f274 + 651c4af commit 0d99ece
Show file tree
Hide file tree
Showing 17 changed files with 449 additions and 60 deletions.
9 changes: 6 additions & 3 deletions doc/main_page.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,15 @@
# Welcome {#mainpage}

Welcome to the documentation of the B.I.O. library.
This web-site contains the API reference (documentation of our interfaces) and more elaborate Tutorials and
How-Tos.
This web-site contains the API reference (documentation of our interfaces) and some small Tutorials and HowTos.

B.I.O makes use of SeqAn3 and it is recommended to have a look at [their documentation](https://docs.seqan.de) first.


## Overview

This section contains a very short overview of the most important parts of the library.


### General IO Utilities

Expand All @@ -20,7 +23,7 @@ The transparent streams can be used in place of the standard library streams. Th
compressions such as GZip, BZip2 and BGZip.


### Readers and Writers
### Record-based I/O


| Reader | Writer | Description |
Expand Down
50 changes: 50 additions & 0 deletions doc/record_based/1_introduction.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# Introduction {#record_based_intro}

Most files in bioinformatics are comprised of *records*, i.e. multiple, indepedent entries that each consist of one or
more *fields*.
For example, a FastA file contains one or more sequence records that each contain an ID field and sequence field.

[TOC]

```
>myseq1
ACGT
>myseq2
GAGGA
>myseq3
ACTA
```

<center>
↓↓↓↓↓↓↓
</center>


| ID field | sequence field |
|:----------:|:--------------:|
| "myseq1" | "ACGT" |
| "myseq2" | "GAGGA" |
| "myseq3" | "ACTA" |

Each line in this table is conceptionally "a record", and each file is modeled as a series of these records.
The process of "reading a file", is transforming the on-disk representation displayed above into the "abstraction" shown below.
The process of "writing a file" is the reverse.

Details on how records are defined is available here: \ref record_faq

## Readers

So called *readers* are responsible for detecting the format and decoding a file into a series of records:

\snippet test/snippet/seq_io/seq_io_reader.cpp simple_usage_file

The reader is an *input range* which is C++ terminology for "something that you can iterate over (once)".
The last bit is important, it implies that once you reach the end, the reader will be "empty". To iterate over it again, you need to recreate it.

<!-- Details on how readers are defined is available here: \ref reader_writer_faq -->

## Writers

TODO
156 changes: 156 additions & 0 deletions doc/record_based/2_record_faq.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,156 @@
# Record FAQ {#record_faq}

Records in B.I.O. are of implemented as a specialisation of the bio::record template.¹
This behaves very similar to a std::tuple with the difference that a bio::field identifier is associated with every
element and a corresponding member function is provided, so you can easily access the elements without knowing the order.

<small>¹ With the exception of bio::plain_io which uses bio::plain_io::record.</small>

[TOC]

\note This page contains details on how records are defined. It is meant to provide a better understanding of the design and performance implications. We recommend starting with the snippets shown in the API (e.g. bio::seq_io::reader, bio::var_io::reader, …) and only return to this page if you have questions or want to fine-tune things.

## What is the full type of my record? {#record_type}

Most records you interact with are produced by readers.

\snippet test/snippet/seq_io/seq_io_reader.cpp simple_usage_file

In this example, `rec` is the record and with each iteration of the loop, a new record is generated from the file. The exact type of the record depends on the reader. In the above example, it is:

\snippet test/snippet/seq_io/seq_io_reader.cpp simple_usage_file_type

That is quite long and difficulat to remember (even though definitions of X* and Y* are omitted here),
so we write `auto &` instead.
But it is important to know which fields are contained in the record (in this case ID, SEQ and QUAL).
The documentation for the reader will tell you this, e.g. bio::seq_io::reader.

## How can I access the fields?

The easiest way to access a field, is by calling the respective member function:

\snippet test/snippet/seq_io/seq_io_reader.cpp simple_usage_file

Here, `.id()` (bio::record#id()) and `.seq()` (bio::record#seq()) are used to access the fields. Note, that the
documentation has entries for all field-accessor member functions, but it depends on the specific specialisation
(used by the reader) whether that function is available.
So, on the record defined by bio::seq_io::reader above, the members `.id()`, `.seq()`, `.qual()` are available, but
the member `.pos()` would not be.

When the number of fields in the record is low and you know the order, you can also use
[structured bindings](https://en.cppreference.com/w/cpp/language/structured_binding)
to decompose the record into its fields:

\snippet test/snippet/seq_io/seq_io_reader.cpp decomposed

Note that the order of the fields is fixed (in this case it is defined by bio::seq_io::default_field_ids).
It is independent of the names you give to the bindings, so this syntax is error-prone when used with large records
(e.g. those defined by bio::var_io::reader).

In generic contexts, you can also access fields via `get<0>(rec)` (returns the 0-th field in the record) or
`get<bio::field::id>(rec)` (the same as calling `rec.id()`); but most users will never need this.


## Does my record own the data? (Shallow vs deep records) {#shallow_vs_deep}

As shown above, every field has an identifier (e.g. bio::field::id) and a type (e.g. std::string_view).
You may have wondered, why std::string_view is used as a type and what these `transform_view`s are.
These imply that the record is a *shallow* data structure, i.e. the fields *appear* like strings or vectors, but they
are implemented more like references or pointers.
See the SeqAn3 documentation for an in-depth [Tutorial on Ranges and Views](http://docs.seqan.de/seqan/3-master-user/tutorial_ranges.html).

Shallow records imply fewer memory allocations and/or copy operations during reading. This results in a **better
performance** but also in some important limitations:

* Shallow records cannot be modified (as easily²).
* Shallow records cannot be "stored"; they depend on internal caches and buffers of the reader and become invalid
as soon as the next record is read from the file.


If you need to change a record in-place and/or "store" the record for longer than one iteration of the reader, you need to use *deep records* instead.
You can tell the reader that you want deep records by providing the respective options:

\snippet test/snippet/seq_io/seq_io_reader.cpp options2

This snippet behaves similar to the previous one, except that the type of `rec` is now the following:

\snippet test/snippet/seq_io/seq_io_reader.cpp options2_type

This allows you to call std::vector's `.push_back()` member function (which is not possible in the default case).
Creating this kind of record is likely a bit slower than the shallow record.

**Summary**

* The records generated by readers are *shallow* by default.
* This setting has the best performance; but it is less flexible than a *deep* record.
* Readers can be configured to produce *deep* records via the options.

<small>² Some modifying operations are possible on views, too, but this depends on the exact types.</small>

## How can I change the field types?

In the previous section, we showed how to change the field types from being shallow to deep.
For some readers, more options are available, e.g. bio::seq_io::reader assumes nucleotide data for the SEQ field by default, but you might want to read protein data instead.

\snippet test/snippet/seq_io/seq_io_reader.cpp options

The snippet above illustrates how the alphabet can be changed (and how to provide another option at the same time).

Instead of using these pre-defined `field_types`, you can also define them completely manually. You can decide to even read only a subset of the fields by changing the `.field_ids` member:

\snippet test/snippet/seq_io/seq_io_reader_options.cpp example_advanced2

This code makes FASTA the only legal format and creates records with only the sequence field asa std::string.

But you can also use this mechanism to make some fields shallow and other fields deep. It also allows
to choose different container types.
See the API documentation of the respective `reader_options` for more advanced use-cases and the
exact restrictions on allowed types.

## How can I create record variables?

There are various easy ways to create a bio::record that do not involve manually providing the template arguments:

1. Deduce from the reader.
2. Use an alias.
3. Use bio::make_record or bio::tie_record.

### Deduce from the reader {#record_type_from_reader}

When iterating over a reader, it is easy to use `auto &` to deduce the record type, but sometimes you need
the record type outside of the for-loop or in a separate context.

This snippet demonstrates how to read an interleaved FastQ file and process the read pairs together (at every second iteration of the loop):

\snippet test/snippet/detail/reader_base.cpp read_pair_processing

To to this, you need to use deep records, because shallow records become invalid after the loop iteration.
Note how it is possible to "ask" the reader for the type of its record to create the local variable.

### Record type aliases {#record_aliases}

When writing a file without reading a file previously, you can use one of the predefined aliases:

* bio::var_io::default_record

This longer example illustrates using an alias:

\snippet test/snippet/var_io/var_io_writer.cpp creation
\snippet test/snippet/var_io/var_io_writer.cpp simple_usage_file

Here bio::var_io::default_record is the type that a bio::var_io::reader would generate if it is defined without any options, **except that the alias is deep by default.**
This is based on the assumption that aliases are typically used to define local variables whose values you want to change.

### Making and tying records {#record_make_tie}

There are convenience functions for making and tying records, similar to std::make_tuple and std::tie:
\snippet test/snippet/record.cpp make_and_tie_record

The type of rec1 is:
\snippet test/snippet/record.cpp make_and_tie_record_type_rec1

The type of rec2 is:
\snippet test/snippet/record.cpp make_and_tie_record_type_rec2

When creating a record from existing variables, you can use bio::tie_record to avoid needless copies.
Instead of manually entering the identifiers as a bio::vtag, you can use bio::seq_io::default_field_ids (or the respective defaults of another reader/writer).
15 changes: 14 additions & 1 deletion include/bio/detail/reader_base.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ namespace bio
// ----------------------------------------------------------------------------

/*!\brief This is a (non-CRTP) base-class for I/O readers.
* \ingroup bio
* \tparam options_t Type of the reader options.
* \details
*
Expand Down Expand Up @@ -72,7 +73,19 @@ class reader_base : public std::ranges::view_base
* \brief The exact type of the record depends on the options!
* \{
*/
//!\brief The type of the record, a specialisation of bio::record; acts as a tuple of the selected field types.
/*!\brief The type of the record, a specialisation of bio::record.
* \details
*
* ### Example
*
* This snippet demonstrates how to read an interleaved FastQ file and process the read pairs
* together (at every second iteration of the loop):
*
* \snippet test/snippet/detail/reader_base.cpp read_pair_processing
*
* To be able to easily backup the first record of a mate-pair, you need to create a temporary
* variable (`last_record`). This type alias helps define it.
*/
using record_type = record<decltype(options_t::field_ids), decltype(options_t::field_types)>;
//!\brief The iterator type of this view (an input iterator).
using iterator = detail::in_file_iterator<reader_base>;
Expand Down
2 changes: 2 additions & 0 deletions include/bio/misc.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,8 @@ namespace bio
* Typically used to configure a class template to have members that are vectors/strings VS members that are views.
* The "shallow" version of such a class is typically cheap to copy (no dynamic memory) while the "deep" version
* is expensive to copy (holds dynamic memory).
*
* See \ref shallow_vs_deep on what this means in practice.
*/
enum class ownership
{
Expand Down
39 changes: 26 additions & 13 deletions include/bio/record.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -107,23 +107,22 @@ namespace bio
/*!\brief The class template that file records are based on; behaves like an std::tuple.
* \implements seqan3::tuple_like
* \ingroup bio
* \tparam field_types The types of the fields in this record as a seqan3::type_list.
* \tparam field_ids A vtag_t type with bio::field IDs corresponding to field_types.
* \tparam field_types The types of the fields in this record as a seqan3::type_list.
* \details
*
* This class template behaves just like an std::tuple, with the exception that it provides an additional
* This class template behaves like a std::tuple, with the exception that it provides an additional
* get-interface that takes a bio::field identifier. The traditional get interfaces (via index and
* via type) are also supported, but discouraged, because accessing via bio::field is unambiguous and
* better readable.
*
* ### Example
* In addition to the get()-interfaces, member accessors are provided with the same name as the fields.
*
* For input files this template is specialised automatically and provided by the file via its `record_type` member.
* For output files you my define it locally and pass instances of this to the output file's `push_back()`.
* See bio::seq_io::reader for how this data structure is used in practice.
*
* This is how it works:
* See #make_record() and #tie_record() for easy ways to create stand-alone record variables.
*
* \todo include test/snippet/io/record_2.cpp
* See the \ref record_faq for more details.
*/
template <typename field_ids_, typename field_types_>
struct record : seqan3::detail::transfer_template_args_onto_t<field_types_, std::tuple>
Expand Down Expand Up @@ -372,15 +371,22 @@ auto const && get(record<field_ids, field_types> const && r)
// make_record
//-------------------------------------------------------------------------------

/*!\brief Create a bio::record and deduce type from arguments (like std::make_tuple for std::tuple).
/*!\brief Create a deep bio::record from the arguments (like std::make_tuple for std::tuple).
* \param[in] tag A tag that specifies the identifiers of the subsequent arguments.
* \param[in] fields The arguments to put into the record.
* \returns A bio::record with copies of the field arguments.
* \details
*
* The record will contain copies of the arguments.
*
* For more information, see \ref record_type and \ref record_make_tie
*
* ### Example
*
* TODO
* \snippet test/snippet/record.cpp make_and_tie_record
*/
template <auto... field_ids, typename... field_type_ts>
constexpr auto make_record(vtag_t<field_ids...>, field_type_ts &&... fields)
constexpr auto make_record(vtag_t<field_ids...> BIO_DOXYGEN_ONLY(tag), field_type_ts &&... fields)
-> record<vtag_t<field_ids...>, seqan3::type_list<std::decay_t<field_type_ts>...>>
{
return {std::forward<field_type_ts>(fields)...};
Expand All @@ -390,15 +396,22 @@ constexpr auto make_record(vtag_t<field_ids...>, field_type_ts &&... fields)
// tie_record
//-------------------------------------------------------------------------------

/*!\brief Create a bio::record of references (like std::tie for std::tuple).
/*!\brief Create a shallow bio::record from the arguments (like std::tie for std::tuple).
* \param[in] tag A tag that specifies the identifiers of the subsequent arguments.
* \param[in] fields The arguments to represent in the record.
* \returns A bio::record with references to the field arguments.
* \details
*
* The record will contain references to the arguments.
*
* For more information, see \ref record_type and \ref record_make_tie
*
* ### Example
*
* TODO
* \snippet test/snippet/record.cpp make_and_tie_record
*/
template <auto... field_ids, typename... field_type_ts>
constexpr auto tie_record(vtag_t<field_ids...>, field_type_ts &... fields)
constexpr auto tie_record(vtag_t<field_ids...> BIO_DOXYGEN_ONLY(tag), field_type_ts &... fields)
-> record<vtag_t<field_ids...>, seqan3::type_list<field_type_ts &...>>
{
return {fields...};
Expand Down
5 changes: 5 additions & 0 deletions include/bio/seq_io/reader.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,11 @@ namespace bio::seq_io
* at the first whitespace:
* \snippet test/snippet/seq_io/seq_io_reader.cpp options
*
* If you need to modify or store the records, request *deep records* from the reader:
* \snippet test/snippet/seq_io/seq_io_reader.cpp options2
*
* For more information on *shallow* vs *deep*, see \ref shallow_vs_deep
*
* For more advanced options, see bio::seq_io::reader_options.
*/
template <typename... option_args_t>
Expand Down
2 changes: 2 additions & 0 deletions include/bio/seq_io/reader_options.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,8 @@ inline constinit auto field_types_protein = field_types<ownership::shallow, seqa
* \details
*
* Configures a shallow record where sequence and quality data are plain characters.
* This can be used in cases where the application needs to handle nucleotide *and*
* protein data.
*/
inline constinit auto field_types_char = field_types<ownership::shallow, char, char>;
//!\}
Expand Down
3 changes: 1 addition & 2 deletions include/bio/var_io/reader.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -55,8 +55,7 @@ namespace bio::var_io
* are returned by default also correspond to VCF specification (i.e. 1-based positions, string as strings and not
* as numbers) **with one exception:** the genotypes are not grouped by sample (as in the VCF format) but by
* genotype field (as in the BCF format).
* This results in a notably better performance when reading BCF files. See below for information on how to change
* this.
* This results in a notably better performance when reading BCF files.
*
* This reader supports the following formats:
*
Expand Down
Loading

0 comments on commit 0d99ece

Please sign in to comment.