diff --git a/dev/.documenter-siteinfo.json b/dev/.documenter-siteinfo.json index 230e5c9..f541bcb 100644 --- a/dev/.documenter-siteinfo.json +++ b/dev/.documenter-siteinfo.json @@ -1 +1 @@ -{"documenter":{"julia_version":"1.10.2","generation_timestamp":"2024-04-10T08:58:01","documenter_version":"1.1.0"}} \ No newline at end of file +{"documenter":{"julia_version":"1.10.3","generation_timestamp":"2024-05-18T09:29:44","documenter_version":"1.1.0"}} \ No newline at end of file diff --git a/dev/advising/index.html b/dev/advising/index.html index 47ea43e..d250d3f 100644 --- a/dev/advising/index.html +++ b/dev/advising/index.html @@ -28,7 +28,7 @@ result end (post ∘ writecommit, writefn, (output, info)) - end)source

Advisement points (standard join points)

Parsing and serialisation of data sets and collections

DataCollection​s, DataSet​s, and AbstractDataTransformer​s are advised at two stages during parsing:

  1. When calling fromspec on the Dict representation, at the start of

parsing

  1. At the end of the fromspec function, calling identity on the object

Serialisation is performed through the tospec call, which is also advised.

The signatures of the advised function calls are as follows:

fromspec(DataCollection, spec::Dict{String, Any}; path::Union{String, Nothing})::DataCollection
+    end)
source

Advisement points (standard join points)

Parsing and serialisation of data sets and collections

DataCollection​s, DataSet​s, and AbstractDataTransformer​s are advised at two stages during parsing:

  1. When calling fromspec on the Dict representation, at the start of

parsing

  1. At the end of the fromspec function, calling identity on the object

Serialisation is performed through the tospec call, which is also advised.

The signatures of the advised function calls are as follows:

fromspec(DataCollection, spec::Dict{String, Any}; path::Union{String, Nothing})::DataCollection
 identity(collection::DataCollection)::DataCollection
 tospec(collection::DataCollection)::Dict
fromspec(DataSet, collection::DataCollection, name::String, spec::Dict{String, Any})::DataSet
 identity(dataset::DataSet)::DataSet
@@ -37,4 +37,4 @@
 tospec(adt::AbstractDataTransformer)::Dict

Processing identifiers

Both the parsing of an Identifier from a string, and the serialisation of an Identifier to a string are advised. Specifically, the following function calls:

parse_ident(spec::AbstractString)
 string(ident::Identifier)

The data flow arrows

The reading, writing, and storage of data may all be advised. Specifically, the following function calls:

load(loader::DataLoader, datahandle, as::Type)
 storage(provider::DataStorage, as::Type; write::Bool)
-save(writer::DataWriter, datahandle, info)

Index of advised calls (all known join points)

There are 33 advised function calls, across 9 files, covering 12 functions (automatically detected).

Arranged by function

_read (2 instances)

fromspec (5 instances)

identity (3 instances)

init (1 instance)

lint (1 instance)

load (2 instances)

parse_ident (8 instances)

refine (1 instance)

save (1 instance)

storage (1 instance)

string (5 instances)

tospec (3 instances)

Arranged by file

display.jl (1 instance)
externals.jl (8 instances)
lint.jl (1 instance)
manipulation.jl (4 instances)
errors.jl (4 instances)
identification.jl (4 instances)
parameters.jl (2 instances)
parser.jl (6 instances)
writer.jl (3 instances)
+save(writer::DataWriter, datahandle, info)

Index of advised calls (all known join points)

There are 33 advised function calls, across 9 files, covering 12 functions (automatically detected).

Arranged by function

_read (2 instances)

fromspec (5 instances)

identity (3 instances)

init (1 instance)

lint (1 instance)

load (2 instances)

parse_ident (8 instances)

refine (1 instance)

save (1 instance)

storage (1 instance)

string (5 instances)

tospec (3 instances)

Arranged by file

display.jl (1 instance)
externals.jl (8 instances)
lint.jl (1 instance)
manipulation.jl (4 instances)
errors.jl (4 instances)
identification.jl (4 instances)
parameters.jl (2 instances)
parser.jl (6 instances)
writer.jl (3 instances)
diff --git a/dev/datatoml/index.html b/dev/datatoml/index.html index e709026..7f6c9e9 100644 --- a/dev/datatoml/index.html +++ b/dev/datatoml/index.html @@ -39,4 +39,4 @@ type=["a QualifiedType", ...] # probably optional type="a QualifiedType" # single-value alternative form priority=1 # (optional) -# other properties...

A data set is a top-level instance of an array of tables, with any name other than config. Data set names need not be unique, but should be able to be uniquely identified by the combination of their name and parameters.

Apart from data transformers, there is one recognised data property: uuid, a UUIDv4 string. Any number of additional properties may be given (so long as they do not conflict with the transformer names), they may have special behaviour based on plugins or extensions loaded, but will not be treated specially by DataToolkitBase.

A data set can have any number of data transformers, but at least two are needed for a functional data set. Data transformers are instances of an array of tables (like data sets), but directly under the data set table.

Structure of a data transformer

There are three data transformers types, with the following names:

All transformers recognise three properties:

The driver property is mandatory. type and priority can be omitted, in which case they will adopt the default values. The default type value is either determined dynamically from the available methods, or set for that particular transformer.

+# other properties...

A data set is a top-level instance of an array of tables, with any name other than config. Data set names need not be unique, but should be able to be uniquely identified by the combination of their name and parameters.

Apart from data transformers, there is one recognised data property: uuid, a UUIDv4 string. Any number of additional properties may be given (so long as they do not conflict with the transformer names), they may have special behaviour based on plugins or extensions loaded, but will not be treated specially by DataToolkitBase.

A data set can have any number of data transformers, but at least two are needed for a functional data set. Data transformers are instances of an array of tables (like data sets), but directly under the data set table.

Structure of a data transformer

There are three data transformers types, with the following names:

All transformers recognise three properties:

The driver property is mandatory. type and priority can be omitted, in which case they will adopt the default values. The default type value is either determined dynamically from the available methods, or set for that particular transformer.

diff --git a/dev/errors/index.html b/dev/errors/index.html index 091d166..1a6945e 100644 --- a/dev/errors/index.html +++ b/dev/errors/index.html @@ -9,37 +9,37 @@ ERROR: UnresolveableIdentifier: "iris::Int" does not match any available data sets Without the type restriction, however, the following data sets match: dataset:iris, which is available as a DataFrame, Matrix, CSV.File -Stacktrace: [...]source
DataToolkitBase.AmbiguousIdentifierType
AmbiguousIdentifier(identifier::Union{String, UUID}, matches::Vector, [collection])

Searching for identifier (optionally within collection), found multiple matches (provided as matches).

Example occurrence

julia> d"multimatch"
+Stacktrace: [...]
source
DataToolkitBase.AmbiguousIdentifierType
AmbiguousIdentifier(identifier::Union{String, UUID}, matches::Vector, [collection])

Searching for identifier (optionally within collection), found multiple matches (provided as matches).

Example occurrence

julia> d"multimatch"
 ERROR: AmbiguousIdentifier: "multimatch" matches multiple data sets
     ■:multimatch [45685f5f-e6ff-4418-aaf6-084b847236a8]
     ■:multimatch [92be4bda-55e9-4317-aff4-8d52ee6a5f2c]
-Stacktrace: [...]
source

Package exceptions

DataToolkitBase.UnregisteredPackageType
UnregisteredPackage(pkg::Symbol, mod::Module)

The package pkg was asked for within mod, but has not been registered by mod, and so cannot be loaded.

Example occurrence

julia> @import Foo
+Stacktrace: [...]
source

Package exceptions

DataToolkitBase.UnregisteredPackageType
UnregisteredPackage(pkg::Symbol, mod::Module)

The package pkg was asked for within mod, but has not been registered by mod, and so cannot be loaded.

Example occurrence

julia> @import Foo
 ERROR: UnregisteredPackage: Foo has not been registered by Main, see @addpkg for more information
-Stacktrace: [...]
source
DataToolkitBase.MissingPackageType
MissingPackage(pkg::Base.PkgId)

The package pkg was asked for, but does not seem to be available in the current environment.

Example occurrence

julia> @addpkg Bar "00000000-0000-0000-0000-000000000000"
+Stacktrace: [...]
source
DataToolkitBase.MissingPackageType
MissingPackage(pkg::Base.PkgId)

The package pkg was asked for, but does not seem to be available in the current environment.

Example occurrence

julia> @addpkg Bar "00000000-0000-0000-0000-000000000000"
 Bar [00000000-0000-0000-0000-000000000000]
 
 julia> @import Bar
 [ Info: Lazy-loading Bar [00000000-0000-0000-0000-000000000001]
 ERROR: MissingPackage: Bar [00000000-0000-0000-0000-000000000001] has been required, but does not seem to be installed.
-Stacktrace: [...]
source

Data Operation exceptions

DataToolkitBase.CollectionVersionMismatchType
CollectionVersionMismatch(version::Int)

The version of the collection currently being acted on is not supported by the current version of DataToolkitBase.

Example occurrence

julia> fromspec(DataCollection, SmallDict{String, Any}("data_config_version" => -1))
+Stacktrace: [...]
source

Data Operation exceptions

DataToolkitBase.CollectionVersionMismatchType
CollectionVersionMismatch(version::Int)

The version of the collection currently being acted on is not supported by the current version of DataToolkitBase.

Example occurrence

julia> fromspec(DataCollection, SmallDict{String, Any}("data_config_version" => -1))
 ERROR: CollectionVersionMismatch: -1 (specified) ≠ 0 (current)
   The data collection specification uses the v-1 data collection format, however
   the installed DataToolkitBase version expects the v0 version of the format.
   In the future, conversion facilities may be implemented, for now though you
   will need to manually upgrade the file to the v0 format.
-Stacktrace: [...]
source
DataToolkitBase.EmptyStackErrorType
EmptyStackError()

An attempt was made to perform an operation on a collection within the data stack, but the data stack is empty.

Example occurrence

julia> getlayer(nothing) # with an empty STACK
+Stacktrace: [...]
source
DataToolkitBase.EmptyStackErrorType
EmptyStackError()

An attempt was made to perform an operation on a collection within the data stack, but the data stack is empty.

Example occurrence

julia> getlayer(nothing) # with an empty STACK
 ERROR: EmptyStackError: The data collection stack is empty
-Stacktrace: [...]
source
DataToolkitBase.ReadonlyCollectionType
ReadonlyCollection(collection::DataCollection)

Modification of collection is not viable, as it is read-only.

Example Occurrence

julia> lockedcollection = DataCollection(SmallDict{String, Any}("uuid" => Base.UUID(rand(UInt128)), "config" => SmallDict{String, Any}("locked" => true)))
+Stacktrace: [...]
source
DataToolkitBase.ReadonlyCollectionType
ReadonlyCollection(collection::DataCollection)

Modification of collection is not viable, as it is read-only.

Example Occurrence

julia> lockedcollection = DataCollection(SmallDict{String, Any}("uuid" => Base.UUID(rand(UInt128)), "config" => SmallDict{String, Any}("locked" => true)))
 julia> write(lockedcollection)
 ERROR: ReadonlyCollection: The data collection unnamed#298 is locked
-Stacktrace: [...]
source
DataToolkitBase.TransformerErrorType
TransformerError(msg::String)

A catch-all for issues involving data transformers, with details given in msg.

Example occurrence

julia> emptydata = DataSet(DataCollection(), "empty", SmallDict{String, Any}("uuid" => Base.UUID(rand(UInt128))))
+Stacktrace: [...]
source
DataToolkitBase.TransformerErrorType
TransformerError(msg::String)

A catch-all for issues involving data transformers, with details given in msg.

Example occurrence

julia> emptydata = DataSet(DataCollection(), "empty", SmallDict{String, Any}("uuid" => Base.UUID(rand(UInt128))))
 DataSet empty
 
 julia> read(emptydata)
 ERROR: TransformerError: Data set "empty" could not be loaded in any form.
-Stacktrace: [...]
source
DataToolkitBase.UnsatisfyableTransformerType
UnsatisfyableTransformer{T}(dataset::DataSet, types::Vector{QualifiedType})

A transformer (of type T) that could provide any of types was asked for, but there is no transformer that satisfies this restriction.

Example occurrence

julia> emptydata = DataSet(DataCollection(), "empty", SmallDict{String, Any}("uuid" => Base.UUID(rand(UInt128))))
+Stacktrace: [...]
source
DataToolkitBase.UnsatisfyableTransformerType
UnsatisfyableTransformer{T}(dataset::DataSet, types::Vector{QualifiedType})

A transformer (of type T) that could provide any of types was asked for, but there is no transformer that satisfies this restriction.

Example occurrence

julia> emptydata = DataSet(DataCollection(), "empty", SmallDict{String, Any}("uuid" => Base.UUID(rand(UInt128))))
 DataSet empty
 
 julia> read(emptydata, String)
 ERROR: UnsatisfyableTransformer: There are no loaders for "empty" that can provide a String. The defined loaders are as follows:
-Stacktrace: [...]
source
DataToolkitBase.OrphanDataSetType
OrphanDataSet(dataset::DataSet)

The data set (dataset) is no longer a child of its parent collection.

This error should not occur, and is intended as a sanity check should something go quite wrong.

source
+Stacktrace: [...]source
DataToolkitBase.OrphanDataSetType
OrphanDataSet(dataset::DataSet)

The data set (dataset) is no longer a child of its parent collection.

This error should not occur, and is intended as a sanity check should something go quite wrong.

source
diff --git a/dev/index.html b/dev/index.html index 826cb77..830ac6d 100644 --- a/dev/index.html +++ b/dev/index.html @@ -1,2 +1,2 @@ -Introduction · DataToolkitBase.jl

Introduction

The problem with the current state of affairs

Data is beguiling. It can initially seem simple to deal with: "here I have a file, and that's it". However as soon as you do things with the data you're prone to be asked tricky questions like:

  • where's the data?
  • how did you process that data?
  • how can I be sure I'm looking at the same data as you?

This is no small part of the replication crisis.

image

Further concerns arise as soon as you start dealing with large quantities of data, or computationally expensive derived data sets. For example:

  • Have I already computed this data set somewhere else?
  • Is my generated data up to date with its sources/dependencies?

Generic tools exist for many parts of this problem, but there are some benefits that can be realised by creating a Julia-specific system, namely:

  • Having all pertinent environmental information in the data processing contained in a single Project.toml
  • Improved convenience in data loading and management, compared to a generic solution
  • Allowing datasets to be easily shared with a Julia package

In addition, the Julia community seems to have a strong tendency to NIH[NIH] tools, so we may as well get ahead of this and try to make something good 😛.

Pre-existing solutions

DataLad

  • Does a lot of things well
  • Puts information on how to create data in git commit messages (bad)
  • No data file specification

Kedro data catalog

Snakemake

  • Workflow manager, with remote file support
  • Snakemake Remote Files
  • Good list of possible file locations to handle
  • Drawback is that you have to specify the location you expect(S3, http, FTP, etc.)
  • No data file specification

Nextflow

  • Workflow manager, with remote file support
  • Docs on files and IO
  • Docs on S3
  • You just call file() and nextflow figures out under the hood the protocol whether it should pull it from S3, http, FTP, or a local file.
  • No data file specification
  • NIHNot Invented Here, a tendency to "reinvent the wheel" to avoid using tools from external origins — it would of course be better if you (re)made it.
+Introduction · DataToolkitBase.jl

Introduction

The problem with the current state of affairs

Data is beguiling. It can initially seem simple to deal with: "here I have a file, and that's it". However as soon as you do things with the data you're prone to be asked tricky questions like:

  • where's the data?
  • how did you process that data?
  • how can I be sure I'm looking at the same data as you?

This is no small part of the replication crisis.

image

Further concerns arise as soon as you start dealing with large quantities of data, or computationally expensive derived data sets. For example:

  • Have I already computed this data set somewhere else?
  • Is my generated data up to date with its sources/dependencies?

Generic tools exist for many parts of this problem, but there are some benefits that can be realised by creating a Julia-specific system, namely:

  • Having all pertinent environmental information in the data processing contained in a single Project.toml
  • Improved convenience in data loading and management, compared to a generic solution
  • Allowing datasets to be easily shared with a Julia package

In addition, the Julia community seems to have a strong tendency to NIH[NIH] tools, so we may as well get ahead of this and try to make something good 😛.

Pre-existing solutions

DataLad

  • Does a lot of things well
  • Puts information on how to create data in git commit messages (bad)
  • No data file specification

Kedro data catalog

Snakemake

  • Workflow manager, with remote file support
  • Snakemake Remote Files
  • Good list of possible file locations to handle
  • Drawback is that you have to specify the location you expect(S3, http, FTP, etc.)
  • No data file specification

Nextflow

  • Workflow manager, with remote file support
  • Docs on files and IO
  • Docs on S3
  • You just call file() and nextflow figures out under the hood the protocol whether it should pull it from S3, http, FTP, or a local file.
  • No data file specification
  • NIHNot Invented Here, a tendency to "reinvent the wheel" to avoid using tools from external origins — it would of course be better if you (re)made it.
diff --git a/dev/libinternal/index.html b/dev/libinternal/index.html index 1c4dc41..c2bcf53 100644 --- a/dev/libinternal/index.html +++ b/dev/libinternal/index.html @@ -3,10 +3,10 @@ ╵ ▼ Storage ◀────▶ Data Information ▲ ╷ - ╰────writer─────╯

There are three subtypes:

Each subtype takes a Symbol type parameter designating the driver which should be used to perform the data operation. In addition, each subtype has the following fields:

source

Advice Amalgamation

DataToolkitBase.AdviceAmalgamationType

A collection of Advices sourced from available Plugins.

Like individual Advices, a AdviceAmalgamation can be called as a function. However, it also supports the following convenience syntax:

(::AdviceAmalgamation)(f::Function, args...; kargs...) # -> result

Constructors

AdviceAmalgamation(adviseall::Function, advisors::Vector{Advice},
+                 ╰────writer─────╯

There are three subtypes:

  • DataStorage
  • DataLoader
  • DataWrite

Each subtype takes a Symbol type parameter designating the driver which should be used to perform the data operation. In addition, each subtype has the following fields:

  • dataset::DataSet, the data set the method operates on
  • type::Vector{<:QualifiedType}, the Julia types the method supports
  • priority::Int, the priority with which this method should be used, compared to alternatives. Lower values have higher priority.
  • parameters::SmallDict{String, Any}, any parameters applied to the method.
source

Advice Amalgamation

DataToolkitBase.AdviceAmalgamationType

A collection of Advices sourced from available Plugins.

Like individual Advices, a AdviceAmalgamation can be called as a function. However, it also supports the following convenience syntax:

(::AdviceAmalgamation)(f::Function, args...; kargs...) # -> result

Constructors

AdviceAmalgamation(adviseall::Function, advisors::Vector{Advice},
                    plugins_wanted::Vector{String}, plugins_used::Vector{String})
 AdviceAmalgamation(plugins::Vector{String})
-AdviceAmalgamation(collection::DataCollection)
source

Qualified Types

DataToolkitBase.QualifiedTypeType

A representation of a Julia type that does not need the type to be defined in the Julia session, and can be stored as a string. This is done by storing the type name and the module it belongs to as Symbols.

Warning

While QualifiedType is currently quite capable, it is not currently able to express the full gamut of Julia types. In future this will be improved, but it will likely always be restricted to a certain subset.

Subtyping

While the subtype operator cannot work on QualifiedTypes (<: is a built-in), when the Julia types are defined the subset operator can be used instead. This works by simply converting the QualifiedTypes to the corresponding Type and then applying the subtype operator.

julia> QualifiedTypes(:Base, :Vector) ⊆ QualifiedTypes(:Core, :Array)
+AdviceAmalgamation(collection::DataCollection)
source

Qualified Types

DataToolkitBase.QualifiedTypeType

A representation of a Julia type that does not need the type to be defined in the Julia session, and can be stored as a string. This is done by storing the type name and the module it belongs to as Symbols.

Warning

While QualifiedType is currently quite capable, it is not currently able to express the full gamut of Julia types. In future this will be improved, but it will likely always be restricted to a certain subset.

Subtyping

While the subtype operator cannot work on QualifiedTypes (<: is a built-in), when the Julia types are defined the subset operator can be used instead. This works by simply converting the QualifiedTypes to the corresponding Type and then applying the subtype operator.

julia> QualifiedTypes(:Base, :Vector) ⊆ QualifiedTypes(:Core, :Array)
 true
 
 julia> Matrix ⊆ QualifiedTypes(:Core, :Array)
@@ -17,9 +17,9 @@
 
 julia> QualifiedTypes(:Base, :Foobar) ⊆ AbstractVector
 false

Constructors

QualifiedType(parentmodule::Symbol, typename::Symbol)
-QualifiedType(t::Type)

Parsing

A QualifiedType can be expressed as a string as "$parentmodule.$typename". This can be easily parsed as a QualifiedType, e.g. parse(QualifiedType, "Core.IO").

source

Global variables

DataToolkitBase.STACKConstant

The set of data collections currently available.

source
DataToolkitBase.PLUGINSConstant

The set of plugins currently available.

source
DataToolkitBase.EXTRA_PACKAGESConstant

The set of packages loaded by each module via @addpkg, for import with @import.

More specifically, when a module M invokes @addpkg pkg id then EXTRA_PACKAGES[M][pkg] = id is set, and then this information is used with @import to obtain the package from the root module.

source
DataToolkitBase.DATA_CONFIG_RESERVED_ATTRIBUTESConstant

The data specification TOML format constructs a DataCollection, which itself contains DataSets, comprised of metadata and AbstractDataTransformers.

DataCollection
+QualifiedType(t::Type)

Parsing

A QualifiedType can be expressed as a string as "$parentmodule.$typename". This can be easily parsed as a QualifiedType, e.g. parse(QualifiedType, "Core.IO").

source

Global variables

DataToolkitBase.STACKConstant

The set of data collections currently available.

source
DataToolkitBase.PLUGINSConstant

The set of plugins currently available.

source
DataToolkitBase.EXTRA_PACKAGESConstant

The set of packages loaded by each module via @addpkg, for import with @import.

More specifically, when a module M invokes @addpkg pkg id then EXTRA_PACKAGES[M][pkg] = id is set, and then this information is used with @import to obtain the package from the root module.

source
DataToolkitBase.DATA_CONFIG_RESERVED_ATTRIBUTESConstant

The data specification TOML format constructs a DataCollection, which itself contains DataSets, comprised of metadata and AbstractDataTransformers.

DataCollection
 ├─ DataSet
 │  ├─ AbstractDataTransformer
 │  └─ AbstractDataTransformer
 ├─ DataSet
-⋮

Within each scope, there are certain reserved attributes. They are listed in this Dict under the following keys:

  • :collection for DataCollection
  • :dataset for DataSet
  • :transformer for AbstractDataTransformer
source
+⋮

Within each scope, there are certain reserved attributes. They are listed in this Dict under the following keys:

source diff --git a/dev/newtransformer/index.html b/dev/newtransformer/index.html index 7358340..0f3cc45 100644 --- a/dev/newtransformer/index.html +++ b/dev/newtransformer/index.html @@ -1,6 +1,6 @@ Transformer backends · DataToolkitBase.jl

Creating a new data transformer

As mentioned before, there are three types of data transformer:

  • storage
  • loader
  • writer

The three corresponding Julia types are:

  • DataStorage
  • DataLoader
  • DataWriter

All three types accept a driver (symbol) type parameter. For example, a storage transformer using a "filesystem" driver would be of the type DataStorage{:filesystem}.

Adding support for a new driver is a simple as adding method implementations for the three key data transformer methods:

DataToolkitBase.loadFunction
load(loader::DataLoader{driver}, source::Any, as::Type)

Using a certain loader, obtain information in the form of as from the data given by source.

This fulfils this component of the overall data flow:

  ╭────loader─────╮
   ╵               ▼
-Data          Information

When the loader produces nothing this is taken to indicate that it was unable to load the data for some reason, and that another loader should be tried if possible. This can be considered a soft failure. Any other value is considered valid information.

source
DataToolkitBase.storageFunction
storage(storer::DataStorage, as::Type; write::Bool=false)

Fetch a storer in form as, appropiate for reading from or writing to (depending on write).

By default, this just calls getstorage or putstorage (when write=true).

This executes this component of the overall data flow:

Storage ◀────▶ Data
source
DataToolkitBase.saveFunction
save(writer::Datasaveer{driver}, destination::Any, information::Any)

Using a certain writer, save the information to the destination.

This fulfils this component of the overall data flow:

Data          Information
+Data          Information

When the loader produces nothing this is taken to indicate that it was unable to load the data for some reason, and that another loader should be tried if possible. This can be considered a soft failure. Any other value is considered valid information.

source
DataToolkitBase.storageFunction
storage(storer::DataStorage, as::Type; write::Bool=false)

Fetch a storer in form as, appropiate for reading from or writing to (depending on write).

By default, this just calls getstorage or putstorage (when write=true).

This executes this component of the overall data flow:

Storage ◀────▶ Data
source
DataToolkitBase.saveFunction
save(writer::Datasaveer{driver}, destination::Any, information::Any)

Using a certain writer, save the information to the destination.

This fulfils this component of the overall data flow:

Data          Information
   ▲               ╷
-  ╰────writer─────╯
source
+ ╰────writer─────╯source diff --git a/dev/packages/index.html b/dev/packages/index.html index d7b5502..7c1ae0e 100644 --- a/dev/packages/index.html +++ b/dev/packages/index.html @@ -1,12 +1,12 @@ -Packages · DataToolkitBase.jl

Using Packages

It is entirely likely that in the course of writing a package providing a custom data transformer, one would come across packages that may be needed.

Every possibly desired package could be shoved into the list of dependences, but this is a somewhat crude approach. A more granular approach is enabled with two macros, @addpkg and @import.

Letting DataToolkitBase know about extra packages

DataToolkitBase.@addpkgMacro
@addpkg name::Symbol uuid::String

Register the package identified by name with UUID uuid. This package may now be used with @import $name.

All @addpkg statements should lie within a module's __init__ function.

Example

@addpkg CSV "336ed68f-0bac-5ca0-87d4-7b16caf5d00b"
source

Using extra packages

DataToolkitBase.@importMacro
@import pkg1, pkg2...
+Packages · DataToolkitBase.jl

Using Packages

It is entirely likely that in the course of writing a package providing a custom data transformer, one would come across packages that may be needed.

Every possibly desired package could be shoved into the list of dependences, but this is a somewhat crude approach. A more granular approach is enabled with two macros, @addpkg and @import.

Letting DataToolkitBase know about extra packages

DataToolkitBase.@addpkgMacro
@addpkg name::Symbol uuid::String

Register the package identified by name with UUID uuid. This package may now be used with @import $name.

All @addpkg statements should lie within a module's __init__ function.

Example

@addpkg CSV "336ed68f-0bac-5ca0-87d4-7b16caf5d00b"
source

Using extra packages

DataToolkitBase.@importMacro
@import pkg1, pkg2...
 @import pkg1 as name1, pkg2 as name2...
 @import pkg: foo, bar...
 @import pkg: foo as bar, bar as baz...

Fetch modules previously registered with @addpkg, and import them into the current namespace. This macro tries to largely mirror the syntax of using.

For the sake of type inference, it is assumed that all bindings that start with a lower case letter are functions, and bindings that start with an upper case letter are types. Exceptions must be manually annotated with type assertions.

If a required package had to be loaded for the @import statement, a PkgRequiredRerunNeeded singleton will be returned.

Example

@import pkg
 pkg.dothing(...)
 # Alternative form
 @import pkg: dothing
-dothing(...)
source

Example

module DataToolkitExample
+dothing(...)
source

Example

module DataToolkitExample
 
 using DataToolkitBase
 using DataFrame
@@ -30,4 +30,4 @@
     result
 end
 
-end
+end diff --git a/dev/repl/index.html b/dev/repl/index.html index c17cb83..efa7a93 100644 --- a/dev/repl/index.html +++ b/dev/repl/index.html @@ -13,7 +13,7 @@ data> help cmd subcmd # Obtain the help for subcmd. data> ?cmd subcmd # Obtain the help for subcmd. data> cmd help subcmd # Obtain the help for subcmd. -data> cmd ?subcmd # Obtain the help for subcmd.

Extending the Data REPL

Registering commands

To register a command, one simply needs to push a ReplCmd onto REPL_CMDS.

DataToolkitBase.REPL_CMDSConstant

The set of commands available directly in the Data REPL.

source
DataToolkitBase.ReplCmdType

A command that can be used in the Data REPL (accessible through '}').

A ReplCmd must have a:

  • name, a symbol designating the command keyword.
  • trigger, a string used as the command trigger (defaults to String(name)).
  • description, a short overview of the functionality as a string or displayable object.
  • execute, either a list of sub-ReplCmds, or a function which will perform the command's action. The function must take a single argument, the rest of the command as an AbstractString (for example, 'cmd arg1 arg2' will call the execute function with "arg1 arg2").

Constructors

ReplCmd{name::Symbol}(trigger::String, description::Any, execute::Function)
+data> cmd ?subcmd      # Obtain the help for subcmd.

Extending the Data REPL

Registering commands

To register a command, one simply needs to push a ReplCmd onto REPL_CMDS.

DataToolkitBase.ReplCmdType

A command that can be used in the Data REPL (accessible through '}').

A ReplCmd must have a:

  • name, a symbol designating the command keyword.
  • trigger, a string used as the command trigger (defaults to String(name)).
  • description, a short overview of the functionality as a string or displayable object.
  • execute, either a list of sub-ReplCmds, or a function which will perform the command's action. The function must take a single argument, the rest of the command as an AbstractString (for example, 'cmd arg1 arg2' will call the execute function with "arg1 arg2").

Constructors

ReplCmd{name::Symbol}(trigger::String, description::Any, execute::Function)
 ReplCmd{name::Symbol}(description::Any, execute::Function)
 ReplCmd(name::Union{Symbol, String}, trigger::String, description::Any, execute::Function)
 ReplCmd(name::Union{Symbol, String}, description::Any, execute::Function)

Examples

ReplCmd(:echo, "print the argument", identity)
@@ -22,18 +22,18 @@
     [ReplCmd(:add, "a + b + ...", nums -> sum(parse.(Int, split(nums))))],
      ReplCmd(:mul, "a * b * ...", nums -> prod(parse.(Int, split(nums)))))

Methods

help(::ReplCmd) # -> print detailed help
 allcompletions(::ReplCmd) # -> list all candidates
-completions(::ReplCmd, sofar::AbstractString) # -> list relevant candidates
source

Completion

As hinted by the ReplCmd docstring, completions can be implemented by implementing completions(::ReplCmd{:CMD_ID}, sofar::AbstractString) or allcompletions.

DataToolkitBase.completionsFunction
completions(r::ReplCmd, sofar::AbstractString)

Obtain a list of String completion candidates based on sofar. All candidates should begin with sofar.

Should this function not be implemented for the specific ReplCmd r, allcompletions(r) will be called and filter to candidates that begin with sofar.

If r has subcommands, then the subcommand prefix will be removed and completions re-called on the relevant subcommand.

source
DataToolkitBase.allcompletionsFunction
allcompletions(r::ReplCmd)

Obtain all possible String completion candidates for r. This defaults to the empty vector String[].

allcompletions is only called when completions(r, sofar::AbstractString) is not implemented.

source

Helper functions

To create a pleasant user interface, a number of utility functions are provided.

DataToolkitBase.promptFunction
prompt(question::AbstractString, default::AbstractString="",
+completions(::ReplCmd, sofar::AbstractString) # -> list relevant candidates
source

Completion

As hinted by the ReplCmd docstring, completions can be implemented by implementing completions(::ReplCmd{:CMD_ID}, sofar::AbstractString) or allcompletions.

DataToolkitBase.completionsFunction
completions(r::ReplCmd, sofar::AbstractString)

Obtain a list of String completion candidates based on sofar. All candidates should begin with sofar.

Should this function not be implemented for the specific ReplCmd r, allcompletions(r) will be called and filter to candidates that begin with sofar.

If r has subcommands, then the subcommand prefix will be removed and completions re-called on the relevant subcommand.

source
DataToolkitBase.allcompletionsFunction
allcompletions(r::ReplCmd)

Obtain all possible String completion candidates for r. This defaults to the empty vector String[].

allcompletions is only called when completions(r, sofar::AbstractString) is not implemented.

source

Helper functions

To create a pleasant user interface, a number of utility functions are provided.

DataToolkitBase.promptFunction
prompt(question::AbstractString, default::AbstractString="",
        allowempty::Bool=false, cleardefault::Bool=true,
        multiline::Bool=false)

Interactively ask question and return the response string, optionally with a default value. If multiline is true, RET must be pressed twice consecutively to submit a value.

Unless allowempty is set an empty response is not accepted. If cleardefault is set, then an initial backspace will clear the default value.

The prompt supports the following line-edit-y keys:

  • left arrow
  • right arrow
  • home
  • end
  • delete forwards
  • delete backwards

Example

julia> prompt("What colour is the sky? ")
 What colour is the sky? Blue
-"Blue"
source
DataToolkitBase.prompt_charFunction
prompt_char(question::AbstractString, options::Vector{Char},
-            default::Union{Char, Nothing}=nothing)

Interactively ask question, only accepting options keys as answers. All keys are converted to lower case on input. If default is not nothing and 'RET' is hit, then default will be returned.

Should '^C' be pressed, an InterruptException will be thrown.

source
DataToolkitBase.confirm_ynFunction
confirm_yn(question::AbstractString, default::Bool=false)

Interactively ask question and accept y/Y/n/N as the response. If any other key is pressed, then default will be taken as the response. A " [y/n]: " string will be appended to the question, with y/n capitalised to indicate the default value.

Example

julia> confirm_yn("Do you like chocolate?", true)
+"Blue"
source
DataToolkitBase.prompt_charFunction
prompt_char(question::AbstractString, options::Vector{Char},
+            default::Union{Char, Nothing}=nothing)

Interactively ask question, only accepting options keys as answers. All keys are converted to lower case on input. If default is not nothing and 'RET' is hit, then default will be returned.

Should '^C' be pressed, an InterruptException will be thrown.

source
DataToolkitBase.confirm_ynFunction
confirm_yn(question::AbstractString, default::Bool=false)

Interactively ask question and accept y/Y/n/N as the response. If any other key is pressed, then default will be taken as the response. A " [y/n]: " string will be appended to the question, with y/n capitalised to indicate the default value.

Example

julia> confirm_yn("Do you like chocolate?", true)
 Do you like chocolate? [Y/n]: y
-true
source
DataToolkitBase.peelwordFunction
peelword(input::AbstractString)

Read the next 'word' from input. If input starts with a quote, this is the unescaped text between the opening and closing quote. Other wise this is simply the next word.

Returns a tuple of the form (word, rest).

Example

julia> peelword("one two")
+true
source
DataToolkitBase.peelwordFunction
peelword(input::AbstractString)

Read the next 'word' from input. If input starts with a quote, this is the unescaped text between the opening and closing quote. Other wise this is simply the next word.

Returns a tuple of the form (word, rest).

Example

julia> peelword("one two")
 ("one", "two")
 
 julia> peelword(""one two" three")
-("one two", "three")
source

Simple example

In the below example we will extend the Data REPL by adding a command cowsay which simply call the (assumed to be installed) system cowsay executable.

function cowsay_repl(input::AbstractString)
+("one two", "three")
source

Simple example

In the below example we will extend the Data REPL by adding a command cowsay which simply call the (assumed to be installed) system cowsay executable.

function cowsay_repl(input::AbstractString)
     if isempty(input)
         confirm_yn("Are you ready to hear your fortune?", true) &&
             cowsay_repl(read(`fortune`, String))
@@ -76,4 +76,4 @@
             (__)\       )\/\
                 ||----w |
                 ||     ||
-
+ diff --git a/dev/usage/index.html b/dev/usage/index.html index 531328e..155eb41 100644 --- a/dev/usage/index.html +++ b/dev/usage/index.html @@ -1,5 +1,5 @@ -Usage · DataToolkitBase.jl

Usage

Identifying a dataset

Reading datasets

Base.readFunction
read(filename::AbstractString, DataCollection; writer::Union{Function, Nothing})

Read the entire contents of a file as a DataCollection.

The default value of writer is self -> write(filename, self).

source
read(io::IO, DataCollection; path::Union{String, Nothing}=nothing, mod::Module=Base.Main)

Read the entirety of io, as a DataCollection.

source
read(dataset::DataSet, as::Type)
+Usage · DataToolkitBase.jl

Usage

Identifying a dataset

Reading datasets

Base.readFunction
read(filename::AbstractString, DataCollection; writer::Union{Function, Nothing})

Read the entire contents of a file as a DataCollection.

The default value of writer is self -> write(filename, self).

source
read(io::IO, DataCollection; path::Union{String, Nothing}=nothing, mod::Module=Base.Main)

Read the entirety of io, as a DataCollection.

source
read(dataset::DataSet, as::Type)
 read(dataset::DataSet) # as default type

Obtain information from dataset in the form of as, with the appropriate loader and storage provider automatically determined.

This executes this component of the overall data flow:

                 ╭────loader─────╮
                  ╵               ▼
 Storage ◀────▶ Data          Information

The loader and storage provider are selected by identifying the highest priority loader that can be satisfied by a storage provider. What this looks like in practice is illustrated in the diagram below.

      read(dataset, Matrix) ⟶ ::Matrix ◀╮
@@ -15,6 +15,6 @@
   ─ the load path used
   ┄ an option not taken
 
-TODO explain further
source

Writing datasets

Base.writeFunction
write(dataset::DataSet, info::Any)

TODO write docstring

source

Accessing the raw data

Base.openFunction
open(dataset::DataSet, as::Type; write::Bool=false)

Obtain the data of dataset in the form of as, with the appropriate storage provider automatically selected.

A write flag is also provided, to help the driver pick a more appropriate form of as.

This executes this component of the overall data flow:

                 ╭────loader─────╮
+TODO explain further
source

Writing datasets

Base.writeFunction
write(dataset::DataSet, info::Any)

TODO write docstring

source

Accessing the raw data

Base.openFunction
open(dataset::DataSet, as::Type; write::Bool=false)

Obtain the data of dataset in the form of as, with the appropriate storage provider automatically selected.

A write flag is also provided, to help the driver pick a more appropriate form of as.

This executes this component of the overall data flow:

                 ╭────loader─────╮
                  ╵               ▼
-Storage ◀────▶ Data          Information
source
+Storage ◀────▶ Data Information
source