Skip to content

Data model

Dávid Benko edited this page Aug 29, 2022 · 4 revisions

DP³ data model

Basic elements of the DP³ data model are entities (or objects), each entity record (object instance) has a set of attributes. Each attribute has some value (associated to a particular entity), optionally associated with a timestamp (history of previous values can be stored) and confidence value.

There can also be relations between entities. A relation can also have some attributes associated to it.

TODO scheme

TODO make clear difference between entity type (object class) and entity (object instance), etc.

TODO example

Attributes

There are three main types of attributes supported by DP³, each handled quite differently:

  • Plain attributes

    • Common attributes with only one value of some data type.
    • No history is stored.
    • Confidence can be stored optionally.
  • Observations

    • A history of attribute values is stored as tuples containing the value and observation time (or time interval), optionally with confidence estimation.
    • A mechanism to derive the most probable value (and its confidence) of the attribute at any given time is provided.
    • This attributes may be single or multi value.
      • TODO: describe multi-value
  • Timeseries

    • Regular or irregular timeseries, i.e. a row of timestamped numerical data.
    • Multiple values per time instant are supported (multivariate time-series)
    • Types of timeseries:
      • regular - regularly-sampled timeseries, i.e. time is divided into intervals of a fixed length and exactly one value (or one set of values) is assigned to each interval. For example, a temperature measured every 5 minutes. If no data are received for an interval, it's filled with N/A (nan). (TODO make it configurable, zero or nan?)
      • irregular - irregularly-sampled timeseries, i.e. a timestamp is explicitly attached to each value (or a set of values) and these timestamps doesn't generally have the same gaps between them.
      • irregular_intervals - same as irregular, but an interval (two timestamps) is attached to each value instead of a single timestamp. The intervals may overlap.

Configuration

Each attribute is specified by the following set of parameters:

param for types data-type default value description
id all string (identifier) (mandatory) Short string identifying the attribute, it's machine name (must match this regex [a-zA-Z_][a-zA-Z0-9_-]*, most importantly it can't contain a dot). Lower-case only is recommended. TODO: maybe allow some special symbols as prefixes?
type all string (mandatory) Type of attribute. Can be either plain, observations or timeseries.
name all string same as id Attribute name for humans
description all string "" Longer description of the attribute, if needed
color all #xxxxxx None Color to use in GUI (useful mostly for tag values), not used currently
data_type plain/observations string, one of the types below (mandatory) Data type of attribute value, see below for the list of supported data types
categories plain/observations array of strings None List of categories if data_type=category and the set of possible values is known in advance and should be enforced. If not specified, any string can be stored as attr value, but only a small number of unique values are expected (which is important for display/search in GUI, for example)
confidence plain/observations bool false Whether a confidence value should be stored along with data value or not.
multi_value observations bool false Whether multiple values can be set at the same time (can be enabled for all data types expect "tag" and "binary")
history_params observations object, see below (mandatory) History and time aggregation parameters. A subobject with fields described in the table below.
history_force_graph observations bool false By default, if data type of attribute is array, we show it's history on web interface as table. This option can force tag-like graph with comma-joined values of that array as tags.
editable plain/observations bool false Whether value of this attribute is editable via web interface.
timeseries_type timeseries string (mandatory) One of: regular, irregular or irregular_intervals
timeseries_params timeseries object, see below None History parameters for timeseries. A subobject with fields described in the table below.
series timeseries object of objects, see below (mandatory) Configuration of series of data represented by this timeseries.

History params

param type/format default value description
max_age <int><s/m/h/d> (e.g. 30s, 12h, 7d) None How many seconds/minutes/hours/days of history to keep (older data-points/intervals are removed).
max_items int (>0) None How many data-points/intervals to store (oldest ones are removed when limit is exceeded).
expire_time <int><s/m/h/d> or inf inf How long after the end time (t2) is the last value considered valid (i.e. is used as "current value"). Zero (0) means to strictly follow t1,t2. Zero can be specified without a unit (s/m/h/d).

Note: At least one of max_age and max_items SHOULD be defined, otherwise the amount of stored data can grow unbounded.

Timeseries params

param type/format default value description
max_age <int><s/m/h/d> (e.g. 30s, 12h, 7d) None How many seconds/minutes/hours/days of history to keep (older data-points/intervals are removed).

Note: max_age SHOULD be defined, otherwise the amount of stored data can grow unbounded.

Series

Key for series object is id - short string identifying the series (e.g. bytes, temperature, parcels).

param type/format default value description
type string (mandatory) Data type of series. Only int and float are allowed (also time, but that's used internally, see below).

Time series (axis) is added implicitly by DP³ and this behaviour is specific to selected timeseries_type:

  • regular: "time": { "data_type": "time" }
  • irregular: "time": { "data_type": "time" }
  • irregular_timestamps: "time_first": { "data_type": "time" }, "time_last": { "data_type": "time" }

Data ingestion (datapoint API)

Data-points

All data are written to DP³ in the form of data-points. A data-point sets a value of a given attribute of given entity. It is a JSON-encoded object with the set of keys defined in the table below. Presence of some keys depends on the primary type of the attribute (plain/observations/timseries).

key description data-type required? plain observations timeseries
type Entity type string mandatory
id Entity identification string mandatory
attr Attribute name string mandatory
v The value to set, depends on attr. type and data-type, see below -- mandatory
t1 Start time of the observation interval string (rfc 3339 format) mandatory --
t2 End time of the observation interval string (rfc 3339 format) optional, default=t1 --
c Confidence float (0.0-1.0) optional, default=1.0
src Identification of the information source string optional, default=""

More details depends on the particular type of the attribute ...

Plain

TODO

Example:

{
  "type": "ip",
  "id": "192.168.0.1",
  "attr": "note",
  "v": "My home router",
  "src": "web_gui"
}

Observations

TODO (stávající data-pointy)

Example:

{
  "type": "ip",
  "id": "192.168.0.1",
  "attr": "open_ports",
  "v": [22, 80, 443],
  "t1": "2022-08-01T12:00:00",
  "t2": "2022-08-01T12:10:00",
  "src": "open_ports_module"
}

Timeseries

Timeseries are sent to DP³ in "chunks", short timeseries that can later be joined together. Each chunk bears value(s) for one or more time instants.

The time-series datapoint looks like the other ones, but its value (v) is an object (dictionary) whose values are arrays containing values of sub-series.

All arrays must have the same length.

t1 and t2 of the data-point should specify the observation period covered by this chunk. All times within v must lie between t1 and t2.

In case of irregular (or irregular_intervals) timeseries, there are implicit time (irregular) or time_first and time_last (irregular_intervals) sub-series to store time information.

In regular time-series, time is not passed explicitly. The first value each of the sub-series is the value of the interval starting at t1, the second is of the next interval (t1 + time_step), etc. If t2 is given, it must be t1 + n*time_step, where n is the number of items in the sub-series (t2 can be omitted, in which case it's computed automatically).

For regular timeseries, the intervals of individual chunks must not overlap. Any gaps between intervals will be filled by "N/A" values (or zeros, depending on configuration - TODO).

Example of regularly sampled timeseries:

{
  ...
  "t1": "2022-08-01T12:00:00",
  "t2": "2022-08-01T12:20:00", // assuming time_step = 5 min
  "v": {
    "a": [1, 3, 0, 2]
  }
}

In irregular time-series, timestamps must always be present.

Example of irregular timeseries:

{
  ...
  "t1": "2022-08-01T12:00:00",
  "t2": "2022-08-01T12:05:00",
  "v": {
    "time": ["2022-08-01T12:00:00", "2022-08-01T12:01:10", "2022-08-01T12:01:15", "2022-08-01T12:03:30"],
    "x": [0.5, 0.8, 1.2, 0.7],
    "y": [-1, 3, 0, 0]
  }
}