Skip to content

Data model

Václav Bartoš edited this page Aug 19, 2022 · 4 revisions

DP³ data model

Basic elements of the DP³ data model are entities (or objects), each entity record (object instance) has a set of attributes. Each attribute has some value (associated to a particular entity), optionally associated with a timestamp (history of previous values can be stored) and confidence value.

There can also be relations between entities. A relation can also have some attributes associated to it.

TODO scheme

TODO make clear difference between entity type (object class) and entity (object instance), etc.

TODO example

Attributes

There are three main types of attributes supported by DP³, each handled quite differently:

  • Plain attributes

    • Common attributes with only one value of some data type.
    • No history is stored.
    • Confidence can be stored optionally.
  • Observations

    • A history of attribute values is stored as tuples containing the value and observation time (or time interval), optionally with confidence estimation.
    • A mechanism to derive the most probable value (and its confidence) of the attribute at any given time is provided.
    • This attributes may be single or multi value.
      • TODO: describe multi-value
  • Timeseries

    • Regular or irregular timeseries, i.e. a row of timestamped numerical data.
    • Multiple values per time instant are supported (multivariate time-series)
    • Types of timeseries:
      • regular - regularly-sampled timeseries, i.e. time is divided into intervals of a fixed length and exactly one value (or one set of values) is assigned to each interval. For example, a temperature measured every 5 minutes. If no data are received for an interval, it's filled with N/A (nan). (TODO make it configurable, zero or nan?)
      • irregular - irregularly-sampled timeseries, i.e. a timestamp is explicitly attached to each value (or a set of values) and these timestamps doesn't generally have the same gaps between them.
      • irregular_intervals - same as irregular, but an interval (two timestamps) is attached to each value instead of a single timestamp. The intervals may overlap.

Configuration

TODO

Data ingestion (datapoint API)

Data-points

All data are written to DP³ in the form of data-points. A data-point sets a value of a given attribute of given entity. It is a JSON-encoded object with the set of keys defined in the table below. Presence of some keys depends on the primary type of the attribute (plain/observations/timseries).

key description data-type required? plain observations timeseries
type Entity type string mandatory
id Entity identification string mandatory
attr Attribute name string mandatory
v The value to set, depends on attr. type and data-type, see below -- mandatory
t1 Start time of the observation interval string (rfc 3339 format) mandatory --
t2 End time of the observation interval string (rfc 3339 format) optional, default=t1 --
c Confidence float (0.0-1.0) optional, default=1.0
src Identification of the information source string optional, default=""

More details depends on the particular type of the attribute ...

Plain

TODO

Example:

{
  "type": "ip",
  "id": "192.168.0.1",
  "attr": "note",
  "v": "My home router",
  "src": "web_gui"
}

Observation

TODO (stávající data-pointy)

Example:

{
  "type": "ip",
  "id": "192.168.0.1",
  "attr": "open_ports",
  "v": [22, 80, 443],
  "t1": "2022-08-01T12:00:00",
  "t2": "2022-08-01T12:10:00",
  "src": "open_ports_module"
}

Timeseries

Timeseries are sent to DP³ in "chunks", short timeseries that can later be joined together. Each chunk bears value(s) for one or more time instants.

//The time-series datapoint looks like the other ones, but its value (v) is a 2D array - an array of arrays containing values of sub-series. In case of irregular (or irregular_intervals) timeseries, the first one (or two) arrays are the timestamps. All arrays must have the same length.

TODO: nebylo by lepší mít to jako dict polí, aby bylo jasné, které pole je která sub-series? Nechávat to jen na pořadí přijde moc náchylné k chybám.

t1 and t2 of the data-point should specify the observation period covered by this chunk. All times within v must lie between t1 and t2.

In regular time-series, time is not passed explicitly. The first value each of the sub-series is the value of the interval starting at t1, the second is of the next interval (t1 + time_step), etc. If t2 is given, it must be t1 + n*time_step, where n is the number of items in the sub-series (t2 can be omitted, in which case it's computed automatically).

For regular timeseries, the intervals of individual chunks must not overlap. Any gaps between intervals will be filled by "N/A" values (or zeros, depending on configuration - TODO).

Example of regularly sampled timeseries:

{
  ...
  "t1": "2022-08-01T12:00:00",
  "t2": "2022-08-01T12:15:00", // assuming time_step = 5 min
  "v": {
    "a": [1, 3, 0, 2]
  }
}

In irregular time-series, timestamps must always

Example of irregular timeseries:

{
  ...
  "t1": "2022-08-01T12:00:00",
  "t2": "2022-08-01T12:05:00",
  "v": {
    "time": ["2022-08-01T12:00:00", "2022-08-01T12:01:10", "2022-08-01T12:01:15", "2022-08-01T12:03:30"],
    "x": [0.5, 0.8, 1.2, 0.7],
    "y": [-1, 3, 0, 0]
  }
}
Clone this wiki locally