How To: DKB instance from scratch

This page provides information on development and establishment of a new instance Data Knowledge Base in a non-ATLAS related use-case -- from the use-case specifics analysis to recomendations on technical solutions.

When to use the DKB approach

Metadata are Big Data in the terms of 3V:
- Volume, Velocity (original storages are optimized for daily routine operations, not for analytical tasks);
- Variety (metadata are spread across multiple sources);
Analytical tasks requirements cannot be fulfilled with the original infrastructure (or require new development for every new task).

Use-case analysis

Develop the information model: concepts, relationships, attributes.
Analyze metadata usage scenarios (in terms of the model):
- object(s) selection:
  - lookup by primary key;
  - search by pre-defined (set of) attributes;
  - search by arbitrary (set of) attributes;
  - search by links between objects (one-to-one, one-to-many, many-to-many);
- attribute(s) values aggregation over selected objects;
- time series aggregation;
- (aggregated) time series analysis;
- ...
Estimate metadata volumes.

Integrated metadata storage

Choose metadata storage/query technology(-ies).
Develop the storage/indexing schema(s):
- denormalizing the information model with respect to the
Install.
Configure (load schema(s), etc).

ETL process(es)

Analyze metadata sources.
Identify primary source(s) and their relationships.
Develop the ETL process(es)' scheme (steps definition):
- (E)xtraction of new/updated records from the primary source;
- (T)ransformations:
  - extraction of related information from a secondary source;
  - calculation of derived values and surrogate keys generation;
  - format conversion;
- (L)oad to the final storage.
Install pyDKB library (see instructions).
Implement ETL steps (independently, as standalone programs sharing only input/output data format):
- for E-stage: see example (Oracle Connector);
- for T-stage: see pyDKB quickstart guide;
- for L-stage: see example (ES data load).
Chain steps (according to the ETL process(es) scheme(s)):
- for shell pipeline scenario: see example (ETL process start scenario).
Schedule ETL processes execution.

Consistency checks

Develop ET[L] process(es) for consistency checks:
- steps:
  - (E)xtraction of identifiers and (minimal set of) update marker attribute(s) (timestamp, status, ...) of new/updated items:
    - make sure it can skip ones updated after the latest run of the main ETL process;
  - (T)ransformation:
    - extention with values from the final storage;
    - consistency check (filter for items with inconsistent values);
  - (L)oad to administrator notification system (can be omitted to make use of cron daemon e-mail notifications);
- chain:
  - for shell pipeline scenario relying ob cron daemon notifications: see example (consistensy check scenario).
Schedule checks execution.

User interface

Wrap common metadata usage scenarios into a set of parametric requests.
Implement the requests (e.g. as REST API server methods).
Add Web GUI, if required (most likely, there already is (are) some GUI(s), which can be extended with the new functionality or can make use of it for better performance of existing pages).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How To: DKB instance from scratch

When to use the DKB approach

Use-case analysis

Integrated metadata storage

ETL process(es)

Consistency checks

User interface

Docs

Wiki

Developer guides

Clone this wiki locally