-
Notifications
You must be signed in to change notification settings - Fork 2
How To: DKB instance from scratch
Marina Golosova edited this page Oct 14, 2020
·
1 revision
This page provides information on development and establishment of a new instance Data Knowledge Base in a non-ATLAS related use-case -- from the use-case specifics analysis to recomendations on technical solutions.
- Metadata are Big Data in the terms of 3V:
- Volume, Velocity (original storages are optimized for daily routine operations, not for analytical tasks);
- Variety (metadata are spread across multiple sources);
- Analytical tasks requirements cannot be fulfilled with the original infrastructure (or require new development for every new task).
- Develop the information model: concepts, relationships, attributes.
- Analyze metadata usage scenarios (in terms of the model):
- object(s) selection:
- lookup by primary key;
- search by pre-defined (set of) attributes;
- search by arbitrary (set of) attributes;
- search by links between objects (one-to-one, one-to-many, many-to-many);
- attribute(s) values aggregation over selected objects;
- time series aggregation;
- (aggregated) time series analysis;
- ...
- object(s) selection:
- Estimate metadata volumes.
- Choose metadata storage/query technology(-ies).
- Develop the storage/indexing schema(s):
- denormalizing the information model with respect to the
- Install.
- Configure (load schema(s), etc).
- Analyze metadata sources.
- Identify primary source(s) and their relationships.
- Develop the ETL process(es)' scheme (steps definition):
- (E)xtraction of new/updated records from the primary source;
- (T)ransformations:
- extraction of related information from a secondary source;
- calculation of derived values and surrogate keys generation;
- format conversion;
- (L)oad to the final storage.
- Install
pyDKB
library (see instructions). - Implement ETL steps (independently, as standalone programs sharing only input/output data format):
- for E-stage: see example (Oracle Connector);
- for T-stage: see
pyDKB
quickstart guide; - for L-stage: see example (ES data load).
- Chain steps (according to the ETL process(es) scheme(s)):
- for shell pipeline scenario: see example (ETL process start scenario).
- Schedule ETL processes execution.
- Develop ET[L] process(es) for consistency checks:
- steps:
- (E)xtraction of identifiers and (minimal set of) update marker attribute(s) (timestamp, status, ...) of new/updated items:
- make sure it can skip ones updated after the latest run of the main ETL process;
- (T)ransformation:
- extention with values from the final storage;
- consistency check (filter for items with inconsistent values);
- (L)oad to administrator notification system (can be omitted to make use of cron daemon e-mail notifications);
- (E)xtraction of identifiers and (minimal set of) update marker attribute(s) (timestamp, status, ...) of new/updated items:
- chain:
- for shell pipeline scenario relying ob cron daemon notifications: see example (consistensy check scenario).
- steps:
- Schedule checks execution.
- Wrap common metadata usage scenarios into a set of parametric requests.
- Implement the requests (e.g. as REST API server methods).
- Add Web GUI, if required (most likely, there already is (are) some GUI(s), which can be extended with the new functionality or can make use of it for better performance of existing pages).