-
Notifications
You must be signed in to change notification settings - Fork 2
ETL: glossary
Marina Golosova edited this page Oct 21, 2020
·
2 revisions
- Control symbol
- ASCII symbol used for flow control (see Internal communication protocol specification).
- Data flow
- Sequence of messages, produced by the E-stage and passing through all T-stages till the L-stage.
- Dataflow topology
- Shape of the data flow from the main source to the final storage.
- ETL module, stage
- Logical unit of the ETL process, implementing single operation on the data flow (extraction, transformation, load). Consists of supervisor and worker.
- ETL process, Dataflow
- Combination of independent modules, or stages, orchestrated into a dataflow topology connecting the main source and final storage. Its purpose is to monitor the main source for presence of new/updated data and push them into the final storage in a transformed -- prepared for the further usage -- view.
- E-, T-, L-stage
- ETL stage, responsible for corresponding ETL operation (extraction, transformation or load).
- Final storage
- Storage, indexing and access system for integrated representation of objects of interest, produced by the ETL process.
- Main (primary) source
- One of the original metadata sources, containg update mark attribute of the object(s) of interest (preferrable -- in a queryable form).
- Message
- Object of interest metadata in a pre-defined format (e.g. JSON), possibly containing service fields and ending with a control symbol
EOM
(end-of-message). - Secondary source
- One of the original metadata sources, used to get addintional information of the object of interest by the object's ID.
- Supervisor
- Stage component responsible for data flow control: read from and write to the flow, mark data in the flow as processed, etc. Can have same implementation for all stages of the same type (E-, T-, L-type stages).
- Topology description
- Contains list of the ETL stages, their start instructions and linking scheme (which stage's output is which stage's input).
- Worker
- Stage component responsible for case-specific operations on data: querying external metadata source (main or secondary), transformation or pushing the processing results to the final storage. Usually has an individual implementation for each case-specific operation, but some workers can be reused in similar situations (e.g. in case of format conversion stage).
- Worker run instructions
- Instruction on how to run given worker's instance (e.g. command line).