Skip to content

Normalization

Lando edited this page Jul 27, 2017 · 5 revisions

The normalization is the first step of the structured data import. At the end of the normalization we will have a the data from the new structured data source transformed to the subject scheme and saved to a temporary table.

Import

First of all, the data has to be imported to the database. For that, a new Import-Job has to be added to the dataimport package. The Import-Job has to extract the data from a resource file, transform the data to a data source dependent entity type and finally filter the entities for entities which represent german businesses.

Normalization

After the data of the data source is saved to the database, they have to be transformed to the subject scheme. To do so, another DataLakeImport-Job has to be implemented. The implementation should inherit from the DataLakeImportImplementation class and override the necessary methods. This transformation includes the normalization of the data source original attributes to the database uniform attributes. The normalized attributes can be found in here.

The resulting subjects are then saved to a temporary table.


Next step Duplicate Detection

Clone this wiki locally