Skip to content

Duplicate Detection

Lando edited this page Jul 27, 2017 · 5 revisions

The duplicate detection is a part of the structured data import and is essential for the maintenance of a clean database. To detect duplicates within the new imported data source and the subjects a configuration file has to be created. Three different settings have to be setup for a working configuration

Blocking

For the blocking process two parameters have to set: filterUndefinedand maxBlockSize. Both parameters influence the blockings filter behavior. filterUndefined is boolean and if set to true all undefined Blocks will be sorted out. maxBlockSize defines the maximal allowed number of subjects within of a block. All blocks with number greater than maxBlockSize, will be sorted out as well.

Confidence

The confidence parameter has to be set to real number within the interval [0,1]. And it describes the minimal similarity score threshold for a pair of subjects to be declared duplicates.

Similarity Score

The similarity score is represented by a nested value and describes the configuration for the computation of the similarity score. For each attribute which should be considered during the comparison of subjects, an attribute tag should created, containing the the name of the attribute, a weight representing the relevance and a list of similarity measures.

The nested configuration can look like this:

<attribute>
  <key>Name of the Attribute</key>
  <weight>Relevance of the Attribute</weight>
  <feature>
    <similarityMeasure>Suitable Similarity Measure</similarityMeasure>
    <weight>Relevance of the Similarity Measure</weight>
    <scale>Scale for the Similarity Calculation (Optionally)</scale>
  </feature>
  <feature>
    <key>...</key>
    ...
  </feature>
  ...
</attribute>

Another example can be found here


Next step Data Merging

Clone this wiki locally