Skip to content
Sebastian Kruse edited this page Feb 13, 2016 · 2 revisions

This page lists use cases that benefit from having an MDMS.

Database reverse engineering

Database reverse engineering describes the process of recapturing the schema and semantics of a dataset. According to literature, this process involves in particular application code (e.g., analyzing existing queries), domain experts (e.g., interviewing them), and other external artifacts. Arguably, these resources are oftentimes not available for several reasons, such as attrition or datasets obtained from the internet. In those cases, the database reverse engineering can only resort to the raw data and, consequently, derived metadata/data profiles. To this end, it is important to allow users to gain as much knowledge from the metadata as possible. In the following, we take a closer look at two phases of the database reverse engineering, namely schema reconstruction and comprehension.

Schema reconstruction. Database reverse engineering typically entails a reconstruction of the schema of a given dataset. Amongst others, the following is to be done:

  • Datatype discovery: For all columns of a given dataset, propose a suitable (SQL) datatype. One can take this one step further and also annotate the type of contained information for each of the columns, such as JSON or amount of money.
  • Constraint discovery: Propose suitable (SQL) constraints for a dataset. Primary keys and foreign keys are of paramount importance. Also interesting are not-null constraints, value ranges, and text pattern.
  • Naming: To work with and maintain a dataset, it is important to name its tables and columns.

Comprehension. Besides reconstructing the technical schema of a dataset, database reverse engineering also requires to build up an understanding of that dataset. In particular, obtaining an entity-relationship diagram for a relational schema has been researched in the past. While ER diagrams are indeed helpful by abstracting technical details of relational schemata and introducing concepts instead, they still leave a bunch of questions open:

  • Utility: Datasets are often dirty and sparse. That is, a dataset might expose certain attributes of its contained entities and yet deliver only so little data or so poor quality for those attributes, that it is virtually impossible to make use of them.
  • Regularities: In particular large databases are likely to exhibit certain regularities in their data modeling, such as table templates, naming schemes, and design pattern. Discovering such regularities can drastically facilitate the understanding of a dataset.
  • Relevance: In general, not all tables of a dataset are equally relevant. Some tables model central business objects, some model rare special cases, and some tables are even only "meta tables" that help to tie together the actual data matter (think of link tables or change histories). Pointing out the most relevant tables of a schema can help to separate the wheat from the chaff.
  • Categorization: Breaking down a large matter into smaller pieces is usually beneficial for understanding. This technique can also be applied to schemata by categorizing its tables. We note that there are multiple dimensions along which a schema can be decomposed: topical (tables addressing the same topic and/or having similar data), topological (neighbor tables connected via few FKs), structural (tables with a similar structure).
Clone this wiki locally