Data quality is used to ensure the accuracy of data in the process of integration and processing. It is also the core component of DataOps. DataVines is an easy-to-use data quality service platform that supports multiple metric.
Need: Maven 3.6.1 and later
$ mvn clean package -Prelease -DskipTests
- Obtain data source metadata regularly to construct data directory
- Regular monitoring of metadata changes
- Tag management with support for metadata
- Built-in 27 data quality check rules
- Support 4 data quality check rule types
- Single Table-Column Check
- Single Table Custom
SQL
check - Cross Table Accuracy Check
- Two Table Value Comparison Check
- Support schedule tasks for check
- Support
SLA
for check result alert
- Support timing execution of data detection, output data profile report
- Support automatically identify column types to automatically match appropriate data profile indicators
- Support table row number trend monitoring
- Support data distribution view
The platform is based on plug-in design, and the following modules support user-defined plug-ins to expand
- Data Source:
MySQL
,Impala
,Starocks
,Doris
,Presto
,Trino
,ClickHouse
,PostgreSQL
are already supported - Check Rules: 27 check rules such as built-in null value check, non-null check, enumeration check, etc.
- Job Execution Engine: Two execution engines
Spark
andLocal
have been supported. TheSpark
engine currently only supports theSpark2.4
version, and theLocal
engine is a local execution engine developed based onJDBC
, without relying on other execution engines. - Alert Channel: Supported Email
- Error Data Storage:
MySQL
and local files are already supported (onlyLocal
execution engine is supported) - Registry: Already supports
MySQL
,PostgreSQL
andZooKeeper
- Provide Web page to configure check jobs, run jobs, view job execution logs, view error data and check results
- Support online generation job running scripts, submit jobs through
datavines-submit.sh
, can be used in conjunction with the scheduling system
- Less platform dependency, easy to deploy
- Minimal only rely on
MySQL
to start the project and complete the check of data quality operations - Support horizontal expansion, automatic fault tolerance
- Decentralized design,
Server
node supports horizontal expansion to improve performance - Job Automatic Fault Tolerance, to ensure that jobs are not lost or repeated
- java runtime environment: jdk8
- If the data volume is small, or the goal is merely for functional verification, you can use JDBC engine
- If you want to run DataVines based on Spark, you need to ensure that your server has spark installed
Click Document for more information
Click Document for more information
You can submit any ideas as pull requests or as GitHub issues.
If you're new to posting issues, we ask that you read How To Ask Questions The Smart Way (This guide does not provide actual support services for this project!), How to Report Bugs Effectively prior to posting. Well written bug reports help us help you!
Thank you to all the people who already contributed to Datavines!
Datavines is licensed under the Apache License 2.0. Datavines relies on some third-party components, and their open source protocols are also Apache License 2.0 or compatible with Apache License 2.0. In addition, Datavines also directly references or modifies some codes in Apache DolphinScheduler, SeaTunnel and Dubbo, all of which are Apache License 2.0. Thanks for contributions to these projects.
- WeChat Official Account (in Chinese, scan the QR code to follow)