Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DSIP-78][Data Quality] Remove data quality module #16794

Open
wants to merge 9 commits into
base: dev
Choose a base branch
from

Conversation

SbloodyS
Copy link
Member

Purpose of the pull request

close #16728

Brief change log

Verify this pull request

This pull request is code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(or)

Pull Request Notice

Pull Request Notice

If your pull request contain incompatible change, you should also add it to docs/docs/en/guide/upgrede/incompatible.md

@davidzollo
Copy link
Contributor

-1,there're no conclusion for now, please don't do this op

@SbloodyS
Copy link
Member Author

-1,there're no conclusion for now, please don't do this op

Based on the discussion of issue #16728, we can draw the following conclusions.

  1. More than half of the maintainers agreed.
  2. Half of the users who disagree don't understand the difference between DS and Hadoop and data-quality. It's meaningless.
  3. The remaining half users want to refactor the data-quality module instead of removing it.

No one among the maintainers on the current community is willing to refactor this module. And data-quality module has seriously blocked the progress of #16098 which is is very important for the next version's release. So I think the conclusion of removal is obvious.

@davidzollo
Copy link
Contributor

-1,there're no conclusion for now, please don't do this op

Based on the discussion of issue #16728, we can draw the following conclusions.

  1. More than half of the maintainers agreed.
  2. Half of the users who disagree don't understand the difference between DS and Hadoop and data-quality. It's meaningless.
  3. The remaining half users want to refactor the data-quality module instead of removing it.

No one among the maintainers on the current community is willing to refactor this module. And data-quality module has seriously blocked the progress of #16098 which is is very important for the next version's release. So I think the conclusion of removal is obvious.

Apache emphasizes achieving consensus. If there are significant differences and we cannot be resolved in the short term, we may choose to temporarily shelve the proposal, allowing much more time for more information before re-discussing.

Considering this is a major decision, I suggest you send a vote email to the dev mailing list.

@SbloodyS
Copy link
Member Author

Apache emphasizes achieving consensus.

First of all, the current consensus does not require all people to reach an agreement, but only more than half. For more info you can take a look at apache/comdev-site#189

If there are significant differences and we cannot be resolved in the short term, we may choose to temporarily shelve the proposal, allowing much more time for more information before re-discussing.

Like I said, no one among the maintainers on the current community is willing to refactor this module. How to achieve this without anyone willing to take responsibility?

Considering this is a major decision, I suggest you send a vote email to the dev mailing list.

I think issue and dev mail list have the same meaning on this issue. And all active PMC/Committer have been included in this issue.

@davidzollo
Copy link
Contributor

Apache emphasizes achieving consensus.

First of all, the current consensus does not require all people to reach an agreement, but only more than half. For more info you can take a look at apache/comdev-site#189

If there are significant differences and we cannot be resolved in the short term, we may choose to temporarily shelve the proposal, allowing much more time for more information before re-discussing.

Like I said, no one among the maintainers on the current community is willing to refactor this module. How to achieve this without anyone willing to take responsibility?

Considering this is a major decision, I suggest you send a vote email to the dev mailing list.

I think issue and dev mail list have the same meaning on this issue. And all active PMC/Committer have been included in this issue.

Github Issue can't instead of mail, especially for big event.
I think we can try to find some contributors who'd like to refactor this module if this function is finally kept.

By the way, i don't disagree with removing this module; I’m just concerned about the significant impact it will have on users. What I see is that many users want to keep this module, while the maintainers are inclined to remove it to make refactoring easier. Therefore, this decision requires great caution. If a vote does not take place in the dev mailing list, then it should happen in private, not in an issue thread.

@SbloodyS
Copy link
Member Author

SbloodyS commented Nov 14, 2024

Github Issue can't instead of mail, especially for big event. I think we can try to find some contributors who'd like to refactor this module if this function is finally kept.

By the way, i don't disagree with removing this module; I’m just concerned about the significant impact it will have on users. What I see is that many users want to keep this module, while the maintainers are inclined to remove it to make refactoring easier. Therefore, this decision requires great caution. If a vote does not take place in the dev mailing list, then it should happen in private, not in an issue thread.

I don't think this module is a big event. This feature was introduced in PR #4830 Since Feb 21, 2021 without any mailing or github issue discussion. For more than three years, apart from this PR author, no contributor has contributed to this function, and there are endless issue of bugs and improvement, and there is no substantial code change in this function. This author also doesn't want to maintain this function and vote +1 for removal.

@davidzollo
Copy link
Contributor

davidzollo commented Nov 14, 2024

At that time, there was no DSIP mechanism in place, and much of the communication took place during community meetings. The feature I’m referring to took 8-9 months to develop and implement. If this is not a major feature, I believe that statement would be inaccurate.

@SbloodyS
Copy link
Member Author

At that time, there was no DSIP mechanism in place, and much of the communication took place during community meetings. The feature I’m referring to took 8-9 months to develop and implement. If this is not a major feature, I believe that statement would be inaccurate.

I don't agree with this opinion. Since dolphinscheduler is focusing the modern data orchestration platform. Data-Quality is focusing accuracy and consistency of data. The relationship between two of them is equivalent to Flink and Flink-CDC. There are many mature examples at present.

Copy link
Member

@ruanwenjun ruanwenjun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add ddl to drop the dq table.

@SbloodyS
Copy link
Member Author

We should add ddl to drop the dq table.

We can remove the create table ddl in the init sql and give some drop table ddl to users in docs to let users decide whether or not to execute it instead of execute it by default. WDYT?

@zhongjiajie
Copy link
Member

We should add ddl to drop the dq table.

We can remove the create table ddl in the init sql and give some drop table ddl to users in docs to let users decide whether or not to execute it instead of execute it by default. WDYT?

How about giving entry point script to remove exists table and package it into binary tarball instead of document, cause we support only two of databases in prod, and we should keep thing easy to use

@SbloodyS
Copy link
Member Author

We should add ddl to drop the dq table.

We can remove the create table ddl in the init sql and give some drop table ddl to users in docs to let users decide whether or not to execute it instead of execute it by default. WDYT?

How about giving entry point script to remove exists table and package it into binary tarball instead of document, cause we support only two of databases in prod, and we should keep thing easy to use

I think it will be more traversal for some users to provide users with operating DLLs intuitively since this is a dangerous operation for drop table...

@zhongjiajie
Copy link
Member

For proposal only, it seems inconsistency exists, although some of the PMCs in #16728 already agreed to remove it, David is right and he could challenge it and ask to vote in the dev mailing list. So maybe should vote in dev mail thread, And I think the vote result should be the ONLY result of this PR continue or not instead of personal emotions.
I found out author @zixi0825 also voted +1 in #16728 (comment), and @qingwli, who has maintained data quality for some time, voted -1 for removing it.

And for this feature, I think it should better act as a plugin instead of a built-in function. Especially since it has many bugs and CVEs and not team member want to maintain it. So I personally would vote +1 for removing it.

And BTW, the time cost for the feature development should not as a standard to measure the importance or not.

@zhongjiajie
Copy link
Member

zhongjiajie commented Nov 14, 2024

We should add ddl to drop the dq table.

We can remove the create table ddl in the init sql and give some drop table ddl to users in docs to let users decide whether or not to execute it instead of execute it by default. WDYT?

How about giving entry point script to remove exists table and package it into binary tarball instead of document, cause we support only two of databases in prod, and we should keep thing easy to use

I think it will be more traversal for some users to provide users with operating DLLs intuitively since this is a dangerous operation for drop table...

I mean we should add new bash script like https://github.com/apache/dolphinscheduler/blob/dev/dolphinscheduler-tools/src/main/bin/migrate-lineage.sh which is separated form upgrade-schema.sh, for easy use of end users. But it not important thing, the exists data should be keep and we can provide document to notice our users

@SbloodyS
Copy link
Member Author

I mean we should add new bash script like https://github.com/apache/dolphinscheduler/blob/dev/dolphinscheduler-tools/src/main/bin/migrate-lineage.sh which is separated form upgrade-schema.sh, for easy use of end users. But it not important thing, the exists data should be keep and we can provide document to notice our users

Ok.

@SbloodyS
Copy link
Member Author

For proposal only, it seems inconsistency exists, although some of the PMCs in #16728 already agreed to remove it, David is right and he could challenge it and ask to vote in the dev mailing list. So maybe should vote in dev mail thread, And I think the vote result should be the ONLY result of this PR continue or not instead of personal emotions. I found out author @zixi0825 also voted +1 in #16728 (comment), and @qingwli, who has maintained data quality for some time, voted -1 for removing it.

And for this feature, I think it should better act as a plugin instead of a built-in function. Especially since it has many bugs and CVEs and not team member want to maintain it. So I personally would vote +1 for removing it.

And BTW, the time cost for the feature development should not as a standard to measure the importance or not.

I will raise a vote in dev mail list.

@zhongjiajie
Copy link
Member

FYI, vote mail in https://lists.apache.org/thread/0tldm33skkbrfgbt01bvd610z5zmb725

@qingwli
Copy link
Member

qingwli commented Nov 15, 2024

I voted -1 for removing Data quality before, but the author gave us another option about the plugin instead of the built-in function. I think it's a good way.

Actually, ds's data quality is not an out-of-the-box module. The author builds a framework but still needs lots of work that can be used and used simply. The author has another github repo about the greater and more functional data quality project. If we enhance the data quality in DS, it means we will do the same work.

So I agree to remove the data quality module in ds, and use a plugin to let users continue to use the better data quality check functions.

About removing tables in ds, I agree @zhongjiajie new bash scrip way.

@fuchanghai
Copy link
Member

I voted -1 for removing Data quality before, but the author gave us another option about the plugin instead of the built-in function. I think it's a good way.

Actually, ds's data quality is not an out-of-the-box module. The author builds a framework but still needs lots of work that can be used and used simply. The author has another github repo about the greater and more functional data quality project. If we enhance the data quality in DS, it means we will do the same work.

So I agree to remove the data quality module in ds, and use a plugin to let users continue to use the better data quality check functions.

About removing tables in ds, I agree @zhongjiajie new bash scrip way.

agree .

@Gallardot
Copy link
Member

Gallardot commented Nov 15, 2024

I voted -1 for removing Data quality before, but the author gave us another option about the plugin instead of the built-in function. I think it's a good way.

Actually, ds's data quality is not an out-of-the-box module. The author builds a framework but still needs lots of work that can be used and used simply. The author has another github repo about the greater and more functional data quality project. If we enhance the data quality in DS, it means we will do the same work.

So I agree to remove the data quality module in ds, and use a plugin to let users continue to use the better data quality check functions.

About removing tables in ds, I agree @zhongjiajie new bash scrip way.

Agree. I voted +1.
Dolphinscheduler should focus on workflow orchestration and task scheduling. It should then provide a good plugin mechanism to integrate with other projects, thereby building a better ecosystem. Not just data quality, I believe Flink real-time stream processing tasks should also be removed from dolphinscheduler.

@qingwli
Copy link
Member

qingwli commented Nov 15, 2024

About the Flink real-time stream processing task, we have rewritten it for our platform. And our platform looks like a big data schedule system. @davidzollo told want to have a quick meet about the Real-time part in Ds, We have finished some doc about introducing our Flink, Do you have some time so we can discuss this part? CC @Gallardot

@SbloodyS
Copy link
Member Author

I mean we should add new bash script like https://github.com/apache/dolphinscheduler/blob/dev/dolphinscheduler-tools/src/main/bin/migrate-lineage.sh which is separated form upgrade-schema.sh, for easy use of end users. But it not important thing, the exists data should be keep and we can provide document to notice our users

We remove data-quality related table in init sql but not add it in upgrade sql. It will cause the schema check in CI failed... @zhongjiajie @ruanwenjun

Copy link

sonarcloud bot commented Nov 15, 2024

Quality Gate Failed Quality Gate failed

Failed conditions
0.0% Coverage on New Code (required ≥ 60%)

See analysis details on SonarQube Cloud

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[DSIP-78][Data Quality] Suggest remove data quality module
7 participants