Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ddl: fix a caught panic and add comment for DDL functions #54685

Merged
merged 8 commits into from
Jul 23, 2024

Conversation

lance6716
Copy link
Contributor

@lance6716 lance6716 commented Jul 17, 2024

What problem does this PR solve?

Issue Number: close #54687 ref #54436

Problem Summary:

What changed and how does it work?

  • add comments for some DDL functions
  • add updateRawArgs as a return value of runOneJobStep, to decide if RawArgs needs to be updated more accurate

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No need to test
    • I checked and no code files have been changed.

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-linked-issue do-not-merge/needs-tests-checked do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 17, 2024
Copy link

tiprow bot commented Jul 17, 2024

Hi @lance6716. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Signed-off-by: lance6716 <[email protected]>
@lance6716 lance6716 changed the title [WIP]ddl: fix a caught panic ddl: fix a caught panic and add comment for DDL functions Jul 17, 2024
@ti-chi-bot ti-chi-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 17, 2024
Copy link

codecov bot commented Jul 17, 2024

Codecov Report

Attention: Patch coverage is 85.29412% with 10 lines in your changes missing coverage. Please review.

Project coverage is 56.2786%. Comparing base (2108661) to head (8a1c803).
Report is 54 commits behind head on master.

Additional details and impacted files
@@                Coverage Diff                @@
##             master     #54685         +/-   ##
=================================================
- Coverage   74.6249%   56.2786%   -18.3464%     
=================================================
  Files          1551       1673        +122     
  Lines        362640     614388     +251748     
=================================================
+ Hits         270620     345769      +75149     
- Misses        72390     245222     +172832     
- Partials      19630      23397       +3767     
Flag Coverage Δ
integration 37.1281% <64.7058%> (?)
unit 71.7111% <83.8235%> (-1.8255%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 52.9656% <ø> (-2.2339%) ⬇️
parser ∅ <ø> (∅)
br 52.4680% <ø> (+4.8364%) ⬆️

@lance6716
Copy link
Contributor Author

/retest

Copy link

tiprow bot commented Jul 17, 2024

@lance6716: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@@ -899,7 +907,27 @@ func (w *worker) prepareTxn(job *model.Job) (kv.Transaction, error) {
return txn, err
}

func (w *worker) HandleDDLJobTable(d *ddlCtx, job *model.Job) (int64, error) {
// runOneJobStep runs one step of the DDL job and persist the states change. One
// *step* is defined as the following reason:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other step: 1. reorg also has it's own state change, 2. reorg runs in async, it will exter/exist this function to check whether the async routine is done, there is no state change during this time

Copy link
Contributor Author

@lance6716 lance6716 Jul 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI shows the step of onLockTables is not a schema state change. LOCK TABLES may have many table arguments and each step it updates one table's lock state and persist new TableInfo. I'll come up with a better comment tomorrow, and align the code to fix UT 😂 Maybe I should keep the old needUpdateRawArgs behaviour (if no runErr we should marshal RawArgs), and fix wrong runErr nilness to close #54687

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reorg runs in async, it will exter/exist this function to check whether the async routine is done, there is no state change during this time

please add this one too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated some function comments. f782a52

reorg is a bit complex as a breif example in function comments, I only added onLockTables. Please check if it's clear enough.

Comment on lines 924 to 926
// - We may need to use caller `runOneJobStepAndWaitSync` to make sure other node
// is synchronized before change the job state. So an extra job state *step* is
// added.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean the wait change above runOneJobStep? it's for failover,

Copy link
Contributor Author

@lance6716 lance6716 Jul 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the wait change above runOneJobStep. This item wants to describe the job state changes, for example from JobStateDone to JobStateSynced.

I'm not sure why JobStateDone -> JobStateSynced is relevant to failover. My understanding is current node (the user connected node) must wait all other nodes for finishing synchronize job state before it tells the user DDL is finished, otherwise it breaks the linearizability (the user connected node shows DDL is finished, but the slow node shows DDL is not finished)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done->synced might not go through the wait change above runOneJobStep, wait part might have done in below waitSchemaChanged

JobStateDone -> JobStateSynced is relevant to failover.

suppose owner change during wait, with this state change, we can catch it and wait again on new owner, i.e. the the wait change above runOneJobStep

Comment on lines 920 to 922
// correctness through failover, this function will decide and persist the
// arguments of a job as a separate *step*. These steps will reuse "schema state"
// changes, see onRecoverTable as an example.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you include done->sync in this part? it's not for failover, it must be done in a separate step after wait version.

Copy link
Contributor Author

@lance6716 lance6716 Jul 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No "done->sync" is included in the item of your above comment.

This part is like the reason in the comments of onRecoverTable. If the job will change system states (like disable GC) and revert it afterward, if must save the GC state before get running. Otherwise if the job fails between changing states and reverting states, we can't recover the original states.

Maybe the other examples are, job should decided the TS and persist it first. Otherwise if it runs twice and choose two TS due to node crash, some problems will occur.

@D3Hunter D3Hunter mentioned this pull request Jul 17, 2024
54 tasks
@lance6716 lance6716 changed the title ddl: fix a caught panic and add comment for DDL functions [WIP]ddl: fix a caught panic and add comment for DDL functions Jul 17, 2024
@ti-chi-bot ti-chi-bot bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 17, 2024
// but they make use of caller transitOneJobStep to persist job changes.
//
// - We may need to use caller transitOneJobStepAndWaitSync to make sure all
// other node is synchronized to provide linearizability. So an extra job state
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add an example, if i am a new comer, i will ask what is extra job state

@@ -899,7 +907,27 @@ func (w *worker) prepareTxn(job *model.Job) (kv.Transaction, error) {
return txn, err
}

func (w *worker) HandleDDLJobTable(d *ddlCtx, job *model.Job) (int64, error) {
// runOneJobStep runs one step of the DDL job and persist the states change. One
// *step* is defined as the following reason:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reorg runs in async, it will exter/exist this function to check whether the async routine is done, there is no state change during this time

please add this one too

Signed-off-by: lance6716 <[email protected]>
Signed-off-by: lance6716 <[email protected]>
@lance6716 lance6716 changed the title [WIP]ddl: fix a caught panic and add comment for DDL functions ddl: fix a caught panic and add comment for DDL functions Jul 18, 2024
@ti-chi-bot ti-chi-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 18, 2024
Signed-off-by: lance6716 <[email protected]>
@ti-chi-bot ti-chi-bot bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Jul 18, 2024
updateRawArgs = err == nil
// if job changed from running to rolling back, arguments may be changed
if prevState == model.JobStateRunning && job.IsRollingback() {
updateRawArgs = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should move runtime args change in something like job ctx, we actively mark it if those args changed and fill back depends on the mark

now we are checking whether args are changed passively.

Copy link
Contributor Author

@lance6716 lance6716 Jul 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I started in the active way yesterday, and found there are too many cases and many function signatures need to be changed 😂

Save it in job seems can avoid many changes, I'll try it in this PR or future ones. It's OK if this PR is merged before I finished the development. Don't need to hold this PR.

Copy link

ti-chi-bot bot commented Jul 23, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: D3Hunter, tangenta

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added approved lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Jul 23, 2024
Copy link

ti-chi-bot bot commented Jul 23, 2024

[LGTM Timeline notifier]

Timeline:

  • 2024-07-18 08:47:06.53689772 +0000 UTC m=+516448.527839228: ☑️ agreed by D3Hunter.
  • 2024-07-23 04:08:38.948885337 +0000 UTC m=+931740.939826807: ☑️ agreed by tangenta.

Copy link

tiprow bot commented Jul 23, 2024

@lance6716: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
fast_test_tiprow 8a1c803 link true /test fast_test_tiprow

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@ti-chi-bot ti-chi-bot bot merged commit f774ef6 into pingcap:master Jul 23, 2024
22 of 23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved lgtm release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Caught panic when canceling DDL jobs
3 participants