Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add graceful shutdown to loki.write draining the WAL #5804

Merged
merged 14 commits into from
Dec 11, 2023

Conversation

thepalbi
Copy link
Contributor

@thepalbi thepalbi commented Nov 17, 2023

PR Description

This PR is a follow up after some discussions in #5770.

The idea of this PR is to allow loki.write to do a graceful shutdown when the component is stopped. This graceful shutdown will attempt to drain the WAL, making the watcher consume all of the segments it can, before a timeout occurs. Note that two timeouts will managet the WAL enabled loki.write component: the WAL drain timeout, and the queue client drain timeout.

Also, the Watcher is refactored a bit to simlify the "possible states" in which this operates.

Which issue(s) this PR fixes

Related to https://github.com/grafana/cloud-onboarding/issues/5407

Notes to the Reviewer

PR Checklist

  • CHANGELOG.md updated
  • Documentation added
  • Tests updated
  • Config converters updated

@clayton-cornell clayton-cornell added the type/docs Docs Squad label across all Grafana Labs repos label Nov 17, 2023
@mattdurham mattdurham self-assigned this Nov 17, 2023
Copy link
Contributor

@clayton-cornell clayton-cornell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc change is 👍

@thepalbi thepalbi force-pushed the pablo/loki-write-drain-on-stop branch from 3514f0f to 6c33067 Compare November 21, 2023 13:10
@@ -90,19 +91,22 @@ func TestUnmarshallWalAttrributes(t *testing.T) {
MaxSegmentAge: wal.DefaultMaxSegmentAge,
MinReadFrequency: wal.DefaultWatchConfig.MinReadFrequency,
MaxReadFrequency: wal.DefaultWatchConfig.MaxReadFrequency,
DrainTimeout: wal.DefaultWatchConfig.DrainTimeout,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have one more config , For Ex: MaxSegmentSize which will deletes the logs when it touches the disk space size. which will be useful in some situations where the following error throws
msg=failed to write entry component=loki.write.remote err=write /tmp/agent/loki.write.remote/wal/00000569: no space left on device.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, the device that backs the WAL running out of space is a smell that something is happening, between:

  • the agent scraping more that it can send
  • the WAL volume is too small
  • one should be able to make the WAL delete all segments more rapidly with max_segment_age, which makes sense since the WAL needs to be thought more of keeping the last X minutes of data rather than X bytes of data

Can you share a bit more of your use case, to understand why the WAL volume is reaching it's capacity?

@thepalbi thepalbi force-pushed the pablo/loki-write-drain-on-stop branch from f6ea92c to 5cde5fc Compare December 6, 2023 20:46
@thepalbi
Copy link
Contributor Author

thepalbi commented Dec 6, 2023

@mattdurham all comments should be resolved

@mattdurham mattdurham enabled auto-merge (squash) December 11, 2023 17:40
Copy link
Collaborator

@mattdurham mattdurham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mattdurham mattdurham merged commit e8fd587 into main Dec 11, 2023
10 checks passed
@mattdurham mattdurham deleted the pablo/loki-write-drain-on-stop branch December 11, 2023 20:24
BarunKGP pushed a commit to BarunKGP/grafana-agent that referenced this pull request Feb 20, 2024
* drain test and routing

* refactoring watcher to use state

* drain working

* fine-tuned test case

* added short timeout test

* prompt exit test

* prompt exit passing

* refactoring watcher

* some comments

* map river configs

* add docs

* splitting apart Stop and Drain

* minimize manager stop time
@github-actions github-actions bot added the frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed. label Feb 21, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 21, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed. type/docs Docs Squad label across all Grafana Labs repos
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants