Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove replace directive for golang.org/x/exp #5972

Merged
merged 17 commits into from
Jan 5, 2024
Merged

Conversation

ptodev
Copy link
Contributor

@ptodev ptodev commented Dec 13, 2023

Removing a replace directive for golang.org/x/exp form the Agent's go.mod file.

This is necessary because on a separate branch I am upgrading Agent to a new OpenTelemetry version, and it requires a new version of github.com/grafana/loki/pkg/push which needs the latest golang.org/x/exp.

The reason why golang.org/x/exp has been problematic is because the return type of SortFunc changed from bool (in the old version) to int (in the new version). Apparently some packages like to use golang.org/x/exp because it's more performant.

Fixes #5921

@ptodev ptodev marked this pull request as ready for review December 13, 2023 18:28
@ptodev
Copy link
Contributor Author

ptodev commented Dec 13, 2023

The replace directive for prometheus/prometheus can be removed, because v2.48.0 contains the two PRs we need:
prometheus/prometheus#12677
prometheus/prometheus#12729

Copy link
Collaborator

@mattdurham mattdurham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we got rid of a lot of cruft! Awesome

@ptodev
Copy link
Contributor Author

ptodev commented Dec 13, 2023

@mattdurham unfortunately the Linux build doesn't work because Pyroscope's eBPF module has a replace directive for the exp module 🙀 I'll try to change their go.mod.

@ptodev
Copy link
Contributor Author

ptodev commented Dec 14, 2023

I raised a PR for Pyroscope.

@ptodev ptodev force-pushed the ptodev/remove-exp-replace branch 2 times, most recently from fc61ae3 to ea5d929 Compare December 18, 2023 16:38
@ptodev
Copy link
Contributor Author

ptodev commented Dec 18, 2023

I had to upgrade our Prometheus dependency, and this became a much bigger PR than expected.

`enable_http2` | `bool` | Whether HTTP2 is supported for requests. | `true` | no
`honor_labels` | `bool` | Indicator whether the scraped metrics should remain unmodified. | `false` | no
`honor_timestamps` | `bool` | Indicator whether the scraped timestamps should be respected. | `true` | no
`track_timestamps_staleness` | `bool` | Indicator whether to track the staleness of the scraped timestamps. | `false` | no
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bboreham Regarding #5921 - for now I intend to default this to false for a few reasons:

  • IIUC, track_timestamps_staleness = true only makes sense if out of order ingestion is allowed in the back end database.
  • Apparently there are implications to querying and alerting.

Please let me know if you disagree. I'm open to changing this to default to true prior to merging. If it causes a backwards-incompatible change for the users, we'll have to mention it explicitly in our changelog and write docs with steps on what users need to change to get the behaviour they need (e.g. how to change their alerts or dashboards).

We usually list backwards incompatible changes here:
https://grafana.com/docs/agent/latest/flow/release-notes/

Copy link
Contributor Author

@ptodev ptodev Dec 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, track_timestamps_staleness = true only makes sense if out of order ingestion is allowed in the back end database.

Not as far as I know. This is to fix the long-standing irritation where cAdvisor metrics linger on for 5 minutes after the pod has gone.

The blog you cite relates to the last change in staleness handling 5 years ago; I don't think it is relevant here.

@bboreham I think the assumption behind setting track_timestamps_staleness = true is that if the scraper didn't get any samples then that must be because there aren't any. But is this a good assumption to make in the general case? If a sample is exposed via explicit timestamps, and if there is no new value to report, then what sample should be exposed for its series the next time a scrape happens? Is the convention to just report the same value with a new timestamp?

I think it makes sense to default track_timestamps_staleness to true if:

  • The convention is that the absence of a sample is considered enough evidence to decide that the series is stale.
  • There is no need to enable out of order ingestion. This is so that we prevent a situation where a timestamp arrives late (e.g. because it takes a long time to generate) but it can't be ingested in the TSDB because there is already a staleness marker with a more recent timestamp.

This explicit timestamp feature seems like a way to "push" metrics.... so I'm not sure what assumptions are ok to make. I suspect that Agents in "push" systems like OTel don't just declare a series as stale if no samples were pushed in a certain time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @bboreham, would you mind getting back to us on the comment above please?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If no sample is supplied for a timestamp, PromQL (at query time) will use the preceding value up to 5 minutes old.
This creates a long-standing issue with cAdvisor (i.e. Kubelet container metrics).

It's got nothing to do with "push".

There is no expectation that explicit timestamps come out of order. This never worked historically in Prometheus, and there is no reason to suppose people started sending them.

But is this a good assumption to make in the general case?

Yes, it is the standard behaviour when exporters do not supply the timestamp. Which is the vast majority.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, I don't think we should change this default in the middle of an 800-line PR doing other things.
I can make a separate PR to fix it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thank you, I'll take the track_timestamps_staleness parameter out of this PR. We can introduce it in a different PR. I don't want to add it now and change its default value later, because prometheus.scrape is a stable component and its defaults aren't meant to change often.

@bboreham
Copy link
Contributor

IIUC, track_timestamps_staleness = true only makes sense if out of order ingestion is allowed in the back end database.

Not as far as I know. This is to fix the long-standing irritation where cAdvisor metrics linger on for 5 minutes after the pod has gone.

The blog you cite relates to the last change in staleness handling 5 years ago; I don't think it is relevant here.

Copy link
Contributor

@wildum wildum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, just a few nits, thanks for taking care of this :)

component/discovery/aws/lightsail.go Outdated Show resolved Hide resolved
component/discovery/ovhcloud/ovhcloud.go Show resolved Hide resolved
- `SERVICE`: The OVHcloud service of the targets to retrieve.
- `PROMETHEUS_REMOTE_WRITE_URL`: The URL of the Prometheus remote_write-compatible server to send metrics to.
- `USERNAME`: The username to use for authentication to the remote_write API.
- `PASSWORD`: The password to use for authentication to the remote_write API.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: for the example I would suggest to also set refresh_interval and endpoint + I would directly set some real looking data instead of the placeholders (the placeholders are already used in the ##usage part)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree. I also don't like how here we repeat the definitions of the arguments. I did it this way because it's consistent with other discovery components, but I agree we should change this for all discovery components at a later point.

Copy link
Member

@tpaschalis tpaschalis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, this was a chunky one! Approving this to unblock you so we can move ahead once CI is green, and the discussion with Bryan has settled.

CHANGELOG.md Show resolved Hide resolved
go.mod Show resolved Hide resolved
@ptodev ptodev requested a review from wildum December 19, 2023 13:36
@clayton-cornell clayton-cornell added the type/docs Docs Squad label across all Grafana Labs repos label Dec 19, 2023
Copy link
Collaborator

@mattdurham mattdurham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@ptodev
Copy link
Contributor Author

ptodev commented Jan 4, 2024

@tpaschalis @mattdurham @wildum I am re-requesting a review, because I rebased the branch and as discussed I removed track_timestamps_staleness from prometheus.scrape. The track_timestamps_staleness argument was added to Prometheus recently. Not having it would mean that Flow doesn't have feature parity with Static mode anymore. However, I think this is ok for two reasons:

  • track_timestamps_staleness is a new argument which is probably not widely used yet.
  • There is already a precedent for not supporting some arguments in Flow. For example, no_proxy in the HTTP client config.

If you agree that it's ok to not support this argument for now, please feel free to approve the PR again.

Copy link
Member

@tpaschalis tpaschalis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Let's follow up with discussion for the new argument in a new PR.

@ptodev ptodev merged commit 404423b into main Jan 5, 2024
10 checks passed
@ptodev ptodev deleted the ptodev/remove-exp-replace branch January 5, 2024 10:53
hainenber pushed a commit to hainenber/agent that referenced this pull request Jan 6, 2024
* Remove replace directive for golang.org/x/exp

* Update pyroscope/ebpf from 0.4.0 to 0.4.1

* Fill in missing docs about HTTP client options.
Fix missing defaults.
Add an "unsupported" converter diagnostic for keep_dropped_targets.
Add HTTP client options to AWS Lightsail SD.

* Add discovery.ovhcloud

* Add a converter for discovery.ovhcloud

* Update cloudwatch_exporter docs

* Fix converter tests

* Mention Prometheus update in the changelog.

---------

Co-authored-by: William Dumont <[email protected]>
@bboreham bboreham mentioned this pull request Feb 6, 2024
3 tasks
BarunKGP pushed a commit to BarunKGP/grafana-agent that referenced this pull request Feb 20, 2024
* Remove replace directive for golang.org/x/exp

* Update pyroscope/ebpf from 0.4.0 to 0.4.1

* Fill in missing docs about HTTP client options.
Fix missing defaults.
Add an "unsupported" converter diagnostic for keep_dropped_targets.
Add HTTP client options to AWS Lightsail SD.

* Add discovery.ovhcloud

* Add a converter for discovery.ovhcloud

* Update cloudwatch_exporter docs

* Fix converter tests

* Mention Prometheus update in the changelog.

---------

Co-authored-by: William Dumont <[email protected]>
@github-actions github-actions bot added the frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed. label Feb 21, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 21, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed. type/docs Docs Squad label across all Grafana Labs repos
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add track_timestamps_staleness to prometheus.scrape
6 participants