Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When scraping using Alloy clustering mode, if there are more than 3 replicas, a duplicate label error occurs. #1006

Closed
itjobs-levi opened this issue Jun 10, 2024 · 7 comments
Labels
bug Something isn't working frozen-due-to-age

Comments

@itjobs-levi
Copy link

itjobs-levi commented Jun 10, 2024

What's wrong?

I am using AWS environment and Alloy agent is exporting metrics through unix, process, advisor exporter for each of about 100 servers.

The Alloy v1.1.0 agent in kubernetes performs ec2 discovery -> relabel (ec2 tag save) -> scrape -> mimir write.

  • Unix scrape interval 1 minutes
  • Process scrape interval 5 minutes
  • Advisor(Docker) scrape interval 5 minutes

At this time, the Alloy agent deployed as statefulset in kubernetes collects well without error in replica count 2.
However, when replica count becomes 3, duplicate label error starts to occur in the last replica pod.
The metrics seem to be collected well, but this error comes up countless times and feels like a fatal problem to me.

Reference Link
#784

Steps to reproduce

  1. ec2(application server) Alloy agent install and unix, process metric export
  2. eks alloy, mimir-distributed helm chart statefulset deploy
  3. alloy configuration setting (ec2 discovery -> relabel -> scrape(clustering) -> mimir write) and alloy replica 2
  4. alloy replica 2 -> 3 change (problem)
  5. After a while, I started getting a lot of err-mimir-duplicate-label-names errors from the last alloy pod added.

System information

Linux 6.1.84-99.169.amzn2023.x86_64

Software version

Alloy v1.1.0

Configuration

alloy:
  configMap:
    create: true
    content: |-
      prometheus.remote_write "mimir_a" {
        endpoint {
          url = "http://mimir-distributed-gateway/api/v1/push"
          remote_timeout = "1m"
          queue_config {
            capacity = 50000
            retry_on_http_429 = false
            sample_age_limit = "5m"
          }
        }
        wal {
          truncate_frequency = "2h"
          min_keepalive_time = "5m"
          max_keepalive_time = "4h"
        }
      }

      prometheus.remote_write "mimir_b" {
        endpoint {
          url = "http://mimir-distributed-gateway/api/v1/push"
          remote_timeout = "1m"
          queue_config {
            capacity = 50000
            retry_on_http_429 = false
            sample_age_limit = "5m"
          }
        }
        wal {
          truncate_frequency = "2h"
          min_keepalive_time = "5m"
          max_keepalive_time = "4h"
        }
      }


      discovery.ec2 "a" {
        refresh_interval = "5m"
        port = 12345
      }

      discovery.relabel "a_keep_label" {
        targets    = discovery.ec2.a.targets

        rule {
          source_labels = ["__meta_ec2_private_ip"]
          target_label  = "private_ip"
          action        = "replace"
        }
        
        rule {
          source_labels = ["__meta_ec2_instance_id"]
          target_label  = "instance_id"
          action        = "replace"
        }
        
        rule {
          source_labels = ["__meta_ec2_instance_type"]
          target_label  = "instance_type"
          action        = "replace"
        }
        
        rule {
          source_labels = ["__meta_ec2_tag_Name"]
          target_label  = "tag_Name"
          action        = "replace"
        }
      }

      prometheus.scrape "a_scrape_unix" {
        targets    = discovery.relabel.a_keep_label.output
        forward_to = [prometheus.remote_write.mimir_a.receiver]
        job_name   = "a-unix"
        scrape_interval = "1m"
        scrape_timeout = "50s"
        metrics_path = "/api/v0/component/prometheus.exporter.unix.unix/metrics"
        clustering {
          enabled = true
        }
      }

      prometheus.scrape "a_scrape_process" {
        targets    = discovery.relabel.a_keep_label.output
        forward_to = [prometheus.remote_write.mimir_a.receiver]
        job_name   = "a-process"
        scrape_interval = "5m"
        scrape_timeout = "4m"
        metrics_path = "/api/v0/component/prometheus.exporter.process.process/metrics"
        clustering {
          enabled = true
        }
      }

      prometheus.scrape "a_scrape_docker" {
        targets    = discovery.relabel.a_keep_label.output
        forward_to = [prometheus.remote_write.mimir_a.receiver]
        job_name   = "a-docker"
        scrape_interval = "5m"
        scrape_timeout = "4m"
        metrics_path = "/api/v0/component/prometheus.exporter.cadvisor.docker/metrics"
        clustering {
          enabled = true
        }
      }

      discovery.ec2 "b" {
        refresh_interval = "5m"
        port = 12345
      }

      discovery.relabel "b_keep_label" {
        targets    = discovery.ec2.b.targets

        rule {
          source_labels = ["__meta_ec2_private_ip"]
          target_label  = "private_ip"
          action        = "replace"
        }
        
        rule {
          source_labels = ["__meta_ec2_instance_id"]
          target_label  = "instance_id"
          action        = "replace"
        }
        
        
        rule {
          source_labels = ["__meta_ec2_instance_type"]
          target_label  = "instance_type"
          action        = "replace"
        }
        
        rule {
          source_labels = ["__meta_ec2_tag_Name"]
          target_label  = "tag_Name"
          action        = "replace"
        }
        
      }

      prometheus.scrape "b_scrape_unix" {
        targets    = discovery.relabel.b_keep_label.output
        forward_to = [prometheus.remote_write.mimir_b.receiver]
        job_name   = "b-unix"
        scrape_interval = "1m"
        scrape_timeout = "50s"
        metrics_path = "/api/v0/component/prometheus.exporter.unix.unix/metrics"
        clustering {
          enabled = true
        }
      }

      prometheus.scrape "b_scrape_process" {
        targets    = discovery.relabel.b_keep_label.output
        forward_to = [prometheus.remote_write.mimir_b.receiver]
        job_name   = "b-process"
        scrape_interval = "5m"
        scrape_timeout = "4m"
        metrics_path = "/api/v0/component/prometheus.exporter.process.process/metrics"
        clustering {
          enabled = true
        }
      }

      prometheus.scrape "b_scrape_docker" {
        targets    = discovery.relabel.b_keep_label.output
        forward_to = [prometheus.remote_write.mimir_b.receiver]
        job_name   = "b-docker"
        scrape_interval = "5m"
        scrape_timeout = "4m"
        metrics_path = "/api/v0/component/prometheus.exporter.cadvisor.docker/metrics"
        clustering {
          enabled = true
        }
      }

    name: null
    key: null

  clustering:
    enabled: true



  extraPorts:
    - name: "otlp-grpc"
      port: 4317
      targetPort: 4317
      protocol: "TCP"
    - name: "otlp-http"
      port: 4318
      targetPort: 4318
      protocol: "TCP"


  securityContext: {}

  resources: 
    limits:
      memory: 3Gi
    requests:
      cpu: 700m
      memory: 3Gi

image:
  registry: "docker.io"
  repository: grafana/alloy
  tag: 'v1.1.0'
  digest: null
  pullPolicy: IfNotPresent
  pullSecrets: []

controller:
  type: 'statefulset'

  replicas: 2

  parallelRollout: true

  dnsPolicy: ClusterFirst

  nodeSelector: {
    Environment: production
  }


Logs

ts=2024-06-10 level=error msg="non-recoverable error" component_path=/ component_id=prometheus.remote_write.mimir subcomponent=rw remote_name=40abfa url=http://mimir-distributed-gateway/api/v1/push count=55 │
│ 0 exemplarCount=0 err="server returned HTTP status 400 Bad Request: received a series with duplicate label name, label: 'tag_Name' series: 'node_memory_Cached_bytes{instance=\"192.168.xxx.xxx:12345\", instance_id=\"i-xxxx\", instance_type=\"xxxx\", job=\"a-unix\",  private_i' (err-mimir-duplicate-label-names)"
@itjobs-levi itjobs-levi added the bug Something isn't working label Jun 10, 2024
@itjobs-levi
Copy link
Author

This problem occurs when using alloy clustering mode and replica 3.

@thampiotr
Copy link
Contributor

This may not be related to the issue, I still want to look into it deeper, but I've noticed that you are reaching into an internal metrics path of an Alloy exporter with this:

metrics_path = "/api/v0/component/prometheus.exporter.unix.unix/metrics"

This is not advised and it's relying on an internal implementation detail. Could you try to use the supported way, similar to the examples in our documentation?

To be specific, you shouldn't need to set the metrics_path, you can just run the exporter and scrape it in one agent instance like this:

prometheus.exporter.process "example" {
  ... // your config
}

// Configure a prometheus.scrape component to collect process_exporter metrics.
prometheus.scrape "demo" {
  targets    = prometheus.exporter.process.example.targets
  forward_to = [...]
}

@itjobs-levi
Copy link
Author

I prefer the prometheus pulling method.
So the Alloy agent on the collection target server only acts as an exporter,
and discovers and scrapes the target server according to the configuration written above.
To pull, you have no choice but to use the metric path.
Because the target server is a server outside of another ec2.

The guide document provided seems to be a push method, which is directly delivered to mimir from each collection target server.

Sorry if I misunderstood. I am pulling from the target server via alloy agent.
(https://prometheus.io/docs/introduction/faq/#why-do-you-pull-rather-than-push)

@thampiotr
Copy link
Contributor

I prefer the prometheus pulling method.

The example that I included in my previous comment uses the pull method. Only targets are passed to prometheus.scrape, but then the prometheus.scrape performs metrics pulling. You still have a pull-based metrics pipeline. So I'd recommend you try the supported approach, as the /api/v0/component/prometheus.exporter.unix.unix/metrics is an internal implementation detail.

BTW, you may also be affected by this issue: #1009 - but there is a simple workaround for it, so try that too :)

@itjobs-levi
Copy link
Author

I may have misunderstood.

However, I would like to ask one more question.

On my multiple collection target EC2 (A Group, B Group), Alloy agent is running in non-cluster mode,
and each exports metrics through unix, process, and cadvisor exporter.

On another EKS server, there are two Alloy agent pods (X Group) running in cluster mode.

If I want to collect metrics for unix, process, and cadvisor from multiple collection target EC2 (A Group, B Group) from Alloy agent pod (X Group) on the EKS server, don't I have no choice but to declare metric_path?

After ec2 discovery, I confirmed that only /metrics is read when prometheus.scrape is performed. /metrics only has metrics for alloy, and unix, process, and cadvisor metrics do not exist.

That's why I declared unix, process, and cadvisor separately in addition to /metrics.

I understand that the method you mentioned is only possible when the exporter and scrape are on the same server and are the same Alloy agent process.
`prometheus.exporter.process "example" {
... // your config
}

// Configure a prometheus.scrape component to collect process_exporter metrics.
prometheus.scrape "demo" {
targets = prometheus.exporter.process.example.targets
forward_to = [...]
}`

If I'm wrong or you have different design guidelines, please let me know and I'd really appreciate it.

@itjobs-levi
Copy link
Author

itjobs-levi commented Jun 10, 2024

@thampiotr
discovery.relabel "replace_instance" { targets = discovery.file.targets.targets rule { action = "replace" source_labels = ["instance"] target_label = "instance" replacement = "alloy-cluster" } }

I applied this and I am not getting any errors on replica 3.
It is not old but it seems to be fixed.

However, I do not understand why the instance label affects clustering
(I checked the PR you provided..)

flow: ec2 discovery -> relabel -> scrape
It seems that the instance label is added as the ip:12345 of the collection target discovered in ec2 at the time of scrape.
This may affect the hashing calculation, but..

Shouldn't the calculation be the same if each collection target has the same instance label in alloy 3 pod in cluster mode?
If you could explain in detail, it would be helpful for others to understand.

Lastly, as you commented above, is there any other way to collect metrics from the collection target other than using the metric path as I provided, since the collection target (non cluster mode) and the collector (cluster mode) are installed separately on different ec2s?

@thampiotr
Copy link
Contributor

thampiotr commented Jun 27, 2024

Thanks for closing this, I'm happy it worked eventually!

However, I do not understand why the instance label affects clustering

I have described this failure mode in more detail in this issue. The instance label would be different between instances and thus the hashing will be different too, breaking an important assumption in clustering.

For anyone encountering this or similar problem in the future - check with this issue for a workaround and potential fix in the future: #1009

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 28, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working frozen-due-to-age
Projects
None yet
Development

No branches or pull requests

2 participants