When scraping using Alloy clustering mode, if there are more than 3 replicas, a duplicate label error occurs. #1006

itjobs-levi · 2024-06-10T02:10:21Z

What's wrong?

I am using AWS environment and Alloy agent is exporting metrics through unix, process, advisor exporter for each of about 100 servers.

The Alloy v1.1.0 agent in kubernetes performs ec2 discovery -> relabel (ec2 tag save) -> scrape -> mimir write.

Unix scrape interval 1 minutes
Process scrape interval 5 minutes
Advisor(Docker) scrape interval 5 minutes

At this time, the Alloy agent deployed as statefulset in kubernetes collects well without error in replica count 2.
However, when replica count becomes 3, duplicate label error starts to occur in the last replica pod.
The metrics seem to be collected well, but this error comes up countless times and feels like a fatal problem to me.

Reference Link
#784

Steps to reproduce

ec2(application server) Alloy agent install and unix, process metric export
eks alloy, mimir-distributed helm chart statefulset deploy
alloy configuration setting (ec2 discovery -> relabel -> scrape(clustering) -> mimir write) and alloy replica 2
alloy replica 2 -> 3 change (problem)
After a while, I started getting a lot of err-mimir-duplicate-label-names errors from the last alloy pod added.

System information

Linux 6.1.84-99.169.amzn2023.x86_64

Software version

Alloy v1.1.0

Configuration

alloy:
  configMap:
    create: true
    content: |-
      prometheus.remote_write "mimir_a" {
        endpoint {
          url = "http://mimir-distributed-gateway/api/v1/push"
          remote_timeout = "1m"
          queue_config {
            capacity = 50000
            retry_on_http_429 = false
            sample_age_limit = "5m"
          }
        }
        wal {
          truncate_frequency = "2h"
          min_keepalive_time = "5m"
          max_keepalive_time = "4h"
        }
      }

      prometheus.remote_write "mimir_b" {
        endpoint {
          url = "http://mimir-distributed-gateway/api/v1/push"
          remote_timeout = "1m"
          queue_config {
            capacity = 50000
            retry_on_http_429 = false
            sample_age_limit = "5m"
          }
        }
        wal {
          truncate_frequency = "2h"
          min_keepalive_time = "5m"
          max_keepalive_time = "4h"
        }
      }


      discovery.ec2 "a" {
        refresh_interval = "5m"
        port = 12345
      }

      discovery.relabel "a_keep_label" {
        targets    = discovery.ec2.a.targets

        rule {
          source_labels = ["__meta_ec2_private_ip"]
          target_label  = "private_ip"
          action        = "replace"
        }
        
        rule {
          source_labels = ["__meta_ec2_instance_id"]
          target_label  = "instance_id"
          action        = "replace"
        }
        
        rule {
          source_labels = ["__meta_ec2_instance_type"]
          target_label  = "instance_type"
          action        = "replace"
        }
        
        rule {
          source_labels = ["__meta_ec2_tag_Name"]
          target_label  = "tag_Name"
          action        = "replace"
        }
      }

      prometheus.scrape "a_scrape_unix" {
        targets    = discovery.relabel.a_keep_label.output
        forward_to = [prometheus.remote_write.mimir_a.receiver]
        job_name   = "a-unix"
        scrape_interval = "1m"
        scrape_timeout = "50s"
        metrics_path = "/api/v0/component/prometheus.exporter.unix.unix/metrics"
        clustering {
          enabled = true
        }
      }

      prometheus.scrape "a_scrape_process" {
        targets    = discovery.relabel.a_keep_label.output
        forward_to = [prometheus.remote_write.mimir_a.receiver]
        job_name   = "a-process"
        scrape_interval = "5m"
        scrape_timeout = "4m"
        metrics_path = "/api/v0/component/prometheus.exporter.process.process/metrics"
        clustering {
          enabled = true
        }
      }

      prometheus.scrape "a_scrape_docker" {
        targets    = discovery.relabel.a_keep_label.output
        forward_to = [prometheus.remote_write.mimir_a.receiver]
        job_name   = "a-docker"
        scrape_interval = "5m"
        scrape_timeout = "4m"
        metrics_path = "/api/v0/component/prometheus.exporter.cadvisor.docker/metrics"
        clustering {
          enabled = true
        }
      }

      discovery.ec2 "b" {
        refresh_interval = "5m"
        port = 12345
      }

      discovery.relabel "b_keep_label" {
        targets    = discovery.ec2.b.targets

        rule {
          source_labels = ["__meta_ec2_private_ip"]
          target_label  = "private_ip"
          action        = "replace"
        }
        
        rule {
          source_labels = ["__meta_ec2_instance_id"]
          target_label  = "instance_id"
          action        = "replace"
        }
        
        
        rule {
          source_labels = ["__meta_ec2_instance_type"]
          target_label  = "instance_type"
          action        = "replace"
        }
        
        rule {
          source_labels = ["__meta_ec2_tag_Name"]
          target_label  = "tag_Name"
          action        = "replace"
        }
        
      }

      prometheus.scrape "b_scrape_unix" {
        targets    = discovery.relabel.b_keep_label.output
        forward_to = [prometheus.remote_write.mimir_b.receiver]
        job_name   = "b-unix"
        scrape_interval = "1m"
        scrape_timeout = "50s"
        metrics_path = "/api/v0/component/prometheus.exporter.unix.unix/metrics"
        clustering {
          enabled = true
        }
      }

      prometheus.scrape "b_scrape_process" {
        targets    = discovery.relabel.b_keep_label.output
        forward_to = [prometheus.remote_write.mimir_b.receiver]
        job_name   = "b-process"
        scrape_interval = "5m"
        scrape_timeout = "4m"
        metrics_path = "/api/v0/component/prometheus.exporter.process.process/metrics"
        clustering {
          enabled = true
        }
      }

      prometheus.scrape "b_scrape_docker" {
        targets    = discovery.relabel.b_keep_label.output
        forward_to = [prometheus.remote_write.mimir_b.receiver]
        job_name   = "b-docker"
        scrape_interval = "5m"
        scrape_timeout = "4m"
        metrics_path = "/api/v0/component/prometheus.exporter.cadvisor.docker/metrics"
        clustering {
          enabled = true
        }
      }

    name: null
    key: null

  clustering:
    enabled: true



  extraPorts:
    - name: "otlp-grpc"
      port: 4317
      targetPort: 4317
      protocol: "TCP"
    - name: "otlp-http"
      port: 4318
      targetPort: 4318
      protocol: "TCP"


  securityContext: {}

  resources: 
    limits:
      memory: 3Gi
    requests:
      cpu: 700m
      memory: 3Gi

image:
  registry: "docker.io"
  repository: grafana/alloy
  tag: 'v1.1.0'
  digest: null
  pullPolicy: IfNotPresent
  pullSecrets: []

controller:
  type: 'statefulset'

  replicas: 2

  parallelRollout: true

  dnsPolicy: ClusterFirst

  nodeSelector: {
    Environment: production
  }

Logs

ts=2024-06-10 level=error msg="non-recoverable error" component_path=/ component_id=prometheus.remote_write.mimir subcomponent=rw remote_name=40abfa url=http://mimir-distributed-gateway/api/v1/push count=55 │
│ 0 exemplarCount=0 err="server returned HTTP status 400 Bad Request: received a series with duplicate label name, label: 'tag_Name' series: 'node_memory_Cached_bytes{instance=\"192.168.xxx.xxx:12345\", instance_id=\"i-xxxx\", instance_type=\"xxxx\", job=\"a-unix\",  private_i' (err-mimir-duplicate-label-names)"

The text was updated successfully, but these errors were encountered:

itjobs-levi · 2024-06-10T02:14:35Z

This problem occurs when using alloy clustering mode and replica 3.

thampiotr · 2024-06-10T10:11:28Z

This may not be related to the issue, I still want to look into it deeper, but I've noticed that you are reaching into an internal metrics path of an Alloy exporter with this:

metrics_path = "/api/v0/component/prometheus.exporter.unix.unix/metrics"

This is not advised and it's relying on an internal implementation detail. Could you try to use the supported way, similar to the examples in our documentation?

To be specific, you shouldn't need to set the metrics_path, you can just run the exporter and scrape it in one agent instance like this:

prometheus.exporter.process "example" {
  ... // your config
}

// Configure a prometheus.scrape component to collect process_exporter metrics.
prometheus.scrape "demo" {
  targets    = prometheus.exporter.process.example.targets
  forward_to = [...]
}

itjobs-levi · 2024-06-10T10:30:43Z

I prefer the prometheus pulling method.
So the Alloy agent on the collection target server only acts as an exporter,
and discovers and scrapes the target server according to the configuration written above.
To pull, you have no choice but to use the metric path.
Because the target server is a server outside of another ec2.

The guide document provided seems to be a push method, which is directly delivered to mimir from each collection target server.

Sorry if I misunderstood. I am pulling from the target server via alloy agent.
(https://prometheus.io/docs/introduction/faq/#why-do-you-pull-rather-than-push)

thampiotr · 2024-06-10T10:35:02Z

I prefer the prometheus pulling method.

The example that I included in my previous comment uses the pull method. Only targets are passed to prometheus.scrape, but then the prometheus.scrape performs metrics pulling. You still have a pull-based metrics pipeline. So I'd recommend you try the supported approach, as the /api/v0/component/prometheus.exporter.unix.unix/metrics is an internal implementation detail.

BTW, you may also be affected by this issue: #1009 - but there is a simple workaround for it, so try that too :)

itjobs-levi · 2024-06-10T11:23:47Z

I may have misunderstood.

However, I would like to ask one more question.

On my multiple collection target EC2 (A Group, B Group), Alloy agent is running in non-cluster mode,
and each exports metrics through unix, process, and cadvisor exporter.

On another EKS server, there are two Alloy agent pods (X Group) running in cluster mode.

If I want to collect metrics for unix, process, and cadvisor from multiple collection target EC2 (A Group, B Group) from Alloy agent pod (X Group) on the EKS server, don't I have no choice but to declare metric_path?

After ec2 discovery, I confirmed that only /metrics is read when prometheus.scrape is performed. /metrics only has metrics for alloy, and unix, process, and cadvisor metrics do not exist.

That's why I declared unix, process, and cadvisor separately in addition to /metrics.

I understand that the method you mentioned is only possible when the exporter and scrape are on the same server and are the same Alloy agent process.
`prometheus.exporter.process "example" {
... // your config
}

// Configure a prometheus.scrape component to collect process_exporter metrics.
prometheus.scrape "demo" {
targets = prometheus.exporter.process.example.targets
forward_to = [...]
}`

If I'm wrong or you have different design guidelines, please let me know and I'd really appreciate it.

itjobs-levi · 2024-06-10T14:43:37Z

@thampiotr
discovery.relabel "replace_instance" { targets = discovery.file.targets.targets rule { action = "replace" source_labels = ["instance"] target_label = "instance" replacement = "alloy-cluster" } }

I applied this and I am not getting any errors on replica 3.
It is not old but it seems to be fixed.

However, I do not understand why the instance label affects clustering
(I checked the PR you provided..)

flow: ec2 discovery -> relabel -> scrape
It seems that the instance label is added as the ip:12345 of the collection target discovered in ec2 at the time of scrape.
This may affect the hashing calculation, but..

Shouldn't the calculation be the same if each collection target has the same instance label in alloy 3 pod in cluster mode?
If you could explain in detail, it would be helpful for others to understand.

Lastly, as you commented above, is there any other way to collect metrics from the collection target other than using the metric path as I provided, since the collection target (non cluster mode) and the collector (cluster mode) are installed separately on different ec2s?

thampiotr · 2024-06-27T12:08:26Z

Thanks for closing this, I'm happy it worked eventually!

However, I do not understand why the instance label affects clustering

I have described this failure mode in more detail in this issue. The instance label would be different between instances and thus the hashing will be different too, breaking an important assumption in clustering.

For anyone encountering this or similar problem in the future - check with this issue for a workaround and potential fix in the future: #1009

itjobs-levi added the bug Something isn't working label Jun 10, 2024

itjobs-levi closed this as completed Jun 24, 2024

github-actions bot added the frozen-due-to-age label Jul 28, 2024

github-actions bot locked as resolved and limited conversation to collaborators Jul 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When scraping using Alloy clustering mode, if there are more than 3 replicas, a duplicate label error occurs. #1006

When scraping using Alloy clustering mode, if there are more than 3 replicas, a duplicate label error occurs. #1006

itjobs-levi commented Jun 10, 2024 •

edited

Loading

itjobs-levi commented Jun 10, 2024

thampiotr commented Jun 10, 2024

itjobs-levi commented Jun 10, 2024

thampiotr commented Jun 10, 2024

itjobs-levi commented Jun 10, 2024

itjobs-levi commented Jun 10, 2024 •

edited

Loading

thampiotr commented Jun 27, 2024 •

edited

Loading

When scraping using Alloy clustering mode, if there are more than 3 replicas, a duplicate label error occurs. #1006

When scraping using Alloy clustering mode, if there are more than 3 replicas, a duplicate label error occurs. #1006

Comments

itjobs-levi commented Jun 10, 2024 • edited Loading

What's wrong?

Steps to reproduce

System information

Software version

Configuration

Logs

itjobs-levi commented Jun 10, 2024

thampiotr commented Jun 10, 2024

itjobs-levi commented Jun 10, 2024

thampiotr commented Jun 10, 2024

itjobs-levi commented Jun 10, 2024

itjobs-levi commented Jun 10, 2024 • edited Loading

thampiotr commented Jun 27, 2024 • edited Loading

itjobs-levi commented Jun 10, 2024 •

edited

Loading

itjobs-levi commented Jun 10, 2024 •

edited

Loading

thampiotr commented Jun 27, 2024 •

edited

Loading