Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestLogIngestionFleetManaged/Monitoring_logs_are_shipped is flaky #3741

Open
cmacknz opened this issue Nov 10, 2023 · 17 comments · Fixed by #3765 or #4632
Open

TestLogIngestionFleetManaged/Monitoring_logs_are_shipped is flaky #3741

cmacknz opened this issue Nov 10, 2023 · 17 comments · Fixed by #3765 or #4632
Assignees
Labels
flaky-test Unstable or unreliable test cases. Team:Elastic-Agent Label for the Agent team

Comments

@cmacknz
Copy link
Member

cmacknz commented Nov 10, 2023

Flaky Test

Stack Trace

=== RUN   TestLogIngestionFleetManaged/Monitoring_logs_are_shipped
    logs_ingestion_test.go:106: Making sure metricbeat logs are populated
    logs_ingestion_test.go:110: metricbeat: Got 0 documents
    logs_ingestion_test.go:111: 
        	Error Trace:	/home/ubuntu/agent/testing/integration/logs_ingestion_test.go:111
        	            				/home/ubuntu/agent/testing/integration/logs_ingestion_test.go:90
        	Error:      	Should not be zero, but was 0
        	Test:       	TestLogIngestionFleetManaged/Monitoring_logs_are_shipped
--- FAIL: TestLogIngestionFleetManaged/Monitoring_logs_are_shipped (15.19s)
@cmacknz cmacknz added Team:Elastic-Agent Label for the Agent team flaky-test Unstable or unreliable test cases. labels Nov 10, 2023
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@cmacknz
Copy link
Member Author

cmacknz commented Nov 10, 2023

Believe this new flakiness was introduced in #3465

@cmacknz
Copy link
Member Author

cmacknz commented Nov 10, 2023

Reverting the change while we investigate what the failure is #3743

@rdner
Copy link
Member

rdner commented Mar 27, 2024

Another failure https://buildkite.com/elastic/elastic-agent/builds/7991#018e803c-05c8-4470-9091-0013cd21917f

This time it's in the new code introduced by the fix #3765

@rdner rdner reopened this Mar 27, 2024
@cmacknz
Copy link
Member Author

cmacknz commented Mar 27, 2024

Looking at the logs it looks like the filestream-monitoring instance never connected to Elasticsearch, or legitimately didn't publish events. There are no acked events or output stats and the number of published events equals the number of events in the pipeline.

{
  "log.level": "info",
  "@timestamp": "2024-03-27T14:50:08.758Z",
  "message": "Non-zero metrics in the last 30s",
  "component": {
    "binary": "filebeat",
    "dataset": "elastic_agent.filebeat",
    "id": "filestream-monitoring",
    "type": "filestream"
  },
  "log": {
    "source": "filestream-monitoring"
  },
  "log.logger": "monitoring",
  "log.origin": {
    "file.line": 187,
    "file.name": "log/log.go",
    "function": "github.com/elastic/beats/v7/libbeat/monitoring/report/log.(*reporter).logSnapshot"
  },
  "service.name": "filebeat",
  "monitoring": {
    "ecs.version": "1.6.0",
    "metrics": {
      "beat": {
        "cpu": {
          "system": {
            "ticks": 250,
            "time": {
              "ms": 47
            }
          },
          "total": {
            "ticks": 609,
            "time": {
              "ms": 125
            },
            "value": 609
          },
          "user": {
            "ticks": 359,
            "time": {
              "ms": 78
            }
          }
        },
        "info": {
          "ephemeral_id": "d50a0fbb-c0b8-4697-8a6f-bd24a1146e0d",
          "uptime": {
            "ms": 210130
          },
          "version": "8.14.0"
        },
        "memstats": {
          "gc_next": 41174472,
          "memory_alloc": 23165520,
          "memory_total": 77048088,
          "rss": 82186240
        },
        "runtime": {
          "goroutines": 61
        }
      },
      "filebeat": {
        "events": {
          "active": 50,
          "added": 53,
          "done": 3
        },
        "harvester": {
          "open_files": 4,
          "running": 4
        }
      },
      "libbeat": {
        "config": {
          "module": {
            "running": 1
          }
        },
        "output": {
          "events": {
            "active": 0
          },
          "write": {
            "latency": {
              "histogram": {
                "count": 1,
                "max": 1483,
                "mean": 1483,
                "median": 1483,
                "min": 1483,
                "p75": 1483,
                "p95": 1483,
                "p99": 1483,
                "p999": 1483,
                "stddev": 0
              }
            }
          }
        },
        "pipeline": {
          "clients": 4,
          "events": {
            "active": 44,
            "filtered": 9,
            "published": 44,
            "total": 53
          }
        }
      },
      "registrar": {
        "states": {
          "current": 0
        }
      },
      "system": {
        "handles": {
          "open": -2
        }
      }
    }
  },
  "ecs.version": "1.6.0"
}

@cmacknz
Copy link
Member Author

cmacknz commented Mar 27, 2024

Nevermind, there was an earlier point where the writes happened:

{
  "log.level": "info",
  "@timestamp": "2024-03-27T14:47:08.765Z",
  "message": "Non-zero metrics in the last 30s",
  "component": {
    "binary": "filebeat",
    "dataset": "elastic_agent.filebeat",
    "id": "filestream-monitoring",
    "type": "filestream"
  },
  "log": {
    "source": "filestream-monitoring"
  },
  "service.name": "filebeat",
  "monitoring": {
    "ecs.version": "1.6.0",
    "metrics": {
      "beat": {
        "cpu": {
          "system": {
            "ticks": 187,
            "time": {
              "ms": 187
            }
          },
          "total": {
            "ticks": 421,
            "time": {
              "ms": 421
            },
            "value": 421
          },
          "user": {
            "ticks": 234,
            "time": {
              "ms": 234
            }
          }
        },
        "info": {
          "ephemeral_id": "d50a0fbb-c0b8-4697-8a6f-bd24a1146e0d",
          "name": "filebeat",
          "uptime": {
            "ms": 30135
          },
          "version": "8.14.0"
        },
        "memstats": {
          "gc_next": 41157728,
          "memory_alloc": 20457360,
          "memory_sys": 49289496,
          "memory_total": 70755320,
          "rss": 92114944
        },
        "runtime": {
          "goroutines": 61
        }
      },
      "filebeat": {
        "events": {
          "active": 0,
          "added": 216,
          "done": 216
        },
        "harvester": {
          "open_files": 4,
          "running": 4,
          "started": 4
        }
      },
      "libbeat": {
        "config": {
          "module": {
            "running": 1,
            "starts": 1
          }
        },
        "output": {
          "events": {
            "acked": 57,
            "active": 0,
            "batches": 1,
            "total": 57
          },
          "read": {
            "bytes": 4344,
            "errors": 1
          },
          "type": "elasticsearch",
          "write": {
            "bytes": 12790,
            "latency": {
              "histogram": {
                "count": 1,
                "max": 1483,
                "mean": 1483,
                "median": 1483,
                "min": 1483,
                "p75": 1483,
                "p95": 1483,
                "p99": 1483,
                "p999": 1483,
                "stddev": 0
              }
            }
          }
        },
        "pipeline": {
          "clients": 4,
          "events": {
            "active": 0,
            "filtered": 159,
            "published": 57,
            "retry": 57,
            "total": 216
          },
          "queue": {
            "acked": 57,
            "max_events": 3200
          }
        }
      },
      "registrar": {
        "states": {
          "current": 0
        }
      },
      "system": {
        "cpu": {
          "cores": 4
        },
        "handles": {
          "open": 330
        }
      }
    }
  },
  "log.logger": "monitoring",
  "log.origin": {
    "file.line": 187,
    "file.name": "log/log.go",
    "function": "github.com/elastic/beats/v7/libbeat/monitoring/report/log.(*reporter).logSnapshot"
  },
  "ecs.version": "1.6.0"
}

@rdner
Copy link
Member

rdner commented Apr 5, 2024

@rdner
Copy link
Member

rdner commented Apr 8, 2024

@belimawr
Copy link
Contributor

belimawr commented Apr 8, 2024

That's pretty odd, on both cases the logs stop before the Beats finalise starting up, there are not shutdown or failure logs, it looks like the Elastic-Agent process is brutally killed. I'll try to reproduce it.

@rdner
Copy link
Member

rdner commented Apr 13, 2024

@rdner
Copy link
Member

rdner commented Apr 29, 2024

Lots of failures in https://buildkite.com/elastic/elastic-agent/builds/8549#018f2728-f289-4c37-9d0b-c922580c3e80

 logs_ingestion_test.go:290: 
        	Error Trace:	/home/rhel/agent/testing/integration/logs_ingestion_test.go:290
        	            				/home/rhel/agent/testing/integration/logs_ingestion_test.go:231
        	Error:      	Should be empty, but was [{.ds-logs-elastic_agent.metricbeat-nd1rpyfwjgb7tjcrv4h9ujuvtwsqt2tinojxhcr1om-2024.04.29-000001 3.4849067 map[@timestamp:2024-04-29T00:37:13.216Z agent:map[ephemeral_id:a6e52050-f4d1-4e40-90c0-79b3b321c043 id:c16bca15-aad4-46a9-9030-784103d8b2ba name:ogc-linux-amd64-rhel-8-rpm-ac1f type:filebeat version:8.14.0] cloud:map[account:map[id:elastic-platform-ingest] availability_zone:us-central1-a instance:map[id:1962762493583644331 name:ogc-linux-amd64-rhel-8-rpm-ac1f] machine:map[type:e2-standard-2] project:map[id:elastic-platform-ingest] provider:gcp region:us-central1 service:map[name:GCE]] component:map[binary:metricbeat dataset:elastic_agent.metricbeat id:system/metrics-default type:system/metrics] data_stream:map[dataset:elastic_agent.metricbeat namespace:nd1rpyfwjgb7tjcrv4h9ujuvtwsqt2tinojxhcr1om type:logs] ecs:map[version:8.0.0] elastic_agent:map[id:c16bca15-aad4-46a9-9030-784103d8b2ba snapshot:true version:8.14.0] event:map[agent_id_status:verified dataset:elastic_agent.metricbeat ingested:2024-04-29T00:37:29Z] host:map[architecture:x86_64 containerized:false hostname:ogc-linux-amd64-rhel-8-rpm-ac1f id:fafceb6b665133ff13f7c253592ca0c6 ip:[10.128.0.148 fe80::b7a7:5439:a61b:ba0f] mac:[42-01-0A-80-00-94] name:ogc-linux-amd64-rhel-8-rpm-ac1f os:map[codename:Ootpa family:redhat kernel:4.18.0-513.24.1.el8_9.x86_64 name:Red Hat Enterprise Linux platform:rhel type:linux version:8.9 (Ootpa)]] input:map[type:filestream] log:map[file:map[device_id:2050 inode:84110223 path:/var/lib/elastic-agent/data/elastic-agent-8.14.0-SNAPSHOT-5503e4/logs/elastic-agent-20240429.ndjson] offset:15675 source:system/metrics-default] log.level:error log.logger:add_cloud_metadata log.origin:map[file.line:190 file.name:add_cloud_metadata/providers.go function:github.com/elastic/beats/v7/libbeat/processors/add_cloud_metadata.(*addCloudMetadata).fetchMetadata] message:add_cloud_metadata: received error failed fetching EC2 Identity Document: not found, Signing service.name:metricbeat]}]
        	Test:       	TestRpmLogIngestFleetManaged/Monitoring_logs_are_shipped

@belimawr
Copy link
Contributor

This seems to be a new error from the add_cloud_metedata processor:

add_cloud_metadata: received error failed fetching EC2 Identity Document: not found, Signing

I created a PR to not consider it an error any more: #4632

@rdner
Copy link
Member

rdner commented Apr 30, 2024

@belimawr the team who maintains the processor suggested us to add this to our CI scripts:

export BEATS_ADD_CLOUD_METADATA_PROVIDERS=“gcp”

So, the processor does not get confused. However, it would most likely require another change in the test though.

They also told me the message

add_cloud_metadata: received error failed fetching EC2 Identity Document: not found, Signing

is safe to ignore.

@ycombinator
Copy link
Contributor

This failure is still happening, e.g. https://buildkite.com/elastic/elastic-agent-extended-testing/builds/2544#0191c7da-22fd-4113-aa2f-e64740ff7ebb (from #5450). Reopening...

@cmacknz
Copy link
Member Author

cmacknz commented Sep 6, 2024

The most recent failure is Failed to connect to backoff(elasticsearch(http://127.0.0.1:9200)): Get "http://127.0.0.1:9200": dial tcp 127.0.0.1:9200: connect: connection refused which seems like a legitimate problem.

We should never be trying to connect to localhost.

@cmacknz
Copy link
Member Author

cmacknz commented Sep 6, 2024

We can add that error to the ignore list as a quick way to mitigate this.

@belimawr
Copy link
Contributor

belimawr commented Sep 9, 2024

This failure is still happening, e.g. https://buildkite.com/elastic/elastic-agent-extended-testing/builds/2544#0191c7da-22fd-4113-aa2f-e64740ff7ebb (from #5450). Reopening...

Looking at #5450, it does not contain the fix commit 6695324, so I believe we can close this issue.

The most recent failure is Failed to connect to backoff(elasticsearch(http://127.0.0.1:9200)): Get "http://127.0.0.1:9200": dial tcp 127.0.0.1:9200: connect: connection refused which seems like a legitimate problem.

We should never be trying to connect to localhost.

It happens on the deb version, I didn't dig into it, but I imagine the installation process ends up starting the agent with an empty/default configuration. Anyways, I agree this should not happen, if the agent is started with the correct configuration, Beats should never try to connect to the default ES configuration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flaky-test Unstable or unreliable test cases. Team:Elastic-Agent Label for the Agent team
Projects
None yet
6 participants