[SOLVED] can't specify service_freshness_check_interval for an individual service #468
-
It seems that the It is hardcoded to 60 seconds, which explains why (after a lot of poking around through the options and docs) I was seeing errors in a service using the freshness option. I could refresh the page every few seconds, and as the duration approached 60s, the check_command would be invoked, even though I have the freshness_threshold set to 600 (maybe something else is wrong also, idk). I guess this is by design, but I find it a little frustrating that it cannot be handled like other options. |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 2 replies
-
I had max attempts set to 1, but now it is set to 2 and the errors disappeared. Since the passive checks come in, steadily, once per minute + a tiny bit more for network overhead, it stands to reason that there would be a failure. The service_freshness_check_interval is set to 60s, which is the default and I have not mucked with. So 60s is not enough time for a check that takes just a tiny bit over 60s to arrive. This is why it would be nice to have the ability to adjust this parameter only for this one service. Other services are working just fine this way, and I don't really want to have to adjust all of those to accommodate this one service. I have set the freshness check to 90s to allow enough time for the minutely passive checks to arrive, but not so long that longer delays go unnotified. What I did not expect is that even though my freshness check was set to 10 minutes previously, the freshness check was being run just after the one minute was up (when max attempts was set to 1). I might be overlooking something here, or just not fully comprehend how freshness checking works (yes, I've read the doc on freshness checks, many times). Marking this "SORTOF" rather than "SOLVED" because of this. |
Beta Was this translation helpful? Give feedback.
-
Sadly, this problem is not solved, or even sort of solved. I am seeing messages in the logs for the service as well as in the naemon log that the service is stale... but only for a few seconds, and the service goes back to OK state as prior to these incidents. These occur regularly, every minute. The passive checks arrive on the naemon host about :00-:02 seconds into the minute. I can see the spool files for the passive check each time for each instance of the service (it runs on more than one of the monitored systems). The status of the service does not change (remains OK state), but then at about :51 seconds into the same minute, the state goes to CRITICAL. I am also noticing that the naemon.log shows messages like
and, in another instance for another service
In both cases, the services have only been "stale" for a short time, not exceeding the threshold. But the fact is that the services are receiving checks which show up clearly in the logs. Incidentally, these 2 examples are (obfuscated) logs from 2 different thruk installations, and they are not copies. Again, this could be my lack of full understanding of how it is supposed to work. I might have overlooked something in my configuration. |
Beta Was this translation helpful? Give feedback.
-
I only can give you my ideas and a sort-of work around to this, based on my experiences. The timestamp you put into the spoolfile, is this timestamp from the passive system? If so you have to make sure that the time on the naemon server and on the remote system are exactly the same. You can use NTP for this. Having the same time on all systems is relay important and I can't stress this enough. The global config parameter You can combine this with the per host / per service parameters such as:
I have never used In your case, you need to increase the value of Our default freshness_threshold value for distributed systems is check_interval + 300 seconds which works in most situations for systems around the globe or even in space. Hope this helps. |
Beta Was this translation helpful? Give feedback.
-
Turns out that one of the two programs that generates the passive checks was putting the wrong time for the start_time parameter. I didn't realize that it was doing things differently than the other program. Once I fixed that, everything settled down. Thanks for your help, Daniel. |
Beta Was this translation helpful? Give feedback.
Turns out that one of the two programs that generates the passive checks was putting the wrong time for the start_time parameter. I didn't realize that it was doing things differently than the other program. Once I fixed that, everything settled down. Thanks for your help, Daniel.