[SOLVED] can't specify service_freshness_check_interval for an individual service #468

systemdlete · 2024-06-16T09:38:05Z

systemdlete
Jun 16, 2024

It seems that the service_freshness_check_interval can only be set in the naemon configuration, but not in a service definition. I would like to override it for just one service without impacting other services.

It is hardcoded to 60 seconds, which explains why (after a lot of poking around through the options and docs) I was seeing errors in a service using the freshness option. I could refresh the page every few seconds, and as the duration approached 60s, the check_command would be invoked, even though I have the freshness_threshold set to 600 (maybe something else is wrong also, idk).

I guess this is by design, but I find it a little frustrating that it cannot be handled like other options.

Answered by systemdlete

Jun 28, 2024

Turns out that one of the two programs that generates the passive checks was putting the wrong time for the start_time parameter. I didn't realize that it was doing things differently than the other program. Once I fixed that, everything settled down. Thanks for your help, Daniel.

View full answer

systemdlete · 2024-06-22T04:02:49Z

systemdlete
Jun 22, 2024
Author

I had max attempts set to 1, but now it is set to 2 and the errors disappeared. Since the passive checks come in, steadily, once per minute + a tiny bit more for network overhead, it stands to reason that there would be a failure. The service_freshness_check_interval is set to 60s, which is the default and I have not mucked with. So 60s is not enough time for a check that takes just a tiny bit over 60s to arrive.

This is why it would be nice to have the ability to adjust this parameter only for this one service. Other services are working just fine this way, and I don't really want to have to adjust all of those to accommodate this one service.

I have set the freshness check to 90s to allow enough time for the minutely passive checks to arrive, but not so long that longer delays go unnotified.

What I did not expect is that even though my freshness check was set to 10 minutes previously, the freshness check was being run just after the one minute was up (when max attempts was set to 1). I might be overlooking something here, or just not fully comprehend how freshness checking works (yes, I've read the doc on freshness checks, many times).

Marking this "SORTOF" rather than "SOLVED" because of this.

0 replies

systemdlete · 2024-06-24T12:10:55Z

systemdlete
Jun 24, 2024
Author

Sadly, this problem is not solved, or even sort of solved.

I am seeing messages in the logs for the service as well as in the naemon log that the service is stale... but only for a few seconds, and the service goes back to OK state as prior to these incidents. These occur regularly, every minute.

The passive checks arrive on the naemon host about :00-:02 seconds into the minute. I can see the spool files for the passive check each time for each instance of the service (it runs on more than one of the monitored systems). The status of the service does not change (remains OK state), but then at about :51 seconds into the same minute, the state goes to CRITICAL.

I am also noticing that the naemon.log shows messages like

Warning: The results of service 'mysvc1' on host 'myhost1' are stale by 0d 0h 1m 0s (threshold=0d 0h 5m 30s).  I'm forcing an immediate check of the service.

and, in another instance for another service

Warning: The results of service 'mysvc2' on host 'myhost2' are stale by 0d 0h 0m 6s (threshold=0d 6h 0m 0s).  I'm forcing an immediate check of the service.

In both cases, the services have only been "stale" for a short time, not exceeding the threshold. But the fact is that the services are receiving checks which show up clearly in the logs. Incidentally, these 2 examples are (obfuscated) logs from 2 different thruk installations, and they are not copies.

Again, this could be my lack of full understanding of how it is supposed to work. I might have overlooked something in my configuration.

0 replies

nook24 · 2024-06-24T14:06:00Z

nook24
Jun 24, 2024
Collaborator

I only can give you my ideas and a sort-of work around to this, based on my experiences.

The timestamp you put into the spoolfile, is this timestamp from the passive system? If so you have to make sure that the time on the naemon server and on the remote system are exactly the same. You can use NTP for this.

Having the same time on all systems is relay important and I can't stress this enough.

The global config parameter service_freshness_check_interval=60 determines how frequently Naemon should execute the freshness check. But this is not the value of "when is a check result to old".

You can combine this with the per host / per service parameters such as:

check_freshness - This directive is used to determine whether or not freshness checks are enabled for this host.This directive is used to determine whether or not freshness checks are enabled for this host.
freshness_threshold - This directive is used to specify the freshness threshold (in seconds) for this host. If you set this directive to a value of 0, Naemon will determine a freshness threshold to use automatically.

I have never used freshness_threshold=0 I have to admit.

In your case, you need to increase the value of freshness_threshold, probably to 70.
(60 seconds interval + 2 seconds network + 8 seconds clock shift) - depends on how exact you have to be.

Our default freshness_threshold value for distributed systems is check_interval + 300 seconds which works in most situations for systems around the globe or even in space.

Hope this helps.

2 replies

systemdlete Jun 24, 2024
Author

Thank you for your feedback.

The timestamp in the spoolfile is the time transmitted from the "passive system" as you call it (I call it the remote system). All of my systems are on NTP, and I agree that having inconsistent time across the network creates all kinds of havoc, especially with things like ssl.

I have read the definitions of all the attributes re freshness checking and (I think) I understand them. I've played around with them, trying various things. I have the threshold set to 90s, which is ample time for a minutely service check.

What do you make of those messages I am seeing in the naemon.log above? They seem to be off, to me.

nook24 Jun 24, 2024
Collaborator

I call it the remote system
That's a much better name :)

The calculation looks a bit wild, but has not changed for the last 10 years. So I assume it has no major bugs in it.

naemon-core/src/naemon/checks_service.c

Lines 1368 to 1436 in 22f0fb6

    
           	/* calculate expiration time */ 
        
           	/* 
        
           	 * CHANGED 11/10/05 EG - 
        
           	 * program start is only used in expiration time calculation 
        
           	 * if > last check AND active checks are enabled, so active checks 
        
           	 * can become stale immediately upon program startup 
        
           	 */ 
        
           	/* 
        
           	 * CHANGED 02/25/06 SG - 
        
           	 * passive checks also become stale, so remove dependence on active 
        
           	 * check logic 
        
           	 */ 
        
           	if (temp_service->has_been_checked == FALSE) 
        
           		expiration_time = (time_t)(event_start + freshness_threshold); 
        
           	/* 
        
           	 * CHANGED 06/19/07 EG - 
        
           	 * Per Ton's suggestion (and user requests), only use program start 
        
           	 * time over last check if no specific threshold has been set by user. 
        
           	 * Problems can occur if Nagios is restarted more frequently that 
        
           	 * freshness threshold intervals (services never go stale). 
        
           	 */ 
        
           	/* 
        
           	 * CHANGED 10/07/07 EG: 
        
           	 * Only match next condition for services that 
        
           	 * have active checks enabled... 
        
           	 */ 
        
           	else if (temp_service->checks_enabled == TRUE && event_start > temp_service->last_check && temp_service->freshness_threshold == 0) 
        
           		expiration_time = (time_t)(event_start + freshness_threshold); 
        
           	else 
        
           		expiration_time = (time_t)(temp_service->last_check + freshness_threshold); 
        
           	/* 
        
           	 * If the check was last done passively, we assume it's going 
        
           	 * to continue that way and we need to handle the fact that 
        
           	 * Nagios might have been shut off for quite a long time. If so, 
        
           	 * we mustn't spam freshness notifications but use event_start 
        
           	 * instead of last_check to determine freshness expiration time. 
        
           	 * The threshold for "long time" is determined as 61.8% of the normal 
        
           	 * freshness threshold based on vast heuristical research (ie, "some 
        
           	 * guy once told me the golden ratio is good for loads of stuff"). 
        
           	 */ 
        
           	if (temp_service->check_type == CHECK_TYPE_PASSIVE) { 
        
           		if (temp_service->last_check < event_start && 
        
           		    event_start - last_program_stop > freshness_threshold * 0.618) { 
        
           			expiration_time = event_start + freshness_threshold; 
        
           		} 
        
           	} 
        
           	log_debug_info(DEBUGL_CHECKS, 2, "HBC: %d, PS: %lu, ES: %lu, LC: %lu, CT: %lu, ET: %lu\n", temp_service->has_been_checked, (unsigned long)program_start, (unsigned long)event_start, (unsigned long)temp_service->last_check, (unsigned long)current_time, (unsigned long)expiration_time); 
        
           	/* the results for the last check of this service are stale */ 
        
           	if (expiration_time < current_time) { 
        
           		get_time_breakdown((current_time - expiration_time), &days, &hours, &minutes, &seconds); 
        
           		get_time_breakdown(freshness_threshold, &tdays, &thours, &tminutes, &tseconds); 
        
           		/* log a warning */ 
        
           		if (log_this == TRUE) 
        
           			nm_log(NSLOG_RUNTIME_WARNING, 
        
           			       "Warning: The results of service '%s' on host '%s' are stale by %dd %dh %dm %ds (threshold=%dd %dh %dm %ds).  I'm forcing an immediate check of the service.\n", temp_service->description, temp_service->host_name, days, hours, minutes, seconds, tdays, thours, tminutes, tseconds); 
        
           		log_debug_info(DEBUGL_CHECKS, 1, "Check results for service '%s' on host '%s' are stale by %dd %dh %dm %ds (threshold=%dd %dh %dm %ds).  Forcing an immediate check of the service...\n", temp_service->description, temp_service->host_name, days, hours, minutes, seconds, tdays, thours, tminutes, tseconds); 
        
           		return FALSE; 
        
           	} 
        
           	log_debug_info(DEBUGL_CHECKS, 1, "Check results for service '%s' on host '%s' are fresh.\n", temp_service->description, temp_service->host_name); 
        
           	return TRUE; 
        
           }

I just scraped through this a bit and i think as long as you do not restart the Naemon process, the calculation of expiration_time comes down to

expiration_time = (time_t)(temp_service->last_check + freshness_threshold);

for passive checks.

First of all you should check that active_checks_enabled is set to 0 for your passive services.
Also make sure to generate your spool files correctly, like the fractional part of start_time you have mentioned in here: #469

Also make sure that the spool file for a given services got processed, before you create a new file. Naemon does not have any order while processing the files, so I guess if the first file contains a timestamp of 13:00 and the second file 12:00, Naemon will also set the value for last_check to 12:00.

The line

Warning: The results of service 'mysvc2' on host 'myhost2' are stale by 0d 0h 0m 6s (threshold=0d 6h 0m 0s).  I'm forcing an immediate check of the service.

says that the the last_check result Naemon processed for this services is 6h + 6 seconds and it will execute the freshness check command (defined as check_command in the service.cfg if the passive service)

I would suggest to add some logging to your spool file generator to see when you generate which files, for which service with a readable date time.

Alternatively you can go with the query handler and ignore the spool directory at all

#command run [<timestamp>] PROCESS_SERVICE_CHECK_RESULT;host1;service1;0;This is an example plugin output|PERFDATA.
\nLONG OUTPUT IF ANY\n\0

Make sure to terminate the string with \0
You can take a look at my PHP code I use for this: https://github.com/statusengine/worker/blob/master/src/QueryHandler.php#L99-L103

systemdlete · 2024-06-28T18:46:56Z

systemdlete
Jun 28, 2024
Author

Turns out that one of the two programs that generates the passive checks was putting the wrong time for the start_time parameter. I didn't realize that it was doing things differently than the other program. Once I fixed that, everything settled down. Thanks for your help, Daniel.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SOLVED] can't specify service_freshness_check_interval for an individual service #468

{{title}}

Replies: 4 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

[SOLVED] can't specify service_freshness_check_interval for an individual service #468

systemdlete Jun 16, 2024

Replies: 4 comments · 2 replies

systemdlete Jun 22, 2024 Author

systemdlete Jun 24, 2024 Author

nook24 Jun 24, 2024 Collaborator

systemdlete Jun 24, 2024 Author

nook24 Jun 24, 2024 Collaborator

systemdlete Jun 28, 2024 Author

systemdlete
Jun 16, 2024

Replies: 4 comments 2 replies

systemdlete
Jun 22, 2024
Author

systemdlete
Jun 24, 2024
Author

nook24
Jun 24, 2024
Collaborator

systemdlete Jun 24, 2024
Author

nook24 Jun 24, 2024
Collaborator

systemdlete
Jun 28, 2024
Author