Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error log level on nominal use case ? #1634

Closed
cyril2maq opened this issue Aug 1, 2024 · 8 comments
Closed

error log level on nominal use case ? #1634

cyril2maq opened this issue Aug 1, 2024 · 8 comments

Comments

@cyril2maq
Copy link

In production we did encounter this log :

LOG.error("Unexpected error: Unable to find observation with token {} for registration {}",

It seems that this can be a nominal use case when lwm2m client uses wrong token or in case of race condition on removing Observation.

ERROR log level is usually a trigger for operators as a service malfunction, which is not necessary the case here, and should not be used on a potential nominal use case.

I think it should be decreased to INFO level.

What do you think ?

@sbernard31
Copy link
Contributor

ERROR log level is usually a trigger for operators as a service malfunction, which is not necessary the case here, and should not be used on a potential nominal use case.

Agree with that.

It seems that this can be a nominal use case when lwm2m client uses wrong token or in case of race condition on removing Observation.

I don't know if this is really a "nominal use case"

This is a case where we get a notification which was accepted by coap layer but when we search the corresponding observation at LWM2M level we didn't find it.

So theoretically, this should be more about race condition than a wrong token used but this should not be so easy to trigger for a device or maybe there is a bug on how we handle this ?

Is it something you face often ? did you identifier clearly what trigger this ?

I think it should be decreased to INFO level.

I understand your point.
But is INFO not too low for something we should not happened.
I mean when it happens there is something wrong at client or server side. (I hope it's at client side)
Maybe a WARN ?

@cyril2maq
Copy link
Author

This is a case where we get a notification which was accepted by coap layer but when we search the corresponding observation at LWM2M level we didn't find it.

You're right, I missed that the Coap layer did validate the token beforehand.

Is it something you face often ?

Not often, a few times last month.

did you identifier clearly what trigger this ?

On our prod this log happened in 2 different race condition use cases :

  • the notification is sent few seconds AFTER the registration lifetime elapsed and the ExpirationListener.registrationExpired is raised around the same time, few milliseconds afterward.
  • the client send the NOTIFY (that will trigger the log afterward) right before a new REGISTRATION. Therefore the time that the notify is handled, the previous registration on which it was based has been removed.

Maybe a WARN ?

In our log policy, we do not raise WARN on expected race conditions, but it can clearly be discussed. Anyway it seems better than ERROR. In that case we would just change the log level trigger for this class.

@sbernard31
Copy link
Contributor

On our prod this log happened in 2 different race condition use cases

I think in both case, there is something not so good as client side. Don't you think ?

First, client doesn't to the registration update in time. (it should do that several minutes before the lifetime expired)
Second, this is less clear but maybe client should not send a notification just before a new registration ? or at least send a CON Message for this notification wait for the ACK then send the new registration ?

In our log policy, we do not raise WARN on expected race conditions, but it can clearly be discussed. Anyway it seems better than ERROR. In that case we would just change the log level trigger for this class.

ERROR is clearly too high.
My "fear" is that maybe a INFO will be too low and no nobody care about it and so will not fix the device behavior.
But maybe I'm wrong and if the message is clear enough, it could maybe do the job.

Currently we haven't really clear guideline for log level usage, see https://github.com/eclipse-leshan/leshan/wiki/Code-&-design-guidelines#code-rules

Maybe the project need something better than that, If you (Orange) could contribute/help to create together a clear Log Level policy for Leshan ?
I don't know if it exists reference practice about it ? (https://stackoverflow.com/questions/5817738/how-to-use-log-levels-in-java)
Ideally this policy should define the log level to use about "client bad behavior"

@cyril2maq
Copy link
Author

I think in both case, there is something not so good as client side. Don't you think ?

Yes. But race conditions are often tricky to target as implementation error. The client can send its notify with its lifetime still OK, but the the network and SSL handshake can take some time so as the lifetime is now over.

The stackoverfkow thread you mentioned highlight the ambiguity :

WARN: Any message that might warn us of potential problems, e.g. when a user tried to log in with wrong credentials - which might indicate an attack if that happens often or in short periods of time

this definition clearly goes with the WARN choice in our case

WARN: the process might be continued, but take extra caution [...] The application can tolerate warning messages, but they should always be justified and examined.

this definition goes more with the INFO choice in our case. It is consistent with this blog post for instance : https://betterstack.com/community/guides/logging/log-levels-explained/

Clearly, in our project we set with this second definition : WARN logs and above will trigger some human analysis. And I don't want to check what happened every time this race condition occurs 😅.

But if you choose this definition, it means that most libraries should have very few WARN logs; it is usually up to the application level to know there is something unexpected that a human should oversee...

@sbernard31
Copy link
Contributor

Yes. But race conditions are often tricky to target as implementation error. The client can send its notify with its lifetime still OK, but the the network and SSL handshake can take some time so as the lifetime is now over.

I agree but a client should take this into account so this should happens very very rarely.
See :

/**
* Return the longest possible communication period based on given lifetime for the given server.
* <p>
* EndpointsManager should be aware about the kind of transport layer used to contact this server. So it should be
* able to calculate the communication period to avoid that registration expires taking into account specific
* transport layer latency (like DTLS Handshake, CoAP retransmission ...)
*
* @return the longest possible communication period based on given lifetime for the given server.
*/
long getMaxCommunicationPeriodFor(LwM2mServer server, long lifetimeInSeconds);

@sbernard31
Copy link
Contributor

Once I finish what I'm working on I will move this log to WARN.

Then if wanted, we can try to create a log level usage policy together and then adapt it again (maybe to INFO if needed)

@cyril2maq
Copy link
Author

Thanks, LGTM.

@sbernard31
Copy link
Contributor

Once I finish what I'm working on I will move this log to WARN.

Done in 3d03a8e

Then if wanted, we can try to create a log level usage policy together and then adapt it again (maybe to INFO if needed)

I created a dedicated issue about that : #1641

I close this issue but feel free to reopen or comment if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants