You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When Kong loses connectivity to the Database, the first couple of requests to /status and /metrics take too long to return. Meanwhile, Kubernetes uses the /status to probe for the readiness / liveness status and since it got a timeout, it assumed that the pod was unhealthy.
Until Kubernets decided to restart the container, Kong was proxing requests without a problem.
After looking at kong.api.routes.health.lua we realized that kong.db:connect()instructing to test connectivity is being executed both synchronously and without passing a timeout leading to latency of 5+ secs of the first requests.
The sequence of events were:
Several "harmless" entries of: [error] 1322#0: *664431 [lua] health.lua:61: fn(): failed to connect to PostgreSQL during /status endpoint check: [PostgreSQL error] timeout, client: 10.145.65.245, server: kong_status, request: "GET /status HTTP/1.1", host: "10.145.43.33:8802" [error] 1322#0: *660572 [kong] exporter.lua:353 prometheus: failed to reach database while processing/metrics endpoint: timeout, client: 10.145.46.50, server: kong_status, request: "GET /metrics HTTP/1.1", host: "10.145.43.33:8802"
Followed by a surprise SIGQUIT:
2024/09/16 07:03:42 [notice] 1#0: exit
2024/09/16 07:03:42 [notice] 1#0: worker process 1322 exited with code 0
2024/09/16 07:03:42 [notice] 1#0: signal 17 (SIGCHLD) received from 1322
2024/09/16 07:03:41 [notice] 1322#0: exit
2024/09/16 07:03:41 [notice] 1322#0: exiting
2024/09/16 07:03:38 [notice] 1#0: signal 29 (SIGIO) received
2024/09/16 07:03:38 [notice] 1#0: worker process 1323 exited with code 0
2024/09/16 07:03:38 [notice] 1#0: signal 17 (SIGCHLD) received from 1323
2024/09/16 07:03:37 [notice] 1323#0: exit
2024/09/16 07:03:37 [notice] 1323#0: exiting
2024/09/16 07:03:37 [notice] 1323#0: gracefully shutting down
2024/09/16 07:03:37 [notice] 1322#0: gracefully shutting down
2024/09/16 07:03:37 [notice] 1#0: signal 3 (SIGQUIT) received from 1990, shutting down
Followed by the apparent shutdown process, we see:
Please note that the pod was not able to die even though it received a SIGQUIT. After the pod regained the DB connectivity, it remained the same. The reason for this is likely a crashloopbackoff while shutting down, as it tried to save some state in a DB it had no connectivity.
Expected Behavior
Kong had no problem to route request without DB connectivity until it is taken out from the cluster due to /status timeout.
As the DB is not in the hot path, we did not expect an outage on a DB restart.
In our opinion, kong.db:connect() to test connectivity should be done either asynchronously or have a timeout
Steps To Reproduce
Setup a Kong in traditional mode
Disable the DB
Make a request to any route (should work)
Make a request to /status (should have a latency spike for the first attempts)
Anything else?
No response
The text was updated successfully, but these errors were encountered:
After looking at kong.api.routes.health.lua we realized that kong.db:connect()instructing to test connectivity is being executed both synchronously and without passing a timeout leading to latency of 5+ secs of the first requests.
In our opinion, kong.db:connect() to test connectivity should be done either asynchronously or have a timeout
I think this behaviour is expected, the timeout is configured through the pg_timeout option from the kong.conf file, see the default value as follows:
#pg_timeout = 5000 # Defines the timeout (in ms), for connecting,
# reading and writing.
Thank you for the reply @chobits
I tried to follow the code and the connect / reconnect looked to be receiving an config that apparently was not being passed in kong.db:connect(). But I now realize that it is as a self as first argument.
In your opinion, do you think the Kong chart liveness / readiness timeout having default value as pg_timeout might be problematic?
I am not sure about touching pg_timeout as idk exactly what Kong is doing under the hood and what are its needs, but it is bounded to 5seconds i can raise the probes timeout.
Is there an existing issue for this?
Kong version (
$ kong version
)Kong 3.6.1
Current Behavior
When Kong loses connectivity to the Database, the first couple of requests to
/status
and/metrics
take too long to return. Meanwhile, Kubernetes uses the/status
to probe for the readiness / liveness status and since it got a timeout, it assumed that the pod was unhealthy.Until Kubernets decided to restart the container, Kong was proxing requests without a problem.
After looking at
kong.api.routes.health.lua
we realized thatkong.db:connect()
instructing to test connectivity is being executed both synchronously and without passing a timeout leading to latency of 5+ secs of the first requests.The sequence of events were:
Several "harmless" entries of:
[error] 1322#0: *664431 [lua] health.lua:61: fn(): failed to connect to PostgreSQL during /status endpoint check: [PostgreSQL error] timeout, client: 10.145.65.245, server: kong_status, request: "GET /status HTTP/1.1", host: "10.145.43.33:8802"
[error] 1322#0: *660572 [kong] exporter.lua:353 prometheus: failed to reach database while processing/metrics endpoint: timeout, client: 10.145.46.50, server: kong_status, request: "GET /metrics HTTP/1.1", host: "10.145.43.33:8802"
Followed by a surprise SIGQUIT:
Followed by the apparent shutdown process, we see:
Please note that the pod was not able to die even though it received a SIGQUIT. After the pod regained the DB connectivity, it remained the same. The reason for this is likely a
crashloopbackoff
while shutting down, as it tried to save some state in a DB it had no connectivity.Expected Behavior
Kong had no problem to route request without DB connectivity until it is taken out from the cluster due to
/status
timeout.As the DB is not in the hot path, we did not expect an outage on a DB restart.
In our opinion,
kong.db:connect()
to test connectivity should be done either asynchronously or have a timeoutSteps To Reproduce
Anything else?
No response
The text was updated successfully, but these errors were encountered: