You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On July 25, 2024, Rate Exceeded errors were observed from the production Hasura instance. Following this, an investigation was performed with the help of SRE. One of the actions taken was raising the concurrent request limit of each Hasura instance from 80 to 200 while increasing the max instances from 5 to 10. This increase was sufficient to stop the Rate Exceeded error. The following morning, it was discovered that the number of DB connections had spiked to 600 and was floating at 600, with the number of active connections locked at roughly 400. As a result, Hasura did not have enough connections to maintain its metadata, causing it to fall out of sync. This led to QueryApi once again experiencing issues. After QueryApi was shut down in prod, and the database restarted, the connection count fell. However, when QueryApi restarted, it immediately began to deprovision many indexers without cause. QueryApi was shut down again, and the deprovisioning was investigated. After the impacted indexers were documented, QueryApi was restarted with a custom commit which increased the timeout between stalled stream/executor restart attempts, and disabled deprovisioning. After this, the deprovisioned indexers were all brought back and backfilled on Jul 26, 2024.
TLDR:
Hasura rejects all requests due to accumulated timeout queries from either a block stream which was being repeatedly restarted or from KitWallet which tried accessing Hasura after Postgres connections reached the limit
Postgres connections rapidly rise to maximum due to above timing out queries from QueryApi creating permanently active connections
Indexers suddenly deprovisioned when not deleted from contract
On July 25, 2024, Rate Exceeded errors were observed from the production Hasura instance. Following this, an investigation was performed with the help of SRE. One of the actions taken was raising the concurrent request limit of each Hasura instance from 80 to 200 while increasing the max instances from 5 to 10. This increase was sufficient to stop the Rate Exceeded error. The following morning, it was discovered that the number of DB connections had spiked to 600 and was floating at 600, with the number of active connections locked at roughly 400. As a result, Hasura did not have enough connections to maintain its metadata, causing it to fall out of sync. This led to QueryApi once again experiencing issues. After QueryApi was shut down in prod, and the database restarted, the connection count fell. However, when QueryApi restarted, it immediately began to deprovision many indexers without cause. QueryApi was shut down again, and the deprovisioning was investigated. After the impacted indexers were documented, QueryApi was restarted with a custom commit which increased the timeout between stalled stream/executor restart attempts, and disabled deprovisioning. After this, the deprovisioned indexers were all brought back and backfilled on Jul 26, 2024.
TLDR:
More details on Incident Document.
I've separated the task list into two as the two incidents are unrelated.
Hasura and Postgres Incident
Deprovisioning Incident
The text was updated successfully, but these errors were encountered: