Prod Incident 25/07/24 - Hasura 429/504 errors, Postgres Connection Limit, and Unexpected Deprovisioning #930

darunrs · 2024-07-26T18:57:05Z

On July 25, 2024, Rate Exceeded errors were observed from the production Hasura instance. Following this, an investigation was performed with the help of SRE. One of the actions taken was raising the concurrent request limit of each Hasura instance from 80 to 200 while increasing the max instances from 5 to 10. This increase was sufficient to stop the Rate Exceeded error. The following morning, it was discovered that the number of DB connections had spiked to 600 and was floating at 600, with the number of active connections locked at roughly 400. As a result, Hasura did not have enough connections to maintain its metadata, causing it to fall out of sync. This led to QueryApi once again experiencing issues. After QueryApi was shut down in prod, and the database restarted, the connection count fell. However, when QueryApi restarted, it immediately began to deprovision many indexers without cause. QueryApi was shut down again, and the deprovisioning was investigated. After the impacted indexers were documented, QueryApi was restarted with a custom commit which increased the timeout between stalled stream/executor restart attempts, and disabled deprovisioning. After this, the deprovisioned indexers were all brought back and backfilled on Jul 26, 2024.

TLDR:

Hasura rejects all requests due to accumulated timeout queries from either a block stream which was being repeatedly restarted or from KitWallet which tried accessing Hasura after Postgres connections reached the limit
Postgres connections rapidly rise to maximum due to above timing out queries from QueryApi creating permanently active connections
Indexers suddenly deprovisioned when not deleted from contract

Hasura and Postgres Incident

Give feedback

Deprovisioning Incident

Give feedback

Update contract to return specific data when fetching deleted indexer #940

component: Registry
Add Status Code and content check for indexer config fetch #941

component: Coordinator
Restart all broken indexers #979

component: Coordinator
Add logs for all state transitions #942
https://github.com/near/queryapi/issues/965
Refactor Provisioning into Multi-Step Process #915

component: Runner
Update contract to use enums for config instead of struct #964
Convert deprovisioning to flagging #953
Add ability to trigger backfill manually #943
Add guard rail for deprovisioning to prevent deprovisioning core indexers #944
Prevent automatic deprovision of indexers with active hasura usage #945
Create scheduled job to deprovision indexers #954
Handle indexers deleted during Coordinator downtime #968

Ungroomed component: Coordinator
Options

darunrs added bug Something isn't working component: Runner component: Coordinator component: Streamer labels Jul 26, 2024

darunrs self-assigned this Jul 26, 2024

darunrs changed the title ~~Prod Issue 25/07/24 - Hasura Rate Exceeded and Postgres Connection Limit Reached~~ Prod Issue 25/07/24 - Hasura 429/504 errors, Postgres Connection Limit, and Unexpected Deprovisioning Jul 29, 2024

darunrs changed the title ~~Prod Issue 25/07/24 - Hasura 429/504 errors, Postgres Connection Limit, and Unexpected Deprovisioning~~ Prod Incident 25/07/24 - Hasura 429/504 errors, Postgres Connection Limit, and Unexpected Deprovisioning Aug 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prod Incident 25/07/24 - Hasura 429/504 errors, Postgres Connection Limit, and Unexpected Deprovisioning #930

Prod Incident 25/07/24 - Hasura 429/504 errors, Postgres Connection Limit, and Unexpected Deprovisioning #930

darunrs commented Jul 26, 2024 •

edited

Loading

Hasura and Postgres Incident

Deprovisioning Incident

Prod Incident 25/07/24 - Hasura 429/504 errors, Postgres Connection Limit, and Unexpected Deprovisioning #930

Prod Incident 25/07/24 - Hasura 429/504 errors, Postgres Connection Limit, and Unexpected Deprovisioning #930

Comments

darunrs commented Jul 26, 2024 • edited Loading

Hasura and Postgres Incident

Deprovisioning Incident

darunrs commented Jul 26, 2024 •

edited

Loading