Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors with large number of multiple workflows on GitHub (rate limits and timeouts) #4052

Open
3 tasks done
fernandrone opened this issue Aug 21, 2024 · 3 comments
Open
3 tasks done
Labels
enhancement improve existing features
Milestone

Comments

@fernandrone
Copy link
Contributor

fernandrone commented Aug 21, 2024

Component

server

Describe the bug

We've been experience two issues with repositories that have a "large" number of workflows: I have a repository with 41 workflows and another with 61. They're certainly outliers... but they exist.

  1. Secondary rate limits. They will show up as this on the server as:
{"level":"error","error":"POST https://api.github.com/repos/<org>/<repo>/statuses/<hash>: 403 You have exceeded a secondary rate limit. Please wait a few minutes before you try again. If you reach out to GitHub Support for help, please include the request ID <id>.","time":"2024-08-21T13:31:15Z","message":"error setting commit status for <org>/<repo>/11082"}

There's also a variation could not get folder from forge.

If this is resolved in a timely manner (in less than 10 seconds) then the server replies with a 400 and a "failure to parse hook" message. We have observed that this is a relatively common occurrence, repositories with 20 or so workflows will receive "a few" of those daily.

  1. A more general issue is timeouts. GitHub has a 10 second timeout on webhooks. If the server takes too long to request all the workflows (wether or not it is rate limited) and doesn't reply within 10 second then the webhook will show as a "504".

From what I've seem both issues are often correlated. A secondary rate limit will likely cause the webhook to timeout. That said it seems that a timeout alone is possible and itself not destructive. If a webhook timeout happens but the Woodpecker server does manage to parse all workflows (even if it takes more than 10 seconds) and process them, the job is created, picked up by an agent and even updates the github UI.

However when secondary rate limits happens, its worse. The server code doesn't handle rate limits. If it's could not get folder from forge error, then the pipeline job is not created. From the user's perspective this is a silent error. They make a push but no pipeline is started, so it can be very hard to debug without access to the server logs. There's also error setting commit status; in this case it seems to be implied (didn't get to track one down) that the success/failure/running status wouldn't be updated on the commit which could cause issues with pull request validation.

FWIW I'm testing this internally quintoandar#20 which uses a github secondary rate limit library to try and fix the first issue. Looks promising.

Fixing timeouts seems more complex. Arguably it might just be a hard limit on Woodpecker, maybe it makes no sense to support 40+ workflows. Or it'd need to process them asynchronously through an internal queue (but then there's the scenario where Woodpecker replies 200 to GitHub and later asynchronously finds ou some of the workflows are invalid).

Steps to reproduce

  1. Install woodpecker and configure GitHub as the forge
  2. To force the secondary rate limit and/or timeout create a test repository with 100+ workflows (doesn't matter what they do)

Expected behavior

No response

System Info

Woodpecker 2.7.0, installed on Kubernetes, GitHub forge

Additional context

No response

Validations

  • Read the docs.
  • Check that there isn't already an issue that reports the same bug to avoid creating a duplicate.
  • Checked that the bug isn't fixed in the next version already [https://woodpecker-ci.org/faq#which-version-of-woodpecker-should-i-use]
@fernandrone fernandrone added the bug Something isn't working label Aug 21, 2024
@fernandrone fernandrone changed the title Failure to handle multiple workflows (secondary rate limits and timeouts) Failure to handle large number of multiple workflows on GitHub (secondary rate limits and timeouts) Aug 21, 2024
@fernandrone fernandrone changed the title Failure to handle large number of multiple workflows on GitHub (secondary rate limits and timeouts) Errors with large number of multiple workflows on GitHub (rate limits and timeouts) Aug 21, 2024
@fernandrone
Copy link
Contributor Author

So I'm thinking about this. This https://github.com/quintoandar/woodpecker/pull/20/files seems to fix the secondary rate limit issue (tested this in our test servers; should have it rolled out and validated on production by Monday).

However the timeout issue is indeed trickier. Looking at https://github.com/woodpecker-ci/woodpecker/blob/main/server/api/hook.go#L244-L246 I see that Woodpecker creates the pipeline synchronously and then returns the result to the forge. It also performs error handling https://github.com/woodpecker-ci/woodpecker/blob/main/server/api/helper.go#L32-L45.

The only way I can see to bypass github timeouts (which we are sure to hit sporadically when cloning 20+ workflows) is making this asynchronous, but that would be a breaking change.

Another issue with timeouts and rate limits is that so far I have considered only one repository in isolation (i.e. the secondary rate limit is triggered because one repo has dozens of workflows that are downloaded instantly). But the secondary rate limit could also happen if dozens of builds from different repos are triggered in a short interval. So in a large enough installation we can have rate limits causing timeouts even if not using multiple workflow files.

@zc-devs
Copy link
Contributor

zc-devs commented Aug 21, 2024

Woodpecker replies 200 to GitHub and later asynchronously

Should be 202.

finds out some of the workflows are invalid

That's totally fine for 202.

@fernandrone
Copy link
Contributor Author

Good, I may try to make webhook processing assync next week then 👍🏻

@qwerty287 qwerty287 added this to the 3.x.x milestone Sep 7, 2024
@qwerty287 qwerty287 added enhancement improve existing features and removed bug Something isn't working labels Sep 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement improve existing features
Projects
None yet
Development

No branches or pull requests

3 participants