Errors with large number of multiple workflows on GitHub (rate limits and timeouts) #4052
Open
3 tasks done
Labels
enhancement
improve existing features
Milestone
Component
server
Describe the bug
We've been experience two issues with repositories that have a "large" number of workflows: I have a repository with 41 workflows and another with 61. They're certainly outliers... but they exist.
There's also a variation
could not get folder from forge
.If this is resolved in a timely manner (in less than 10 seconds) then the server replies with a 400 and a "failure to parse hook" message. We have observed that this is a relatively common occurrence, repositories with 20 or so workflows will receive "a few" of those daily.
From what I've seem both issues are often correlated. A secondary rate limit will likely cause the webhook to timeout. That said it seems that a timeout alone is possible and itself not destructive. If a webhook timeout happens but the Woodpecker server does manage to parse all workflows (even if it takes more than 10 seconds) and process them, the job is created, picked up by an agent and even updates the github UI.
However when secondary rate limits happens, its worse. The server code doesn't handle rate limits. If it's
could not get folder from forge
error, then the pipeline job is not created. From the user's perspective this is a silent error. They make a push but no pipeline is started, so it can be very hard to debug without access to the server logs. There's alsoerror setting commit status
; in this case it seems to be implied (didn't get to track one down) that the success/failure/running status wouldn't be updated on the commit which could cause issues with pull request validation.FWIW I'm testing this internally quintoandar#20 which uses a github secondary rate limit library to try and fix the first issue. Looks promising.
Fixing timeouts seems more complex. Arguably it might just be a hard limit on Woodpecker, maybe it makes no sense to support 40+ workflows. Or it'd need to process them asynchronously through an internal queue (but then there's the scenario where Woodpecker replies 200 to GitHub and later asynchronously finds ou some of the workflows are invalid).
Steps to reproduce
Expected behavior
No response
System Info
Additional context
No response
Validations
next
version already [https://woodpecker-ci.org/faq#which-version-of-woodpecker-should-i-use]The text was updated successfully, but these errors were encountered: