-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a design doc for distributed execution and test suites #3217
Conversation
3. Similarly, in the "Instance Synchronization API" section I showed how the `SignalAndWait()` pseudo-API can be used to ensure that all instances have finished initializing their VUs and executed `setup()` before they start the actual test execution simultaneously | ||
4. Most failures will be handled by the coordinator and propaged to waiting instances. For the best UX, instances might subscribe to a special error channel so they can be interupted even during the test execution. | ||
5. Metric collection was described in the "Metrics Handling" section, and the coordinator will be able to stop the test if a threshold is crossed. | ||
6. We can probably handle `test.abort()` calls from an instance with a `GetOrCreateData()` call that uses a special `dataID` watched by the coordinator, but it might be simpler to just add a custom API for that... :thinking: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a fan of reusing API calls using "special" parameters to make requests requiring specialized handling, such as a test.abort()
. I would prefer a method in the Controller API for signaling this kind of events. For example, the Signal(event, error)
API call described above with an abort
event.
|
||
https://github.com/grafana/k6/issues/1342 is far behind https://github.com/grafana/k6/issues/140 when it comes to upvotes, but it is still one of our most requested features after it. | ||
|
||
## Limitations of `scenarios` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading this list of limitations followed by the proposal of test suits, it sounds like we are not addressing some of the limitations of scenarios but suggesting using suits. Is this the intention?
I think that even when some complex requirements such as "CI-like" capabilities make more sense in the context of suits, scenarios are still useful and will benefit from features such as "start after" chaining (to avoid the error-prone startTime
) and also the per scenario setup()
and teardown()
functions. This last one is particularly important from the perspective of the disruptor to facilitate the setup of the disruptor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it sounds like we are not addressing some of the limitations of scenarios but suggesting using suits
What limitations are not going to be addressed by test suites? I just re-read the list and all of the limitations are either directly solved by having test suites, or could be solved much more easily by building something on top of test suites compared to building it on top of the current scenarios
(or anything else that currently exists in k6 that I can think of).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@na-- That is not the problem. It is clear that test suites address all the limitations of scenarios. And it is also clear that some limitations such as the dependencies between scenarios are easier to implement in test suits.
The problem I'm mentioning is that scenarios are still valuable in some simple cases and it would be interesting to address some of their limitations instead of somehow "forcing" to use test suites in all but the simplest cases.
In particular the setup/tear-down functions per scenario. Having those would make using the disruptor a lot simpler than now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see! And yeah, that makes sense, I am not against having setup()
/teardown()
functions (or something like them) per scenario too, though that is a bit out of scope for this document. The nice thing about setup()
/teardown()
functions per test in a test suite is that... they already exist 😅 The current logic for these life-cycle functions just works as you'd expect it in the test suite PoC and you essentially get that whole feature for free.
Even so, the execution.Controller
architecture should also help with future efforts to implement per-scenario setup()
and teardown()
too! 🎉
The big problem before was the requirement that such functions need to be called only on a single instance, but their results need to be distributed across all instances. Similarly, this was also the big problem with implementing a SharedArray
(or other similar things) that could be created during the test execution phase (i.e. outside of the script init context). Both of these problems are almost the same, see #1638 (comment). And both of these problems should now be trivial to solve with the GetOrCreateData()
API/helper that will be used for the test setup()
🎉
That is, because setup()
will no longer be all that special, and because the control flow of this architecture is not top-down (the k6 coordinator controlling the agents), but bottoms-up (the main actors are the agents, they are just using the coordinator for a central hub to synchronize with each other and exchange data), it should be trivial to build such features on top of it and have them work in both local and distributed tests! 🤞
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello 👋 I've finally gotten around to actually start grokking this document properly so a couple of additional comments.
I'm still thinking through the implications of such k6 distributed solution in Kubernetes but I'd like to upvote this issue again. It'd be great if it were possible to move it forward.
7. Detecting when the test has finished is trivial with `SignalAndWait()`. | ||
8. `teardown()` can also be executed the same way we do `setup()`. Though, because it doesn't return any data, it is one of the reasons I think it makes sense to split `GetOrCreateData(id, callback)` into separate `Once(callback)`, `SetData(id)` amd `GetData(id)` calls... | ||
9. The `k6 coordinator` node can run `handleSummary()` after the test is done, no problems there :tada: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more thing that would be good to include in this list is:
- termination of k6 processes by coordinator
Currently, k6-cloud relies on --linger
and sending SIGTERM to k6 processes, in order to have consistent logic around teardown
and finish of the test. It is another piece of low-level logic which has to be done "externally".
As an illustration. This is actually something that became the main stumbling block in my initial attempt to implement setup
and teardown
run once in k6-operator. Once --linger
is set, there is no interface to stop k6 process - it must be killed. There is a stop test call in REST API but it doesn't actually terminate the process in case of --linger
: it only returns the status. So one has to remove the k6 process "manually". That leads to almost identical logic for aborting k6 process as response to user abort VS regular stop of the test run, and so these different cases must be additionally tracked as separate thus complicating the logic even more.
Additionally, Kubernetes space in general "expects" the main processes in pods to exit on their own, and these manual interventions seem almost an interference in a healthy Kubernetes flow. From both design and reliability perspective, it'd be beneficial for k6-operator (and perhaps k6-cloud) to rely on a straight-forward interface of a "k6 coordinator" for these things, e.g.: coordinator executes teardown
and then sends termination signal to all k6 runners. After that, coordinator exits itself, and its exit is an indicator of test run being de facto finished.
On a related note, probably coordinator should also take care of abort terminations as well? I.e. if one wants to abort the test run, it should be enough to "ping" coordinator. So there should be a user interface to do that as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yorugac I'm not sure I'm getting the point regarding --linger
on the Cloud. At the moment, it should be used as a workaround for getting the current architecture working in a distributed mode. But with the new architecture proposed here, --linger
is not required anymore to distribute the test as it is already covered by the Sync API. --linger
remains for the specific case where the user wants to force k6
to wait even if the test is finished.
Kubernetes space in general "expects" the main processes in pods to exit on their own,
Regarding this specific point, I think is a valid case to support on k6 HTTP API. It is something mentioned at point 5
for --linger
here #2804 (comment). Feel free to open an issue for it.
then sends termination signal to all k6 runners
We are reversing the flow, so the coordinator shouldn't be able to contact directly the agents. The unique way should be to emit a specific event through the Sync API to terminate/abort but it means there isn't a guarantee to kill the agent. Does it work for the Operator/k8s life-cycle?
On a related note, probably coordinator should also take care of abort terminations as well? I.e. if one wants to abort the test run, it should be enough to "ping" coordinator. So there should be a user interface to do that as well.
Yep, it sounds like a good idea for a single entrypoint for forcing the abort. It can be done as described via the Sync API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If --linger
won't be required with native distribution, all the better. I was trying to illustrate precisely that point: that existing workarounds with --linger
in k6 Cloud are neither easy to implement nor natural in a setting like Kubernetes. IOW, if native distribution will work and terminate without --linger
and manual interventions, that's basically ideal.
I think is a valid case to support on k6 HTTP API. It is something mentioned at point 5 for --linger here #2804 (comment). Feel free to open an issue for it.
Done, here.
From test distribution viewpoint, HTTP call is still an intervention (something must make that call) in comparison to native distribution. But having that call would be better than current state for sure 👍
The unique way should be to emit a specific event through the Sync API to terminate/abort but it means there isn't a guarantee to kill the agent. Does it work for the Operator/k8s life-cycle?
🤔 What do you think is the worst-case behaviour on the agent's side? I.e. if "terminate event" is emitted and something gets broken, will agent's process be stuck or will it exit with non-zero error code after some timeout? Additionally, if the agent can get stuck, when can that happen?
|
||
While there were some details in the original proposal that are no longer applicable because of the big architectural changes in k6 in the meantime, the overall proposal is probably still valid. I am pretty sure that the architecture I proposed above can be built on top of these technologies and in a way where instances don't need to hold a persistent connection to a central `k6 coordinator` node! | ||
|
||
However, I think that the benefits of such an approach would be outweighed by the drawbacks in terms of imlpementation and deployment complexity. In short, it can probably work, but IMO the UX will be worse. Even if we ignore the complications of deploying multiple software stacks (vs. just the k6 binary), it'd be difficult to implement great error handling and timely handling of things like `abortOnFail` thresholds and `test.abort()` API calls. It can certainly be done, but it'd be a ton of effort to not get a flaky experience... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Additionally, external dependencies will make it much harder to incorporate this solution into k6-operator or k6-cloud, if possible at all. So it'd be nice if the first, "native" implementation of the interface was as "light" and transparent from user perspective as possible.
Continuing the Kubernetes-related musings 🙂 In context of this design doc, I'd like to suggest a three-level approach to load distribution which should allow a more flexible setup for distributed test runs. Currently it is implied that there will be a two level system:
Another way to approach this would be with three levels:
IOW, allow coordinator to act as a sub-coordinator:
Main benefits from this approach:
Context: this came up while pondering questions accumulated from multiple sources, which include inquiries around PLZ and k6-operator in general (e.g. here in OSS). It seems to me that this kind of approach will be relatively straigtht-forward to add, assuming two-level coordination is implemented or planned for implementation. |
@yorugac Sounds like it will scale better also, if a large number of agents can be served by a reasonable number of sub-coordinators that then also aggregate results and control data up to the "root" coordinator. If it does scale better without making UX worse due to increased complexity, I think we may have an USP versus practically all load testing tools that support distributed exectution. AFAIK most (maybe all) other tools that can do distributed execution have a bottleneck in a single master controller/coordinator process, which has to aggregate data from potentially many agents. |
I think this should be fairly easy to do on top of the architecture I proposed in this PR, though I'd encourage someone to try to make a PoC on top of #3205 or even #3213 to validate that gut feeling before fully embracing the idea 😅 I can see the benefits when it comes to multiple zones and mixed deployments, though a single coordinator will probably be sufficient in the vast majority of cases 🤔 Given the simplicity of the control interface, I'd expect it to comfortably support hundreds of instances, though I haven't actually measured that. |
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #3217 +/- ##
==========================================
+ Coverage 72.68% 73.57% +0.88%
==========================================
Files 259 277 +18
Lines 19864 20228 +364
==========================================
+ Hits 14439 14882 +443
+ Misses 4521 4398 -123
- Partials 904 948 +44
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
I don't have any feedback that isn't implementaion detail that likely will be wrong when we implement it with all the error handling , timeouting and configuration.
Let's merge this and try to start making it work, the design is good and simple enough IMO to be bendable to fix any problem we hit. Especially any problem that I can envision
@yorugac @ragnarlonn Regarding the sub-coordinators and in particular this comment
I don't see how sub-coordinators will alleviate the data aggregation in the coordinator if we want to have execution-wide thresholds.
|
@pablochacin @yorugac My spontaneous thought was that aggregation could happen lower down in the tree, but I might be wrong. I'm not very much up to date on how the k6 internals work these days. In my mind, it just means that thresholds have to be communicated to sub-agents and agents, so they can aggregate things and send to up the tree. I'm probably missing something though :) |
@pablochacin @ragnarlonn I've got an impression that Ned indirectly already answered to this question over here 👀 |
Yep, though keep in mind that in the PoC version of distributed support for metrics and thresholds (#3213), the For that, as I wrote in this design doc, you need HDR histograms. The PoC for that is in the next PR in the chain, #2816, specifically this commit: 1ad356b As you can see, it's a relatively tiny change (most of the diff size is due to the used HDR library dependency). This commit actually used to be at the start of the PoC PR chain, but I moved it at the very end, so all of the rest of them could be merged without breaking changes. After all, After having HDR histograms, sub-coordinator would be able to safely merge the HDR histograms from their agents and then pass the aggregated value to the central coordinator without a loss in precision. That said, it shouldn't make that much of a performance improvement, merging the HDR histograms is pretty efficient and I expect the central coordinator to be able to handle dozens if not hundreds of instances without much of a problem (though, admittedly, I haven't tested that assumption 😅 ). |
Thanks for the detailed explanation @na--. For me is still unclear if we should consider partial aggregation using HDR and sub-coordinators as part of the "base" of distributed execution, or we should consider them in a second phase once we have an stable version of the coordinator/agent model. |
@pablochacin The first experimental version should validate the main assumptions:
As @na-- mentioned in the previous comment sub-coordinators should not be intended as a primitive. We can build it as an additional feature on top of a more simplistic version. In addition, considering the complexity it introduces, we should have an idea of the real demand for this specific feature. I don't think we will be able to predict it with good accuracy before the first version will be available to the community. |
What?
This is a design doc for my proposal on how native distributed execution (#140) can be added to k6 🎉 It briefly touches on test suites (#1342) as well, because my distributed execution offers a way to implement test suites (both for local and distributed tests!), but doesn't explore that topic to the same depth.
It is essentially a brain dump of most of the thoughts that exist in my head on the topic, after first making a distributed execution PoC in #2438, and then polishing it and making a PoC of test suites in #2816.
Why?
I will be taking a long vacation/sabbatical for the next few months and then gradually moving to another team. Because of that (and somewhat prompted by @ragnarlonn's comments in #140 😅), I thought it made sense to try and write down my thoughts on the topic. It's very unlikely that I will be able to completely finish the distributed execution work myself, so I've tried to make what exists as code and ideas in my head as easy to adopt and built upon as it's practical to do before I go away for a while.
That's also why I refactored #2816 to make big parts of it completely backwards compatible. It hopefully means that they are merge-able even when they are somewhat incomplete, untested and unpolished... 🤞 At this point, I've (lightly) polished the original distributed execution PoC and split it into a few PRs (#3204, #3205, #3213) that should be safe to merge as they are, even if the whole feature needs more work to be considered stable and not experimental.
That's also why I have written this document, so hopefully someone else can pick up the task and implement the missing pieces to make what exists stable and even more useful. Still, this design doc and the connected PRs are just my suggestion for how things can be done. If someone comes up with a better idea, feel free to throw all of this away. Even if some of the linked PRs were already merged, we need to be very clear that they are super-experimental and liable to change (or go away) at any point in time before they are officially declared stable by the team.
Related PR(s)/Issue(s)
execution.Controller
with alocal
no-op implementation #3204