-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Caller provided buffers question #369
Comments
Is it possible to generalize the feature for any dynamically sized data structure? As opposed to only allowing it in places explicitly annotated as such (using If I understand the ABI correctly, a record IncomingDatagram {
data: list<u8>,
remote_address: u32,
}
record ReceiveResult {
datagrams: list<IncomingDatagram>,
}
receive: func() -> ReceiveResult; .. is roughly lowered into: struct IncomingDatagram {
data_ptr: void*,
data_len: u32,
remote_address: u32,
}
struct ReceiveResult {
datagrams_ptr: void*,
datagrams_len: u32,
}
fn receive(out: ReceiveResult*); At the moment, the memory referenced by these My suggestion (if possible) is to allow callers to optionally prefill any of the At all times, the caller must be prepared to find the Example: fn receive(out: ReceiveResult*) {
let max_recv = if out.datagrams_ptr != NULL { out.datagrams_len } else { 16 };
let native_dgrams = receive_the_native_datagrams(max_recv);
let native_dgrams_len = native_dgrams.len()
// Only use dynamic allocation if the caller did not provide a buffer or the provided buffer was too small.
if out.datagrams_ptr == NULL || out.datagrams_len < native_dgrams_len {
out.datagrams_ptr = cabi_realloc(native_dgrams_len * sizeof(IncomingDatagram));
}
out.datagrams_len = native_dgrams_len;
for i in 0..native_dgrams_len {
let native_dgram = native_dgrams[i];
let out_dgram = out.datagrams_ptr[i];
let data_len = native_dgram.data.len();
// Only use dynamic allocation if the caller did not provide a buffer or the provided buffer was too small.
if out_dgram.data_ptr == NULL || out_dgram.data_len < data_len {
out_dgram.data_ptr = cabi_realloc(data_len);
}
out_dgram.data_len = data_len;
memcpy(out_dgram.data_ptr, native_dgram.data);
out_dgram.remote_address = native_dgram.remote_address;
}
} Expected benefits:
|
This is a good question. If we represent the record r {
bytes: list<u8; ..n>;
other: string;
};
f: func(n: u32) -> r; but if you renamed the parameter
For a function type such as
This is a great question and was the initial direction I was exploring too. The challenge here is with what happens in the case where there is a caller-supplied buffer, but it isn't big-enough (esp. in partial+nested situations), and how this plays out in the bindings generators, both reliably achieving the caller-supplied-buffer optimization while robustly handling the arbitrary-list-size case. By having a distinction in the type that rules out this awkward case, we can more reliably generate more simple, predictable bindings. |
Your proposal definitely has a leg up on my suggestion in that regard. At the same time, I wonder how much this matters in practice. Any self-respecting wasm engine would support caller-supplied buffers at the host side, so really it is only up to whether the guest supports it or not. If they do, they can be pretty confident the optimization will be used.
In the example above, I chose to fall back on dynamic allocation. The existing
I personally would expect
(Yay, even more syntax... 🥲)
One concern I have is that having distinct types could bifurcate/"color" the APIs. One variant that allocates, and another that doesn't. Or: one variant for "input" lists, another for "output" lists.
The component-model is already on track to eradicate sync/async coloring (IMO, a much, much tougher problem). I hope we can do the same for user- vs. host-allocated buffers, and forgo the additional WIT complications entirely. |
The challenge is effectively representing this in the guest bindings. E.g., for
That's a good general instinct and goal, but I think in this case, we don't have the anti-composability effects caused by, e.g., function coloring. I think that's because Another key ingredient to the overall story that speaks to some of the non-parameter-bounded-length use cases I expect you're thinking about is the forthcoming |
Hmm, ok. Fair points. Different question then: is there a reason why the list length needs to be encoded in the WIT list type? From what you've described so far, it seems that from a technical point of view all we actually need to know from the WIT signature is whether the output list will be user-allocated or not. The actual allocated length can be passed at runtime. Ie. if we were to define just one type, lets say Avoids the dependent type parameter hazzle. |
Also good question! At the Canonical ABI level, we ideally want to guarantee that, if the caller passes a buffer correctly-sized for |
The exact same validation can still be done if the length was passed at runtime, right? The only change is that in that case there is effectively an implicit |
Yeah, that's worth considering too. If we bring it back to what we write in WIT, if the WIT signature is just: foo: func() -> list<u8, user-allocated> then it seems a bit magical/surprising that the source-level signature contains an extra Also, over the longer term, explicitly-referencing parameters provides some extra expressivity, e.g.: foo1: func(n: u32) -> tuple<list<u8; ..n>,list<u8; ..n>>;
foo2: func(m: u32, n: u32) -> tuple<list<u8; ..m>, list<u8; ..n>>;
foo3: func(m: u32, n: u32) -> list<list<u8; ..m>; ..n>; |
I've given it some time, but I'm still not quite happy with what has been proposed so far (including my own proposals). Overall, I can't help feeling uneasy about the dependent typing stuff:
but obviously this won't work. I've looked at how existing programming languages handle this.
Some observations:
Fair point. I'd argue the same argument equally applies to the buffer's After seeing how the other languages deal with it and what the WIT eventually will desugar into, I came up with another idea. To be clear this is mostly the same as what has been discussed before, but interpreted from another angle: What if we allow value types to be borrowed? And first and foremost: allow Conceptually, a caller-provided buffer is a special kind of temporary E.g.:
Regular lists are encoded as a ( This has the same benefits as before:
as well as: all inputs now being explicitly captured in the parameter list. Binding generators for high-level languages that don't care about allocation can replace any WhoopsIf you're with me so far; we can step it up a notch by keeping the borrowed outparam semantics _and also_ syntactically flip the just introduced `dest: borrow>` parameter back to the other side of the arrow: by realizing that _if_ we can borrow value types, _every_ return type can be seen as just a specialized form of a borrowed outparam. (it already is at the ABI level).// I.e. this:
local-address: func() -> result<ip-address, error-code>;
// is conceptually the same as this:
local-address: func(result: borrow<result<ip-address, error-code>>); Moving stuff the other way around should work equally well, so:
can be swapped back into the signature we want:
In low-level languages, the bindings should generate an additional 'output' parameter for these borrows. In dynamic languages that don't care about allocation, the signature can stay as it is. I realize this a bit of a mental 360 only to end up almost back to where I started. Hope it makes sense. Edit: scratch that for now |
Thanks for all the consideration exploring the design space here. I think you're right that the common idom for expressing caller-supplied buffers is some sort of outparam that pairs the requested length with the buffer, so it's definitely worth exploring that area of the design space. This is only partially a nit, but I think The next issue I see is that it's unclear why the combination of Now that starts to look pretty attractive, so let's compare:
with
The stumbling point for me when I was considering this sort of option earlier concerns the various modes of failure. In these baby examples it's obvious what "should" happen, but that depends on special-casing the meaning of the single top-level
When one considers various solutions to this, I found that they all ended up very special-case-y compared to putting the bounded-size Also, here's a more hand-wavy reason to prefer the former. If you think of 2 components as 2 separate processes and diagram the actual data flow between them, it looks like this (it's Mermaid time!): sequenceDiagram
Caller ->>+ Callee: n
Callee -->>- Caller: bytes
In particular, in a component (as opposed to a raw .dll), Caller never actually passes the buffer to Callee, only the number Thus, for both the practical error-handling issues and the high-level "what's happening at the component level" reasons, I still prefer the former. As for the concerns you mentioned:
|
Thinking some more about the problems I mentioned above with " |
In Go, the read: func(dest: buffer<u8>) -> result<u32, error> It is the caller’s responsibility to create a sub-slice of |
I would like to mention the statically sized buffer placed in static memory by the callee once the number of results exceed the limit. From a re-entrancy/multi-threading perspective (think 0.3) this is a blocker and feels related to caller provided buffers, as the caller knows exactly how much memory to reserve and would be able to do so on the stack, then pass the pointer to the callee. PS: I also ran into the problem of calling such a function twice before calling the cabi_post on the "first one" and double-freeing as both results used the same physical buffer and the second return value overwrote the first. |
Is there any publicly available information about the solution you came up with? 🙏 Because I know that several people are currently investigating independently into this problem, including me. |
Just to check: are you saying there is currently a problem we need to fix or are you wanting to remind us not to regress things in a reentrant future by baking in static memory addresses? If the latter: totally agreed.
(Sorry for the slowness here; now that #378 is up, I'm keen to focus on this more.) One thing @sunfishcode pointed out to me was this organic use case in wasi-i2c which really does seem to want a Thinking more about the details of how a
Note that the caller's buffer (pointed to by I was also thinking about the language bindings for the caller of a function taking a Now, in JS, if an let bufferArg = { view: new Uint8Array(arrayBuffer, begin, end), written: undefined };
read(bufferArg);
console.log(`written: ${ bufferArg.written }`); That's not super-elegant, but it works and maybe, given that let view = new Uint8Array(arrayBuffer, begin, end, { resizable: true });
read(view);
console.log(`written: ${ view.length }`); Anyhow, the point is that every language bindings generator will need to figure out it's own answer to this question. I think there are ok options so this isn't a major problem, but I'd be curious if there was anything more idiomatic in any languages that handles the general case ( |
Another realization: it seems like the key capability we want to add here isn't simply "avoiding cabi_realloc" but rather "providing one or many guest wasm memory pointers up-front so that the host can make syscalls that write directly into this wasm memory", particularly for cases where the exact amount isn't fixed, just upper-bounded, as with As one concrete example: if we had had (Or at least that's where I'm currently at, continuing to think about this.) |
Currently calling an exported function with more than one result "register" has no way to provide a memory area for this known size temporary buffer (which is later freed by the cabi_post). C calling convention often optimizes returning a constant size structure by passing a pointer to a caller-stack-located buffer. Right now wit-bindgen uses a static buffer for this return values, making the code non-re-entrant and non-thread-safe. To me this resembled the problem statement of caller provided buffers, this is why I brought it up here. PS: In my symmetric ABI experiment I use this return value optimization for larger constant size returns - but I feel that a more generic and elegant solution would be preferred as we design the ABI for caller provided buffers. (I run into exactly the ownership ambiguity problems for CPB described by you in #175 , e.g. I realize that CPB is mostly about imported functions, but my native plugin brain keeps insisting on taking both directions into view. |
Yesterday evening I realized that another option might solve the ownership/initialization ambiguity elegantly. If we could pass a pointer from the caller all the way through to a specialized (per function) realloc function the realloc could use caller provided storage and return it once the conditions are considered right (e.g. buffer is large enough). This makes it very flexible, fully customizable from the caller side and can also enable CPB for This way the caller can provide either temporary caller-stack located storage (return value optimization) or from a pre-allocated pool or simply heap at its own choice. E.g. in C notation (the first argument
for |
A matching WIT indicating the CPB convention could be (making memory-pool a built-in type)
|
Extending on this idea: instead of encoding the i31 triple directly in the call, there may be benefit in letting the caller construct the buffer from the same i31 triple upfront, and pass a single buffer id/handle in the actuall call instead. Consider a component composition like: It would be great if |
@badeend That sounds like a great idea. So perhaps the built-ins are:
(and symmetrically for |
@cpetig Oops, replying out-of-order.
Ah yes, I think that's just an optimization that would need to be updated in the async world to use (shadow-)stack allocated storage instead. Speaking to your other idea: this relates to a different idea that achieves the same effect that I was coincidentally just talking with @alexcrichton about yesterday which I was calling "lazy lowering". The basic idea is that: in any place where we would have to call There's a lot more to say about lazy-value-lowering, but I think it's super-promising and I'd like to file a new issue to discuss it. What's great about it is that it avoids |
Does lazy-lowering also enable the guest to call into the host to receive temporary access to a slice of shared memory (mapped into the guest address space by separate means)? Basically this boils down to returning People on the embedded group look for a way to do zero-copy image processing in wasm, preferably using the component model. And I look for enabling zero-copy publisher-subscriber mechanisms like the iceoryx2 API (you receive temporary access to a slice for reading or writing, the lifetime of the slice is tied to a lending resource object). |
No, the idea (which I'll try to write up in a proper issue tomorrow; sorry for the hand-waving) is just to give the guest code more control over when the copy happens and into what destination address in the guest's linear memory. Now, when you say "mapped into the guest address space by separate means": any copy into guest memory (even in 0.2 today) can be implemented without actually copying (e.g., via |
For readable buffers specifically: If we let go of the "sequence of scalar numbers" requirement and broaden the definition of buffers to mean "any region of remote memory", wouldn't it be fair to say that:
This can then be thought of as a natural consequence of having a buffer in the parameter position vs. the return position. |
Great question! At a high-level, I think there is a meaningful semantic difference: if, e.g., a function has a |
Hmm, fair point. If the "number of bytes read" outparam is the only difference between readable buffers and regular lists, I wonder whether it then is worth the effort of creating a dedicated type for it. Now that we have a solution for fixing the general case (all lists), I don't mind moving the "number of bytes read" back to into the return type. I.e.:
is good enough for me ¯\(ツ)/¯ |
Yes, in most situations, I think lazy-lowering is all you need (although there are some interesting per-language bindings questions as to how to expose this lazy-lowering in the source-level API). The one non-trivial performance advantage of the buffer types is that they provide the pointer to wasm memory before "the syscall", allowing "the syscall" to read/write directly from linear memory. Now many APIs don't have "the syscall" or will be subsumed by 0.3 |
This clearly isn't meant to discourage lazy-lowering design, but I just found that caller provided buffers map to a good and predictable enough API if you mix it with ideas from shared memory types (e.g. boost interprocess), see #398 for details. For now that solution has become my favorite as it also enables a lot of new uses of WIT. |
Just as an update on my end, working on streams in the add-stream branch, spec-internal |
Last week, an idea was presented by which callers could prevent unnecessary intermediate copies using the syntax:
Is there any provision to make this work when the
list
is not directly present in the function result signature? For example when it is nested within a record type (along with other properties)And even more challenging: what if the number of output lists is variable? Looking specifically at
wasi:sockets.udp.incoming-datagram-stream::receive
😇The text was updated successfully, but these errors were encountered: