Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caller provided buffers question #369

Open
badeend opened this issue Jun 19, 2024 · 30 comments
Open

Caller provided buffers question #369

badeend opened this issue Jun 19, 2024 · 30 comments

Comments

@badeend
Copy link
Contributor

badeend commented Jun 19, 2024

Last week, an idea was presented by which callers could prevent unnecessary intermediate copies using the syntax:

read: func(length: u64) -> result<list<u8; ..length>,error-code>;

Is there any provision to make this work when the list is not directly present in the function result signature? For example when it is nested within a record type (along with other properties)

And even more challenging: what if the number of output lists is variable? Looking specifically at wasi:sockets.udp.incoming-datagram-stream::receive 😇

@badeend
Copy link
Contributor Author

badeend commented Jun 19, 2024

Is it possible to generalize the feature for any dynamically sized data structure? As opposed to only allowing it in places explicitly annotated as such (using list<T ,..size>).

If I understand the ABI correctly, a list is represented as a pointer+length tuple. So:

record IncomingDatagram {
	data: list<u8>,
	remote_address: u32,
}

record ReceiveResult {
	datagrams: list<IncomingDatagram>,
}

receive: func() -> ReceiveResult;

.. is roughly lowered into:

struct IncomingDatagram {
	data_ptr: void*,
	data_len: u32,
	remote_address: u32,
}

struct ReceiveResult {
	datagrams_ptr: void*,
	datagrams_len: u32,
}

fn receive(out: ReceiveResult*);

At the moment, the memory referenced by these _ptrs is always allocated using cabi.realloc.


My suggestion (if possible) is to allow callers to optionally prefill any of the _ptr & _len fields before making the call. The function implementation may take advantage of this by writing directly into those buffers and/or using the provided buffer size as an optimization hint. Even if the implementation wasn't written to be aware this then no-harm-no-foul: it will simply (keep) using dynamic allocation.

At all times, the caller must be prepared to find the _ptr field being overwritten to a dynamic allocation. Only if the _ptr was prefilled and hasn't been changed, is the _len guaranteed to be less-or-equal to the input value.

Example:

fn receive(out: ReceiveResult*) {

	let max_recv = if out.datagrams_ptr != NULL { out.datagrams_len } else { 16 };

	let native_dgrams = receive_the_native_datagrams(max_recv);
	let native_dgrams_len = native_dgrams.len()


	// Only use dynamic allocation if the caller did not provide a buffer or the provided buffer was too small.
	if out.datagrams_ptr == NULL || out.datagrams_len < native_dgrams_len {
		out.datagrams_ptr = cabi_realloc(native_dgrams_len * sizeof(IncomingDatagram));
	}
	out.datagrams_len = native_dgrams_len;


	for i in 0..native_dgrams_len {
		let native_dgram = native_dgrams[i];
		let out_dgram = out.datagrams_ptr[i];

		let data_len = native_dgram.data.len();


		// Only use dynamic allocation if the caller did not provide a buffer or the provided buffer was too small.
		if out_dgram.data_ptr == NULL || out_dgram.data_len < data_len {
			out_dgram.data_ptr = cabi_realloc(data_len);
		}
		out_dgram.data_len = data_len;


		memcpy(out_dgram.data_ptr, native_dgram.data);
		out_dgram.remote_address = native_dgram.remote_address;
	}
}

Expected benefits:

  • Can even be used for strings?
  • Supports arbitrarily nested datatypes.
  • Composes cleanly: both the caller and the callee can independently support it or not.
  • Is "just" a ABI-level detail. Doesn't depend on the WIT author to have been clairvoyant enough to annotate this upfront.

@lukewagner
Copy link
Member

Is there any provision to make this work when the list is not directly present in the function result signature?

This is a good question. If we represent the n in func(n: u32) -> list<u8; ..n> with the parameter index of n (which would be analogous to how we handle abstract resource types), then we'd need the list typedef to be nested in a scope created by a new compound form of the function type constructor (similar to module, component and instance type constructors). But this would be rather awkward to represent syntactically in WIT. So instead I was thinking of n just being a string literal "free variable" that is bound once used within the context of a function type (with a validation rule to check that there are no remaining free variables in the constructed function type. In that case, you could write:

record r {
  bytes: list<u8; ..n>;
  other: string;
};
f: func(n: u32) -> r;

but if you renamed the parameter n to something else, it would be a validation error (complaining n was not bound to a parameter name).

And even more challenging: what if the number of output lists is variable?

For a function type such as func(n: u32) -> list<list<u8; ..n>>;, I was thinking we could either (1) fall back to the normal ABI for list<u8>, (2) reject in validation. The fun case is func(m: u32, n: u32) -> list<list<u8; ..m>, ..n> (and in general any nesting where all containing lists have fixed or bounded size), where caller-supplied buffers also make sense (and basically capture the iovec pattern). To incrementalize the implementation effort, though, I was thinking we might start by temporarily rejecting all such nested cases.

Is it possible to generalize the feature for any dynamically sized data structure? [...]

This is a great question and was the initial direction I was exploring too. The challenge here is with what happens in the case where there is a caller-supplied buffer, but it isn't big-enough (esp. in partial+nested situations), and how this plays out in the bindings generators, both reliably achieving the caller-supplied-buffer optimization while robustly handling the arbitrary-list-size case. By having a distinction in the type that rules out this awkward case, we can more reliably generate more simple, predictable bindings.

@badeend
Copy link
Contributor Author

badeend commented Jun 19, 2024

reliably achieving the caller-supplied-buffer optimization

Your proposal definitely has a leg up on my suggestion in that regard. At the same time, I wonder how much this matters in practice. Any self-respecting wasm engine would support caller-supplied buffers at the host side, so really it is only up to whether the guest supports it or not. If they do, they can be pretty confident the optimization will be used.

The challenge here is with what happens in the case where there is a caller-supplied buffer, but it isn't big-enough (esp. in partial+nested situations)

In the example above, I chose to fall back on dynamic allocation. The existing max-length-esque parameters could continue to exist to make sure this doesn't happen in practice.


I was thinking of n just being a string literal "free variable", [...] if you renamed the parameter n to something else, it would be a validation error

I personally would expect n to be defined in some kind of lexical scope. In this case, most naturally as a parameter of the type:

record r<n: u32> {
  bytes: list<u8; ..n>;
  other: string;
};
f: func(n: u32) -> r<n>;

(Yay, even more syntax... 🥲)

By having a distinction in the type that rules out this awkward case, we can more reliably generate more simple, predictable bindings.

One concern I have is that having distinct types could bifurcate/"color" the APIs. One variant that allocates, and another that doesn't. Or: one variant for "input" lists, another for "output" lists.

resource http-fields {
	get: func(name: string) -> list<u8>;
	// vs.
	get: func(name: string, maxlen: u32) -> list<u8, ..maxlen>;
}

The component-model is already on track to eradicate sync/async coloring (IMO, a much, much tougher problem). I hope we can do the same for user- vs. host-allocated buffers, and forgo the additional WIT complications entirely.

@lukewagner
Copy link
Member

In the example above, I chose to fall back on dynamic allocation. The existing max-length-esque parameters could continue to exist to make sure this doesn't happen in practice.

The challenge is effectively representing this in the guest bindings. E.g., for func(n: u32) -> list<u8; ..n>, JS bindings could easily take a Uint8Array as an outparam, but if we have the "maybe reuse, maybe allocate" semantics, now we would need something much more complicated or else we'd give up entirely and just return a new array by value. And I don't this is JS-specific, I think this same tension arises in other languages too.

One concern I have is that having distinct types could bifurcate/"color" the APIs.

That's a good general instinct and goal, but I think in this case, we don't have the anti-composability effects caused by, e.g., function coloring. I think that's because list<u8; ..n> is just a specialization of list<T> (just like list<u8; 4>, as proposed in #304), in that it's just a list with extra guarantees. It's true that, in the short term, we'd probably want to introduce list<u8; ..n>-returning versions of existing list<u8>-returning functions, but that's just to maintain semver compatibility; in a breaking future version, we could keep only the list<u8; ..n> version.

Another key ingredient to the overall story that speaks to some of the non-parameter-bounded-length use cases I expect you're thinking about is the forthcoming streams. If we do it right, the ABI for a function returning a stream would allow you to pass a caller-supplied buffer in the initial call that returns a stream, saying "use as much of this as you can, but if it's not enough (or the data isn't ready), that's fine, tell me how much you were able to write and then return a handle to a stream that lets me consume the rest later. With that, in the cases where you couldn't return a parameter-bounded list because the list was fully dynamically sized, you'd probably want to return a stream instead (with the nice property that streams handle the very-big cases gracefully w/o OOM).

@badeend
Copy link
Contributor Author

badeend commented Jun 19, 2024

Hmm, ok. Fair points.

Different question then: is there a reason why the list length needs to be encoded in the WIT list type? From what you've described so far, it seems that from a technical point of view all we actually need to know from the WIT signature is whether the output list will be user-allocated or not. The actual allocated length can be passed at runtime.

Ie. if we were to define just one type, lets say list<T; user-allocated> (syntax TBD). Where user-allocated is a keyword, not a reference to a parameter. user-allocated lists continue being "just" a specialization of normal lists.

Avoids the dependent type parameter hazzle.

@lukewagner
Copy link
Member

Also good question! At the Canonical ABI level, we ideally want to guarantee that, if the caller passes a buffer correctly-sized for n elements, that the callee cannot overflow that buffer by returning more than n elements -- instead this case should reliably trap, blaming the callee. That means we need to know the runtime value of n so that when the callee's returned list is lifted (from an i32 ptr+length pair), there is a trap if the length is greater than n. This also allows the caller to statically assume that the length of returned list is in fact less than n.

@badeend
Copy link
Contributor Author

badeend commented Jun 19, 2024

The exact same validation can still be done if the length was passed at runtime, right? The only change is that in that case there is effectively an implicit length parameter per provided buffer, instead one global one per function

@lukewagner
Copy link
Member

Yeah, that's worth considering too. If we bring it back to what we write in WIT, if the WIT signature is just:

foo: func() -> list<u8, user-allocated>

then it seems a bit magical/surprising that the source-level signature contains an extra length parameter not declared here. Also, that length parameter isn't just "the length of the returned list", it's also just a straight-up parameter that you'd want to name and document just like any other ordinary parameter.

Also, over the longer term, explicitly-referencing parameters provides some extra expressivity, e.g.:

foo1: func(n: u32) -> tuple<list<u8; ..n>,list<u8; ..n>>;
foo2: func(m: u32, n: u32) -> tuple<list<u8; ..m>, list<u8; ..n>>;
foo3: func(m: u32, n: u32) -> list<list<u8; ..m>; ..n>;

@badeend
Copy link
Contributor Author

badeend commented Jun 27, 2024

I've given it some time, but I'm still not quite happy with what has been proposed so far (including my own proposals).

Overall, I can't help feeling uneasy about the dependent typing stuff:

  • it brings additional complexity to the WIT language,
  • dependent typing in general is completely foreign to most developers
  • none of the mainstream languages support these kind of constraints, so this metadata will be lost in binding generation anyways.
  • I'm worried it will be an endless game of whack-a-mole regarding what is and isn't expressable in WIT. Take wasi-io's poll for example. Ideally, one would want to write this:
poll: func(in: list<borrow<pollable>>) -> list<u32; ..(in.length)>;

but obviously this won't work.


I've looked at how existing programming languages handle this.

// C:
ssize_t read(uint8_t* dest_ptr, size_t dest_cap);

// Rust:
fn read(dest: &mut [u8]) -> Result<usize>;

// JavaScript:
function read(dest: Uint8Array): number;

// Java:
int read(byte[] dest);

// C#:
int Read(Span<byte> dest);

// Go:
func Read(dest []byte) (len int, error);

Some observations:

  • All languages except C have their own idiomatic way of representing mutable buffers as a single type containing both the ptr and capacity.
  • The destination buffer is present in the source language and is always in the parameter-position.
  • The actual number of bytes written (len) is always in the result-position.
  • The constraint that len must be less or equal to cap is only "enforced" through documentation.

it seems a bit magical/surprising that the source-level signature contains an extra length parameter not declared here. Also, that length parameter isn't just "the length of the returned list", it's also just a straight-up parameter that you'd want to name and document just like any other ordinary parameter.

Fair point. I'd argue the same argument equally applies to the buffer's ptr (?). The function implementation needs them both in order to write into the destination buffer.


After seeing how the other languages deal with it and what the WIT eventually will desugar into, I came up with another idea. To be clear this is mostly the same as what has been discussed before, but interpreted from another angle:

What if we allow value types to be borrowed? And first and foremost: allow lists to be borrowed.

Conceptually, a caller-provided buffer is a special kind of temporary resource. The memory backed by the buffer is only valid and accessible by the callee during the function invocation, just like a regular resource borrow.

E.g.:

read: func(dest: borrow<list<u8>>) -> result<_, error-code>;

Regular lists are encoded as a (ptr+len) pair. Borrowed mutable lists will then be a (ptr+len+capacity) triple. Before calling the caller must set the ptr & capacity to the proper values and len to 0.

This has the same benefits as before:

  • borrows can be arbitrarily nested,
  • could work equally well for strings,

as well as: all inputs now being explicitly captured in the parameter list.

Binding generators for high-level languages that don't care about allocation can replace any borrow<list<...>> in the signature with just an u32 that represents the capacity, and heap-allocate the buffers transparently.


Whoops If you're with me so far; we can step it up a notch by keeping the borrowed outparam semantics _and also_ syntactically flip the just introduced `dest: borrow>` parameter back to the other side of the arrow: by realizing that _if_ we can borrow value types, _every_ return type can be seen as just a specialized form of a borrowed outparam. (it already is at the ABI level).
// I.e. this:

local-address: func() -> result<ip-address, error-code>;

// is conceptually the same as this:

local-address: func(result: borrow<result<ip-address, error-code>>);

Moving stuff the other way around should work equally well, so:

read: func(dest: borrow<list<u8>>);

can be swapped back into the signature we want:

read: func() -> borrow<list<u8>>;

In low-level languages, the bindings should generate an additional 'output' parameter for these borrows. In dynamic languages that don't care about allocation, the signature can stay as it is.

I realize this a bit of a mental 360 only to end up almost back to where I started. Hope it makes sense.

Edit: scratch that for now

@lukewagner
Copy link
Member

Thanks for all the consideration exploring the design space here. I think you're right that the common idom for expressing caller-supplied buffers is some sort of outparam that pairs the requested length with the buffer, so it's definitely worth exploring that area of the design space.

This is only partially a nit, but I think borrow and resources aren't quite the right starting point to build on since, with a borrowed handle, I'm passing in a valid resource that the caller receives and can use during the call, whereas with our case here, we're not passing in a list<u8> that the callee can see (before writing into). So let's say that instead we use a different word altogether, like out (e.g., so I could write read: func(dest: out<list<u8>>) -> result).

The next issue I see is that it's unclear why the combination of out<...> with list<u8> means that I pass in a length and get back a list of exactly that length. E.g., let's say I wrote read: func(dest: out<tuple<u32, u32>>), there's no length there. Thus, the incoming length is due to some special combination of out<...> when applied to list<...>, which suggests that you don't want out<...> as a general type constructor but, rather, something fused with lists, e.g. buffer (so I could write read: func(dest: buffer<u8>) -> result).

Now that starts to look pretty attractive, so let's compare:

read: func(n: u32) -> result<list<u8; ..n>>;

with

read: func(dest: buffer<u8>) -> result;

The stumbling point for me when I was considering this sort of option earlier concerns the various modes of failure. In these baby examples it's obvious what "should" happen, but that depends on special-casing the meaning of the single top-level result return type's error case. But what about:

  • other ways to signal failure than literally a single top-level result (consider all the cases in Preview 2 where we have nested results with different kinds of failure being signaled by each one, with some options thrown in too).
  • if there are multiple buffers in the parameter list (as you were excited about above) -- does failure have to be atomic and if not, how do you signal which ones succeeded or failed?

When one considers various solutions to this, I found that they all ended up very special-case-y compared to putting the bounded-size list in the return type where the list works just like a normal value, and the fact that I'm passing a caller-supplied buffer is merely an ABI optimization to avoid a realloc call.

Also, here's a more hand-wavy reason to prefer the former. If you think of 2 components as 2 separate processes and diagram the actual data flow between them, it looks like this (it's Mermaid time!):

sequenceDiagram
    Caller ->>+ Callee: n
    Callee -->>- Caller: bytes
Loading

In particular, in a component (as opposed to a raw .dll), Caller never actually passes the buffer to Callee, only the number n. Thus, the latter function signature paints a sortof inaccurate picture of what's happening from a cross-component perspective. It's even more clear if you imagine Caller and Callee are on different machines (which isn't a thing that we should ever be doing automatically (like CORBA did), but certainly there are some cases where it's appropriate): in this case, the caller-supplied buffer never leaves Caller's machine; it's simply an ABI detail for "where to put bytes when they come back.

Thus, for both the practical error-handling issues and the high-level "what's happening at the component level" reasons, I still prefer the former.

As for the concerns you mentioned:

  • Programmers don't have to learn about dependent typing, the WIT type just documents a dynamic fact that is dynamically enforced (it's a "post-condition").
  • Languages don't need any dependent typing: they can either ignore the size-bound (and just return the same thing as a normal list) or use the same outparam signatures that you wrote above.
  • WIT resolution/validation is also not doing anything fancy because it's only validating that the types are well-formed (no free variables and parameter have integral types).
  • We may or may not extend the expressivity of what can go in a list bound, but there's nothing forcing us to: we can motivate each extension balancing complexity and benefit. We can "pay as we go".

@lukewagner
Copy link
Member

Thinking some more about the problems I mentioned above with "buffer" and exploring the design space a bit with @sunfishcode, there may actually be some good solutions. What I like about this whole approach is, as you said, it tends to map directly to what you want to see in the source bindings, and it's also somewhat more expressive (particularly in lists-of-buffers cases like writev/readv). More thought is necessary to explore the design, but I just wanted to point out that it's seeming more feasible now than I initially thought in my last comment.

@ydnar
Copy link

ydnar commented Jul 8, 2024

In Go, the io.Reader interface reads into a caller-supplied buffer, returning the number of bytes read, or an error. The equivalent here would be:

read: func(dest: buffer<u8>) -> result<u32, error>

It is the caller’s responsibility to create a sub-slice of buffer or only read the N items returned from read.

@cpetig
Copy link

cpetig commented Jul 18, 2024

I would like to mention the statically sized buffer placed in static memory by the callee once the number of results exceed the limit. MAX_FLAT_RESULTS in wit-bindgen.

From a re-entrancy/multi-threading perspective (think 0.3) this is a blocker and feels related to caller provided buffers, as the caller knows exactly how much memory to reserve and would be able to do so on the stack, then pass the pointer to the callee.

PS: I also ran into the problem of calling such a function twice before calling the cabi_post on the "first one" and double-freeing as both results used the same physical buffer and the second return value overwrote the first.

@cpetig
Copy link

cpetig commented Jul 18, 2024

Thinking some more about the problems I mentioned above with "buffer" and exploring the design space a bit with @sunfishcode, there may actually be some good solutions. What I like about this whole approach is, as you said, it tends to map directly to what you want to see in the source bindings, and it's also somewhat more expressive (particularly in lists-of-buffers cases like writev/readv). More thought is necessary to explore the design, but I just wanted to point out that it's seeming more feasible now than I initially thought in my last comment.

Is there any publicly available information about the solution you came up with? 🙏 Because I know that several people are currently investigating independently into this problem, including me.

@lukewagner
Copy link
Member

I would like to mention the statically sized buffer placed in static memory by the callee once the number of results exceed the limit. MAX_FLAT_RESULTS in wit-bindgen.

Just to check: are you saying there is currently a problem we need to fix or are you wanting to remind us not to regress things in a reentrant future by baking in static memory addresses? If the latter: totally agreed.

Is there any publicly available information about the solution you came up with? 🙏 Because I know that several people are currently investigating independently into this problem, including me.

(Sorry for the slowness here; now that #378 is up, I'm keen to focus on this more.) One thing @sunfishcode pointed out to me was this organic use case in wasi-i2c which really does seem to want a buffer<u8> that can be used as the payload of the read case.

Thinking more about the details of how a buffer<T> might work:

  • At least for starters, I think we'd want to limit T to just the scalar number types.
  • On the caller's side, it seems like a buffer parameter would be lifted from 3 i32s: a ptr, a capacity, and a len outparam that is set by the callee.
  • On the callee's side, it seems like a buffer<T> would be lowered into 2 fields: capacity and bufidx, and there would be a new built-in canon buffer.write: [bufidx:i32 ptr:i32 len:i32] -> [] (which could be called multiple times to copy memory from the callee's linear memory into the caller's and would trap if capacity would be exceeded).

Note that the caller's buffer (pointed to by ptr) would be directly mutated by buffer.write during the call, not just at the end, so in async/reentrant situations, buffer.writes would be visible to the caller before the call finished (which seems fine and par for the course in async, but worth noting).

I was also thinking about the language bindings for the caller of a function taking a buffer b/c the tricky question is how to communicate the len outparam back to the caller in the source bindings. The tricky things with all the idioms listed above is that they can't mutate the length of the incoming slice/span. The bindings could maybe automatically add len as a return value in some cases, but in the general case (e.g., the wasi-i2c one I linked to above), it's less clear how this should work, and thus it seems better to somehow reflect len as an outparam of each individual buffer argument.

Now, in JS, if an ArrayBuffer is passed, it could be .resize()ed based on len. And similarly C++/Rust could pass a mutable reference to a std::vector/Vec and so on for other languages. However, this doesn't work if the caller wants to write into just a slice/view/span of a larger buffer/vector/array (noting that the length of a slice/view/span is immutable). Thus, it seems like you'd need to pass a mutable reference to a structure that contains both the slice/view/span and a mutable len outparam). E.g., in JS, to call read: func(buf: buffer<u8>), you might write something vaguely like:

let bufferArg = { view: new Uint8Array(arrayBuffer, begin, end), written: undefined };
read(bufferArg);
console.log(`written: ${ bufferArg.written }`);

That's not super-elegant, but it works and maybe, given that ArrayBuffer.resize() was added recently, resizable typed array views could be standardized in the future and so I'd be able to simply write:

let view = new Uint8Array(arrayBuffer, begin, end, { resizable: true });
read(view);
console.log(`written: ${ view.length }`);

Anyhow, the point is that every language bindings generator will need to figure out it's own answer to this question. I think there are ok options so this isn't a major problem, but I'd be curious if there was anything more idiomatic in any languages that handles the general case (buffers nested inside records and variants and lists)?

@lukewagner
Copy link
Member

Another realization: it seems like the key capability we want to add here isn't simply "avoiding cabi_realloc" but rather "providing one or many guest wasm memory pointers up-front so that the host can make syscalls that write directly into this wasm memory", particularly for cases where the exact amount isn't fixed, just upper-bounded, as with read. But this problem statement applies just as well to write (when partial writes are allowed, e.g., due to backpressure). Thus, it seems like we'd want both a readable-buffer<T> and writable-buffer<T> (names to be bikeshedded). Everything I sketched above I think would work symmetrically (e.g., replacing buffer.read with buffer.write).

As one concrete example: if we had had readable-buffer, we wouldn't have needed to add wasi:io/output-stream|check-write; write would just take a readable-buffer and thereby allow partial writes in cases of backpressure. Of course I'd like the new stream type in Preview 3 to obviate wasi:io streams entirely, but as the wasi-i2c example, there will still be more specialized cases that want buffer-style ABIs that aren't simply streams.

(Or at least that's where I'm currently at, continuing to think about this.)

@cpetig
Copy link

cpetig commented Jul 24, 2024

I would like to mention the statically sized buffer placed in static memory by the callee once the number of results exceed the limit. MAX_FLAT_RESULTS in wit-bindgen.

Just to check: are you saying there is currently a problem we need to fix or are you wanting to remind us not to regress things in a reentrant future by baking in static memory addresses? If the latter: totally agreed.

Currently calling an exported function with more than one result "register" has no way to provide a memory area for this known size temporary buffer (which is later freed by the cabi_post).

C calling convention often optimizes returning a constant size structure by passing a pointer to a caller-stack-located buffer.

Right now wit-bindgen uses a static buffer for this return values, making the code non-re-entrant and non-thread-safe. To me this resembled the problem statement of caller provided buffers, this is why I brought it up here.

PS: In my symmetric ABI experiment I use this return value optimization for larger constant size returns - but I feel that a more generic and elegant solution would be preferred as we design the ABI for caller provided buffers. (I run into exactly the ownership ambiguity problems for CPB described by you in #175 , e.g. out<list<resource>> feels inconsistent as the resources are transferred to the caller but the list isn't a normal heap based vector)

I realize that CPB is mostly about imported functions, but my native plugin brain keeps insisting on taking both directions into view.

@cpetig
Copy link

cpetig commented Jul 24, 2024

Yesterday evening I realized that another option might solve the ownership/initialization ambiguity elegantly. If we could pass a pointer from the caller all the way through to a specialized (per function) realloc function the realloc could use caller provided storage and return it once the conditions are considered right (e.g. buffer is large enough).

This makes it very flexible, fully customizable from the caller side and can also enable CPB for list<string> , at the unchanged cost of making all the realloc calls.

This way the caller can provide either temporary caller-stack located storage (return value optimization) or from a pre-allocated pool or simply heap at its own choice.

E.g. in C notation (the first argument pool is new and the pointer to buffer passed to file_read)

void* file_read_realloc(void* pool, void* oldptr, usize_t oldsize, usize_t align, usize_t size) {
   assert(size<=128); return pool;
}
...  
  char buffer[128];
  pointer_and_size_t result = file_read(buffer, sizeof(buffer));

for list<string> you would need to design a more complex allocator and store the remaining free bytes inside the pool structure.

@cpetig
Copy link

cpetig commented Jul 24, 2024

A matching WIT indicating the CPB convention could be (making memory-pool a built-in type)

   file-read: func(buffer: memory-pool, size: u32) -> list<u8>;

@badeend
Copy link
Contributor Author

badeend commented Jul 24, 2024

Thinking more about the details of how a buffer might work:

  • At least for starters, I think we'd want to limit T to just the scalar number types.
  • On the caller's side, it seems like a buffer parameter would be lifted from 3 i32s: a ptr, a capacity, and a len outparam that is set by the callee.
  • On the callee's side, it seems like a buffer would be lowered into 2 fields: capacity and bufidx, and there would be a new built-in canon buffer.write: [bufidx:i32 ptr:i32 len:i32] -> [] (which could be called multiple times to copy memory from the callee's linear memory into the caller's and would trap if capacity would be exceeded).

Extending on this idea: instead of encoding the i31 triple directly in the call, there may be benefit in letting the caller construct the buffer from the same i31 triple upfront, and pass a single buffer id/handle in the actuall call instead.

Consider a component composition like: A -> B -> C where B forwards calls from C to A.

It would be great if B could simply forward the buffer id received from C directly into the call to A as-is. This would let A read/write directly into Cs memory, despite the fact that the call went through B. Saving unnecessary copies.

@lukewagner
Copy link
Member

@badeend That sounds like a great idea. So perhaps the built-ins are:

  • canon writable-buffer.new $t: [ptr:i32 capacity:i32] -> [bufidx:i32]
  • canon writable-buffer.remain $t: [bufidx:i32] -> [n:i32]
  • canon writable-buffer.write $t: [bufidx:i32 ptr:i32 n:i32] -> [] (traps if you exceed remain)
  • canon writable-buffer.written $t: [bufidx:i32] -> [n:i32]
  • canon writable-buffer.drop $t: [bufidx:i32] -> []

(and symmetrically for readable-buffer) and always borrow semantics for calls (since there's no destructor to call on drop in any case), like you were saying above. WDYT?

@lukewagner
Copy link
Member

@cpetig Oops, replying out-of-order.

Right now wit-bindgen uses a static buffer for this return values, making the code non-re-entrant and non-thread-safe.

Ah yes, I think that's just an optimization that would need to be updated in the async world to use (shadow-)stack allocated storage instead.

Speaking to your other idea: this relates to a different idea that achieves the same effect that I was coincidentally just talking with @alexcrichton about yesterday which I was calling "lazy lowering". The basic idea is that: in any place where we would have to call cabi_realloc to get an i32 pointer to copy into, the CABI instead inserts, in that same i32, an index that represents the "value to be lowered", and we provide canon built-ins for (lazily) lowering this value, given the lazy-value-index. This ends up inverting the control flow in a way that gives the guest code full flexibility for controlling allocation. Also coincidentally, the idea of zero-copy forwarding via "lazy value index" came up in a way that mirror's @badeend's most recent idea above.

There's a lot more to say about lazy-value-lowering, but I think it's super-promising and I'd like to file a new issue to discuss it. What's great about it is that it avoids cabi_realloc for arbitrary types (of unknown-up-front-size, unlike the buffers we're considering here) and it also integrates pretty naturally with stream. This lead me to ask: "can we do lazy-lowering instead of the buffers (as the more general solution for avoiding cabi_realloc). The one key use case not addressed by lazy-lowering, though, is that the buffers give you the pointer to memory up-front before the underlying syscall, whereas lazy-lowering inherently happens "later" (as part of the process of returning results), and thus I think the buffer types we're discussing here provide distinct additional value for these more-specialized use cases like i2c's transaction. Thus, I think we want both, with the net result being no-cabi_realloc ever (if you don't want it) and you can have up-front memory-buffers for the special/advanced cases when you want them as part of the WIT interface.

@cpetig
Copy link

cpetig commented Jul 24, 2024

Does lazy-lowering also enable the guest to call into the host to receive temporary access to a slice of shared memory (mapped into the guest address space by separate means)? Basically this boils down to returning &mut 'self [u8] from a resource (WIT syntax could be -> borrow<list<u8>>).

People on the embedded group look for a way to do zero-copy image processing in wasm, preferably using the component model.

And I look for enabling zero-copy publisher-subscriber mechanisms like the iceoryx2 API (you receive temporary access to a slice for reading or writing, the lifetime of the slice is tied to a lending resource object).

@lukewagner
Copy link
Member

Does lazy-lowering also enable the guest to call into the host to receive temporary access to a slice of shared memory (mapped into the guest address space by separate means)?

No, the idea (which I'll try to write up in a proper issue tomorrow; sorry for the hand-waving) is just to give the guest code more control over when the copy happens and into what destination address in the guest's linear memory.

Now, when you say "mapped into the guest address space by separate means": any copy into guest memory (even in 0.2 today) can be implemented without actually copying (e.g., via mmap() of a file with MAP_FIXED|MAP_PRIVATE) with the appropriate alignment/size restrictions, but I get the impression that many of these same embedded systems also don't have MMUs. I'd love to learn more about what mapping techniques are possible in common embedded systems and what their rules/restrictions are (maybe not in this issue, but offline or elsewhere).

@badeend
Copy link
Contributor Author

badeend commented Jul 25, 2024

The basic idea is that: in any place where we would have to call cabi_realloc to get an i32 pointer to copy into, the CABI instead inserts, in that same i32, an index that represents the "value to be lowered". (...) This lead me to ask: "can we do lazy-lowering instead of the buffers?"

For readable buffers specifically: If we let go of the "sequence of scalar numbers" requirement and broaden the definition of buffers to mean "any region of remote memory", wouldn't it be fair to say that:
"lazy-value indexes" are readable-buffer handles?

The one key use case not addressed by lazy-lowering, though, is that the buffers give you the pointer to memory up-front before the underlying syscall, whereas lazy-lowering inherently happens "later".

This can then be thought of as a natural consequence of having a buffer in the parameter position vs. the return position.

@lukewagner
Copy link
Member

For readable buffers specifically [...] wouldn't it be fair to say that: "lazy-value indexes" are readable-buffer handles?

Great question! At a high-level, I think there is a meaningful semantic difference: if, e.g., a function has a list parameter, the caller can assume the whole list is given to the callee, end of story. Maybe the callee ignores all or half of the received list, but the caller doesn't know or care. With a readable-buffer OTOH, there is this "outparam" of "how much did the callee actually consume" which is semantically part of the call contract, making it an altogether more-complex contract. Thus, you'd want separate WIT/CM-level types so that you could say what you wanted the contract to be. However, it does seem semantically meaningful to pass a lazy-list (e.g., received as an export param) to an import with a readable-buffer param and vice versa, so perhaps they both go into the same dynamic index space and share the same canon built-ins.

@badeend
Copy link
Contributor Author

badeend commented Jul 25, 2024

Hmm, fair point.

If the "number of bytes read" outparam is the only difference between readable buffers and regular lists, I wonder whether it then is worth the effort of creating a dedicated type for it. Now that we have a solution for fixing the general case (all lists), I don't mind moving the "number of bytes read" back to into the return type.

I.e.:

write: func(data: list<u8>) -> u32;

is good enough for me ¯\(ツ)

@lukewagner
Copy link
Member

Yes, in most situations, I think lazy-lowering is all you need (although there are some interesting per-language bindings questions as to how to expose this lazy-lowering in the source-level API). The one non-trivial performance advantage of the buffer types is that they provide the pointer to wasm memory before "the syscall", allowing "the syscall" to read/write directly from linear memory. Now many APIs don't have "the syscall" or will be subsumed by 0.3 streams, but as the i2c transaction use case demonstrates, there will be particular cases that slip through these cracks and so I think it's useful to have buffers as the backstop.

@cpetig
Copy link

cpetig commented Sep 21, 2024

This clearly isn't meant to discourage lazy-lowering design, but I just found that caller provided buffers map to a good and predictable enough API if you mix it with ideas from shared memory types (e.g. boost interprocess), see #398 for details.

For now that solution has become my favorite as it also enables a lot of new uses of WIT.

@lukewagner
Copy link
Member

Just as an update on my end, working on streams in the add-stream branch, spec-internal ReadableBuffer and WritableBuffer concepts have naturally appeared in the Python ABI code that have the same "(pointer, capacity, remain) tuple" semantics as what we're describing here. A ReadableBuffer is implicitly created by a call to stream.write (given the i32 (ptr, length) parameters) and vice versa for stream.read. This suggests a future direct integration with readable-buffer<T>/writable-buffer<T> in which a stream<T> could be directly stream.read into a writable-buffer<T> or stream.writen into a readable-buffer<T> (avoiding what would otherwise need to be an intermediate copy through linear memory). More generally, this suggests that one can think of a stream as just a series of rendezvous between two sides supplying their respective readable and writable buffers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants