Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multibyte access to (array i8) #395

Open
wingo opened this issue Jul 7, 2023 · 9 comments
Open

Multibyte access to (array i8) #395

wingo opened this issue Jul 7, 2023 · 9 comments
Labels
Post-MVP Ideas for Post-MVP extensions

Comments

@wingo
Copy link
Contributor

wingo commented Jul 7, 2023

Use case: Your language allows users to define new packed struct types at runtime. Your language toolchain targets wasm/gc. You use an (array i8) to represent the backing store for those packed structs.

Problem: For multi-byte loads you have to emit multiple array.get_s or array.get_u calls and then combine the bytes appropriately. This is inefficient, which is enough of a motivation to add multibyte accessors. There should be something to load and store u16 and i16 from arbitrary offsets in an (array i8), as well as i32, i64, f32, f64, and i128.

However I just realized another motivation for multibyte accessors: byte-by-byte access is potentially incorrect in the presence of threads and mutation. Unlike naturally-aligned access to memory in the MVP, access to (array i8) contents with MVP GC ops will tear. Not sure what to do about that: whether to specify that naturally aligned multibyte access does not tear (perhaps with an exception for i128), or whether to ensure atomic access only via specific atomic operations. In any case there is some design work to do here.

@titzer
Copy link
Contributor

titzer commented Jul 7, 2023

I've been ruminating on this in the background and I've started to think that we should consider a post-MVP feature that allows regular load/store instructions (of which we now have several dozen) to apply to either on-heap GC arrays or vice versa: to allow slices of linear memory to be viewable through (i.e. aliased by) GC arrays. We have bits available in the memarg immediate of loads and stores. Would this fit your use case?

@vouillon
Copy link
Contributor

vouillon commented Jul 7, 2023

Another post-MVP feature I would be interested in is to be able to access JavaScript typed arrays (or ArrayBuffer objects) directly from WebAssembly. (Sure, one can access the linear memory as an ArrayBuffer, but it is not as convenient as first-class garbage collected objects.) I was concerned that this would involve adding a lot of new instructions. But using regular load/store instructions could be a solution.

@wingo
Copy link
Contributor Author

wingo commented Jul 10, 2023

I've been ruminating on this in the background and I've started to think that we should consider a post-MVP feature that allows regular load/store instructions (of which we now have several dozen) to apply to either on-heap GC arrays or vice versa: to allow slices of linear memory to be viewable through (i.e. aliased by) GC arrays. We have bits available in the memarg immediate of loads and stores. Would this fit your use case?

Oh I like this! A couple thoughts:

  • This could an alternate way to get to 64-bit array access offsets; but then array.len is 32-bit. Dunno, probably not worth exploring, but needs to be kept in mind.
  • There is overlap with the WebGPU use case; perhaps in LLVM toolchains one could associate an (array i8) with an address space for some code segment, and then memory operations on that address space would go through the arrayref. See Proposal: Fine grained control of memory design#1439.
  • Probably needs to be restricted to array i8; if you include (array i32) you would then depend on byte order. I suppose you could extend to array-of-struct-with-only-numeric-fields but then we are in the realm of value types rather than reference types.

@osa1
Copy link
Contributor

osa1 commented Jul 11, 2023

We have a similar use case in dart2wasm. Dart standard library has a few typed array types like Float32List, Int64List that store the elements unboxed. These types use a byte buffer type (ByteBuffer) to store the elements, and a ByteBuffer can be shared with different lists with different element types, and the lists that share the same ByteBuffer can even use different start offsets as the index of the first element. (for example x[0] reads a 32-bit int at 0x1000, y[0] which shares the same byte buffer reads a 64-bit float at 0x1001)

Currently for ByteBuffer we use array i8, which leads to extremely slow code even in the common case where a list is used directly (not via a view) because of the single-byte reads and writes. One benchmark for typed array performance runs at 23% of the same program compiled to JS using JS typed array API.

Multi-byte reads and writes to array i8s would solve the problem with views, but we also want to be able to share these arrays with JS. Post-MVP JS API for GC arrays may help with this, but @titzer's idea of linear-memory-backed arrays would also solve it nicely and as efficiently as possible. An additional benefit is that it would also make it possible to share these arrays with other linear memory programs (e.g. C++ compiled to Wasm with emscripten, sharing the same linear memory with dart2wasm-generated program, sharing the malloc/free implementations).

(Sharing GC references with linear memory application is also possible, but the language-level types for these references need to be more restrictive compared to a type representing a linear memory address, which can just be a uint8_t*.)

Currently the best we can do for the common case of using a list directly (not via a view) while implementing the Dart typed data API (e.g. with the ability to get byte buffer of a list and use it as storage for another list with different element type, maybe with an offset into the buffer) is we implement multiple ByteBuffer subclasses, each with a differently typed array:

class _I32ByteBuffer implements ByteBuffer {
  final WasmIntArray<WasmI32> _data; // array i32
  ...
}

class _F64ByteBuffer implements ByteBuffer {
  final WasmFloatArray<WasmF64> _data; // array f64
  ...
}

We need one such class for: i8, i16, i32, i64, f32, f64, and for SIMD types for f64x2, f32x4, i32x4, i64x2.

A ByteBuffer implementation for e.g. f64 will have efficient read_f64 and write_f64 iff the offset into the Wasm GC array is also a multiple of the element size. All the other read/write methods (and when the offset is not a multiple of element type) needs to read/write either one byte at a time (better for code size as we can inherit these methods from a base class) or we can improve cases like reading an i32 when the array type is i64, or by doing two i32 reads when the array type is i32 and read_i64 is called etc.

However (1) there will be a lot of code (2) ByteBuffer accesses will have to be virtualized (3) this doesn't solve the problem with sharing these arrays with JS (and maybe also with linear-memory applications).

@rossberg
Copy link
Member

Just to mention it: an alternative that we have thrown around in the past was to have a form of reinterpret cast on (transparent) array types, such that array(i8) can be viewed as array(f64) etc. That may be useful for other purposes. But it might add some extra complexity around unaligned array sizes.

@osa1
Copy link
Contributor

osa1 commented Jul 11, 2023

Would multi-byte read and write instructions for array i8 cause any redundancy in the current MVP GC spec? At least for aligned reads I think a 4-byte read from an array i8 will have the same runtime performance as a read from array i32, so it seems like other array types with unboxed elements become less useful.

@titzer
Copy link
Contributor

titzer commented Dec 18, 2023

Just to mention it: an alternative that we have thrown around in the past was to have a form of reinterpret cast on (transparent) array types, such that array(i8) can be viewed as array(f64) etc. That may be useful for other purposes. But it might add some extra complexity around unaligned array sizes.

I think we want to avoid exposing the byte order of array elements and struct fields, so I'd be fine restricting the scope of this feature to array i8.

Another possibility that I've ruminated on is to allow "pinning" all or part of an array i8 by temporarily binding a memory declaration to it. For example, suppose a module declares a separate, empty, but "pinnable" memory, and we introduce instructions memory.pin_array(a: array i8, offset: i32/64, length: i32/i6) and memory.unpin. The semantics of memory.pin_array would be to update mutable state in the instance to refer to the specified inner portion of the array. Then all load and store instructions that target that memory can actually work on the raw storage of the array. This has the advantage of not introducing new immediate flags for instructions related to memories. (Though these memories would be byte-sized, rather than page-sized, and cannot be grown).

@yuri91
Copy link

yuri91 commented Jul 29, 2024

We have a similar use case in CheerpJ (JVM running in the browser):

C++ JNI code that access Java byte arrays expects to be able to do arbitrarily sized load and stores on them.

Currently we compile Java bytecode to JavaScript, but we would like to eventually target WasmGC.

This issue and the inability to expose GC arrays as JS typed arrays (or being able to pass them directly to Web APIs) are the main roadblocks.

@mkustermann
Copy link

mkustermann commented Oct 3, 2024

Another use case where this would be very helpful:

When compiling to WasmGC instead of JavaScript one may uses WasmGC arrays instead of JS typed data - as WasmGC arrays are faster to allocate & operate on inside wasm code. Though in some occasions one has to pass data across the boundary to JavaScript.

=> Right now there's no efficient way to memcpy from WasmGC arrays to JS typed arrays (also not to linear memory). One has to do that byte-by-byte
=> If we had multi-byte access we could have much faster bulk copy of data - load&store 8 bytes at a time instead of 1. It may not be the fastest possible way to copy memory (which may be using vectorized instructions), but it would get much closer to it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Post-MVP Ideas for Post-MVP extensions
Projects
None yet
Development

No branches or pull requests

7 participants