Introduce basic support for SVE #802

murzinv · 2024-02-28T10:01:24Z

Hi All,

This PR introduces basic support for SVE instructions in litmus7, diy7 and herd7.

Scalable Vector Extension (SVE) is Vector Length Agnostic (VLA):

Vector Length (VL) is a hardware implementation choice from 128 up to 2048 bits.
New programming model allows software to scale dynamically to available vector length.
No need to define a new ISA, rewrite or recompile for new vector lengths.

Scalable vector registers:

Z0-Z31 extending NEON’s 128-bit V0-V31
Packed DP, SP & HP floating-point elements
Packed 64, 32, 16 & 8-bit integer elements

Scalable predicate registers:

P0-P7 governing predicates for load/store/arithmetic
P8-P15 additional predicates for loop management
FFR first fault register for software speculation

I personally find tutorial at [1] really great introduction to the SVE programming model.

Implementation choices and known limitations:
- SVE memory model is not addressed
- herd7 implements 128 bits vector length (on top of existing Neon infrastructure)

[1] https://www.stonybrook.edu/commcms/ookami/support/_docs/ARM_SVE_tutorial.pdf

relokin

Some very quick comments, I'll try and do another more thorough pass tomorrow.

I understand that the current implementation re-uses the Neon registers just renames them. How hard would it be to be able to change the register size?

relokin · 2024-03-19T17:15:14Z

herd/tests/instructions/AArch64.sve/V02.litmus

+ DUP Z0.H, W0      ;
+ DUP Z1.S, W0      ;
+
+forall 0:V0.4S={262148,262148,262148,262148} /\ 0:V1.4S={4,4,4,4}


I think these are a little more obvious in hex 0:V0.4S={0x40004,0x40004,0x40004,0x40004};

Ack. I'll have a look in other tests and change to hex representation where appropriate

relokin · 2024-03-19T17:23:22Z

herd/AArch64Sem.ml

+        let psize = predicate_psize src in
+        let esize = scalable_esize dst in
+        let orig = match pg with
+        | AArch64Base.PMreg (_,AArch64Base.Zero) -> M.add V.zero V.zero


This could be M.zero as well

M.zero doesn't work, yet mzero does the trick. I'll also change the places where I need to wrap constant to use M.unitT

relokin · 2024-03-19T17:24:48Z

herd/AArch64Sem.ml

+        | AArch64Base.PMreg (_,AArch64Base.Merge) -> read_reg_scalable false dst ii
+        | _ -> assert false
+        in
+        orig >>|


the semantics here are rmw, which makes sense for \Z. But for \M, do you know if you want rmw semantics or just setting individual lanes?

Looking into psudocode suggest that rmw is expected, yet I missed the case of all predicate being inactive. I'll change here and update other places where appropriate.

murzinv · 2024-03-20T10:07:10Z

Some very quick comments, I'll try and do another more thorough pass tomorrow.

I understand that the current implementation re-uses the Neon registers just renames them. How hard would it be to be able to change the register size?

At the first glance, IIUC, we'd need to support arithmetic on larger register size. Another problem to solve would relate to the way to observe bits of register above 128, now we (ab)use Neon register for that.

relokin · 2024-03-20T17:15:17Z

Some very quick comments, I'll try and do another more thorough pass tomorrow.
I understand that the current implementation re-uses the Neon registers just renames them. How hard would it be to be able to change the register size?

At the first glance, IIUC, we'd need to support arithmetic on larger register size. Another problem to solve would relate to the way to observe bits of register above 128, now we (ab)use Neon register for that.

Thanks! If it was trivial to make it scalable then it would be nice. OTOH, I am not sure that the questions on the memory model change so maybe having one implementation where the register size is 128 is sufficient.

relokin

Some more comments

Makefile

herd/AArch64Arch_herd.ml

relokin · 2024-03-25T22:28:33Z

herd/AArch64Arch_herd.ml

+    | Zreg (_,(8)) -> MachSize.Byte
+    | Zreg (_,(16)) -> MachSize.Short
+    | Zreg (_,(32)) -> MachSize.Word
+    | Zreg (_,(64)) -> MachSize.Quad


But also this function seems to be unused?

Right, it is not used. I think I introduced that while working on supporting different access sizes for SVE load/store instructions but realized that access size is driven by instruction mnemonic rather than arguments (in contrast to the Neon case). I'll drop this unused function.

lib/AArch64Base.ml

relokin

Some comments, but I haven't finished my review yet.

relokin · 2024-04-08T13:50:26Z

lib/AArch64Base.ml

+let pp_preg_simple r = match r with
+| Preg (r',_) ->
+  (try List.assoc r' pvrs with Not_found -> assert false)
+| PMreg (r',_) ->
+  (try List.assoc r' pvrs with Not_found -> assert false)
+| _ -> assert false


You could also do something like:

Suggested change

let pp_preg_simple r = match r with

| Preg (r',_) ->

(try List.assoc r' pvrs with Not_found -> assert false)

| PMreg (r',_) ->

(try List.assoc r' pvrs with Not_found -> assert false)

| _ -> assert false

let pp_preg_simple = function

| Preg (r,_)

| PMreg (r,_) ->

(try List.assoc r pvrs with Not_found -> assert false)

| _ -> assert false

Also not sure why we're catching the exception here. It's not like assert false is much more helpful.

Hi @relokin, assert false is slightly more informative as it it leads to a message with line number included. That said, assert false should never be executed...

You could also do something like:

Ack

lib/AArch64Base.ml

herd/tests/instructions/AArch64.sve/V01.litmus

relokin · 2024-04-08T22:19:14Z

herd/tests/instructions/AArch64.sve/V05.litmus

+AArch64 V05
+(* Contiguous load two-word structures to two vectors (scalar index) *)
+{
+ uint16_t x[8] = {1,2,1,2,1,2,1,2};


Would it make sense to initialise x with different values instead of an alternating 1,2 pattern? IIUC, something like:

Suggested change

uint16_t x[8] = {1,2,1,2,1,2,1,2};

uint16_t x[8] = {0,0,1,2,0,0,0,0};

I've changed it to uint16_t x[8] = {1,2,0,0,0,0,0,0};

relokin · 2024-04-08T22:40:34Z

litmus/AArch64Arch_litmus.ml

+  | Preg (pr, _) ->
+      (try Misc.lowercase (List.assoc pr pvrs) with Not_found -> assert false)
+  | PMreg (pr, _) ->
+      (try Misc.lowercase (List.assoc pr pvrs) with Not_found -> assert false)
+  | Zreg (zr, _) ->
+      (try Misc.lowercase (List.assoc zr vvrs) with Not_found -> assert false) (*Intentionally 'vvrs' instead of 'zvrs'*)


Again here I don't know that catch Not_found and asserting is any more useful than just letting it throw an unhandled Not_found exception.

But then again I guess you're doing what the code before this PR was doing...

litmus/template.ml

Makefile

relokin

A couple of more comments but if you fixed some glitches we could merge this soon. Also, it might be worth adding something like this as it allowed me to run the sve tests with litmus7:

$> cat litmus/libdir/armv8+sve-qemu.cfg
size_of_test = 10k
number_of_run = 100
avail = 4
limit = true
memory = direct
stride = 1
carch = AArch64
barrier = userfence
smt = 4
smt_mode = seq
affinity = incr0
ccopts = -O2 -march=armv8-a+sve
crossrun = qemu:qemu-aarch64

relokin · 2024-04-19T16:40:12Z

herd/tests/instructions/AArch64.sve/V10.litmus

+ SXTW X3,W1                           ;
+ ST2H {Z1.H,Z2.H},P0,[X0,X3, LSL #1]  ;


I think this is missing a WHILELT and it doesn't really need the SXTW:

Suggested change

SXTW X3,W1 ;

ST2H {Z1.H,Z2.H},P0,[X0,X3, LSL #1] ;

WHILELT P0.H,W1,W2 ;

ST2H {Z1.H,Z2.H},P0,[X0] ;

Good catch! I've added setup for P0 register. As for SXTW if I remove that and swap X3 with X1 in ST2H instruction I'd get Register x1 has different types: <int> and <int64_t> when it is consumed by litmus7. Anyway, since I set P0 with PTRUE only user of X1 left is ST2H.

relokin · 2024-04-19T16:40:30Z

herd/tests/instructions/AArch64.sve/V10.litmus.expected

@@ -0,0 +1,11 @@
+Test V10 Required
+States 1
+x={0,0,0,0,0,0,0,0};


This is not the intended result.

relokin · 2024-04-19T16:41:30Z

herd/tests/instructions/AArch64.sve/V11.litmus

+ MOV W2,#4                               ;
+ SXTW X3,W1                              ;
+ WHILELT P0.S,W1,W2                      ;
+ ST3W {Z1.S,Z2.S,Z3.S},P0,[X0,X3, LSL #2];


This is really the same as this, or is there something intentional about using a zero offset?

Suggested change

ST3W {Z1.S,Z2.S,Z3.S},P0,[X0,X3, LSL #2];

ST3W {Z1.S,Z2.S,Z3.S},P0,[X0];

Offset register here is to test parser side

relokin · 2024-04-19T16:41:58Z

herd/tests/instructions/AArch64.sve/V12.litmus

+ MOV W2,#2                                      ;
+ SXTW X3,W1                                     ;
+ WHILELT P0.D,W1,W2                             ;
+ ST4D {Z1.D,Z2.D,Z3.D,Z4.D},P0,[X0,X3, LSL #3]  ;


litmus/template.ml

herd/tests/instructions/AArch64.sve/V19.litmus

herd/tests/instructions/AArch64.sve/V20.litmus

relokin · 2024-04-24T12:24:27Z

herd/AArch64Sem.ml

@@ -1381,6 +1523,7 @@ module Make
                   read_mem_postindexed
                     a_virt op sz Annot.N aexp ac rd rs k a ii)
                 ma ii)
+          | _ -> assert false


nit: the indentation is off here.

relokin · 2024-04-24T12:35:23Z

herd/AArch64Sem.ml

+
+      let rec reduce_ord l =
+        match l with
+        | [] -> M.unitT ()
+        | h::t -> h >>= fun () -> reduce_ord t
+


there is no need to move this, is there?

Yup, unnecessary move.

relokin · 2024-04-24T13:15:44Z

herd/AArch64Sem.ml

+    let check_neon inst =
+      if not (neon || sve) then
+        Warn.user_error
+          "Neon instruction %s require -variant neon"


Suggested change

"Neon instruction %s require -variant neon"

"Neon instruction %s requires -variant neon"

relokin · 2024-04-24T13:15:54Z

herd/AArch64Sem.ml

+    let check_sve inst =
+      if not sve then
+        Warn.user_error
+          "SVE instruction %s require -variant sve"


Suggested change

"SVE instruction %s require -variant sve"

"SVE instruction %s requires -variant sve"

herd/AArch64Sem.ml

relokin · 2024-04-29T15:02:32Z

I am happy with this PR, so we could merge soon! Is anyone else planning to have to review this? Otherwise, we'll aim to merge this by the end of the week.

Makefile

maranget · 2024-05-01T16:24:15Z

Hi @murzinv, with the following patch, predicated sve instructions may run much faster. As an example, runing make test.sve is now instantaneous.

litmus/libdir/armv8+sve-qemu.cfg

litmus/AArch64Arch_litmus.ml

litmus/AArch64Compile_litmus.ml

maranget · 2024-05-01T18:04:40Z

Hi @murzinv, nice work, LGTM. I have nevertheless suggested a patch and could not resist making a few cosmetic remarks. You probably find the patch to be worth considering.

murzinv · 2024-05-02T09:44:09Z

Hi @murzinv, with the following patch, predicated sve instructions may run much faster. As an example, runing make test.sve is now instantaneous.

Wow! That runs at speed of light! I'm wondering if similar optimization can be applied to other instructions which set predicate like WHILELT and friends?

murzinv · 2024-05-02T09:46:24Z

Hi @murzinv, nice work, LGTM. I have nevertheless suggested a patch and could not resist making a few cosmetic remarks. You probably find the patch to be worth considering.

Thanks for your review comments! I could not resist from adding your patch in the series 😉

maranget · 2024-05-02T13:50:17Z

Wow! That runs at speed of light! I'm wondering if similar optimization can be applied to other instructions which set predicate like WHILELT and friends?

Here is the same optimisation applied to WHILE<OP> instructions.

I also notice that all those WHILE<OP> instructions could be implemented by one single constructor with an extra argument that describes the operation : that is, WHILE(<OP>,....) in place of WHILELT(...), WHILEEQ(...) etc. However I would not like this to prevent merging. I am in favor of merging now, once the patch is applied.

murzinv · 2024-05-02T14:09:21Z

Wow! That runs at speed of light! I'm wondering if similar optimization can be applied to other instructions which set predicate like WHILELT and friends?

Here is the same optimisation applied to WHILE<OP> instructions.

Great thanks!

I also notice that all those WHILE<OP> instructions could be implemented by one single constructor with an extra argument that describes the operation : that is, WHILE(<OP>,....) in place of WHILELT(...), WHILEEQ(...) etc. However I would not like this to prevent merging. I am in favor of merging now, once the patch is applied.

I see. Patch now part of the series 😉

maranget · 2024-05-02T14:22:58Z

Fine, if you are happy with the current state, would you please rebase on master. You may also squash all commits into one, or at least squash the last two commit into one. Once this is done, I'll merge.

Scalable Vector Extension (SVE) is Vector Length Agnostic (VLA): - Vector Length (VL) is a hardware implementation choice from 128 up to 2048 bits. - New programming model allows software to scale dynamically to available vector length. - No need to define a new ISA, rewrite or recompile for new vector lengths. Scalable vector registers: - Z0-Z31 extending NEON’s 128-bit V0-V31 - Packed DP, SP & HP floating-point elements - Packed 64, 32, 16 & 8-bit integer elements Scalable predicate registers: - P0-P7 governing predicates for load/store/arithmetic - P8-P15 additional predicates for loop management - FFR first fault register for software speculation Implementation choices and known limitations: - SVE memory model is not addressed - `herd7` implements 128 bits vector length (on top of existing Neon infrastructure) - `litmus7` uses ARM C Language Extensions (ACLE) for SVE + Building SVE test require `-ccopts "-march=armv8-a+sve -O2"` + Although `Z` registers overlap with `V` registers mixing them in litmus test would likely lead to compilation failure (due to difference in the ACLE types) + However, `V` register overlapped with `Z` register in `final` clause is supported (this way we can inspect content of `Z` register) - Following SVE instructions are implemented: + PTRUE (predicate) + MOV (immediate, unpredicated) + DUP (scalar) + ADD (vectors, unpredicated) + INDEX (immediate, scalar) + INDEX (immediates) + INDEX (scalar, immediate) + INDEX (scalars) + WHILELE (predicate) + WHILELT (predicate) + WHILELE (predicate) + WHILELO (predicate) + WHILELS (predicate) + LD1B (scalar plus immediate, single register) + LD1H (scalar plus immediate, single register) + LD1W (scalar plus immediate, single register) + LD1D (scalar plus immediate, single register) + LD1D (scalar plus scalar, single register) + LD1B (scalar plus scalar, single register) + LD1H (scalar plus scalar, single register) + LD1W (scalar plus scalar, single register) + LD1B (scalar plus vector) + LD1H (scalar plus vector) + LD1W (scalar plus vector) + LD1D (scalar plus vector) + LD2B (scalar plus immediate) + LD2H (scalar plus immediate) + LD2W (scalar plus immediate) + LD2D (scalar plus immediate) + LD2B (scalar plus scalar) + LD2H (scalar plus scalar) + LD2W (scalar plus scalar) + LD2D (scalar plus scalar) + LD3B (scalar plus immediate) + LD3H (scalar plus immediate) + LD3W (scalar plus immediate) + LD3D (scalar plus immediate) + LD3B (scalar plus scalar) + LD3H (scalar plus scalar) + LD3W (scalar plus scalar) + LD3D (scalar plus scalar) + LD4B (scalar plus immediate) + LD4H (scalar plus immediate) + LD4W (scalar plus immediate) + LD4D (scalar plus immediate) + LD4B (scalar plus scalar) + LD4H (scalar plus scalar) + LD4W (scalar plus scalar) + LD4D (scalar plus scalar) + ST1B (scalar plus immediate, single register) + ST1H (scalar plus immediate, single register) + ST1W (scalar plus immediate, single register) + ST1D (scalar plus immediate, single register) + ST1B (scalar plus scalar, single register) + ST1H (scalar plus scalar, single register) + ST1W (scalar plus scalar, single register) + ST1D (scalar plus scalar, single register) + ST1B (scalar plus vector) + ST1H (scalar plus vector) + ST1W (scalar plus vector) + ST1D (scalar plus vector) + ST2B (scalar plus immediate) + ST2H (scalar plus immediate) + ST2W (scalar plus immediate) + ST2D (scalar plus immediate) + ST2B (scalar plus scalar) + ST2H (scalar plus scalar) + ST2W (scalar plus scalar) + ST2D (scalar plus scalar) + ST3B (scalar plus immediate) + ST3H (scalar plus immediate) + ST3W (scalar plus immediate) + ST3D (scalar plus immediate) + ST3B (scalar plus scalar) + ST3H (scalar plus scalar) + ST3W (scalar plus scalar) + ST3D (scalar plus scalar) + ST4B (scalar plus immediate) + ST4H (scalar plus immediate) + ST4W (scalar plus immediate) + ST4D (scalar plus immediate) + ST4B (scalar plus scalar) + ST4H (scalar plus scalar) + ST4W (scalar plus scalar) + ST4D (scalar plus scalar) + SVE aliases for condition codes Signed-off-by: Vladimir Murzin <[email protected]>

SVE pseudo-relaxation are grouped under `Sv` suffix: - `SvV` command a SVE gather load and scatter store - `Sv{1,2,3,4}` command a SVE interleave loads and stores' Generated tests set predicates in such way that single access won't exceed 128 bit (which is minimal allowed vector length) which allow them tolerate difference in implemented vector length when run on the target. Also it aligns with herd's implementation choice of 128 bit vector length. Please note that predicate register require another allocator which implemented via newly introduced `special2` allocator Signed-off-by: Vladimir Murzin <[email protected]>

Note that both instructions accept P/M (aka merge predicate) which means that destination register can be updated partially thus we mark destination register both input and output in `litmus` Signed-off-by: Vladimir Murzin <[email protected]>

The `PTRUE` and ` WHILE<OP>` instructions now transmit stored value to semantic-time environement. As a result, predicated instructions may sometimes decide whether to access or not at semantic-time, leading to a dramatic decrease of execution candidates number.

murzinv · 2024-05-02T15:25:21Z

Fine, if you are happy with the current state, would you please rebase on master. You may also squash all commits into one, or at least squash the last two commit into one. Once this is done, I'll merge.

Done 😸

maranget · 2024-05-02T15:29:27Z

Merged, thanks @murzinv and @relokin.

murzinv force-pushed the sve-basics branch from 63327e7 to f467d40 Compare March 8, 2024 11:01

murzinv marked this pull request as ready for review March 8, 2024 11:10

murzinv force-pushed the sve-basics branch 2 times, most recently from c8ab54f to 67d9bc3 Compare March 12, 2024 12:16

murzinv force-pushed the sve-basics branch 2 times, most recently from 0d86ee9 to 83c3fcc Compare March 19, 2024 11:14

relokin reviewed Mar 19, 2024

View reviewed changes

murzinv force-pushed the sve-basics branch 2 times, most recently from 0c1de28 to 46b13c5 Compare March 22, 2024 09:50

relokin reviewed Mar 25, 2024

View reviewed changes

murzinv force-pushed the sve-basics branch from 46b13c5 to 1ab3599 Compare March 26, 2024 10:15

relokin reviewed Apr 11, 2024

View reviewed changes

murzinv force-pushed the sve-basics branch 2 times, most recently from 374ef9c to 3f67773 Compare April 16, 2024 16:03

relokin reviewed Apr 19, 2024

View reviewed changes

murzinv force-pushed the sve-basics branch from 3f67773 to 733f2d2 Compare April 22, 2024 14:56

relokin reviewed Apr 24, 2024

View reviewed changes

murzinv force-pushed the sve-basics branch 2 times, most recently from 9d51420 to acbeb83 Compare April 25, 2024 13:31

maranget reviewed Apr 30, 2024

View reviewed changes

Makefile Show resolved Hide resolved

murzinv force-pushed the sve-basics branch from acbeb83 to 5092711 Compare May 1, 2024 07:32

maranget reviewed May 1, 2024

View reviewed changes

litmus/libdir/armv8+sve-qemu.cfg Show resolved Hide resolved

maranget reviewed May 1, 2024

View reviewed changes

litmus/AArch64Arch_litmus.ml Outdated Show resolved Hide resolved

maranget reviewed May 1, 2024

View reviewed changes

litmus/AArch64Compile_litmus.ml Outdated Show resolved Hide resolved

maranget reviewed May 1, 2024

View reviewed changes

litmus/AArch64Compile_litmus.ml Outdated Show resolved Hide resolved

murzinv force-pushed the sve-basics branch from 5092711 to 7b20aeb Compare May 2, 2024 09:42

Vladimir Murzin and others added 4 commits May 2, 2024 15:28

murzinv force-pushed the sve-basics branch from 7071f3e to a97200c Compare May 2, 2024 14:30

maranget merged commit 4fbccd1 into herd:master May 2, 2024
3 checks passed

murzinv deleted the sve-basics branch May 2, 2024 16:15

	uint16_t x[8] = {1,2,1,2,1,2,1,2};
	uint16_t x[8] = {0,0,1,2,0,0,0,0};

	ST3W {Z1.S,Z2.S,Z3.S},P0,[X0,X3, LSL #2];
	ST3W {Z1.S,Z2.S,Z3.S},P0,[X0];

	"Neon instruction %s require -variant neon"
	"Neon instruction %s requires -variant neon"

	"SVE instruction %s require -variant sve"
	"SVE instruction %s requires -variant sve"

Introduce basic support for SVE #802

Introduce basic support for SVE #802

Conversation

murzinv commented Feb 28, 2024

relokin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

murzinv commented Mar 20, 2024

relokin commented Mar 20, 2024

relokin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

relokin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

relokin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

relokin commented Apr 29, 2024

maranget commented May 1, 2024 • edited Loading

maranget commented May 1, 2024

murzinv commented May 2, 2024

murzinv commented May 2, 2024

maranget commented May 2, 2024

murzinv commented May 2, 2024

maranget commented May 2, 2024 • edited Loading

murzinv commented May 2, 2024

maranget commented May 2, 2024

maranget commented May 1, 2024 •

edited

Loading

maranget commented May 2, 2024 •

edited

Loading