Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce basic support for SVE #802

Merged
merged 4 commits into from
May 2, 2024
Merged

Introduce basic support for SVE #802

merged 4 commits into from
May 2, 2024

Conversation

murzinv
Copy link

@murzinv murzinv commented Feb 28, 2024

Hi All,

This PR introduces basic support for SVE instructions in litmus7, diy7 and herd7.

Scalable Vector Extension (SVE) is Vector Length Agnostic (VLA):

  • Vector Length (VL) is a hardware implementation choice from 128 up to 2048 bits.
  • New programming model allows software to scale dynamically to available vector length.
  • No need to define a new ISA, rewrite or recompile for new vector lengths.

Scalable vector registers:

  • Z0-Z31 extending NEON’s 128-bit V0-V31
  • Packed DP, SP & HP floating-point elements
  • Packed 64, 32, 16 & 8-bit integer elements

Scalable predicate registers:

  • P0-P7 governing predicates for load/store/arithmetic
  • P8-P15 additional predicates for loop management
  • FFR first fault register for software speculation

I personally find tutorial at [1] really great introduction to the SVE programming model.

Implementation choices and known limitations:
- SVE memory model is not addressed
- herd7 implements 128 bits vector length (on top of existing Neon infrastructure)

[1] https://www.stonybrook.edu/commcms/ookami/support/_docs/ARM_SVE_tutorial.pdf

@murzinv murzinv marked this pull request as ready for review March 8, 2024 11:10
@murzinv murzinv force-pushed the sve-basics branch 2 times, most recently from c8ab54f to 67d9bc3 Compare March 12, 2024 12:16
@murzinv murzinv force-pushed the sve-basics branch 2 times, most recently from 0d86ee9 to 83c3fcc Compare March 19, 2024 11:14
Copy link
Member

@relokin relokin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some very quick comments, I'll try and do another more thorough pass tomorrow.

I understand that the current implementation re-uses the Neon registers just renames them. How hard would it be to be able to change the register size?

DUP Z0.H, W0 ;
DUP Z1.S, W0 ;

forall 0:V0.4S={262148,262148,262148,262148} /\ 0:V1.4S={4,4,4,4}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are a little more obvious in hex 0:V0.4S={0x40004,0x40004,0x40004,0x40004};

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. I'll have a look in other tests and change to hex representation where appropriate

let psize = predicate_psize src in
let esize = scalable_esize dst in
let orig = match pg with
| AArch64Base.PMreg (_,AArch64Base.Zero) -> M.add V.zero V.zero
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be M.zero as well

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

M.zero doesn't work, yet mzero does the trick. I'll also change the places where I need to wrap constant to use M.unitT

| AArch64Base.PMreg (_,AArch64Base.Merge) -> read_reg_scalable false dst ii
| _ -> assert false
in
orig >>|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the semantics here are rmw, which makes sense for \Z. But for \M, do you know if you want rmw semantics or just setting individual lanes?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking into psudocode suggest that rmw is expected, yet I missed the case of all predicate being inactive. I'll change here and update other places where appropriate.

@murzinv
Copy link
Author

murzinv commented Mar 20, 2024

Some very quick comments, I'll try and do another more thorough pass tomorrow.

I understand that the current implementation re-uses the Neon registers just renames them. How hard would it be to be able to change the register size?

At the first glance, IIUC, we'd need to support arithmetic on larger register size. Another problem to solve would relate to the way to observe bits of register above 128, now we (ab)use Neon register for that.

@relokin
Copy link
Member

relokin commented Mar 20, 2024

Some very quick comments, I'll try and do another more thorough pass tomorrow.
I understand that the current implementation re-uses the Neon registers just renames them. How hard would it be to be able to change the register size?

At the first glance, IIUC, we'd need to support arithmetic on larger register size. Another problem to solve would relate to the way to observe bits of register above 128, now we (ab)use Neon register for that.

Thanks! If it was trivial to make it scalable then it would be nice. OTOH, I am not sure that the questions on the memory model change so maybe having one implementation where the register size is 128 is sufficient.

@murzinv murzinv force-pushed the sve-basics branch 2 times, most recently from 0c1de28 to 46b13c5 Compare March 22, 2024 09:50
Copy link
Member

@relokin relokin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more comments

Makefile Outdated Show resolved Hide resolved
herd/AArch64Arch_herd.ml Show resolved Hide resolved
Comment on lines 241 to 244
| Zreg (_,(8)) -> MachSize.Byte
| Zreg (_,(16)) -> MachSize.Short
| Zreg (_,(32)) -> MachSize.Word
| Zreg (_,(64)) -> MachSize.Quad
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| Zreg (_,(8)) -> MachSize.Byte
| Zreg (_,(16)) -> MachSize.Short
| Zreg (_,(32)) -> MachSize.Word
| Zreg (_,(64)) -> MachSize.Quad
| Zreg (_,8) -> MachSize.Byte
| Zreg (_,16) -> MachSize.Short
| Zreg (_,32) -> MachSize.Word
| Zreg (_,64) -> MachSize.Quad

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But also this function seems to be unused?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it is not used. I think I introduced that while working on supporting different access sizes for SVE load/store instructions but realized that access size is driven by instruction mnemonic rather than arguments (in contrast to the Neon case). I'll drop this unused function.

lib/AArch64Base.ml Show resolved Hide resolved
Copy link
Member

@relokin relokin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments, but I haven't finished my review yet.

Comment on lines 464 to 471
let pp_preg_simple r = match r with
| Preg (r',_) ->
(try List.assoc r' pvrs with Not_found -> assert false)
| PMreg (r',_) ->
(try List.assoc r' pvrs with Not_found -> assert false)
| _ -> assert false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also do something like:

Suggested change
let pp_preg_simple r = match r with
| Preg (r',_) ->
(try List.assoc r' pvrs with Not_found -> assert false)
| PMreg (r',_) ->
(try List.assoc r' pvrs with Not_found -> assert false)
| _ -> assert false
let pp_preg_simple = function
| Preg (r,_)
| PMreg (r,_) ->
(try List.assoc r pvrs with Not_found -> assert false)
| _ -> assert false

Also not sure why we're catching the exception here. It's not like assert false is much more helpful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @relokin, assert false is slightly more informative as it it leads to a message with line number included. That said, assert false should never be executed...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also do something like:

Ack

lib/AArch64Base.ml Show resolved Hide resolved
lib/AArch64Base.ml Show resolved Hide resolved
lib/AArch64Base.ml Show resolved Hide resolved
lib/AArch64Base.ml Show resolved Hide resolved
AArch64 V05
(* Contiguous load two-word structures to two vectors (scalar index) *)
{
uint16_t x[8] = {1,2,1,2,1,2,1,2};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to initialise x with different values instead of an alternating 1,2 pattern? IIUC, something like:

Suggested change
uint16_t x[8] = {1,2,1,2,1,2,1,2};
uint16_t x[8] = {0,0,1,2,0,0,0,0};

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed it to uint16_t x[8] = {1,2,0,0,0,0,0,0};

Comment on lines 39 to 42
| Preg (pr, _) ->
(try Misc.lowercase (List.assoc pr pvrs) with Not_found -> assert false)
| PMreg (pr, _) ->
(try Misc.lowercase (List.assoc pr pvrs) with Not_found -> assert false)
| Zreg (zr, _) ->
(try Misc.lowercase (List.assoc zr vvrs) with Not_found -> assert false) (*Intentionally 'vvrs' instead of 'zvrs'*)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again here I don't know that catch Not_found and asserting is any more useful than just letting it throw an unhandled Not_found exception.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then again I guess you're doing what the code before this PR was doing...

litmus/template.ml Show resolved Hide resolved
Makefile Outdated Show resolved Hide resolved
@murzinv murzinv force-pushed the sve-basics branch 2 times, most recently from 374ef9c to 3f67773 Compare April 16, 2024 16:03
Copy link
Member

@relokin relokin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of more comments but if you fixed some glitches we could merge this soon. Also, it might be worth adding something like this as it allowed me to run the sve tests with litmus7:

$> cat litmus/libdir/armv8+sve-qemu.cfg
size_of_test = 10k
number_of_run = 100
avail = 4
limit = true
memory = direct
stride = 1
carch = AArch64
barrier = userfence
smt = 4
smt_mode = seq
affinity = incr0
ccopts = -O2 -march=armv8-a+sve
crossrun = qemu:qemu-aarch64

Comment on lines 13 to 14
SXTW X3,W1 ;
ST2H {Z1.H,Z2.H},P0,[X0,X3, LSL #1] ;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is missing a WHILELT and it doesn't really need the SXTW:

Suggested change
SXTW X3,W1 ;
ST2H {Z1.H,Z2.H},P0,[X0,X3, LSL #1] ;
WHILELT P0.H,W1,W2 ;
ST2H {Z1.H,Z2.H},P0,[X0] ;

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I've added setup for P0 register. As for SXTW if I remove that and swap X3 with X1 in ST2H instruction I'd get Register x1 has different types: <int> and <int64_t> when it is consumed by litmus7. Anyway, since I set P0 with PTRUE only user of X1 left is ST2H.

@@ -0,0 +1,11 @@
Test V10 Required
States 1
x={0,0,0,0,0,0,0,0};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the intended result.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

MOV W2,#4 ;
SXTW X3,W1 ;
WHILELT P0.S,W1,W2 ;
ST3W {Z1.S,Z2.S,Z3.S},P0,[X0,X3, LSL #2];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really the same as this, or is there something intentional about using a zero offset?

Suggested change
ST3W {Z1.S,Z2.S,Z3.S},P0,[X0,X3, LSL #2];
ST3W {Z1.S,Z2.S,Z3.S},P0,[X0];

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Offset register here is to test parser side

MOV W2,#2 ;
SXTW X3,W1 ;
WHILELT P0.D,W1,W2 ;
ST4D {Z1.D,Z2.D,Z3.D,Z4.D},P0,[X0,X3, LSL #3] ;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

litmus/template.ml Show resolved Hide resolved
@@ -1381,6 +1523,7 @@ module Make
read_mem_postindexed
a_virt op sz Annot.N aexp ac rd rs k a ii)
ma ii)
| _ -> assert false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the indentation is off here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

Comment on lines 2000 to 2005

let rec reduce_ord l =
match l with
| [] -> M.unitT ()
| h::t -> h >>= fun () -> reduce_ord t

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no need to move this, is there?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, unnecessary move.

let check_neon inst =
if not (neon || sve) then
Warn.user_error
"Neon instruction %s require -variant neon"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Neon instruction %s require -variant neon"
"Neon instruction %s requires -variant neon"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

let check_sve inst =
if not sve then
Warn.user_error
"SVE instruction %s require -variant sve"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"SVE instruction %s require -variant sve"
"SVE instruction %s requires -variant sve"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

herd/AArch64Sem.ml Show resolved Hide resolved
@murzinv murzinv force-pushed the sve-basics branch 2 times, most recently from 9d51420 to acbeb83 Compare April 25, 2024 13:31
@relokin
Copy link
Member

relokin commented Apr 29, 2024

I am happy with this PR, so we could merge soon! Is anyone else planning to have to review this? Otherwise, we'll aim to merge this by the end of the week.

Makefile Show resolved Hide resolved
@maranget
Copy link
Member

maranget commented May 1, 2024

Hi @murzinv, with the following patch, predicated sve instructions may run much faster. As an example, runing make test.sve is now instantaneous.

@maranget
Copy link
Member

maranget commented May 1, 2024

Hi @murzinv, nice work, LGTM. I have nevertheless suggested a patch and could not resist making a few cosmetic remarks. You probably find the patch to be worth considering.

@murzinv
Copy link
Author

murzinv commented May 2, 2024

Hi @murzinv, with the following patch, predicated sve instructions may run much faster. As an example, runing make test.sve is now instantaneous.

Wow! That runs at speed of light! I'm wondering if similar optimization can be applied to other instructions which set predicate like WHILELT and friends?

@murzinv
Copy link
Author

murzinv commented May 2, 2024

Hi @murzinv, nice work, LGTM. I have nevertheless suggested a patch and could not resist making a few cosmetic remarks. You probably find the patch to be worth considering.

Thanks for your review comments! I could not resist from adding your patch in the series 😉

@maranget
Copy link
Member

maranget commented May 2, 2024

Wow! That runs at speed of light! I'm wondering if similar optimization can be applied to other instructions which set predicate like WHILELT and friends?

Here is the same optimisation applied to WHILE<OP> instructions.

I also notice that all those WHILE<OP> instructions could be implemented by one single constructor with an extra argument that describes the operation : that is, WHILE(<OP>,....) in place of WHILELT(...), WHILEEQ(...) etc. However I would not like this to prevent merging. I am in favor of merging now, once the patch is applied.

@murzinv
Copy link
Author

murzinv commented May 2, 2024

Wow! That runs at speed of light! I'm wondering if similar optimization can be applied to other instructions which set predicate like WHILELT and friends?

Here is the same optimisation applied to WHILE<OP> instructions.

Great thanks!

I also notice that all those WHILE<OP> instructions could be implemented by one single constructor with an extra argument that describes the operation : that is, WHILE(<OP>,....) in place of WHILELT(...), WHILEEQ(...) etc. However I would not like this to prevent merging. I am in favor of merging now, once the patch is applied.

I see. Patch now part of the series 😉

@maranget
Copy link
Member

maranget commented May 2, 2024

Fine, if you are happy with the current state, would you please rebase on master. You may also squash all commits into one, or at least squash the last two commit into one. Once this is done, I'll merge.

Vladimir Murzin and others added 4 commits May 2, 2024 15:28
Scalable Vector Extension (SVE) is Vector Length Agnostic (VLA):

 - Vector Length (VL) is a hardware implementation choice from 128 up
   to 2048 bits.
 - New programming model allows software to scale dynamically to
   available vector length.
 - No need to define a new ISA, rewrite or recompile for new vector
   lengths.

Scalable vector registers:

 - Z0-Z31 extending NEON’s 128-bit V0-V31
 - Packed DP, SP & HP floating-point elements
 - Packed 64, 32, 16 & 8-bit integer elements

Scalable predicate registers:

 - P0-P7 governing predicates for load/store/arithmetic
 - P8-P15 additional predicates for loop management
 - FFR first fault register for software speculation

Implementation choices and known limitations:

 - SVE memory model is not addressed

 - `herd7` implements 128 bits vector length (on top of existing Neon
   infrastructure)

 - `litmus7` uses ARM C Language Extensions (ACLE) for SVE
   + Building SVE test require `-ccopts "-march=armv8-a+sve -O2"`
   + Although `Z` registers overlap with `V` registers mixing them in litmus
     test would likely lead to compilation failure (due to difference
     in the ACLE types)
   + However, `V` register overlapped with `Z` register in `final` clause
     is supported (this way we can inspect content of `Z` register)

 - Following SVE instructions are implemented:
   + PTRUE (predicate)
   + MOV (immediate, unpredicated)
   + DUP (scalar)
   + ADD (vectors, unpredicated)
   + INDEX (immediate, scalar)
   + INDEX (immediates)
   + INDEX (scalar, immediate)
   + INDEX (scalars)
   + WHILELE (predicate)
   + WHILELT (predicate)
   + WHILELE (predicate)
   + WHILELO (predicate)
   + WHILELS  (predicate)
   + LD1B (scalar plus immediate, single register)
   + LD1H (scalar plus immediate, single register)
   + LD1W (scalar plus immediate, single register)
   + LD1D (scalar plus immediate, single register)
   + LD1D (scalar plus scalar, single register)
   + LD1B (scalar plus scalar, single register)
   + LD1H (scalar plus scalar, single register)
   + LD1W (scalar plus scalar, single register)
   + LD1B (scalar plus vector)
   + LD1H (scalar plus vector)
   + LD1W (scalar plus vector)
   + LD1D (scalar plus vector)
   + LD2B (scalar plus immediate)
   + LD2H (scalar plus immediate)
   + LD2W (scalar plus immediate)
   + LD2D (scalar plus immediate)
   + LD2B (scalar plus scalar)
   + LD2H (scalar plus scalar)
   + LD2W (scalar plus scalar)
   + LD2D (scalar plus scalar)
   + LD3B (scalar plus immediate)
   + LD3H (scalar plus immediate)
   + LD3W (scalar plus immediate)
   + LD3D (scalar plus immediate)
   + LD3B (scalar plus scalar)
   + LD3H (scalar plus scalar)
   + LD3W (scalar plus scalar)
   + LD3D (scalar plus scalar)
   + LD4B (scalar plus immediate)
   + LD4H (scalar plus immediate)
   + LD4W (scalar plus immediate)
   + LD4D (scalar plus immediate)
   + LD4B (scalar plus scalar)
   + LD4H (scalar plus scalar)
   + LD4W (scalar plus scalar)
   + LD4D (scalar plus scalar)
   + ST1B (scalar plus immediate, single register)
   + ST1H (scalar plus immediate, single register)
   + ST1W (scalar plus immediate, single register)
   + ST1D (scalar plus immediate, single register)
   + ST1B (scalar plus scalar, single register)
   + ST1H (scalar plus scalar, single register)
   + ST1W (scalar plus scalar, single register)
   + ST1D (scalar plus scalar, single register)
   + ST1B (scalar plus vector)
   + ST1H (scalar plus vector)
   + ST1W (scalar plus vector)
   + ST1D (scalar plus vector)
   + ST2B (scalar plus immediate)
   + ST2H (scalar plus immediate)
   + ST2W (scalar plus immediate)
   + ST2D (scalar plus immediate)
   + ST2B (scalar plus scalar)
   + ST2H (scalar plus scalar)
   + ST2W (scalar plus scalar)
   + ST2D (scalar plus scalar)
   + ST3B (scalar plus immediate)
   + ST3H (scalar plus immediate)
   + ST3W (scalar plus immediate)
   + ST3D (scalar plus immediate)
   + ST3B (scalar plus scalar)
   + ST3H (scalar plus scalar)
   + ST3W (scalar plus scalar)
   + ST3D (scalar plus scalar)
   + ST4B (scalar plus immediate)
   + ST4H (scalar plus immediate)
   + ST4W (scalar plus immediate)
   + ST4D (scalar plus immediate)
   + ST4B (scalar plus scalar)
   + ST4H (scalar plus scalar)
   + ST4W (scalar plus scalar)
   + ST4D (scalar plus scalar)
   + SVE aliases for condition codes

Signed-off-by: Vladimir Murzin <[email protected]>
SVE pseudo-relaxation are grouped under `Sv` suffix:

 - `SvV` command a SVE gather load and scatter store
 - `Sv{1,2,3,4}` command a SVE interleave loads and stores'

Generated tests set predicates in such way that single access won't
exceed 128 bit (which is minimal allowed vector length) which allow
them tolerate difference in implemented vector length when run on the
target. Also it aligns with herd's implementation choice of 128 bit
vector length.

Please note that predicate register require another allocator which
implemented via newly introduced `special2` allocator

Signed-off-by: Vladimir Murzin <[email protected]>
Note that both instructions accept P/M (aka merge predicate) which
means that destination register can be updated partially thus we mark
destination register both input and output in `litmus`

Signed-off-by: Vladimir Murzin <[email protected]>
The `PTRUE` and ` WHILE<OP>` instructions now transmit stored value
to semantic-time environement.

As a result, predicated instructions may sometimes
decide whether to access or not at semantic-time, leading to
a dramatic decrease of execution candidates number.
@murzinv
Copy link
Author

murzinv commented May 2, 2024

Fine, if you are happy with the current state, would you please rebase on master. You may also squash all commits into one, or at least squash the last two commit into one. Once this is done, I'll merge.

Done 😸

@maranget maranget merged commit 4fbccd1 into herd:master May 2, 2024
3 checks passed
@maranget
Copy link
Member

maranget commented May 2, 2024

Merged, thanks @murzinv and @relokin.

@murzinv murzinv deleted the sve-basics branch May 2, 2024 16:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants