-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce basic support for SVE #802
Conversation
c8ab54f
to
67d9bc3
Compare
0d86ee9
to
83c3fcc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some very quick comments, I'll try and do another more thorough pass tomorrow.
I understand that the current implementation re-uses the Neon registers just renames them. How hard would it be to be able to change the register size?
DUP Z0.H, W0 ; | ||
DUP Z1.S, W0 ; | ||
|
||
forall 0:V0.4S={262148,262148,262148,262148} /\ 0:V1.4S={4,4,4,4} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these are a little more obvious in hex 0:V0.4S={0x40004,0x40004,0x40004,0x40004};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack. I'll have a look in other tests and change to hex representation where appropriate
herd/AArch64Sem.ml
Outdated
let psize = predicate_psize src in | ||
let esize = scalable_esize dst in | ||
let orig = match pg with | ||
| AArch64Base.PMreg (_,AArch64Base.Zero) -> M.add V.zero V.zero |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be M.zero
as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
M.zero
doesn't work, yet mzero
does the trick. I'll also change the places where I need to wrap constant to use M.unitT
| AArch64Base.PMreg (_,AArch64Base.Merge) -> read_reg_scalable false dst ii | ||
| _ -> assert false | ||
in | ||
orig >>| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the semantics here are rmw, which makes sense for \Z. But for \M, do you know if you want rmw semantics or just setting individual lanes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking into psudocode suggest that rmw is expected, yet I missed the case of all predicate being inactive. I'll change here and update other places where appropriate.
At the first glance, IIUC, we'd need to support arithmetic on larger register size. Another problem to solve would relate to the way to observe bits of register above 128, now we (ab)use Neon register for that. |
Thanks! If it was trivial to make it scalable then it would be nice. OTOH, I am not sure that the questions on the memory model change so maybe having one implementation where the register size is 128 is sufficient. |
0c1de28
to
46b13c5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more comments
herd/AArch64Arch_herd.ml
Outdated
| Zreg (_,(8)) -> MachSize.Byte | ||
| Zreg (_,(16)) -> MachSize.Short | ||
| Zreg (_,(32)) -> MachSize.Word | ||
| Zreg (_,(64)) -> MachSize.Quad |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Zreg (_,(8)) -> MachSize.Byte | |
| Zreg (_,(16)) -> MachSize.Short | |
| Zreg (_,(32)) -> MachSize.Word | |
| Zreg (_,(64)) -> MachSize.Quad | |
| Zreg (_,8) -> MachSize.Byte | |
| Zreg (_,16) -> MachSize.Short | |
| Zreg (_,32) -> MachSize.Word | |
| Zreg (_,64) -> MachSize.Quad |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But also this function seems to be unused?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, it is not used. I think I introduced that while working on supporting different access sizes for SVE load/store instructions but realized that access size is driven by instruction mnemonic rather than arguments (in contrast to the Neon case). I'll drop this unused function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments, but I haven't finished my review yet.
lib/AArch64Base.ml
Outdated
let pp_preg_simple r = match r with | ||
| Preg (r',_) -> | ||
(try List.assoc r' pvrs with Not_found -> assert false) | ||
| PMreg (r',_) -> | ||
(try List.assoc r' pvrs with Not_found -> assert false) | ||
| _ -> assert false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could also do something like:
let pp_preg_simple r = match r with | |
| Preg (r',_) -> | |
(try List.assoc r' pvrs with Not_found -> assert false) | |
| PMreg (r',_) -> | |
(try List.assoc r' pvrs with Not_found -> assert false) | |
| _ -> assert false | |
let pp_preg_simple = function | |
| Preg (r,_) | |
| PMreg (r,_) -> | |
(try List.assoc r pvrs with Not_found -> assert false) | |
| _ -> assert false |
Also not sure why we're catching the exception here. It's not like assert false is much more helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @relokin, assert false
is slightly more informative as it it leads to a message with line number included. That said, assert false
should never be executed...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could also do something like:
Ack
AArch64 V05 | ||
(* Contiguous load two-word structures to two vectors (scalar index) *) | ||
{ | ||
uint16_t x[8] = {1,2,1,2,1,2,1,2}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to initialise x with different values instead of an alternating 1,2 pattern? IIUC, something like:
uint16_t x[8] = {1,2,1,2,1,2,1,2}; | |
uint16_t x[8] = {0,0,1,2,0,0,0,0}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed it to uint16_t x[8] = {1,2,0,0,0,0,0,0};
litmus/AArch64Arch_litmus.ml
Outdated
| Preg (pr, _) -> | ||
(try Misc.lowercase (List.assoc pr pvrs) with Not_found -> assert false) | ||
| PMreg (pr, _) -> | ||
(try Misc.lowercase (List.assoc pr pvrs) with Not_found -> assert false) | ||
| Zreg (zr, _) -> | ||
(try Misc.lowercase (List.assoc zr vvrs) with Not_found -> assert false) (*Intentionally 'vvrs' instead of 'zvrs'*) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again here I don't know that catch Not_found and asserting is any more useful than just letting it throw an unhandled Not_found exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But then again I guess you're doing what the code before this PR was doing...
374ef9c
to
3f67773
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of more comments but if you fixed some glitches we could merge this soon. Also, it might be worth adding something like this as it allowed me to run the sve tests with litmus7:
$> cat litmus/libdir/armv8+sve-qemu.cfg
size_of_test = 10k
number_of_run = 100
avail = 4
limit = true
memory = direct
stride = 1
carch = AArch64
barrier = userfence
smt = 4
smt_mode = seq
affinity = incr0
ccopts = -O2 -march=armv8-a+sve
crossrun = qemu:qemu-aarch64
SXTW X3,W1 ; | ||
ST2H {Z1.H,Z2.H},P0,[X0,X3, LSL #1] ; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is missing a WHILELT and it doesn't really need the SXTW:
SXTW X3,W1 ; | |
ST2H {Z1.H,Z2.H},P0,[X0,X3, LSL #1] ; | |
WHILELT P0.H,W1,W2 ; | |
ST2H {Z1.H,Z2.H},P0,[X0] ; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! I've added setup for P0
register. As for SXTW
if I remove that and swap X3
with X1
in ST2H
instruction I'd get Register x1 has different types: <int> and <int64_t>
when it is consumed by litmus7
. Anyway, since I set P0
with PTRUE
only user of X1
left is ST2H
.
@@ -0,0 +1,11 @@ | |||
Test V10 Required | |||
States 1 | |||
x={0,0,0,0,0,0,0,0}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the intended result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
MOV W2,#4 ; | ||
SXTW X3,W1 ; | ||
WHILELT P0.S,W1,W2 ; | ||
ST3W {Z1.S,Z2.S,Z3.S},P0,[X0,X3, LSL #2]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really the same as this, or is there something intentional about using a zero offset?
ST3W {Z1.S,Z2.S,Z3.S},P0,[X0,X3, LSL #2]; | |
ST3W {Z1.S,Z2.S,Z3.S},P0,[X0]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Offset register here is to test parser side
MOV W2,#2 ; | ||
SXTW X3,W1 ; | ||
WHILELT P0.D,W1,W2 ; | ||
ST4D {Z1.D,Z2.D,Z3.D,Z4.D},P0,[X0,X3, LSL #3] ; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
herd/AArch64Sem.ml
Outdated
@@ -1381,6 +1523,7 @@ module Make | |||
read_mem_postindexed | |||
a_virt op sz Annot.N aexp ac rd rs k a ii) | |||
ma ii) | |||
| _ -> assert false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the indentation is off here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
herd/AArch64Sem.ml
Outdated
|
||
let rec reduce_ord l = | ||
match l with | ||
| [] -> M.unitT () | ||
| h::t -> h >>= fun () -> reduce_ord t | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is no need to move this, is there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, unnecessary move.
herd/AArch64Sem.ml
Outdated
let check_neon inst = | ||
if not (neon || sve) then | ||
Warn.user_error | ||
"Neon instruction %s require -variant neon" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Neon instruction %s require -variant neon" | |
"Neon instruction %s requires -variant neon" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
herd/AArch64Sem.ml
Outdated
let check_sve inst = | ||
if not sve then | ||
Warn.user_error | ||
"SVE instruction %s require -variant sve" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"SVE instruction %s require -variant sve" | |
"SVE instruction %s requires -variant sve" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
9d51420
to
acbeb83
Compare
I am happy with this PR, so we could merge soon! Is anyone else planning to have to review this? Otherwise, we'll aim to merge this by the end of the week. |
Wow! That runs at speed of light! I'm wondering if similar optimization can be applied to other instructions which set predicate like |
Here is the same optimisation applied to I also notice that all those |
Great thanks!
I see. Patch now part of the series 😉 |
Fine, if you are happy with the current state, would you please rebase on master. You may also squash all commits into one, or at least squash the last two commit into one. Once this is done, I'll merge. |
Scalable Vector Extension (SVE) is Vector Length Agnostic (VLA): - Vector Length (VL) is a hardware implementation choice from 128 up to 2048 bits. - New programming model allows software to scale dynamically to available vector length. - No need to define a new ISA, rewrite or recompile for new vector lengths. Scalable vector registers: - Z0-Z31 extending NEON’s 128-bit V0-V31 - Packed DP, SP & HP floating-point elements - Packed 64, 32, 16 & 8-bit integer elements Scalable predicate registers: - P0-P7 governing predicates for load/store/arithmetic - P8-P15 additional predicates for loop management - FFR first fault register for software speculation Implementation choices and known limitations: - SVE memory model is not addressed - `herd7` implements 128 bits vector length (on top of existing Neon infrastructure) - `litmus7` uses ARM C Language Extensions (ACLE) for SVE + Building SVE test require `-ccopts "-march=armv8-a+sve -O2"` + Although `Z` registers overlap with `V` registers mixing them in litmus test would likely lead to compilation failure (due to difference in the ACLE types) + However, `V` register overlapped with `Z` register in `final` clause is supported (this way we can inspect content of `Z` register) - Following SVE instructions are implemented: + PTRUE (predicate) + MOV (immediate, unpredicated) + DUP (scalar) + ADD (vectors, unpredicated) + INDEX (immediate, scalar) + INDEX (immediates) + INDEX (scalar, immediate) + INDEX (scalars) + WHILELE (predicate) + WHILELT (predicate) + WHILELE (predicate) + WHILELO (predicate) + WHILELS (predicate) + LD1B (scalar plus immediate, single register) + LD1H (scalar plus immediate, single register) + LD1W (scalar plus immediate, single register) + LD1D (scalar plus immediate, single register) + LD1D (scalar plus scalar, single register) + LD1B (scalar plus scalar, single register) + LD1H (scalar plus scalar, single register) + LD1W (scalar plus scalar, single register) + LD1B (scalar plus vector) + LD1H (scalar plus vector) + LD1W (scalar plus vector) + LD1D (scalar plus vector) + LD2B (scalar plus immediate) + LD2H (scalar plus immediate) + LD2W (scalar plus immediate) + LD2D (scalar plus immediate) + LD2B (scalar plus scalar) + LD2H (scalar plus scalar) + LD2W (scalar plus scalar) + LD2D (scalar plus scalar) + LD3B (scalar plus immediate) + LD3H (scalar plus immediate) + LD3W (scalar plus immediate) + LD3D (scalar plus immediate) + LD3B (scalar plus scalar) + LD3H (scalar plus scalar) + LD3W (scalar plus scalar) + LD3D (scalar plus scalar) + LD4B (scalar plus immediate) + LD4H (scalar plus immediate) + LD4W (scalar plus immediate) + LD4D (scalar plus immediate) + LD4B (scalar plus scalar) + LD4H (scalar plus scalar) + LD4W (scalar plus scalar) + LD4D (scalar plus scalar) + ST1B (scalar plus immediate, single register) + ST1H (scalar plus immediate, single register) + ST1W (scalar plus immediate, single register) + ST1D (scalar plus immediate, single register) + ST1B (scalar plus scalar, single register) + ST1H (scalar plus scalar, single register) + ST1W (scalar plus scalar, single register) + ST1D (scalar plus scalar, single register) + ST1B (scalar plus vector) + ST1H (scalar plus vector) + ST1W (scalar plus vector) + ST1D (scalar plus vector) + ST2B (scalar plus immediate) + ST2H (scalar plus immediate) + ST2W (scalar plus immediate) + ST2D (scalar plus immediate) + ST2B (scalar plus scalar) + ST2H (scalar plus scalar) + ST2W (scalar plus scalar) + ST2D (scalar plus scalar) + ST3B (scalar plus immediate) + ST3H (scalar plus immediate) + ST3W (scalar plus immediate) + ST3D (scalar plus immediate) + ST3B (scalar plus scalar) + ST3H (scalar plus scalar) + ST3W (scalar plus scalar) + ST3D (scalar plus scalar) + ST4B (scalar plus immediate) + ST4H (scalar plus immediate) + ST4W (scalar plus immediate) + ST4D (scalar plus immediate) + ST4B (scalar plus scalar) + ST4H (scalar plus scalar) + ST4W (scalar plus scalar) + ST4D (scalar plus scalar) + SVE aliases for condition codes Signed-off-by: Vladimir Murzin <[email protected]>
SVE pseudo-relaxation are grouped under `Sv` suffix: - `SvV` command a SVE gather load and scatter store - `Sv{1,2,3,4}` command a SVE interleave loads and stores' Generated tests set predicates in such way that single access won't exceed 128 bit (which is minimal allowed vector length) which allow them tolerate difference in implemented vector length when run on the target. Also it aligns with herd's implementation choice of 128 bit vector length. Please note that predicate register require another allocator which implemented via newly introduced `special2` allocator Signed-off-by: Vladimir Murzin <[email protected]>
Note that both instructions accept P/M (aka merge predicate) which means that destination register can be updated partially thus we mark destination register both input and output in `litmus` Signed-off-by: Vladimir Murzin <[email protected]>
The `PTRUE` and ` WHILE<OP>` instructions now transmit stored value to semantic-time environement. As a result, predicated instructions may sometimes decide whether to access or not at semantic-time, leading to a dramatic decrease of execution candidates number.
Done 😸 |
Hi All,
This PR introduces basic support for SVE instructions in
litmus7
,diy7
andherd7
.Scalable Vector Extension (SVE) is Vector Length Agnostic (VLA):
Scalable vector registers:
Scalable predicate registers:
I personally find tutorial at [1] really great introduction to the SVE programming model.
Implementation choices and known limitations:
- SVE memory model is not addressed
-
herd7
implements 128 bits vector length (on top of existing Neon infrastructure)[1] https://www.stonybrook.edu/commcms/ookami/support/_docs/ARM_SVE_tutorial.pdf