Instruction | General theme | Writemask | Optional special features |
---|---|---|---|
fms64 (63=0)fms32 (63=0)fms16 (63=0) |
z[j][i] -= x[i] * y[j] |
7 bit X, 7 bit Y | X/Y/Z input disable |
fms64 (63=1)fms32 (63=1)fms16 (63=1) |
z[_][i] -= x[i] * y[i] |
7 bit | X/Y/Z input disable |
Bit | Width | Meaning | Notes |
---|---|---|---|
10 | 22 | A64 reserved instruction | Must be 0x201000 >> 10 |
5 | 5 | Instruction | 11 for fms64 13 for fms32 16 for fms16 |
0 | 5 | 5-bit GPR index | See below for the meaning of the 64 bits in the GPR |
Bit | Width | Meaning | Notes |
---|---|---|---|
63 | 1 | Vector mode (1 ) or matrix mode (0 ) |
|
62 | 1 | Z is f32 (1 ) or Z is instruction width (0 ) |
Only used by fms16 in matrix mode, ignored otherwise |
61 | 1 | X is f16 (1 ) or X is instruction width (0 ) |
Only used by fms32 , ignored otherwise |
60 | 1 | Y is f16 (1 ) or Y is instruction width (0 ) |
Only used by fms32 , ignored otherwise |
48 | 12 | Ignored | |
46 | 2 | X enable mode | |
41 | 5 | X enable value | Meaning dependent upon associated mode |
39 | 2 | Ignored | |
37 | 2 | Y enable mode | Ignored in vector mode |
32 | 5 | Y enable value | Ignored in vector mode Meaning dependent upon associated mode |
30 | 2 | Ignored | |
29 | 1 | Skip X input (1 ) or use X input (0 ) |
|
28 | 1 | Skip Y input (1 ) or use Y input (0 ) |
|
27 | 1 | Skip Z input (1 ) or use Z input (0 ) |
|
26 | 1 | Ignored | |
20 | 6 | Z row | High bits ignored in matrix mode |
19 | 1 | Ignored | |
10 | 9 | X offset (in bytes) | |
9 | 1 | Ignored | |
0 | 9 | Y offset (in bytes) |
Combinations of bits 27-29 result in various floating-point ALU operations:
Operation | 29 (X) | 28 (Y) | 27 (Z) |
---|---|---|---|
z-x*y |
0 |
0 |
0 |
-x*y |
0 |
0 |
1 |
z-x |
0 |
1 |
0 |
-x |
0 |
1 |
1 |
z- y |
1 |
0 |
0 |
- y |
1 |
0 |
1 |
z |
1 |
1 |
0 |
-0 |
1 |
1 |
1 |
Combinations of the instruction and bits 60-63 result in various widths for X / Y / Z:
Mode | X | Y | Z | 63 (M) | 62 (Z) | 61 (X) | 60 (Y) | Op |
---|---|---|---|---|---|---|---|---|
Matrix | f16 | f16 | f16 (one row from each two) | 0 |
0 |
fms16 |
||
Matrix | f16 | f16 | f32 (all rows, interleaved pairs) | 0 |
1 |
fms16 |
||
Matrix | f32 | f32 | f32 (one row from each four) | 0 |
0 |
0 |
fms32 |
|
Matrix | f32 | f16 (even lanes) | f32 (one row from each four) | 0 |
0 |
1 |
fms32 |
|
Matrix | f16 (even lanes) | f32 | f32 (one row from each four) | 0 |
1 |
0 |
fms32 |
|
Matrix | f16 (even lanes) | f16 (even lanes) | f32 (one row from each four) | 0 |
1 |
1 |
fms32 |
|
Matrix | f64 | f64 | f64 (one row from each eight) | 0 |
fms64 |
|||
Vector | f16 | f16 | f16 (one row) | 1 |
fms16 |
|||
Vector | f32 | f32 | f32 (one row) | 1 |
0 |
0 |
fms32 |
|
Vector | f32 | f16 (even lanes) | f32 (one row) | 1 |
0 |
1 |
fms32 |
|
Vector | f16 (even lanes) | f32 | f32 (one row) | 1 |
1 |
0 |
fms32 |
|
Vector | f16 (even lanes) | f16 (even lanes) | f32 (one row) | 1 |
1 |
1 |
fms32 |
|
Vector | f64 | f64 | f64 (one row) | 1 |
fms64 |
X/Y enable modes:
Mode | Meaning of value (N) |
---|---|
0 |
Enable all lanes (0 ), or odd lanes only (1 ), or even lanes only (2 ), or no lanes (anything else) |
1 |
Only enable lane #N |
2 |
Only enable the first N lanes, or all lanes when N is zero |
3 |
Only enable the last N lanes, or all lanes when N is zero |
In vector mode, performs a pointwise fused-multiply-subtract (or simplification thereof) operation between an X vector, a Y vector, and a Z vector, accumulating onto the Z vector. All three vectors have the same element type, either f16 or f32 or f64. Alternatively, when Z has type f32, X or Y (or both) can have type f16, though only the even lanes are used.
In matrix mode, performs a fused-multiply-subtract (or simplification thereof) outer-product between an X vector, a Y vector, and a 2D grid of Z values, accumulating onto Z. All three of X and Y and Z have the same element type, either f16 or f32 or f64. Alternatively, when Z has type f32, X or Y (or both) can have type f16, though only the even lanes are used. As a final alternative, when Z has type f32 and both X/Y have type f16, then all lanes of X and Y can be used in combination with the entire 64x64 byte grid of Z, with even lanes of X going into even Z registers and odd lanes of X going into odd Z registers (see Mixed lane widths).
See fms.c. Note the code in test.c to set the DN bit of fpcr
.
A representative sample is:
void emulate_AMX_FMS64(amx_state* state, uint64_t operand) {
uint64_t y_offset = operand & 0x1FF;
uint64_t x_offset = (operand >> 10) & 0x1FF;
uint64_t z_row = (operand >> 20) & 63;
uint64_t x_enable = parse_writemask(operand >> 41, 8, 7);
uint64_t y_enable = parse_writemask(operand >> 32, 8, 7);
double x[8];
double y[8];
load_xy_reg(x, state->x, x_offset);
load_xy_reg(y, state->y, y_offset);
for (int i = 0; i < 8; i++) {
if (!((x_enable >> (i * 8)) & 1)) continue;
if (operand & FMA_VECTOR_PRODUCT) {
double* z = &state->z[z_row].f64[i];
*z = fms64_alu(x[i], y[i], *z, operand);
} else {
for (int j = 0; j < 8; j++) {
if (!((y_enable >> (j * 8)) & 1)) continue;
double* z = &state->z[(j * 8) + (z_row & 7)].f64[i];
*z = fms64_alu(x[i], y[j], *z, operand);
}
}
}
}
double fms64_alu(double x, double y, double z, uint64_t operand) {
switch ((operand >> 27) & 7) {
case 1: return -x * y;
case 2: return z - x;
case 3: return -x;
case 4: return z - y;
case 5: return -y;
case 6: return z;
case 7: return -0.;
}
double out;
__asm("fmsub %d0, %d1, %d2, %d3" : "=w"(out) : "w"(x), "w"(y), "w"(z));
return out;
}
Identical to corresponding fma instruction.