Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RISCV] Account for factor in interleave memory op costs #111511

Merged
merged 4 commits into from
Oct 30, 2024

Conversation

lukel97
Copy link
Contributor

@lukel97 lukel97 commented Oct 8, 2024

Currently we cost an interleaved memory op as if it were a load/store of the widened vector type, but this was undercosting in all cases when compared to the measured performance of todays hardware.

On the x280 at NF=2 and spacemit-x60 at NF=2,3 and 4, a segmented load is carried out as a wide load and NF LMUL shuffle ops: https://github.com/preames/bp3-microarch#vlseg_lmul_x_sew_throughput

All other NFs go through a slow path. On the spacemit-x60 this is proportional to VLMAX * NF, and on the x280 proportional to the number of segments.

This patch increases the cost by implementing a wide load + NF LMUL shuffle op cost for the lowest common denominator NF=2, and then a slower cost proportional to VL for the other NFs.

In a follow up patch we can add a tuning flag to use the faster cost model for NF=3 and 4 on the spacemit-x60.

Note that the FIXME about illegal vectors seems to have been fixed in #100436

@llvmbot
Copy link
Collaborator

llvmbot commented Oct 8, 2024

@llvm/pr-subscribers-backend-risc-v

Author: Luke Lau (lukel97)

Changes

Currently we cost an interleaved memory op as if it were a load/store of the widened vector type.

However this doesn't take into account that we'll most likely need to perform at least Factor uops because we're writing/reading from Factor number of registers.

E.g. Today an i8 VF=2 Factor=8 interleave is costed as a single LMUL=1 op with +zvl128b, because the widened type is <16 x i8>.

This changes it to be calculated as <2 x i8> * Factor=8, i.e. 8 LMUL=1 ops.

Thankfully the FIXME about illegal vectors seems to have been fixed in #100436, and even then I think the LT.first should have been multiplied, not added.

Note we still have a quirk where the loop vectorizer will happily emit interleaved accesses for what could be strided accesses, because the costs are break-even in LoopVectorizationCostModel::setCostBasedWideningDecision:

void f(int8_t* a, int n) {
    for (int i = 0; i &lt; n; i++) {
        a[i * 2] += 1;
    }
}

vsetvli	t1, zero, e8, m2, ta, ma
vlseg2e8.v	v24, (t0)
vadd.vi	v24, v24, 1
vsse8.v	v24, (a6), a5

I think we may need to either adjust the cost or add a hook to get the loop vectorizer to s


Full diff: https://github.com/llvm/llvm-project/pull/111511.diff

2 Files Affected:

  • (modified) llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp (+2-10)
  • (modified) llvm/test/Transforms/LoopVectorize/RISCV/interleaved-cost.ll (+56-56)
diff --git a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
index a61461681f79ed..608298903cace0 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
+++ b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
@@ -691,19 +691,11 @@ InstructionCost RISCVTTIImpl::getInterleavedMemoryOpCost(
       auto *SubVecTy =
           VectorType::get(VTy->getElementType(),
                           VTy->getElementCount().divideCoefficientBy(Factor));
-
       if (VTy->getElementCount().isKnownMultipleOf(Factor) &&
           TLI->isLegalInterleavedAccessType(SubVecTy, Factor, Alignment,
                                             AddressSpace, DL)) {
-        // FIXME: We use the memory op cost of the *legalized* type here,
-        // because it's getMemoryOpCost returns a really expensive cost for
-        // types like <6 x i8>, which show up when doing interleaves of
-        // Factor=3 etc. Should the memory op cost of these be cheaper?
-        auto *LegalVTy = VectorType::get(VTy->getElementType(),
-                                         LT.second.getVectorElementCount());
-        InstructionCost LegalMemCost = getMemoryOpCost(
-            Opcode, LegalVTy, Alignment, AddressSpace, CostKind);
-        return LT.first + LegalMemCost;
+        return Factor * getMemoryOpCost(Opcode, SubVecTy, Alignment,
+                                        AddressSpace, CostKind);
       }
     }
   }
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/interleaved-cost.ll b/llvm/test/Transforms/LoopVectorize/RISCV/interleaved-cost.ll
index fa346b4eac02d4..34285b26c13875 100644
--- a/llvm/test/Transforms/LoopVectorize/RISCV/interleaved-cost.ll
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/interleaved-cost.ll
@@ -12,20 +12,20 @@ entry:
 ; CHECK: Cost of 2 for VF 4: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
 ; CHECK: Cost of 2 for VF 8: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
 ; CHECK: Cost of 2 for VF 8: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
-; CHECK: Cost of 3 for VF 16: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
-; CHECK: Cost of 3 for VF 16: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
-; CHECK: Cost of 5 for VF 32: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
-; CHECK: Cost of 5 for VF 32: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
+; CHECK: Cost of 2 for VF 16: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
+; CHECK: Cost of 2 for VF 16: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
+; CHECK: Cost of 4 for VF 32: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
+; CHECK: Cost of 4 for VF 32: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
 ; CHECK: Cost of 2 for VF vscale x 1: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
 ; CHECK: Cost of 2 for VF vscale x 1: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
 ; CHECK: Cost of 2 for VF vscale x 2: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
 ; CHECK: Cost of 2 for VF vscale x 2: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
 ; CHECK: Cost of 2 for VF vscale x 4: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
 ; CHECK: Cost of 2 for VF vscale x 4: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
-; CHECK: Cost of 3 for VF vscale x 8: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
-; CHECK: Cost of 3 for VF vscale x 8: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
-; CHECK: Cost of 5 for VF vscale x 16: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
-; CHECK: Cost of 5 for VF vscale x 16: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
+; CHECK: Cost of 2 for VF vscale x 8: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
+; CHECK: Cost of 2 for VF vscale x 8: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
+; CHECK: Cost of 4 for VF vscale x 16: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
+; CHECK: Cost of 4 for VF vscale x 16: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
 for.body:
   %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
   %p0 = getelementptr inbounds %i8.2, ptr %data, i64 %i, i32 0
@@ -49,16 +49,16 @@ define void @i8_factor_3(ptr %data, i64 %n) {
 entry:
   br label %for.body
 ; CHECK-LABEL: Checking a loop in 'i8_factor_3'
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 3 at <badref>, ir<%p0>
-; CHECK: Cost of 2 for VF 4: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
-; CHECK: Cost of 2 for VF 4: INTERLEAVE-GROUP with factor 3 at <badref>, ir<%p0>
+; CHECK: Cost of 3 for VF 2: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
+; CHECK: Cost of 3 for VF 2: INTERLEAVE-GROUP with factor 3 at <badref>, ir<%p0>
+; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
+; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 3 at <badref>, ir<%p0>
 ; CHECK: Cost of 3 for VF 8: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
 ; CHECK: Cost of 3 for VF 8: INTERLEAVE-GROUP with factor 3 at <badref>, ir<%p0>
-; CHECK: Cost of 5 for VF 16: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
-; CHECK: Cost of 5 for VF 16: INTERLEAVE-GROUP with factor 3 at <badref>, ir<%p0>
-; CHECK: Cost of 9 for VF 32: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
-; CHECK: Cost of 9 for VF 32: INTERLEAVE-GROUP with factor 3 at <badref>, ir<%p0>
+; CHECK: Cost of 3 for VF 16: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
+; CHECK: Cost of 3 for VF 16: INTERLEAVE-GROUP with factor 3 at <badref>, ir<%p0>
+; CHECK: Cost of 6 for VF 32: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
+; CHECK: Cost of 6 for VF 32: INTERLEAVE-GROUP with factor 3 at <badref>, ir<%p0>
 for.body:
   %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
   %p0 = getelementptr inbounds %i8.3, ptr %data, i64 %i, i32 0
@@ -86,16 +86,16 @@ define void @i8_factor_4(ptr %data, i64 %n) {
 entry:
   br label %for.body
 ; CHECK-LABEL: Checking a loop in 'i8_factor_4'
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 4 at <badref>, ir<%p0>
-; CHECK: Cost of 2 for VF 4: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
-; CHECK: Cost of 2 for VF 4: INTERLEAVE-GROUP with factor 4 at <badref>, ir<%p0>
-; CHECK: Cost of 3 for VF 8: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
-; CHECK: Cost of 3 for VF 8: INTERLEAVE-GROUP with factor 4 at <badref>, ir<%p0>
-; CHECK: Cost of 5 for VF 16: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
-; CHECK: Cost of 5 for VF 16: INTERLEAVE-GROUP with factor 4 at <badref>, ir<%p0>
-; CHECK: Cost of 9 for VF 32: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
-; CHECK: Cost of 9 for VF 32: INTERLEAVE-GROUP with factor 4 at <badref>, ir<%p0>
+; CHECK: Cost of 4 for VF 2: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
+; CHECK: Cost of 4 for VF 2: INTERLEAVE-GROUP with factor 4 at <badref>, ir<%p0>
+; CHECK: Cost of 4 for VF 4: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
+; CHECK: Cost of 4 for VF 4: INTERLEAVE-GROUP with factor 4 at <badref>, ir<%p0>
+; CHECK: Cost of 4 for VF 8: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
+; CHECK: Cost of 4 for VF 8: INTERLEAVE-GROUP with factor 4 at <badref>, ir<%p0>
+; CHECK: Cost of 4 for VF 16: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
+; CHECK: Cost of 4 for VF 16: INTERLEAVE-GROUP with factor 4 at <badref>, ir<%p0>
+; CHECK: Cost of 8 for VF 32: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
+; CHECK: Cost of 8 for VF 32: INTERLEAVE-GROUP with factor 4 at <badref>, ir<%p0>
 for.body:
   %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
   %p0 = getelementptr inbounds %i8.4, ptr %data, i64 %i, i32 0
@@ -127,14 +127,14 @@ define void @i8_factor_5(ptr %data, i64 %n) {
 entry:
   br label %for.body
 ; CHECK-LABEL: Checking a loop in 'i8_factor_5'
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 5 at %l0, ir<%p0>
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 5 at <badref>, ir<%p0>
-; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 5 at %l0, ir<%p0>
-; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 5 at <badref>, ir<%p0>
+; CHECK: Cost of 5 for VF 2: INTERLEAVE-GROUP with factor 5 at %l0, ir<%p0>
+; CHECK: Cost of 5 for VF 2: INTERLEAVE-GROUP with factor 5 at <badref>, ir<%p0>
+; CHECK: Cost of 5 for VF 4: INTERLEAVE-GROUP with factor 5 at %l0, ir<%p0>
+; CHECK: Cost of 5 for VF 4: INTERLEAVE-GROUP with factor 5 at <badref>, ir<%p0>
 ; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 5 at %l0, ir<%p0>
 ; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 5 at <badref>, ir<%p0>
-; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 5 at %l0, ir<%p0>
-; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 5 at <badref>, ir<%p0>
+; CHECK: Cost of 5 for VF 16: INTERLEAVE-GROUP with factor 5 at %l0, ir<%p0>
+; CHECK: Cost of 5 for VF 16: INTERLEAVE-GROUP with factor 5 at <badref>, ir<%p0>
 for.body:
   %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
   %p0 = getelementptr inbounds %i8.5, ptr %data, i64 %i, i32 0
@@ -170,14 +170,14 @@ define void @i8_factor_6(ptr %data, i64 %n) {
 entry:
   br label %for.body
 ; CHECK-LABEL: Checking a loop in 'i8_factor_6'
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 6 at %l0, ir<%p0>
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 6 at <badref>, ir<%p0>
-; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 6 at %l0, ir<%p0>
-; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 6 at <badref>, ir<%p0>
-; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 6 at %l0, ir<%p0>
-; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 6 at <badref>, ir<%p0>
-; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 6 at %l0, ir<%p0>
-; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 6 at <badref>, ir<%p0>
+; CHECK: Cost of 6 for VF 2: INTERLEAVE-GROUP with factor 6 at %l0, ir<%p0>
+; CHECK: Cost of 6 for VF 2: INTERLEAVE-GROUP with factor 6 at <badref>, ir<%p0>
+; CHECK: Cost of 6 for VF 4: INTERLEAVE-GROUP with factor 6 at %l0, ir<%p0>
+; CHECK: Cost of 6 for VF 4: INTERLEAVE-GROUP with factor 6 at <badref>, ir<%p0>
+; CHECK: Cost of 6 for VF 8: INTERLEAVE-GROUP with factor 6 at %l0, ir<%p0>
+; CHECK: Cost of 6 for VF 8: INTERLEAVE-GROUP with factor 6 at <badref>, ir<%p0>
+; CHECK: Cost of 6 for VF 16: INTERLEAVE-GROUP with factor 6 at %l0, ir<%p0>
+; CHECK: Cost of 6 for VF 16: INTERLEAVE-GROUP with factor 6 at <badref>, ir<%p0>
 for.body:
   %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
   %p0 = getelementptr inbounds %i8.6, ptr %data, i64 %i, i32 0
@@ -217,14 +217,14 @@ define void @i8_factor_7(ptr %data, i64 %n) {
 entry:
   br label %for.body
 ; CHECK-LABEL: Checking a loop in 'i8_factor_7'
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 7 at %l0, ir<%p0>
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 7 at <badref>, ir<%p0>
-; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 7 at %l0, ir<%p0>
-; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 7 at <badref>, ir<%p0>
-; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 7 at %l0, ir<%p0>
-; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 7 at <badref>, ir<%p0>
-; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 7 at %l0, ir<%p0>
-; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 7 at <badref>, ir<%p0>
+; CHECK: Cost of 7 for VF 2: INTERLEAVE-GROUP with factor 7 at %l0, ir<%p0>
+; CHECK: Cost of 7 for VF 2: INTERLEAVE-GROUP with factor 7 at <badref>, ir<%p0>
+; CHECK: Cost of 7 for VF 4: INTERLEAVE-GROUP with factor 7 at %l0, ir<%p0>
+; CHECK: Cost of 7 for VF 4: INTERLEAVE-GROUP with factor 7 at <badref>, ir<%p0>
+; CHECK: Cost of 7 for VF 8: INTERLEAVE-GROUP with factor 7 at %l0, ir<%p0>
+; CHECK: Cost of 7 for VF 8: INTERLEAVE-GROUP with factor 7 at <badref>, ir<%p0>
+; CHECK: Cost of 7 for VF 16: INTERLEAVE-GROUP with factor 7 at %l0, ir<%p0>
+; CHECK: Cost of 7 for VF 16: INTERLEAVE-GROUP with factor 7 at <badref>, ir<%p0>
 for.body:
   %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
   %p0 = getelementptr inbounds %i8.7, ptr %data, i64 %i, i32 0
@@ -268,14 +268,14 @@ define void @i8_factor_8(ptr %data, i64 %n) {
 entry:
   br label %for.body
 ; CHECK-LABEL: Checking a loop in 'i8_factor_8'
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 8 at %l0, ir<%p0>
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 8 at <badref>, ir<%p0>
-; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 8 at %l0, ir<%p0>
-; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 8 at <badref>, ir<%p0>
-; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 8 at %l0, ir<%p0>
-; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 8 at <badref>, ir<%p0>
-; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 8 at %l0, ir<%p0>
-; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 8 at <badref>, ir<%p0>
+; CHECK: Cost of 8 for VF 2: INTERLEAVE-GROUP with factor 8 at %l0, ir<%p0>
+; CHECK: Cost of 8 for VF 2: INTERLEAVE-GROUP with factor 8 at <badref>, ir<%p0>
+; CHECK: Cost of 8 for VF 4: INTERLEAVE-GROUP with factor 8 at %l0, ir<%p0>
+; CHECK: Cost of 8 for VF 4: INTERLEAVE-GROUP with factor 8 at <badref>, ir<%p0>
+; CHECK: Cost of 8 for VF 8: INTERLEAVE-GROUP with factor 8 at %l0, ir<%p0>
+; CHECK: Cost of 8 for VF 8: INTERLEAVE-GROUP with factor 8 at <badref>, ir<%p0>
+; CHECK: Cost of 8 for VF 16: INTERLEAVE-GROUP with factor 8 at %l0, ir<%p0>
+; CHECK: Cost of 8 for VF 16: INTERLEAVE-GROUP with factor 8 at <badref>, ir<%p0>
 for.body:
   %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
   %p0 = getelementptr inbounds %i8.8, ptr %data, i64 %i, i32 0

@llvmbot
Copy link
Collaborator

llvmbot commented Oct 8, 2024

@llvm/pr-subscribers-llvm-transforms

Author: Luke Lau (lukel97)

Changes

Currently we cost an interleaved memory op as if it were a load/store of the widened vector type.

However this doesn't take into account that we'll most likely need to perform at least Factor uops because we're writing/reading from Factor number of registers.

E.g. Today an i8 VF=2 Factor=8 interleave is costed as a single LMUL=1 op with +zvl128b, because the widened type is <16 x i8>.

This changes it to be calculated as <2 x i8> * Factor=8, i.e. 8 LMUL=1 ops.

Thankfully the FIXME about illegal vectors seems to have been fixed in #100436, and even then I think the LT.first should have been multiplied, not added.

Note we still have a quirk where the loop vectorizer will happily emit interleaved accesses for what could be strided accesses, because the costs are break-even in LoopVectorizationCostModel::setCostBasedWideningDecision:

void f(int8_t* a, int n) {
    for (int i = 0; i &lt; n; i++) {
        a[i * 2] += 1;
    }
}

vsetvli	t1, zero, e8, m2, ta, ma
vlseg2e8.v	v24, (t0)
vadd.vi	v24, v24, 1
vsse8.v	v24, (a6), a5

I think we may need to either adjust the cost or add a hook to get the loop vectorizer to s


Full diff: https://github.com/llvm/llvm-project/pull/111511.diff

2 Files Affected:

  • (modified) llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp (+2-10)
  • (modified) llvm/test/Transforms/LoopVectorize/RISCV/interleaved-cost.ll (+56-56)
diff --git a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
index a61461681f79ed..608298903cace0 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
+++ b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
@@ -691,19 +691,11 @@ InstructionCost RISCVTTIImpl::getInterleavedMemoryOpCost(
       auto *SubVecTy =
           VectorType::get(VTy->getElementType(),
                           VTy->getElementCount().divideCoefficientBy(Factor));
-
       if (VTy->getElementCount().isKnownMultipleOf(Factor) &&
           TLI->isLegalInterleavedAccessType(SubVecTy, Factor, Alignment,
                                             AddressSpace, DL)) {
-        // FIXME: We use the memory op cost of the *legalized* type here,
-        // because it's getMemoryOpCost returns a really expensive cost for
-        // types like <6 x i8>, which show up when doing interleaves of
-        // Factor=3 etc. Should the memory op cost of these be cheaper?
-        auto *LegalVTy = VectorType::get(VTy->getElementType(),
-                                         LT.second.getVectorElementCount());
-        InstructionCost LegalMemCost = getMemoryOpCost(
-            Opcode, LegalVTy, Alignment, AddressSpace, CostKind);
-        return LT.first + LegalMemCost;
+        return Factor * getMemoryOpCost(Opcode, SubVecTy, Alignment,
+                                        AddressSpace, CostKind);
       }
     }
   }
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/interleaved-cost.ll b/llvm/test/Transforms/LoopVectorize/RISCV/interleaved-cost.ll
index fa346b4eac02d4..34285b26c13875 100644
--- a/llvm/test/Transforms/LoopVectorize/RISCV/interleaved-cost.ll
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/interleaved-cost.ll
@@ -12,20 +12,20 @@ entry:
 ; CHECK: Cost of 2 for VF 4: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
 ; CHECK: Cost of 2 for VF 8: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
 ; CHECK: Cost of 2 for VF 8: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
-; CHECK: Cost of 3 for VF 16: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
-; CHECK: Cost of 3 for VF 16: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
-; CHECK: Cost of 5 for VF 32: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
-; CHECK: Cost of 5 for VF 32: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
+; CHECK: Cost of 2 for VF 16: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
+; CHECK: Cost of 2 for VF 16: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
+; CHECK: Cost of 4 for VF 32: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
+; CHECK: Cost of 4 for VF 32: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
 ; CHECK: Cost of 2 for VF vscale x 1: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
 ; CHECK: Cost of 2 for VF vscale x 1: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
 ; CHECK: Cost of 2 for VF vscale x 2: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
 ; CHECK: Cost of 2 for VF vscale x 2: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
 ; CHECK: Cost of 2 for VF vscale x 4: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
 ; CHECK: Cost of 2 for VF vscale x 4: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
-; CHECK: Cost of 3 for VF vscale x 8: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
-; CHECK: Cost of 3 for VF vscale x 8: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
-; CHECK: Cost of 5 for VF vscale x 16: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
-; CHECK: Cost of 5 for VF vscale x 16: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
+; CHECK: Cost of 2 for VF vscale x 8: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
+; CHECK: Cost of 2 for VF vscale x 8: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
+; CHECK: Cost of 4 for VF vscale x 16: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
+; CHECK: Cost of 4 for VF vscale x 16: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
 for.body:
   %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
   %p0 = getelementptr inbounds %i8.2, ptr %data, i64 %i, i32 0
@@ -49,16 +49,16 @@ define void @i8_factor_3(ptr %data, i64 %n) {
 entry:
   br label %for.body
 ; CHECK-LABEL: Checking a loop in 'i8_factor_3'
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 3 at <badref>, ir<%p0>
-; CHECK: Cost of 2 for VF 4: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
-; CHECK: Cost of 2 for VF 4: INTERLEAVE-GROUP with factor 3 at <badref>, ir<%p0>
+; CHECK: Cost of 3 for VF 2: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
+; CHECK: Cost of 3 for VF 2: INTERLEAVE-GROUP with factor 3 at <badref>, ir<%p0>
+; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
+; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 3 at <badref>, ir<%p0>
 ; CHECK: Cost of 3 for VF 8: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
 ; CHECK: Cost of 3 for VF 8: INTERLEAVE-GROUP with factor 3 at <badref>, ir<%p0>
-; CHECK: Cost of 5 for VF 16: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
-; CHECK: Cost of 5 for VF 16: INTERLEAVE-GROUP with factor 3 at <badref>, ir<%p0>
-; CHECK: Cost of 9 for VF 32: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
-; CHECK: Cost of 9 for VF 32: INTERLEAVE-GROUP with factor 3 at <badref>, ir<%p0>
+; CHECK: Cost of 3 for VF 16: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
+; CHECK: Cost of 3 for VF 16: INTERLEAVE-GROUP with factor 3 at <badref>, ir<%p0>
+; CHECK: Cost of 6 for VF 32: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
+; CHECK: Cost of 6 for VF 32: INTERLEAVE-GROUP with factor 3 at <badref>, ir<%p0>
 for.body:
   %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
   %p0 = getelementptr inbounds %i8.3, ptr %data, i64 %i, i32 0
@@ -86,16 +86,16 @@ define void @i8_factor_4(ptr %data, i64 %n) {
 entry:
   br label %for.body
 ; CHECK-LABEL: Checking a loop in 'i8_factor_4'
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 4 at <badref>, ir<%p0>
-; CHECK: Cost of 2 for VF 4: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
-; CHECK: Cost of 2 for VF 4: INTERLEAVE-GROUP with factor 4 at <badref>, ir<%p0>
-; CHECK: Cost of 3 for VF 8: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
-; CHECK: Cost of 3 for VF 8: INTERLEAVE-GROUP with factor 4 at <badref>, ir<%p0>
-; CHECK: Cost of 5 for VF 16: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
-; CHECK: Cost of 5 for VF 16: INTERLEAVE-GROUP with factor 4 at <badref>, ir<%p0>
-; CHECK: Cost of 9 for VF 32: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
-; CHECK: Cost of 9 for VF 32: INTERLEAVE-GROUP with factor 4 at <badref>, ir<%p0>
+; CHECK: Cost of 4 for VF 2: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
+; CHECK: Cost of 4 for VF 2: INTERLEAVE-GROUP with factor 4 at <badref>, ir<%p0>
+; CHECK: Cost of 4 for VF 4: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
+; CHECK: Cost of 4 for VF 4: INTERLEAVE-GROUP with factor 4 at <badref>, ir<%p0>
+; CHECK: Cost of 4 for VF 8: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
+; CHECK: Cost of 4 for VF 8: INTERLEAVE-GROUP with factor 4 at <badref>, ir<%p0>
+; CHECK: Cost of 4 for VF 16: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
+; CHECK: Cost of 4 for VF 16: INTERLEAVE-GROUP with factor 4 at <badref>, ir<%p0>
+; CHECK: Cost of 8 for VF 32: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
+; CHECK: Cost of 8 for VF 32: INTERLEAVE-GROUP with factor 4 at <badref>, ir<%p0>
 for.body:
   %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
   %p0 = getelementptr inbounds %i8.4, ptr %data, i64 %i, i32 0
@@ -127,14 +127,14 @@ define void @i8_factor_5(ptr %data, i64 %n) {
 entry:
   br label %for.body
 ; CHECK-LABEL: Checking a loop in 'i8_factor_5'
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 5 at %l0, ir<%p0>
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 5 at <badref>, ir<%p0>
-; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 5 at %l0, ir<%p0>
-; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 5 at <badref>, ir<%p0>
+; CHECK: Cost of 5 for VF 2: INTERLEAVE-GROUP with factor 5 at %l0, ir<%p0>
+; CHECK: Cost of 5 for VF 2: INTERLEAVE-GROUP with factor 5 at <badref>, ir<%p0>
+; CHECK: Cost of 5 for VF 4: INTERLEAVE-GROUP with factor 5 at %l0, ir<%p0>
+; CHECK: Cost of 5 for VF 4: INTERLEAVE-GROUP with factor 5 at <badref>, ir<%p0>
 ; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 5 at %l0, ir<%p0>
 ; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 5 at <badref>, ir<%p0>
-; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 5 at %l0, ir<%p0>
-; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 5 at <badref>, ir<%p0>
+; CHECK: Cost of 5 for VF 16: INTERLEAVE-GROUP with factor 5 at %l0, ir<%p0>
+; CHECK: Cost of 5 for VF 16: INTERLEAVE-GROUP with factor 5 at <badref>, ir<%p0>
 for.body:
   %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
   %p0 = getelementptr inbounds %i8.5, ptr %data, i64 %i, i32 0
@@ -170,14 +170,14 @@ define void @i8_factor_6(ptr %data, i64 %n) {
 entry:
   br label %for.body
 ; CHECK-LABEL: Checking a loop in 'i8_factor_6'
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 6 at %l0, ir<%p0>
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 6 at <badref>, ir<%p0>
-; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 6 at %l0, ir<%p0>
-; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 6 at <badref>, ir<%p0>
-; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 6 at %l0, ir<%p0>
-; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 6 at <badref>, ir<%p0>
-; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 6 at %l0, ir<%p0>
-; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 6 at <badref>, ir<%p0>
+; CHECK: Cost of 6 for VF 2: INTERLEAVE-GROUP with factor 6 at %l0, ir<%p0>
+; CHECK: Cost of 6 for VF 2: INTERLEAVE-GROUP with factor 6 at <badref>, ir<%p0>
+; CHECK: Cost of 6 for VF 4: INTERLEAVE-GROUP with factor 6 at %l0, ir<%p0>
+; CHECK: Cost of 6 for VF 4: INTERLEAVE-GROUP with factor 6 at <badref>, ir<%p0>
+; CHECK: Cost of 6 for VF 8: INTERLEAVE-GROUP with factor 6 at %l0, ir<%p0>
+; CHECK: Cost of 6 for VF 8: INTERLEAVE-GROUP with factor 6 at <badref>, ir<%p0>
+; CHECK: Cost of 6 for VF 16: INTERLEAVE-GROUP with factor 6 at %l0, ir<%p0>
+; CHECK: Cost of 6 for VF 16: INTERLEAVE-GROUP with factor 6 at <badref>, ir<%p0>
 for.body:
   %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
   %p0 = getelementptr inbounds %i8.6, ptr %data, i64 %i, i32 0
@@ -217,14 +217,14 @@ define void @i8_factor_7(ptr %data, i64 %n) {
 entry:
   br label %for.body
 ; CHECK-LABEL: Checking a loop in 'i8_factor_7'
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 7 at %l0, ir<%p0>
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 7 at <badref>, ir<%p0>
-; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 7 at %l0, ir<%p0>
-; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 7 at <badref>, ir<%p0>
-; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 7 at %l0, ir<%p0>
-; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 7 at <badref>, ir<%p0>
-; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 7 at %l0, ir<%p0>
-; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 7 at <badref>, ir<%p0>
+; CHECK: Cost of 7 for VF 2: INTERLEAVE-GROUP with factor 7 at %l0, ir<%p0>
+; CHECK: Cost of 7 for VF 2: INTERLEAVE-GROUP with factor 7 at <badref>, ir<%p0>
+; CHECK: Cost of 7 for VF 4: INTERLEAVE-GROUP with factor 7 at %l0, ir<%p0>
+; CHECK: Cost of 7 for VF 4: INTERLEAVE-GROUP with factor 7 at <badref>, ir<%p0>
+; CHECK: Cost of 7 for VF 8: INTERLEAVE-GROUP with factor 7 at %l0, ir<%p0>
+; CHECK: Cost of 7 for VF 8: INTERLEAVE-GROUP with factor 7 at <badref>, ir<%p0>
+; CHECK: Cost of 7 for VF 16: INTERLEAVE-GROUP with factor 7 at %l0, ir<%p0>
+; CHECK: Cost of 7 for VF 16: INTERLEAVE-GROUP with factor 7 at <badref>, ir<%p0>
 for.body:
   %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
   %p0 = getelementptr inbounds %i8.7, ptr %data, i64 %i, i32 0
@@ -268,14 +268,14 @@ define void @i8_factor_8(ptr %data, i64 %n) {
 entry:
   br label %for.body
 ; CHECK-LABEL: Checking a loop in 'i8_factor_8'
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 8 at %l0, ir<%p0>
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 8 at <badref>, ir<%p0>
-; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 8 at %l0, ir<%p0>
-; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 8 at <badref>, ir<%p0>
-; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 8 at %l0, ir<%p0>
-; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 8 at <badref>, ir<%p0>
-; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 8 at %l0, ir<%p0>
-; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 8 at <badref>, ir<%p0>
+; CHECK: Cost of 8 for VF 2: INTERLEAVE-GROUP with factor 8 at %l0, ir<%p0>
+; CHECK: Cost of 8 for VF 2: INTERLEAVE-GROUP with factor 8 at <badref>, ir<%p0>
+; CHECK: Cost of 8 for VF 4: INTERLEAVE-GROUP with factor 8 at %l0, ir<%p0>
+; CHECK: Cost of 8 for VF 4: INTERLEAVE-GROUP with factor 8 at <badref>, ir<%p0>
+; CHECK: Cost of 8 for VF 8: INTERLEAVE-GROUP with factor 8 at %l0, ir<%p0>
+; CHECK: Cost of 8 for VF 8: INTERLEAVE-GROUP with factor 8 at <badref>, ir<%p0>
+; CHECK: Cost of 8 for VF 16: INTERLEAVE-GROUP with factor 8 at %l0, ir<%p0>
+; CHECK: Cost of 8 for VF 16: INTERLEAVE-GROUP with factor 8 at <badref>, ir<%p0>
 for.body:
   %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
   %p0 = getelementptr inbounds %i8.8, ptr %data, i64 %i, i32 0

@topperc
Copy link
Collaborator

topperc commented Oct 8, 2024

I think we may need to either adjust the cost or add a hook to get the loop vectorizer to s

Incomplete sentence

@lukel97
Copy link
Contributor Author

lukel97 commented Oct 8, 2024

I think we may need to either adjust the cost or add a hook to get the loop vectorizer to s

Incomplete sentence

My bad, fixed now

@preames
Copy link
Collaborator

preames commented Oct 8, 2024

At a conceptual level, this doesn't seem like the right costing. I suspect a better model is performing a wide load and then performing some kind of additional shuffle uop.

Here's some preliminary throughput data from the banana pi3:

perf stat ./vlseg-nf4.out 
  ~23.188822 cycles-per-inst
  ~24061.448450 cycles-per-iteration
  ~1037.631350 insts-per-iteration

perf stat ./vlseg-nf2.out 
  ~11.796910 cycles-per-inst
  ~12026.720450 cycles-per-iteration
  ~1019.480600 insts-per-iteration

These are fairly expensive operations. Both above are LMUL=m1.

Some reference points (focusing on the nf2 case):

perf stat ./vnsrl-m1.out
  ~3.935553 cycles-per-inst
  ~20508.467900 cycles-per-iteration
  ~5211.076900 insts-per-iteration

$ perf stat ./vle-m2.out 
  ~3.972999 cycles-per-inst
  ~4020.006750 cycles-per-iteration
  ~1011.831750 insts-per-iteration

./vlseg-nf2-emulated.out
  ~9039.639900 cycles-per-iteration
  ~5019.285100 insts-per-iteration 

From the look of this, it looks like the right cost model for the segment load instruction might be load cost + NF*shuffle cost.

Interestingly, the emulated version - which uses exactly that expansion - appears to be higher throughput. That was a real surprise to me, as I thought I'd heard that BP3 had fast segmented loads and stores. It's possible I've got an error in my tests, anyone else have data on this question?

@lukel97
Copy link
Contributor Author

lukel97 commented Oct 9, 2024

Interestingly, the emulated version - which uses exactly that expansion - appears to be higher throughput. That was a real surprise to me, as I thought I'd heard that BP3 had fast segmented loads and stores. It's possible I've got an error in my tests, anyone else have data on this question?

I tried this out: I think I can recreate your observation that the emulated vlseg2e8.v is faster, ~0.94B cycles vs ~1.26B cycles:

Sources

	.global start
start:
	la a0, foo
	li a1, 0
	li a2, 104857600
	la a3, dst1
	la a4, dst2
loop:
	vsetvli t0, zero, e16, m2, ta, ma
	vle16.v v8, (a0)
	vsetvli t0, zero, e8, m1, ta, ma
	vnsrl.wi v10, v8, 0
	vnsrl.wi v11, v8, 8
	addi a1, a1, 1
	blt a1, a2, loop
exit:
	li a7, 93
	ecall
 
	.data
foo:
	.zero 512
dst1:
	.zero 256
dst2:
	.zero 256	
	.global start
start:
	la a0, foo
	li a1, 0
	li a2, 104857600
	la a3, dst1
	la a4, dst2
	vsetvli t0, zero, e8, m1, ta, ma
loop:
	vlseg2e8.v v8, (a0)
	addi a1, a1, 1
	blt a1, a2, loop
exit:
	li a7, 93
	ecall
 
	.data
foo:
	.zero 512
dst1:
	.zero 256
dst2:
	.zero 256

However were you storing the deinterleaved results back somewhere? Once I added that in the native vlseg2e8.v overtook the emulated version slightly, ~1.58B cycles vs 1.68B

Sources with stores

	.global start
start:
	la a0, foo
	li a1, 0
	li a2, 104857600
	la a3, dst1
	la a4, dst2
loop:
	vsetvli t0, zero, e16, m2, ta, ma
	vle16.v v8, (a0)
	vsetvli t0, zero, e8, m1, ta, ma
	vnsrl.wi v10, v8, 0
	vnsrl.wi v11, v8, 8
	vse8.v v10, (a3)
	vse8.v v11, (a4)
	addi a1, a1, 1
	blt a1, a2, loop
exit:
	li a7, 93
	ecall
 
	.data
foo:
	.zero 512
dst1:
	.zero 256
dst2:
	.zero 256	
	.global start
start:
	la a0, foo
	li a1, 0
	li a2, 104857600
	la a3, dst1
	la a4, dst2
	vsetvli t0, zero, e8, m1, ta, ma
loop:
	vlseg2e8.v v8, (a0)
	vse8.v v8, (a3)
	vse8.v v9, (a4)
	addi a1, a1, 1
	blt a1, a2, loop
exit:
	li a7, 93
	ecall
 
	.data
foo:
	.zero 512
dst1:
	.zero 256
dst2:
	.zero 256

@lukel97
Copy link
Contributor Author

lukel97 commented Oct 9, 2024

I suspect a better model is performing a wide load and then performing some kind of additional shuffle uop.

I think you're right, I did some more benchmarking on the banana pi and I think the throughput is proportional to something like Wide load + Factor * LMUL. These cycle counts are for various segmented loads without any storing:

vlseg2e8 M1: 1.26B      2 + 2 * 1 = 4 ops   (0.315 cycles/op)
vlseg2e8 M2: 2.52B      4 + 2 * 2 = 8 ops   (0.315 cycles/op)
vlseg2e8 M4: 5.04B      8 + 2 * 4 = 16 ops  (0.315 cycles/op)
					        
vlseg3e8 M1: 2.10B      4 + 3 * 1 = 7 ops   (0.30 cycles/op)
vlseg3e8 M2: 4.20B      8 + 3 * 2 = 14 ops  (0.30 cycles/op)
					        
vlseg4e8 M1: 2.52B      4 + 4 * 1 = 8 ops   (0.315 cycles/op)
vlseg4e8 M2: 5.04B      8 + 4 * 2 = 16 ops  (0.315 cycles/op)

I'm not sure what the exact formula is for the NF=3 loads, but it seems close enough. I'll update this PR anyway.

@lukel97
Copy link
Contributor Author

lukel97 commented Oct 11, 2024

Something worth pointing out from the interleaved-accesses.ll test diff is that we now vectorize a factor 3 interleave pattern of i64s as three vlse64.v/vsse64.v (with +zvl128b) instead of a single vlseg3e64.v/vsseg3e64.v. This actually turns out to be slightly profitable on the banana pi F3, most likely due to how strided accesses scale with VLMAX, and at e64 this is smaller.

With vlseg3e64.v + vsseg3e64.v we get 65.59 cycles/iteration

        vsetvli t0, zero, e64, m2, ta, ma
loop:
        vlseg3e64.v v8, (a0)
        vadd.vi v8, v8, 1
        vadd.vi v10, v10, 2
        vadd.vi v12, v12, 3
        vsseg3e64.v v8, (a0)
        addi a1, a1, 1
        blt a1, a2, loop

With 3 x vlse64.v/vsse64.v we get 62.07 cycles/iteration

	li a3, 24
	addi a5, a0, 8
	addi a6, a0, 16
	vsetvli t0, zero, e64, m2, ta, ma
loop:
     	vlse64.v v8, (a0), a3
	vlse64.v v10, (a5), a3
	vlse64.v v12, (a5), a3
	vadd.vi v8, v8, 1
	vadd.vi v10, v10, 2
	vadd.vi v12, v12, 3
	vsse64.v v8, (a0), a3
	vsse64.v v10, (a5), a3
	vsse64.v v12, (a5), a3
	addi a1, a1, 1
	blt a1, a2, loop

@topperc
Copy link
Collaborator

topperc commented Oct 11, 2024

For x280, I think factor 2 is handled pretty much like a unit stride load. It reads/writes two register at once.

Other factors are handled one segment at a time. If the segment is more than DLEN bits then each segment will require multiple cycles. So the cost is something like VL*ceil(DLEN/(SEW*factor)). There's probably also some register file write port cost too, but I don't know exactly.

@lukel97
Copy link
Contributor Author

lukel97 commented Oct 11, 2024

Other factors are handled one segment at a time. If the segment is more than DLEN bits then each segment will require multiple cycles. So the cost is something like VLceil(DLEN/(SEWfactor)).

At NF > 4 I'm seeing the same behaviour on the banana pi too, where it scales with VL. Do we need some sort of subtarget hook to specify which factors go down the wide-load fast path and which ones don't?

For x280, I think factor 2 is handled pretty much like a unit stride load. It reads/writes two register at once.

Does it also have some sort of shuffling overhead similar to the costing currently in this PR?

@preames
Copy link
Collaborator

preames commented Oct 11, 2024

I got curious, and went and did a full set of Factor vs SEW vs LMUL sweeps for the segmented loads on the BP3. You can find all the data here: https://github.com/preames/bp3-microarch/#vlseg_lmul_x_sew_throughput

Overall, my data confirms the snippets that Luke has posted above. Let me summarize what I think is going on here.

BP3 appears to have two implementations - one used for factors 2,3,4 and the other for factors 5,6,7,8. The first appears to be a wide load followed by some kind of shuffle operation, whereas the second appears to scale with the number of 128 b loads required + some kind of adjustment term. I'm having trouble fitting a good formula for either to be honest.

Given these results, and what Craig has noted above about the x280, I think we need to have distinct costing models for different factors. Annoyingly, it looks like that threshold may need to differ by processor as well.

@lukel97
Copy link
Contributor Author

lukel97 commented Oct 15, 2024

BP3 appears to have two implementations - one used for factors 2,3,4 and the other for factors 5,6,7,8. The first appears to be a wide load followed by some kind of shuffle operation, whereas the second appears to scale with the number of 128 b loads required + some kind of adjustment term. I'm having trouble fitting a good formula for either to be honest.

For factors 2,3,4 the formula 1.5 * (VLEN/DLEN) * 2 * LMUL * NF seems to fit the results in https://github.com/preames/bp3-microarch/pull/1/files#diff-b335630551682c19a781afebcf4d07bf978fb1f8ac04c6bf87428ed5106870f5, which (wide load) + NF * LMUL shuffles would model.

Given these results, and what Craig has noted above about the x280, I think we need to have distinct costing models for different factors. Annoyingly, it looks like that threshold may need to differ by processor as well.

Agreed. But regardless of the uarch today, we're still under-costing all segmented accesses regardless of whether or not that specific NF is optimized as a wide load + shuffle.

Might I propose that we land this as an incremental improvement over what's in-tree today, where we use the (wide load) + NF * LMUL shuffles as a generic cost model. That way generically compiled code without -mcpu/-mtune will still benefit.

Then as a follow up we can add a tuning feature like TuneHasSlow{2,3,4,5,6,7,8}SegmentedLoadStore that further increases the cost to be proportional to VLMAX or VL*ceil(DLEN/(SEW*factor)), which we would add to the spacemit-x60 and x280 processor definitions

@preames
Copy link
Collaborator

preames commented Oct 18, 2024

Chatted w/Luke about this offline, and settled on an approach. Starting with review just so we're all on the same page:

  • Our existing known hardware differs in which segments are fast, but appear to have more or less the same behavior for fast and slow implementations respective. x60 appears to have fast NF=2,3,4, whereas x280 only has fast NF=2.
  • The current cost model is better than the best "fast" case. So we're currently under-costing all factors.

Tentative plan is to do the following:

  • Implement both a fast and slow variant of the cost model. Default NF2 to the fast one, and all others to the slow. This covers the intersection of fast cases in the wild today. This will be done in this review.
  • As a separate follow up change, consider adding tuning flags so that code compiled specifically for the x60 can benefit from fast NF=3,4.
  • As a separate follow up change, explore alternate lowerings for NF=2. Even with the "fast" segment load, some preliminary tests seem to indicate that a wide load and two vnrsl are still faster.

Currently we cost an interleaved memory op as if it were a load/store of the widened vector type.

However this doesn't take into account that we'll most likely need to perform at least Factor uops because we're writing/reading from Factor number of registers.

E.g. Today an i8 VF=2 Factor=8 interleave is costed as a single LMUL=1 op with +zvl128b, because the widened type is <16 x i8>.

This changes it to be calculated as <2 x i8> * Factor=8, i.e. 8 LMUL=1 ops.

Thankfully the FIXME about illegal vectors seems to have been fixed in llvm#100436, and even then I think the LT.first should have been multiplied, not added.

Note we still have a quirk where the loop vectorizer will happily emit interleaved accesses for what could be strided accesses, because the costs are break-even in LoopVectorizationCostModel::setCostBasedWideningDecision:

	void f(int8_t* a, int n) {
	    for (int i = 0; i < n; i++) {
	        a[i * 2] += 1;
	    }
	}

	vsetvli	t1, zero, e8, m2, ta, ma
	vlseg2e8.v	v24, (t0)
	vadd.vi	v24, v24, 1
	vsse8.v	v24, (a6), a5

I think we may need to either adjust the cost or add a hook to get the loop vectorizer to s
@@ -481,7 +485,7 @@ loop:
%ext = zext i8 %l.1 to i32
%gep.dst = getelementptr i32, ptr %dst, i64 %iv
store i32 %ext, ptr %gep.dst, align 4
%iv.next = add nsw i64 %iv, 4
%iv.next = add nsw i64 %iv, 2
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the cost of non-NF2 interleaves has gone up I had to change the stride here so the loop vectorizer still decides to use an interleave group here.

Copy link
Collaborator

@preames preames left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lukel97 lukel97 merged commit 14045de into llvm:main Oct 30, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants