[RISCV] Introduce VLOptimizer pass #108640

michaelmaitland · 2024-09-13T20:06:10Z

The purpose of this optimization is to make the VL argument, for instructions that have a VL argument, as small as possible. This is implemented by visiting each instruction in reverse order and checking that if it has a VL argument, whether the VL can be reduced.

By putting this pass before VSETVLI insertion, we see three kinds of changes to generated code:

Eliminate VSETVLI instructions
Reduce the VL toggle on VSETVLI instructions that also change vtype
Reduce the VL set by a VSETVLI instruction

The list of supported instructions is currently whitelisted for safety. In the future, we could add more instructions to isSupportedInstr to support even more VL optimization.

We originally wrote this pass because vector GEP instructions do not take a VL, which leads us to emit code that uses VL=VLMAX to implement GEP in the RISC-V backend. As a result, some of the vector instructions will write to lanes, specifically between the intended VL and VLMAX, that will never be read. As an alternative to this pass, we considered adding a vector predicated GEP instruction, but this would not fit well into the intrinsic type system since GEP has a variable number of arguments, each with arbitrary types. The second approach we considered was to put this pass after VSETVLI insertion, but we found that it was more difficult to recognize optimization opportunities, especially across basic block boundaries -- the data flow analysis was also a bit more expensive and complex.

While this pass solves the GEP problem, we have expanded it to handle more cases of VL optimization, and there is opportunity for the analysis to be improved to enable even more optimization. We have a few follow up patches to post, but figured this would be a good start.

Co-authored-by: Craig Topper [email protected]
Co-authored-by: Kito Cheng [email protected]

llvmbot · 2024-09-13T20:06:28Z

@llvm/pr-subscribers-backend-risc-v

Author: Michael Maitland (michaelmaitland)

Changes

The purpose of this optimization is to make the VL argument, for instructions that have a VL argument, as small as possible. This is implemented by visiting each instruction in reverse order and checking that if it has a VL argument, whether the VL can be reduced.

By putting this pass before VSETVLI insertion, we see three kinds of changes to generated code:

Eliminate VSETVLI instructions
Reduce the VL toggle on VSETVLI instructions that also change vtype
Reduce the VL set by a VSETVLI instruction

The list of supported instructions is currently whitelisted for safety. In the future, we could add more instructions to isSupportedInstr to support even more VL optimization.

We originally wrote this pass because vector GEP instructions do not take a VL, which leads us to emit code that uses VL=VLMAX to implement GEP in the RISC-V backend. As a result, some of the vector instructions will write to lanes, specifically between the intended VL and VLMAX, that will never be read. As an alternative to this pass, we considered adding a vector predicated GEP instruction, but this would not fit well into the intrinsic type system since GEP has a variable number of arguments, each with arbitrary types. The second approach we considered was to put this pass after VSETVLI insertion, but we found that it was more difficult to recognize optimization opportunities, especially across basic block boundaries -- the data flow analysis was also a bit more expensive and complex.

While this pass solves the GEP problem, we have expanded it to handle more cases of VL optimization, and there is opportunity for the analysis to be improved to enable even more optimization.

In our downstream compiler, this was able to optimize > 5000 VLs on spec2006 and > 6000 VLs on spec2017. These measurements were using force-tail-folding-style=data-with-evl during vectorization. I don't have numbers for the upstream compiler.

Patch is 242.21 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/108640.diff

29 Files Affected:

(modified) llvm/lib/Target/RISCV/CMakeLists.txt (+1)
(modified) llvm/lib/Target/RISCV/RISCV.h (+4)
(modified) llvm/lib/Target/RISCV/RISCVTargetMachine.cpp (+8-1)
(added) llvm/lib/Target/RISCV/RISCVVLOptimizer.cpp (+1468)
(modified) llvm/test/CodeGen/RISCV/O3-pipeline.ll (+2-1)
(modified) llvm/test/CodeGen/RISCV/rvv/narrow-shift-extend.ll (+27-27)
(modified) llvm/test/CodeGen/RISCV/rvv/setcc-int-vp.ll (+18-36)
(modified) llvm/test/CodeGen/RISCV/rvv/vdiv-vp.ll (-2)
(modified) llvm/test/CodeGen/RISCV/rvv/vdivu-vp.ll (-2)
(modified) llvm/test/CodeGen/RISCV/rvv/vfwmacc-vp.ll (+14-28)
(modified) llvm/test/CodeGen/RISCV/rvv/vfwmsac-vp.ll (+12-24)
(modified) llvm/test/CodeGen/RISCV/rvv/vfwnmacc-vp.ll (+15-30)
(modified) llvm/test/CodeGen/RISCV/rvv/vfwnmsac-vp.ll (+15-30)
(modified) llvm/test/CodeGen/RISCV/rvv/vmax-vp.ll (-2)
(modified) llvm/test/CodeGen/RISCV/rvv/vmaxu-vp.ll (-2)
(modified) llvm/test/CodeGen/RISCV/rvv/vmin-vp.ll (-2)
(modified) llvm/test/CodeGen/RISCV/rvv/vminu-vp.ll (-2)
(modified) llvm/test/CodeGen/RISCV/rvv/vmul-vp.ll (+3-4)
(modified) llvm/test/CodeGen/RISCV/rvv/vpgather-sdnode.ll (+174-209)
(modified) llvm/test/CodeGen/RISCV/rvv/vpscatter-sdnode.ll (+166-199)
(modified) llvm/test/CodeGen/RISCV/rvv/vrem-vp.ll (-2)
(modified) llvm/test/CodeGen/RISCV/rvv/vremu-vp.ll (-2)
(modified) llvm/test/CodeGen/RISCV/rvv/vsetvli-insert-crossbb.ll (-3)
(modified) llvm/test/CodeGen/RISCV/rvv/vshl-vp.ll (+1-2)
(modified) llvm/test/CodeGen/RISCV/rvv/vsra-vp.ll (-2)
(modified) llvm/test/CodeGen/RISCV/rvv/vsrl-vp.ll (-2)
(modified) llvm/test/CodeGen/RISCV/rvv/vssub-vp.ll (+1-2)
(modified) llvm/test/CodeGen/RISCV/rvv/vssubu-vp.ll (+1-2)
(modified) llvm/test/CodeGen/RISCV/rvv/vwsll-vp.ll (+30-60)

diff --git a/llvm/lib/Target/RISCV/CMakeLists.txt b/llvm/lib/Target/RISCV/CMakeLists.txt
index 124bc239451aed..46ef94b4dd2931 100644
--- a/llvm/lib/Target/RISCV/CMakeLists.txt
+++ b/llvm/lib/Target/RISCV/CMakeLists.txt
@@ -59,6 +59,7 @@ add_llvm_target(RISCVCodeGen
   RISCVTargetObjectFile.cpp
   RISCVTargetTransformInfo.cpp
   RISCVVectorPeephole.cpp
+  RISCVVLOptimizer.cpp
   GISel/RISCVCallLowering.cpp
   GISel/RISCVInstructionSelector.cpp
   GISel/RISCVLegalizerInfo.cpp
diff --git a/llvm/lib/Target/RISCV/RISCV.h b/llvm/lib/Target/RISCV/RISCV.h
index 5a94ada8f8dd46..651119d10d1119 100644
--- a/llvm/lib/Target/RISCV/RISCV.h
+++ b/llvm/lib/Target/RISCV/RISCV.h
@@ -99,6 +99,10 @@ void initializeRISCVO0PreLegalizerCombinerPass(PassRegistry &);
 
 FunctionPass *createRISCVPreLegalizerCombiner();
 void initializeRISCVPreLegalizerCombinerPass(PassRegistry &);
+
+FunctionPass *createRISCVVLOptimizerPass();
+void initializeRISCVVLOptimizerPass(PassRegistry &);
+
 } // namespace llvm
 
 #endif
diff --git a/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp b/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp
index 794df2212dfa53..bfedc6b177332e 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp
+++ b/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp
@@ -103,6 +103,10 @@ static cl::opt<bool> EnableVSETVLIAfterRVVRegAlloc(
     cl::desc("Insert vsetvls after vector register allocation"),
     cl::init(true));
 
+static cl::opt<bool> EnableVLOptimizer("riscv-enable-vloptimizer",
+                                       cl::desc("Enable the VL Optimizer pass"),
+                                       cl::init(true), cl::Hidden);
+
 extern "C" LLVM_EXTERNAL_VISIBILITY void LLVMInitializeRISCVTarget() {
   RegisterTargetMachine<RISCVTargetMachine> X(getTheRISCV32Target());
   RegisterTargetMachine<RISCVTargetMachine> Y(getTheRISCV64Target());
@@ -550,8 +554,11 @@ void RISCVPassConfig::addMachineSSAOptimization() {
 
 void RISCVPassConfig::addPreRegAlloc() {
   addPass(createRISCVPreRAExpandPseudoPass());
-  if (TM->getOptLevel() != CodeGenOptLevel::None)
+  if (TM->getOptLevel() != CodeGenOptLevel::None) {
     addPass(createRISCVMergeBaseOffsetOptPass());
+    if (EnableVLOptimizer)
+      addPass(createRISCVVLOptimizerPass());
+  }
 
   addPass(createRISCVInsertReadWriteCSRPass());
   addPass(createRISCVInsertWriteVXRMPass());
diff --git a/llvm/lib/Target/RISCV/RISCVVLOptimizer.cpp b/llvm/lib/Target/RISCV/RISCVVLOptimizer.cpp
new file mode 100644
index 00000000000000..d25db9b38bec44
--- /dev/null
+++ b/llvm/lib/Target/RISCV/RISCVVLOptimizer.cpp
@@ -0,0 +1,1468 @@
+//===-------------- RISCVVLOptimizer.cpp - VL Optimizer ------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===---------------------------------------------------------------------===//
+//
+// This pass reduces the VL where possible at the MI level, before VSETVLI
+// instructions are inserted.
+//
+// The purpose of this optimization is to make the VL argument, for instructions
+// that have a VL argument, as small as possible. This is implemented by
+// visiting each instruction in reverse order and checking that if it has a VL
+// argument, whether the VL can be reduced.
+//
+//===---------------------------------------------------------------------===//
+
+#include "RISCV.h"
+#include "RISCVMachineFunctionInfo.h"
+#include "RISCVSubtarget.h"
+#include "llvm/ADT/SetVector.h"
+#include "llvm/CodeGen/MachineDominators.h"
+#include "llvm/CodeGen/MachineFunctionPass.h"
+#include "llvm/InitializePasses.h"
+
+#include <algorithm>
+
+using namespace llvm;
+
+#define DEBUG_TYPE "riscv-vl-optimizer"
+
+namespace {
+
+class RISCVVLOptimizer : public MachineFunctionPass {
+  const MachineRegisterInfo *MRI;
+  const MachineDominatorTree *MDT;
+
+public:
+  static char ID;
+
+  RISCVVLOptimizer() : MachineFunctionPass(ID) {
+    initializeRISCVVLOptimizerPass(*PassRegistry::getPassRegistry());
+  }
+
+  bool runOnMachineFunction(MachineFunction &MF) override;
+
+  void getAnalysisUsage(AnalysisUsage &AU) const override {
+    AU.setPreservesCFG();
+    AU.addRequired<MachineDominatorTreeWrapperPass>();
+    MachineFunctionPass::getAnalysisUsage(AU);
+  }
+
+  StringRef getPassName() const override { return "RISC-V VL Optimizer"; }
+
+private:
+  bool tryReduceVL(MachineInstr &MI);
+};
+
+} // end anonymous namespace
+
+char RISCVVLOptimizer::ID = 0;
+INITIALIZE_PASS_BEGIN(RISCVVLOptimizer, DEBUG_TYPE, "RISC-V VL Optimizer",
+                      false, false)
+INITIALIZE_PASS_DEPENDENCY(MachineDominatorTreeWrapperPass)
+INITIALIZE_PASS_END(RISCVVLOptimizer, DEBUG_TYPE, "RISC-V VL Optimizer", false,
+                    false)
+
+FunctionPass *llvm::createRISCVVLOptimizerPass() {
+  return new RISCVVLOptimizer();
+}
+
+/// Return true if R is a physical or virtual vector register, false otherwise.
+static bool isVectorRegClass(Register R, const MachineRegisterInfo *MRI) {
+  if (R.isPhysical())
+    return RISCV::VRRegClass.contains(R);
+  const TargetRegisterClass *RC = MRI->getRegClass(R);
+  return RISCV::VRRegClass.hasSubClassEq(RC) ||
+         RISCV::VRM2RegClass.hasSubClassEq(RC) ||
+         RISCV::VRM4RegClass.hasSubClassEq(RC) ||
+         RISCV::VRM8RegClass.hasSubClassEq(RC);
+}
+
+/// Represents the EMUL and EEW of a MachineOperand.
+struct OperandInfo {
+  enum class State {
+    Unknown,
+    Known,
+  } S;
+
+  // Represent as 1,2,4,8, ... and fractional indicator. This is because
+  // EMUL can take on values that don't map to RISCVII::VLMUL values exactly.
+  // For example, a mask operand can have an EMUL less than MF8.
+  std::pair<unsigned, bool> EMUL;
+
+  unsigned Log2EEW;
+
+  OperandInfo(RISCVII::VLMUL EMUL, unsigned Log2EEW)
+      : S(State::Known), EMUL(RISCVVType::decodeVLMUL(EMUL)), Log2EEW(Log2EEW) {
+  }
+
+  OperandInfo(std::pair<unsigned, bool> EMUL, unsigned Log2EEW)
+      : S(State::Known), EMUL(EMUL), Log2EEW(Log2EEW) {}
+
+  OperandInfo(State S) : S(S) {
+    assert(S != State::Known &&
+           "This constructor may only be used to construct "
+           "an Unknown OperandInfo");
+  }
+
+  bool isUnknown() { return S == State::Unknown; }
+  bool isKnown() { return S == State::Known; }
+
+  static bool EMULAndEEWAreEqual(OperandInfo A, OperandInfo B) {
+    assert(A.isKnown() && B.isKnown() && "Both operands must be known");
+    return A.Log2EEW == B.Log2EEW && A.EMUL.first == B.EMUL.first &&
+           A.EMUL.second == B.EMUL.second;
+  }
+};
+
+/// Return the RISCVII::VLMUL that is two times VLMul.
+/// Precondition: VLMul is not LMUL_RESERVED or LMUL_8.
+static RISCVII::VLMUL twoTimesVLMUL(RISCVII::VLMUL VLMul) {
+  switch (VLMul) {
+  case RISCVII::VLMUL::LMUL_F8:
+    return RISCVII::VLMUL::LMUL_F4;
+  case RISCVII::VLMUL::LMUL_F4:
+    return RISCVII::VLMUL::LMUL_F2;
+  case RISCVII::VLMUL::LMUL_F2:
+    return RISCVII::VLMUL::LMUL_1;
+  case RISCVII::VLMUL::LMUL_1:
+    return RISCVII::VLMUL::LMUL_2;
+  case RISCVII::VLMUL::LMUL_2:
+    return RISCVII::VLMUL::LMUL_4;
+  case RISCVII::VLMUL::LMUL_4:
+    return RISCVII::VLMUL::LMUL_8;
+  case RISCVII::VLMUL::LMUL_8:
+  default:
+    llvm_unreachable("Could not multiply VLMul by 2");
+  }
+}
+
+/// Return the RISCVII::VLMUL that is VLMul / 2.
+/// Precondition: VLMul is not LMUL_RESERVED or LMUL_MF8.
+static RISCVII::VLMUL halfVLMUL(RISCVII::VLMUL VLMul) {
+  switch (VLMul) {
+  case RISCVII::VLMUL::LMUL_F4:
+    return RISCVII::VLMUL::LMUL_F8;
+  case RISCVII::VLMUL::LMUL_F2:
+    return RISCVII::VLMUL::LMUL_F4;
+  case RISCVII::VLMUL::LMUL_1:
+    return RISCVII::VLMUL::LMUL_F2;
+  case RISCVII::VLMUL::LMUL_2:
+    return RISCVII::VLMUL::LMUL_1;
+  case RISCVII::VLMUL::LMUL_4:
+    return RISCVII::VLMUL::LMUL_2;
+  case RISCVII::VLMUL::LMUL_8:
+    return RISCVII::VLMUL::LMUL_4;
+  case RISCVII::VLMUL::LMUL_F8:
+  default:
+    llvm_unreachable("Could not divide VLMul by 2");
+  }
+}
+
+/// Return EMUL = (EEW / SEW) * LMUL where EEW comes from Log2EEW and LMUL and
+/// SEW are from the TSFlags of MI.
+static std::pair<unsigned, bool>
+getEMULEqualsEEWDivSEWTimesLMUL(unsigned Log2EEW, const MachineInstr &MI) {
+  RISCVII::VLMUL MIVLMUL = RISCVII::getLMul(MI.getDesc().TSFlags);
+  auto [MILMUL, MILMULIsFractional] = RISCVVType::decodeVLMUL(MIVLMUL);
+  unsigned MILog2SEW =
+      MI.getOperand(RISCVII::getSEWOpNum(MI.getDesc())).getImm();
+  unsigned MISEW = 1 << MILog2SEW;
+
+  unsigned EEW = 1 << Log2EEW;
+  // Calculate (EEW/SEW)*LMUL preserving fractions less than 1. Use GCD
+  // to put fraction in simplest form.
+  unsigned Num = EEW, Denom = MISEW;
+  int GCD = MILMULIsFractional ? std::gcd(Num, Denom * MILMUL)
+                               : std::gcd(Num * MILMUL, Denom);
+  Num = MILMULIsFractional ? Num / GCD : Num * MILMUL / GCD;
+  Denom = MILMULIsFractional ? Denom * MILMUL / GCD : Denom / GCD;
+  return std::make_pair(Num > Denom ? Num : Denom, Denom > Num);
+}
+
+/// An index segment load or store operand has the form v.*seg<nf>ei<eeew>.v.
+/// Data has EEW=SEW, EMUL=LMUL. Index has EEW=<eew>, EMUL=(EEW/SEW)*LMUL. LMUL
+/// and SEW comes from TSFlags of MI.
+static OperandInfo
+getIndexSegmentLoadStoreOperandInfo(unsigned Log2EEW, const MachineInstr &MI,
+                                    const MachineOperand &MO) {
+  // Operand 0 is data register
+  // Data vector register group has EEW=SEW, EMUL=LMUL.
+  if (MO.getOperandNo() == 0) {
+    RISCVII::VLMUL MIVLMul = RISCVII::getLMul(MI.getDesc().TSFlags);
+    unsigned MILog2SEW =
+        MI.getOperand(RISCVII::getSEWOpNum(MI.getDesc())).getImm();
+    return OperandInfo(MIVLMul, MILog2SEW);
+  }
+
+  // Operand 1 is index vector register
+  // v.*seg<nf>ei<eeew>.v
+  // Index vector register group has EEW=<eew>, EMUL=(EEW/SEW)*LMUL.
+  if (MO.getOperandNo() == 2)
+    return OperandInfo(getEMULEqualsEEWDivSEWTimesLMUL(Log2EEW, MI), Log2EEW);
+
+  llvm_unreachable("Could not get OperandInfo for non-vector register of an "
+                   "indexed segment load or store instruction");
+}
+
+/// Dest has EEW=SEW and EMUL=LMUL. Source EEW=SEW/Factor (i.e. F2 => EEW/2).
+/// Source has EMUL=(EEW/SEW)*LMUL. LMUL and SEW comes from TSFlags of MI.
+static OperandInfo getIntegerExtensionOperandInfo(unsigned Factor,
+                                                  const MachineInstr &MI,
+                                                  const MachineOperand &MO) {
+  RISCVII::VLMUL MIVLMul = RISCVII::getLMul(MI.getDesc().TSFlags);
+  unsigned MILog2SEW =
+      MI.getOperand(RISCVII::getSEWOpNum(MI.getDesc())).getImm();
+
+  if (MO.getOperandNo() == 0)
+    return OperandInfo(MIVLMul, MILog2SEW);
+
+  unsigned MISEW = 1 << MILog2SEW;
+  unsigned EEW = MISEW / Factor;
+  unsigned Log2EEW = Log2_32(EEW);
+
+  return OperandInfo(getEMULEqualsEEWDivSEWTimesLMUL(Log2EEW, MI), Log2EEW);
+}
+
+/// Check whether MO is a mask operand of MI.
+static bool isMaskOperand(const MachineInstr &MI, const MachineOperand &MO,
+                          const MachineRegisterInfo *MRI) {
+
+  if (!MO.isReg() || !isVectorRegClass(MO.getReg(), MRI))
+    return false;
+
+  const MCInstrDesc &Desc = MI.getDesc();
+  return Desc.operands()[MO.getOperandNo()].RegClass == RISCV::VMV0RegClassID;
+}
+
+/// Return the OperandInfo for MO, which is an operand of MI.
+static OperandInfo getOperandInfo(const MachineInstr &MI,
+                                  const MachineOperand &MO,
+                                  const MachineRegisterInfo *MRI) {
+  const RISCVVPseudosTable::PseudoInfo *RVV =
+      RISCVVPseudosTable::getPseudoInfo(MI.getOpcode());
+  assert(RVV && "Could not find MI in PseudoTable");
+
+  // MI has a VLMUL and SEW associated with it. The RVV specification defines
+  // the LMUL and SEW of each operand and definition in relation to MI.VLMUL and
+  // MI.SEW.
+  RISCVII::VLMUL MIVLMul = RISCVII::getLMul(MI.getDesc().TSFlags);
+  unsigned MILog2SEW =
+      MI.getOperand(RISCVII::getSEWOpNum(MI.getDesc())).getImm();
+  bool IsMODef = MO.getOperandNo() == 0;
+
+  // All mask operands have EEW=1, EMUL=(EEW/SEW)*LMUL
+  if (isMaskOperand(MI, MO, MRI))
+    return OperandInfo(getEMULEqualsEEWDivSEWTimesLMUL(0, MI), 0);
+
+  // TODO: Pseudos that end in _MASK or _TU can have a merge operand.
+  // We bail out early for instructions that have merge operands for now.
+  // Merge operand is first operand after defs.
+  if (MO.getOperandNo() == MI.getNumExplicitDefs() && MO.isReg() && MO.isTied())
+    return OperandInfo(OperandInfo::State::Unknown);
+
+  // switch against BaseInstr to reduce number of cases that need to be
+  // considered.
+  switch (RVV->BaseInstr) {
+
+  // 6. Configuration-Setting Instructions
+  // Configuration setting instructions do not read or write vector registers
+  case RISCV::VSETIVLI:
+  case RISCV::VSETVL:
+  case RISCV::VSETVLI:
+    llvm_unreachable("Configuration setting instructions do not read or write "
+                     "vector registers");
+
+  // 7. Vector Loads and Stores
+  // 7.4. Vector Unit-Stride Instructions
+  // 7.5. Vector Strided Instructions
+  // 7.7. Unit-stride Fault-Only-First Loads
+  /// Dest EEW encoded in the instruction and EMUL=(EEW/SEW)*LMUL
+  case RISCV::VLE8_V:
+  case RISCV::VSE8_V:
+  case RISCV::VLM_V:
+  case RISCV::VSM_V:
+  case RISCV::VLSE8_V:
+  case RISCV::VSSE8_V:
+  case RISCV::VLE8FF_V:
+    return OperandInfo(getEMULEqualsEEWDivSEWTimesLMUL(3, MI), 3);
+  case RISCV::VLE16_V:
+  case RISCV::VSE16_V:
+  case RISCV::VLSE16_V:
+  case RISCV::VSSE16_V:
+  case RISCV::VLE16FF_V:
+    return OperandInfo(getEMULEqualsEEWDivSEWTimesLMUL(4, MI), 4);
+  case RISCV::VLE32_V:
+  case RISCV::VSE32_V:
+  case RISCV::VLSE32_V:
+  case RISCV::VSSE32_V:
+  case RISCV::VLE32FF_V:
+    return OperandInfo(getEMULEqualsEEWDivSEWTimesLMUL(5, MI), 5);
+  case RISCV::VLE64_V:
+  case RISCV::VSE64_V:
+  case RISCV::VLSE64_V:
+  case RISCV::VSSE64_V:
+  case RISCV::VLE64FF_V:
+    return OperandInfo(getEMULEqualsEEWDivSEWTimesLMUL(6, MI), 6);
+
+  // 7.6. Vector Indexed Instructions
+  // Data EEW=SEW, EMUL=LMUL. Index EEW=<eew> and EMUL=(EEW/SEW)*LMUL
+  case RISCV::VLUXEI8_V:
+  case RISCV::VLOXEI8_V:
+  case RISCV::VSUXEI8_V:
+  case RISCV::VSOXEI8_V:
+    if (MO.getOperandNo() == 0)
+      return OperandInfo(MIVLMul, MILog2SEW);
+    return OperandInfo(getEMULEqualsEEWDivSEWTimesLMUL(3, MI), 3);
+  case RISCV::VLUXEI16_V:
+  case RISCV::VLOXEI16_V:
+  case RISCV::VSUXEI16_V:
+  case RISCV::VSOXEI16_V:
+    if (MO.getOperandNo() == 0)
+      return OperandInfo(MIVLMul, MILog2SEW);
+    return OperandInfo(getEMULEqualsEEWDivSEWTimesLMUL(4, MI), 4);
+  case RISCV::VLUXEI32_V:
+  case RISCV::VLOXEI32_V:
+  case RISCV::VSUXEI32_V:
+  case RISCV::VSOXEI32_V:
+    if (MO.getOperandNo() == 0)
+      return OperandInfo(MIVLMul, MILog2SEW);
+    return OperandInfo(getEMULEqualsEEWDivSEWTimesLMUL(5, MI), 5);
+  case RISCV::VLUXEI64_V:
+  case RISCV::VLOXEI64_V:
+  case RISCV::VSUXEI64_V:
+  case RISCV::VSOXEI64_V:
+    if (MO.getOperandNo() == 0)
+      return OperandInfo(MIVLMul, MILog2SEW);
+    return OperandInfo(getEMULEqualsEEWDivSEWTimesLMUL(6, MI), 6);
+
+  // 7.8. Vector Load/Store Segment Instructions
+  // 7.8.1. Vector Unit-Stride Segment Loads and Stores
+  // v.*seg<nf>e<eew>.*
+  // EEW=eew, EMUL=LMUL
+  case RISCV::VLSEG2E8_V:
+  case RISCV::VLSEG2E8FF_V:
+  case RISCV::VLSEG3E8_V:
+  case RISCV::VLSEG3E8FF_V:
+  case RISCV::VLSEG4E8_V:
+  case RISCV::VLSEG4E8FF_V:
+  case RISCV::VLSEG5E8_V:
+  case RISCV::VLSEG5E8FF_V:
+  case RISCV::VLSEG6E8_V:
+  case RISCV::VLSEG6E8FF_V:
+  case RISCV::VLSEG7E8_V:
+  case RISCV::VLSEG7E8FF_V:
+  case RISCV::VLSEG8E8_V:
+  case RISCV::VLSEG8E8FF_V:
+  case RISCV::VSSEG2E8_V:
+  case RISCV::VSSEG3E8_V:
+  case RISCV::VSSEG4E8_V:
+  case RISCV::VSSEG5E8_V:
+  case RISCV::VSSEG6E8_V:
+  case RISCV::VSSEG7E8_V:
+  case RISCV::VSSEG8E8_V:
+    return OperandInfo(MIVLMul, 3);
+  case RISCV::VLSEG2E16_V:
+  case RISCV::VLSEG2E16FF_V:
+  case RISCV::VLSEG3E16_V:
+  case RISCV::VLSEG3E16FF_V:
+  case RISCV::VLSEG4E16_V:
+  case RISCV::VLSEG4E16FF_V:
+  case RISCV::VLSEG5E16_V:
+  case RISCV::VLSEG5E16FF_V:
+  case RISCV::VLSEG6E16_V:
+  case RISCV::VLSEG6E16FF_V:
+  case RISCV::VLSEG7E16_V:
+  case RISCV::VLSEG7E16FF_V:
+  case RISCV::VLSEG8E16_V:
+  case RISCV::VLSEG8E16FF_V:
+  case RISCV::VSSEG2E16_V:
+  case RISCV::VSSEG3E16_V:
+  case RISCV::VSSEG4E16_V:
+  case RISCV::VSSEG5E16_V:
+  case RISCV::VSSEG6E16_V:
+  case RISCV::VSSEG7E16_V:
+  case RISCV::VSSEG8E16_V:
+    return OperandInfo(MIVLMul, 4);
+  case RISCV::VLSEG2E32_V:
+  case RISCV::VLSEG2E32FF_V:
+  case RISCV::VLSEG3E32_V:
+  case RISCV::VLSEG3E32FF_V:
+  case RISCV::VLSEG4E32_V:
+  case RISCV::VLSEG4E32FF_V:
+  case RISCV::VLSEG5E32_V:
+  case RISCV::VLSEG5E32FF_V:
+  case RISCV::VLSEG6E32_V:
+  case RISCV::VLSEG6E32FF_V:
+  case RISCV::VLSEG7E32_V:
+  case RISCV::VLSEG7E32FF_V:
+  case RISCV::VLSEG8E32_V:
+  case RISCV::VLSEG8E32FF_V:
+  case RISCV::VSSEG2E32_V:
+  case RISCV::VSSEG3E32_V:
+  case RISCV::VSSEG4E32_V:
+  case RISCV::VSSEG5E32_V:
+  case RISCV::VSSEG6E32_V:
+  case RISCV::VSSEG7E32_V:
+  case RISCV::VSSEG8E32_V:
+    return OperandInfo(MIVLMul, 5);
+  case RISCV::VLSEG2E64_V:
+  case RISCV::VLSEG2E64FF_V:
+  case RISCV::VLSEG3E64_V:
+  case RISCV::VLSEG3E64FF_V:
+  case RISCV::VLSEG4E64_V:
+  case RISCV::VLSEG4E64FF_V:
+  case RISCV::VLSEG5E64_V:
+  case RISCV::VLSEG5E64FF_V:
+  case RISCV::VLSEG6E64_V:
+  case RISCV::VLSEG6E64FF_V:
+  case RISCV::VLSEG7E64_V:
+  case RISCV::VLSEG7E64FF_V:
+  case RISCV::VLSEG8E64_V:
+  case RISCV::VLSEG8E64FF_V:
+  case RISCV::VSSEG2E64_V:
+  case RISCV::VSSEG3E64_V:
+  case RISCV::VSSEG4E64_V:
+  case RISCV::VSSEG5E64_V:
+  case RISCV::VSSEG6E64_V:
+  case RISCV::VSSEG7E64_V:
+  case RISCV::VSSEG8E64_V:
+    return OperandInfo(MIVLMul, 6);
+
+  // 7.8.2. Vector Strided Segment Loads and Stores
+  case RISCV::VLSSEG2E8_V:
+  case RISCV::VLSSEG3E8_V:
+  case RISCV::VLSSEG4E8_V:
+  case RISCV::VLSSEG5E8_V:
+  case RISCV::VLSSEG6E8_V:
+  case RISCV::VLSSEG7E8_V:
+  case RISCV::VLSSEG8E8_V:
+  case RISCV::VSSSEG2E8_V:
+  case RISCV::VSSSEG3E8_V:
+  case RISCV::VSSSEG4E8_V:
+  case RISCV::VSSSEG5E8_V:
+  case RISCV::VSSSEG6E8_V:
+  case RISCV::VSSSEG7E8_V:
+  case RISCV::VSSSEG8E8_V:
+    return OperandInfo(MIVLMul, 3);
+  case RISCV::VLSSEG2E16_V:
+  case RISCV::VLSSEG3E16_V:
+  case RISCV::VLSSEG4E16_V:
+  case RISCV::VLSSEG5E16_V:
+  case RISCV::VLSSEG6E16_V:
+  case RISCV::VLSSEG7E16_V:
+  case RISCV::VLSSEG8E16_V:
+  case RISCV::VSSSEG2E16_V:
+  case RISCV::VSSSEG3E16_V:
+  case RISCV::VSSSEG4E16_V:
+  case RISCV::VSSSEG5E16_V:
+  case RISCV::VSSSEG6E16_V:
+  case RISCV::VSSSEG7E16_V:
+  case RISCV::VSSSEG8E16_V:
+    return OperandInfo(MIVLMul, 4);
+  case RISCV::VLSSEG2E32_V:
+  case RISCV::VLSSEG3E32_V:
+  case RISCV::VLSSEG4E32_V:
+  case RISCV::VLSSEG5E32_V:
+  case RISCV::VLSSEG6E32_V:
+  case RISCV::VLSSEG7E32_V:
+  case RISCV::VLSSEG8E32_V:
+  case RISCV::VSSSEG2E32_V:
+  case RISCV::VSSSEG3E32_V:
+  case RISCV::VSSSEG4E32_V:
+  case RISCV::VSSSEG5E32_V:
+  case RISCV::VSSSEG6E32_V:
+  case RISCV::VSSSEG7E32_V:
+  case RISCV::VSSSEG8E32_V:
+    return OperandInfo(MIVLMul, 5);
+  case RISCV::VLSSEG2E64_V:
+  case RISCV::VLSSEG3E64_V:
+  case RISCV::VLSSEG4E64_V:
+  case RISCV::VLSSEG5E64_V:
+  case RISCV::VLSSEG6E64_V:
+  case RISCV::VLSSEG7E64_V:
+  case RISCV::VLSSEG8E64_V:
+  case RISCV::VSSSEG2E64_V:
+  case RISCV::VSSSEG3E64_V:
+  case RISCV::VSSSEG4E64_V:
+  case RISCV::VSSSEG5E64_V:
+  case RISCV::VSSSEG6E64_V:
+  case RISCV::VSSSEG7E64_V:
+  case RISCV::VSSSEG8E64_V:
+    return OperandInfo(MIVLMul, 6);
+
+  // 7.8.3. Vector Indexed Segment Loads and Stores
+  case RISCV::VLUXSEG2EI8_V:
+  case RISCV::VLUXSEG3EI8_V:
+  case RISCV::VLUXSEG4EI8_V:
+  case RISCV::VLUXSEG5EI8_V:
+  case RISCV::VLUXSEG6EI8_V:
+  case RISCV::VLUXSEG7EI8_V:
+  case RISCV::VLUXSEG8EI8_V:
+  case RISCV::VLOXSEG2EI8_V:
+  case RISCV::VLOXSEG3EI8_V:
+  case RISCV::VLOXSEG4EI8_V:
+  case RISCV::VLOXSEG5EI8_V:
+  case RISCV::VLOXSEG6EI8_V:
+  case RISCV::VLOXSEG7EI8_V:
+  case RISCV::VLOXSEG8EI8_V:
+  case RISCV::VSUXSEG2EI8_V:
+  case RISCV::VSUXSEG3EI8_V:
+  case RISCV::VSUXSEG4EI8_V:
+  case RISCV::VSUXSEG5EI8_V:
+  case RISCV::VSUXSEG6EI8_V:
+  case RISCV::VSUXSEG7EI8_V:
+  case RISCV::VSUXSEG8EI8_V:
+  case RISCV::VSOXSEG2EI8_V...
[truncated]

michaelmaitland · 2024-09-13T21:01:00Z

I have forced push because there had not been any review and I was missing some important bug fixes. I also opted to remove some functionality that I think would be better as a follow up patch to make the review here more straightforward.

github-actions · 2024-09-13T21:01:01Z

✅ With the latest revision this PR passed the C/C++ code formatter.

preames · 2024-09-13T22:29:36Z

Hey, awesome! Really happy to see this making it's way upstream. As I'm sure you know, I've been slowly expanding tryToReduceVL in RISCVVectorPeephole in this direction. We should definitely make sure we're sharing as much code as possible. I think we still need the early peephole (to drive the vmv.v.v conversions), but we should at least share code.

I'm going to take a real look at the code on Monday, but wanted to state initial support!

wangpc-pp

Great! I just left some comments. I will have a detailed review later.

llvm/lib/Target/RISCV/RISCVTargetMachine.cpp

llvm/lib/Target/RISCV/RISCVVLOptimizer.cpp

preames

A couple of detail comments inline, but I want to focus on high level review strategy.

Looking at the code for this, I spot a couple possible bugs (flagged in inline comments), but nothing really terrible code structure wise. I'd like to propose that we defer most of the code cleanup suggestions until after this is in tree. Specifically, I'm proposing the following:

This patch is repurposed to add the transform pass off-by-default. We subset as needed to get something correct in tree with minimal delay.
We iterate in tree to common code, and to add back any key features subset in step (1).
We enable the pass in it's own commit.

I'm very hesitant to do the usual style review here because there's so much room for subtle bugs to be introduced, and if we get into such a cycle the review could extend nearly indefinitely. I'm willing to assume good faith follow up here, and frankly will likely invest a bunch of time myself in getting this code into long term supportable state.

llvm/test/CodeGen/RISCV/rvv/vl-opt-no-prop.ll

preames · 2024-09-17T17:15:07Z

llvm/lib/Target/RISCV/RISCVVLOptimizer.cpp

+    LLVM_DEBUG(dbgs() << "Try reduce VL for " << MI << "\n");
+    std::optional<Register> CommonVL;
+    bool CanReduceVL = true;
+    for (auto &UserOp : MRI->use_operands(MI.getOperand(0).getReg())) {


Visiting each user for each time we visit something on the worklist (combined with an extra visit from the outer loop), seems potentially expensive. We could restructure this along the lines of a instcombine style worklist which integrates the outer pass.

Left a FIXME.

preames · 2024-09-17T17:18:21Z

llvm/lib/Target/RISCV/RISCVVLOptimizer.cpp

+      // Instructions like reductions may use a vector register as a scalar
+      // register. In this case, we should treat it like a scalar register which
+      // does not impact the decision on whether to optimize VL.
+      if (isVectorOpUsedAsScalarOp(UserOp)) {


I think there might be a bug here. The reduction still needs the VL to be non-zero. If all other users are a register which happens to be zero at runtime, I think this can result in value reaching the reduction (which does not itself have the VL reduced) with a wrong scalar input value.

I don't know if I follow your thinking here.

On a side note, ff the VL is zero, we cannot reduce it any further. We shouldn't even be calling tryReduceVL.

If all other users are a register which happens to be zero at runtime, I think this can result in value reaching the reduction (which does not itself have the VL reduced) with a wrong scalar input value.

All the users have to have the same VL, so the reduction's VL would also be zero and it wouldn't read the scalar input value I think.

On a side note, ff the VL is zero, we cannot reduce it any further. We shouldn't even be calling tryReduceVL.

Zero at runtime, not statically known to be zero.

llvm/lib/Target/RISCV/RISCVVLOptimizer.cpp

preames · 2024-09-17T17:20:19Z

llvm/lib/Target/RISCV/RISCVVLOptimizer.cpp

+      unsigned VLOpNum = RISCVII::getVLOpNum(Desc);
+      const MachineOperand &VLOp = UserMI.getOperand(VLOpNum);
+      // Looking for a register VL that isn't X0.
+      if (!VLOp.isReg() || VLOp.getReg() == RISCV::X0) {


The choice to only handle registers here is a surprising limitation. We can likely generalize this for immediates as well.

We have this generalization written, was going to post as a follow up.

llvm/lib/Target/RISCV/RISCVVLOptimizer.cpp

preames · 2024-09-17T17:26:44Z

llvm/lib/Target/RISCV/RISCVVLOptimizer.cpp

+  const MachineInstr &MI = *MO.getParent();
+  bool HasPassthru = RISCVII::isFirstDefTiedToFirstUse(MI.getDesc());
+
+  if (HasPassthru)


This silent off by one on the indexing is confusing to say the least. Given this is only used in a handful of places, I'd prefer to see this inlined and specialized for each user.

preames · 2024-09-17T17:29:53Z

llvm/lib/Target/RISCV/RISCVVLOptimizer.cpp

+  case RISCV::VL8RE32_V:
+  case RISCV::VL8RE64_V:
+  case RISCV::VS8R_V:
+    return OperandInfo(RISCVII::VLMUL::LMUL_8, 0);


I believe all of these hard coded LMUL on the whole register stuff are equal to MIVLMul.

We don't have pseudo instructions for whole register load/store and we don't appear to set the TSFlags VLMul for non-pseudo instructions.

Ok, another idea...

Their LMUL should match the register class right? We don't have fractional register classes, but we also don't have fraction whole register moves? So maybe combine this with isRVVWholeLoadStore from RISCVInstrInfo.cpp for the selection, then use the register class?

(I'm just looking for ways to remove redundancy. If we can share code with other users, that increases the odds we'd catch a bug.)

Again, this is minor, and I'd be happy to iterate on this type of topic in tree.

preames · 2024-09-17T17:30:36Z

llvm/lib/Target/RISCV/RISCVVLOptimizer.cpp

+  case RISCV::VWSUB_VV:
+  case RISCV::VWSUB_VX:
+  case RISCV::VWSLL_VI: {
+    unsigned Log2EEW = IsMODef ? MILog2SEW + 1 : MILog2SEW;


You should be able to drive these off the recently added TableGen flag for DestEEW.

Is the suggestion here to use this:

unsigned Log2EEW = IsMODef ? getDestLog2EEW(MILog2SEW) : MILog2SEW;

Is that a simplification?

preames · 2024-09-17T17:36:46Z

llvm/lib/Target/RISCV/RISCVVLOptimizer.cpp

+  case RISCV::VFREDMIN_VS:
+  case RISCV::VFREDOSUM_VS:
+  case RISCV::VFREDUSUM_VS:
+    return OperandInfo(MIVLMul, MILog2SEW);


I think this (and the other reduction) results is wrong. Specifically, the destination has scalar LMUL, but the source has something else. The fact we're returning the same regardless of op number here can't be correct.

From the RVV spec:

The unsigned vwredsumu.vs instruction zero-extends the SEW-wide vector elements before summing them, then adds the 2SEW-width scalar element, and stores the result in a 2SEW-width scalar element.

The vwredsum.vs instruction sign-extends the SEW-wide vector elements before summing them.

The scalar input and output operands are held in element 0 of a single vector register, not a vector register group, so any vector register can be the scalar source or destination of a vector reduction regardless of LMUL setting.

Although the result is a single scalar, it's still written to a vector register. We can see here that the RVV spec refers to the elements it writes to have a SEW. I think it makes less sense to talk about LMUL here, and I think that I said to use the same relation that the SEW had for simplicity. As used, I don't think it would cause any harm since the LMUL being reported is larger than what is actually getting used (just a single element of a vector register).

Truthfully, I cant recall why I wrote it such that 14.1 instructions return Unknown, but 14.2-14.4 don't. Some ideas:

return Unknown

keep as is

modify OperandInfo to describe this case

I probably prefer the first option for now and we can always improve it in the future. Thoughts?

I wrote in 1c35bd5 to return Unknown. Then I updated the patch to modfify OperandInfo to describe a vector register being used as a scalar in 6785417.

These tests are good candidate for VL optimization. This is a pre-commit for PR #108640, but can could probably also be improved by the peephole VL optimizations.

michaelmaitland · 2024-09-17T19:16:12Z

As I am working on responding to reviews, one thing has become clear. We need better tests for this. I'm not sure what the best way to test things are. It feels like there are two things here:

getOperandInfo
the actual VL optimization

I wonder if it would be a good idea to add unit tests for getOperandInfo. We would go through each operand of each instruction and verify that it had the expected OperandInfo. This is tedious, and I'd like to avoid it if we are going to rip it out in favor of tablegen.

Then for the actual VL optimization, I think it would be best to use lit tests to get code coverage of all paths not (not including all paths through getOperandInfo since that would be tested on its own).

Does anyone have different ideas or does this sound okay?

preames · 2024-09-17T19:25:36Z

Does anyone have different ideas or does this sound okay?

First, I want to explicitly say again that I'd welcome this iteration happening in tree.

Second, I don't tend to be a huge fan of C++ unit tests for this kind of API because they end up just repeating the implementation, and are hard to update. If you want to go in this direction, maybe something along the lines of a printer pass so that diffs can be autogenerated and we only have to (manually) change one place?

Third, part of the problem with getOperandInfo is that it replicates information in other places. In particular, I think this gets a lot simpler if restructured using the recently added DestEEW mechanism.

michaelmaitland · 2024-09-17T19:51:28Z

First, I want to explicitly say again that I'd welcome this iteration happening in tree.

If everyone is on board for this approach, I think that would be good. But I want to highlight here that I think the test coverage is not in a place where I'd feel comfortable to enabling this pass by default. I think this is a must have before turning it on.

These tests are good candidate for VL optimization. This is a pre-commit for PR llvm#108640, but can could probably also be improved by the peephole VL optimizations.

lukel97 · 2024-09-18T17:09:12Z

If everyone is on board for this approach, I think that would be good. But I want to highlight here that I think the test coverage is not in a place where I'd feel comfortable to enabling this pass by default. I think this is a must have before turning it on.

I would be happy to iterate in tree as well, we can leave it off by default and add some .mir tests. As for moving getOperandInfo out into TableGen, it was mostly a refactoring idea and I don't think we should block this PR on it. I can take a stab at it this week in parallel and see if it's feasible or not.

michaelmaitland · 2024-09-18T19:04:08Z

llvm/lib/Target/RISCV/RISCVVLOptimizer.cpp

+  case RISCV::VREDXOR_VS: {
+    if (IsMODef || IsOp1)
+      return OperandInfo(MILog2SEW);
+    if (IsOp2)


I have to check whether this is the correct operand, or if it should be op3. Will add tests.

llvm/lib/Target/RISCV/RISCVVLOptimizer.cpp

topperc · 2024-09-18T21:01:12Z

llvm/lib/Target/RISCV/RISCVVLOptimizer.cpp

+    if (isVectorRegClass(MI.getOperand(0).getReg(), MRI))
+      checkUsers(CommonVL, CanReduceVL, MI);
+    else {
+      CommonVL = MI.getOperand(RISCVII::getVLOpNum(MI.getDesc())).getReg();


Won't this just change the VL to its own VL? What's the point? And if that's all it does, you need to detect that below and not set MadeChange or add the inputs to the worklist because it didn't really change.

I removed it.

topperc · 2024-09-18T21:14:38Z

llvm/lib/Target/RISCV/RISCVVLOptimizer.cpp

+    std::optional<Register> CommonVL;
+    bool CanReduceVL = true;
+    if (isVectorRegClass(MI.getOperand(0).getReg(), MRI))
+      checkUsers(CommonVL, CanReduceVL, MI);


Make checkUsers return CanReduceVl

topperc · 2024-09-18T21:15:26Z

llvm/lib/Target/RISCV/RISCVVLOptimizer.cpp

+    // vec_reg_a = ..., vl_op, sew_op
+    // vec_reg_a = ..., vl_op, sew_op
+    // scalar_reg = vector_instr vec_reg_a, vec_reg_b, vl_op, sew_op
+    // We'd like to reduce the vl_op on vector_instr, despite it producing


Is this comment still relevant?

michaelmaitland · 2024-10-08T14:17:50Z

ping

preames

LGTM

As discussed in today's sync call, we have consensus to land this in tree (off by default!) and continue iterating. There is a good amount of code sharing, and improved testing which needs to happen before this can be enabled by default. All parties are aware, and enabling that incrementally is the goal of landing now.

ppenzin · 2024-10-10T18:25:43Z

I've ran this by our arch folks and we are also in favor of the change.

lukel97

LGTM to land as is, just spotted something I want to point out before I forget

lukel97 · 2024-10-11T12:54:53Z

llvm/lib/Target/RISCV/RISCVVLOptimizer.cpp

+  case RISCV::VSMUL_VV:
+  case RISCV::VSMUL_VX:


Are these not single width? I thought the multiply happened in SEW*2 but the destination is still SEW

The signed fractional multiply instruction produces a 2*SEW product of the two SEW inputs, then shifts the result right by SEW-1 bits, rounding these bits according to vxrm, then saturates the result to fit into SEW bits.

Thats a great catch. I'll add a test specifically for these before merging.

Feel free to land first and add a test later, don't let my reviews block this!

i've removed it from getOperandInfo, we can add back with tests later.

lukel97 · 2024-10-11T13:24:59Z

llvm/lib/Target/RISCV/RISCVVLOptimizer.cpp

+
+  // All mask operands have EEW=1, EMUL=(EEW/SEW)*LMUL
+  if (isMaskOperand(MI, MO, MRI))
+    return OperandInfo(RISCVVType::getEMULEqualsEEWDivSEWTimesLMUL(0, MI), 0);


Do we handle compare instructions e.g. vmseq.vv where the destination operand will be a mask register, but not constrained to V0?

from isMaskOperand

return Desc.operands()[MO.getOperandNo()].RegClass == RISCV::VMV0RegClassID;

It looks like we only support the v0 case for now.

…nd Saturation from getOperandInfo

…ent-indentonly * llvm-trunk/main: (6379 commits) [gn build] Port 1c94388 [RISCV] Introduce VLOptimizer pass (llvm#108640) [mlir][vector] Add more tests for ConvertVectorToLLVM (7/n) (llvm#111895) [libc++] Add output groups to run-buildbot (llvm#111739) [libc++abi] Remove unused LIBCXXABI_LIBCXX_INCLUDES CMake option (llvm#111824) [clang] Ignore inline namespace for `hasName` (llvm#109147) [AArch64] Disable consecutive store merging when Neon is unavailable (llvm#111519) [lldb] Fix finding make tool for tests (llvm#111980) Turn `-Wdeprecated-literal-operator` on by default (llvm#111027) [AMDGPU] Rewrite RegSeqNames using !foreach. NFC. (llvm#111994) Revert "Reland: [clang] Finish implementation of P0522 (llvm#111711)" Revert "[clang] CWG2398: improve overload resolution backwards compat (llvm#107350)" Revert "[clang] Implement TTP P0522 pack matching for deduced function template calls. (llvm#111457)" [Clang] Replace Intrinsic::getDeclaration with getOrInsertDeclaration (llvm#111990) Revert "[NVPTX] Prefer prmt.b32 over bfi.b32 (llvm#110766)" [RISCV] Add DAG combine to turn (sub (shl X, 8-Y), (shr X, Y)) into orc.b (llvm#111828) [libc] Fix compilation of new trig functions (llvm#111987) [NFC] Rename `Intrinsic::getDeclaration` to `getOrInsertDeclaration` (llvm#111752) [NFC][CodingStandard] Add additional example for if-else brace rule (llvm#111733) CodeGen: Remove redundant REQUIRES registered-target from tests (llvm#111982) ...

llvm-ci · 2024-10-11T15:57:23Z

LLVM Buildbot has detected a new failure on builder bolt-x86_64-ubuntu-dylib running on bolt-worker while building llvm at step 6 "test-build-bolt-check-bolt".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/119/builds/2572

Here is the relevant piece of the build log for the reference

Step 6 (test-build-bolt-check-bolt) failure: test (failure)
******************** TEST 'BOLT :: perf2bolt/perf_test.test' FAILED ********************
Exit Code: 1

Command Output (stderr):
--
RUN: at line 5: /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/bin/clang /home/worker/bolt-worker2/llvm-project/bolt/test/perf2bolt/Inputs/perf_test.c -fuse-ld=lld -Wl,--script=/home/worker/bolt-worker2/llvm-project/bolt/test/perf2bolt/Inputs/perf_test.lds -o /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/tools/bolt/test/perf2bolt/Output/perf_test.test.tmp
+ /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/bin/clang /home/worker/bolt-worker2/llvm-project/bolt/test/perf2bolt/Inputs/perf_test.c -fuse-ld=lld -Wl,--script=/home/worker/bolt-worker2/llvm-project/bolt/test/perf2bolt/Inputs/perf_test.lds -o /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/tools/bolt/test/perf2bolt/Output/perf_test.test.tmp
RUN: at line 6: perf record -Fmax -e cycles:u -o /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/tools/bolt/test/perf2bolt/Output/perf_test.test.tmp2 -- /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/tools/bolt/test/perf2bolt/Output/perf_test.test.tmp
+ perf record -Fmax -e cycles:u -o /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/tools/bolt/test/perf2bolt/Output/perf_test.test.tmp2 -- /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/tools/bolt/test/perf2bolt/Output/perf_test.test.tmp
info: Using a maximum frequency rate of 2000 Hz
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.002 MB /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/tools/bolt/test/perf2bolt/Output/perf_test.test.tmp2 (18 samples) ]
RUN: at line 7: /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/bin/perf2bolt /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/tools/bolt/test/perf2bolt/Output/perf_test.test.tmp -p=/home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/tools/bolt/test/perf2bolt/Output/perf_test.test.tmp2 -o /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/tools/bolt/test/perf2bolt/Output/perf_test.test.tmp3 -nl -ignore-build-id 2>&1 | /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/bin/FileCheck /home/worker/bolt-worker2/llvm-project/bolt/test/perf2bolt/perf_test.test
+ /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/bin/perf2bolt /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/tools/bolt/test/perf2bolt/Output/perf_test.test.tmp -p=/home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/tools/bolt/test/perf2bolt/Output/perf_test.test.tmp2 -o /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/tools/bolt/test/perf2bolt/Output/perf_test.test.tmp3 -nl -ignore-build-id
+ /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/bin/FileCheck /home/worker/bolt-worker2/llvm-project/bolt/test/perf2bolt/perf_test.test
RUN: at line 12: /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/bin/clang /home/worker/bolt-worker2/llvm-project/bolt/test/perf2bolt/Inputs/perf_test.c -no-pie -fuse-ld=lld -o /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/tools/bolt/test/perf2bolt/Output/perf_test.test.tmp4
+ /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/bin/clang /home/worker/bolt-worker2/llvm-project/bolt/test/perf2bolt/Inputs/perf_test.c -no-pie -fuse-ld=lld -o /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/tools/bolt/test/perf2bolt/Output/perf_test.test.tmp4
RUN: at line 13: perf record -Fmax -e cycles:u -o /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/tools/bolt/test/perf2bolt/Output/perf_test.test.tmp5 -- /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/tools/bolt/test/perf2bolt/Output/perf_test.test.tmp4
+ perf record -Fmax -e cycles:u -o /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/tools/bolt/test/perf2bolt/Output/perf_test.test.tmp5 -- /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/tools/bolt/test/perf2bolt/Output/perf_test.test.tmp4
info: Using a maximum frequency rate of 2000 Hz
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.002 MB /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/tools/bolt/test/perf2bolt/Output/perf_test.test.tmp5 (9 samples) ]
RUN: at line 14: /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/bin/perf2bolt /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/tools/bolt/test/perf2bolt/Output/perf_test.test.tmp4 -p=/home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/tools/bolt/test/perf2bolt/Output/perf_test.test.tmp5 -o /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/tools/bolt/test/perf2bolt/Output/perf_test.test.tmp6 -nl -ignore-build-id 2>&1 | /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/bin/FileCheck /home/worker/bolt-worker2/llvm-project/bolt/test/perf2bolt/perf_test.test
+ /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/bin/FileCheck /home/worker/bolt-worker2/llvm-project/bolt/test/perf2bolt/perf_test.test
+ /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/bin/perf2bolt /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/tools/bolt/test/perf2bolt/Output/perf_test.test.tmp4 -p=/home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/tools/bolt/test/perf2bolt/Output/perf_test.test.tmp5 -o /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/tools/bolt/test/perf2bolt/Output/perf_test.test.tmp6 -nl -ignore-build-id
/home/worker/bolt-worker2/llvm-project/bolt/test/perf2bolt/perf_test.test:10:12: error: CHECK-NOT: excluded string found in input
CHECK-NOT: !! WARNING !! This high mismatch ratio indicates the input binary is probably not the same binary used during profiling collection.
           ^
<stdin>:26:2: note: found here
 !! WARNING !! This high mismatch ratio indicates the input binary is probably not the same binary used during profiling collection. The generated data may be ineffective for improving performance.
 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Input file: <stdin>
Check file: /home/worker/bolt-worker2/llvm-project/bolt/test/perf2bolt/perf_test.test

-dump-input=help explains the following input dump.

Input was:
<<<<<<
        .
        .
        .
       21: BOLT-WARNING: Running parallel work of 0 estimated cost, will switch to trivial scheduling. 
       22: PERF2BOLT: processing basic events (without LBR)... 
       23: PERF2BOLT: read 9 samples 
       24: PERF2BOLT: out of range samples recorded in unknown regions: 9 (100.0%) 
       25:  
       26:  !! WARNING !! This high mismatch ratio indicates the input binary is probably not the same binary used during profiling collection. The generated data may be ineffective for improving performance. 
not:10      !~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                                                                   error: no match expected
       27:  
...

llvm-ci · 2024-10-11T21:14:04Z

LLVM Buildbot has detected a new failure on builder lld-x86_64-win running on as-worker-93 while building llvm at step 7 "test-build-unified-tree-check-all".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/146/builds/1360

Here is the relevant piece of the build log for the reference

Step 7 (test-build-unified-tree-check-all) failure: test (failure)
******************** TEST 'LLVM-Unit :: Support/./SupportTests.exe/35/87' FAILED ********************
Script(shard):
--
GTEST_OUTPUT=json:C:\a\lld-x86_64-win\build\unittests\Support\.\SupportTests.exe-LLVM-Unit-9448-35-87.json GTEST_SHUFFLE=0 GTEST_TOTAL_SHARDS=87 GTEST_SHARD_INDEX=35 C:\a\lld-x86_64-win\build\unittests\Support\.\SupportTests.exe
--

Script:
--
C:\a\lld-x86_64-win\build\unittests\Support\.\SupportTests.exe --gtest_filter=ProgramEnvTest.CreateProcessLongPath
--
C:\a\lld-x86_64-win\llvm-project\llvm\unittests\Support\ProgramTest.cpp(160): error: Expected equality of these values:
  0
  RC
    Which is: -2

C:\a\lld-x86_64-win\llvm-project\llvm\unittests\Support\ProgramTest.cpp(163): error: fs::remove(Twine(LongPath)): did not return errc::success.
error number: 13
error message: permission denied



C:\a\lld-x86_64-win\llvm-project\llvm\unittests\Support\ProgramTest.cpp:160
Expected equality of these values:
  0
  RC
    Which is: -2

C:\a\lld-x86_64-win\llvm-project\llvm\unittests\Support\ProgramTest.cpp:163
fs::remove(Twine(LongPath)): did not return errc::success.
error number: 13
error message: permission denied




********************

The purpose of this optimization is to make the VL argument, for instructions that have a VL argument, as small as possible. This is implemented by visiting each instruction in reverse order and checking that if it has a VL argument, whether the VL can be reduced. By putting this pass before VSETVLI insertion, we see three kinds of changes to generated code: 1. Eliminate VSETVLI instructions 2. Reduce the VL toggle on VSETVLI instructions that also change vtype 3. Reduce the VL set by a VSETVLI instruction The list of supported instructions is currently whitelisted for safety. In the future, we could add more instructions to `isSupportedInstr` to support even more VL optimization. We originally wrote this pass because vector GEP instructions do not take a VL, which leads us to emit code that uses VL=VLMAX to implement GEP in the RISC-V backend. As a result, some of the vector instructions will write to lanes, specifically between the intended VL and VLMAX, that will never be read. As an alternative to this pass, we considered adding a vector predicated GEP instruction, but this would not fit well into the intrinsic type system since GEP has a variable number of arguments, each with arbitrary types. The second approach we considered was to put this pass after VSETVLI insertion, but we found that it was more difficult to recognize optimization opportunities, especially across basic block boundaries -- the data flow analysis was also a bit more expensive and complex. While this pass solves the GEP problem, we have expanded it to handle more cases of VL optimization, and there is opportunity for the analysis to be improved to enable even more optimization. We have a few follow up patches to post, but figured this would be a good start. --------- Co-authored-by: Craig Topper <[email protected]> Co-authored-by: Kito Cheng <[email protected]>

michaelmaitland added the backend:RISC-V label Sep 13, 2024

michaelmaitland requested review from asb, preames, lukel97, topperc and wangpc-pp September 13, 2024 20:06

michaelmaitland force-pushed the vl-optimize branch from bddad03 to f52913b Compare September 13, 2024 20:57

wangpc-pp reviewed Sep 14, 2024

View reviewed changes

lukel97 reviewed Sep 14, 2024

View reviewed changes

llvm/lib/Target/RISCV/RISCVVLOptimizer.cpp Show resolved Hide resolved

llvm/lib/Target/RISCV/RISCVVLOptimizer.cpp Show resolved Hide resolved

preames reviewed Sep 17, 2024

View reviewed changes

michaelmaitland added a commit that referenced this pull request Sep 17, 2024

[RISCV] Add VL optimization related tests

1ebe16b

These tests are good candidate for VL optimization. This is a pre-commit for PR #108640, but can could probably also be improved by the peephole VL optimizations.

michaelmaitland force-pushed the vl-optimize branch from 4873453 to b49d910 Compare September 17, 2024 19:49

michaelmaitland commented Sep 18, 2024

View reviewed changes

michaelmaitland force-pushed the vl-optimize branch from 6785417 to bfce099 Compare September 18, 2024 19:08

topperc reviewed Sep 18, 2024

View reviewed changes

llvm/lib/Target/RISCV/RISCVVLOptimizer.cpp Outdated Show resolved Hide resolved

topperc reviewed Sep 18, 2024

View reviewed changes

llvm/lib/Target/RISCV/RISCVVLOptimizer.cpp Outdated Show resolved Hide resolved

michaelmaitland force-pushed the vl-optimize branch 3 times, most recently from fb1deca to 0266e30 Compare September 18, 2024 20:49

topperc reviewed Sep 18, 2024

View reviewed changes

michaelmaitland added 11 commits September 30, 2024 08:26

fixup! respond to some comments

7184045

fixup! respond to some more comments

2c32b3e

fixup! fix getOperandInfo for reductions

ebc78eb

fixup! model scalar OperandInfo

f21571e

fixup! fix some comments, add some fixmes, and update check

397adf5

fixup! don't change to same vl

888fe31

fixup! remove comment and return CanReduceVL

5142e12

fixup! add basic tests for supported instrs

43cd9a4

fixup! add more supported instrs and tests

cf37397

fixup! improve test names

2e3027a

fixup! pair down the number of instructions supported in getOperandInfo

8dc125f

michaelmaitland force-pushed the vl-optimize branch from 350afb0 to 8dc125f Compare September 30, 2024 15:27

fixup! cleanup and respond to comments

6fc0290

michaelmaitland requested a review from preames September 30, 2024 16:52

michaelmaitland added 2 commits September 30, 2024 18:17

fixup! more reorganization

c2d0070

fixup! simplify getOperandInfo some more

1294ffd

michaelmaitland requested a review from lukel97 October 2, 2024 02:50

Mel-Chen mentioned this pull request Oct 7, 2024

[VP][EVL] Support select instruction with EVL-vectorization #109614

Closed

preames approved these changes Oct 10, 2024

View reviewed changes

lukel97 approved these changes Oct 11, 2024

View reviewed changes

lukel97 reviewed Oct 11, 2024

View reviewed changes

fixup! remove Vector Single-Width Fractional Multiply with Rounding a…

ed785c5

…nd Saturation from getOperandInfo

michaelmaitland merged commit 1c94388 into llvm:main Oct 11, 2024
5 of 7 checks passed

[RISCV] Introduce VLOptimizer pass #108640

[RISCV] Introduce VLOptimizer pass #108640

Conversation

michaelmaitland commented Sep 13, 2024 • edited Loading

llvmbot commented Sep 13, 2024

michaelmaitland commented Sep 13, 2024

github-actions bot commented Sep 13, 2024 • edited Loading

preames commented Sep 13, 2024

wangpc-pp left a comment

Choose a reason for hiding this comment

preames left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michaelmaitland commented Sep 17, 2024 • edited Loading

preames commented Sep 17, 2024

michaelmaitland commented Sep 17, 2024

lukel97 commented Sep 18, 2024

Choose a reason for hiding this comment

topperc Sep 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michaelmaitland commented Oct 8, 2024

preames left a comment

Choose a reason for hiding this comment

ppenzin commented Oct 10, 2024

lukel97 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

llvm-ci commented Oct 11, 2024

llvm-ci commented Oct 11, 2024

michaelmaitland commented Sep 13, 2024 •

edited

Loading

github-actions bot commented Sep 13, 2024 •

edited

Loading

michaelmaitland commented Sep 17, 2024 •

edited

Loading

topperc Sep 18, 2024 •

edited

Loading