Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RISCV] Introduce VLOptimizer pass #108640

Merged
merged 17 commits into from
Oct 11, 2024
Merged

Conversation

michaelmaitland
Copy link
Contributor

@michaelmaitland michaelmaitland commented Sep 13, 2024

The purpose of this optimization is to make the VL argument, for instructions that have a VL argument, as small as possible. This is implemented by visiting each instruction in reverse order and checking that if it has a VL argument, whether the VL can be reduced.

By putting this pass before VSETVLI insertion, we see three kinds of changes to generated code:

  1. Eliminate VSETVLI instructions
  2. Reduce the VL toggle on VSETVLI instructions that also change vtype
  3. Reduce the VL set by a VSETVLI instruction

The list of supported instructions is currently whitelisted for safety. In the future, we could add more instructions to isSupportedInstr to support even more VL optimization.

We originally wrote this pass because vector GEP instructions do not take a VL, which leads us to emit code that uses VL=VLMAX to implement GEP in the RISC-V backend. As a result, some of the vector instructions will write to lanes, specifically between the intended VL and VLMAX, that will never be read. As an alternative to this pass, we considered adding a vector predicated GEP instruction, but this would not fit well into the intrinsic type system since GEP has a variable number of arguments, each with arbitrary types. The second approach we considered was to put this pass after VSETVLI insertion, but we found that it was more difficult to recognize optimization opportunities, especially across basic block boundaries -- the data flow analysis was also a bit more expensive and complex.

While this pass solves the GEP problem, we have expanded it to handle more cases of VL optimization, and there is opportunity for the analysis to be improved to enable even more optimization. We have a few follow up patches to post, but figured this would be a good start.

Co-authored-by: Craig Topper [email protected]
Co-authored-by: Kito Cheng [email protected]

@llvmbot
Copy link
Collaborator

llvmbot commented Sep 13, 2024

@llvm/pr-subscribers-backend-risc-v

Author: Michael Maitland (michaelmaitland)

Changes

The purpose of this optimization is to make the VL argument, for instructions that have a VL argument, as small as possible. This is implemented by visiting each instruction in reverse order and checking that if it has a VL argument, whether the VL can be reduced.

By putting this pass before VSETVLI insertion, we see three kinds of changes to generated code:

  1. Eliminate VSETVLI instructions
  2. Reduce the VL toggle on VSETVLI instructions that also change vtype
  3. Reduce the VL set by a VSETVLI instruction

The list of supported instructions is currently whitelisted for safety. In the future, we could add more instructions to isSupportedInstr to support even more VL optimization.

We originally wrote this pass because vector GEP instructions do not take a VL, which leads us to emit code that uses VL=VLMAX to implement GEP in the RISC-V backend. As a result, some of the vector instructions will write to lanes, specifically between the intended VL and VLMAX, that will never be read. As an alternative to this pass, we considered adding a vector predicated GEP instruction, but this would not fit well into the intrinsic type system since GEP has a variable number of arguments, each with arbitrary types. The second approach we considered was to put this pass after VSETVLI insertion, but we found that it was more difficult to recognize optimization opportunities, especially across basic block boundaries -- the data flow analysis was also a bit more expensive and complex.

While this pass solves the GEP problem, we have expanded it to handle more cases of VL optimization, and there is opportunity for the analysis to be improved to enable even more optimization.

In our downstream compiler, this was able to optimize > 5000 VLs on spec2006 and > 6000 VLs on spec2017. These measurements were using force-tail-folding-style=data-with-evl during vectorization. I don't have numbers for the upstream compiler.


Patch is 242.21 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/108640.diff

29 Files Affected:

  • (modified) llvm/lib/Target/RISCV/CMakeLists.txt (+1)
  • (modified) llvm/lib/Target/RISCV/RISCV.h (+4)
  • (modified) llvm/lib/Target/RISCV/RISCVTargetMachine.cpp (+8-1)
  • (added) llvm/lib/Target/RISCV/RISCVVLOptimizer.cpp (+1468)
  • (modified) llvm/test/CodeGen/RISCV/O3-pipeline.ll (+2-1)
  • (modified) llvm/test/CodeGen/RISCV/rvv/narrow-shift-extend.ll (+27-27)
  • (modified) llvm/test/CodeGen/RISCV/rvv/setcc-int-vp.ll (+18-36)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vdiv-vp.ll (-2)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vdivu-vp.ll (-2)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vfwmacc-vp.ll (+14-28)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vfwmsac-vp.ll (+12-24)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vfwnmacc-vp.ll (+15-30)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vfwnmsac-vp.ll (+15-30)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmax-vp.ll (-2)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmaxu-vp.ll (-2)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmin-vp.ll (-2)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vminu-vp.ll (-2)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmul-vp.ll (+3-4)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vpgather-sdnode.ll (+174-209)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vpscatter-sdnode.ll (+166-199)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vrem-vp.ll (-2)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vremu-vp.ll (-2)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vsetvli-insert-crossbb.ll (-3)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vshl-vp.ll (+1-2)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vsra-vp.ll (-2)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vsrl-vp.ll (-2)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vssub-vp.ll (+1-2)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vssubu-vp.ll (+1-2)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vwsll-vp.ll (+30-60)
diff --git a/llvm/lib/Target/RISCV/CMakeLists.txt b/llvm/lib/Target/RISCV/CMakeLists.txt
index 124bc239451aed..46ef94b4dd2931 100644
--- a/llvm/lib/Target/RISCV/CMakeLists.txt
+++ b/llvm/lib/Target/RISCV/CMakeLists.txt
@@ -59,6 +59,7 @@ add_llvm_target(RISCVCodeGen
   RISCVTargetObjectFile.cpp
   RISCVTargetTransformInfo.cpp
   RISCVVectorPeephole.cpp
+  RISCVVLOptimizer.cpp
   GISel/RISCVCallLowering.cpp
   GISel/RISCVInstructionSelector.cpp
   GISel/RISCVLegalizerInfo.cpp
diff --git a/llvm/lib/Target/RISCV/RISCV.h b/llvm/lib/Target/RISCV/RISCV.h
index 5a94ada8f8dd46..651119d10d1119 100644
--- a/llvm/lib/Target/RISCV/RISCV.h
+++ b/llvm/lib/Target/RISCV/RISCV.h
@@ -99,6 +99,10 @@ void initializeRISCVO0PreLegalizerCombinerPass(PassRegistry &);
 
 FunctionPass *createRISCVPreLegalizerCombiner();
 void initializeRISCVPreLegalizerCombinerPass(PassRegistry &);
+
+FunctionPass *createRISCVVLOptimizerPass();
+void initializeRISCVVLOptimizerPass(PassRegistry &);
+
 } // namespace llvm
 
 #endif
diff --git a/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp b/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp
index 794df2212dfa53..bfedc6b177332e 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp
+++ b/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp
@@ -103,6 +103,10 @@ static cl::opt<bool> EnableVSETVLIAfterRVVRegAlloc(
     cl::desc("Insert vsetvls after vector register allocation"),
     cl::init(true));
 
+static cl::opt<bool> EnableVLOptimizer("riscv-enable-vloptimizer",
+                                       cl::desc("Enable the VL Optimizer pass"),
+                                       cl::init(true), cl::Hidden);
+
 extern "C" LLVM_EXTERNAL_VISIBILITY void LLVMInitializeRISCVTarget() {
   RegisterTargetMachine<RISCVTargetMachine> X(getTheRISCV32Target());
   RegisterTargetMachine<RISCVTargetMachine> Y(getTheRISCV64Target());
@@ -550,8 +554,11 @@ void RISCVPassConfig::addMachineSSAOptimization() {
 
 void RISCVPassConfig::addPreRegAlloc() {
   addPass(createRISCVPreRAExpandPseudoPass());
-  if (TM->getOptLevel() != CodeGenOptLevel::None)
+  if (TM->getOptLevel() != CodeGenOptLevel::None) {
     addPass(createRISCVMergeBaseOffsetOptPass());
+    if (EnableVLOptimizer)
+      addPass(createRISCVVLOptimizerPass());
+  }
 
   addPass(createRISCVInsertReadWriteCSRPass());
   addPass(createRISCVInsertWriteVXRMPass());
diff --git a/llvm/lib/Target/RISCV/RISCVVLOptimizer.cpp b/llvm/lib/Target/RISCV/RISCVVLOptimizer.cpp
new file mode 100644
index 00000000000000..d25db9b38bec44
--- /dev/null
+++ b/llvm/lib/Target/RISCV/RISCVVLOptimizer.cpp
@@ -0,0 +1,1468 @@
+//===-------------- RISCVVLOptimizer.cpp - VL Optimizer ------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===---------------------------------------------------------------------===//
+//
+// This pass reduces the VL where possible at the MI level, before VSETVLI
+// instructions are inserted.
+//
+// The purpose of this optimization is to make the VL argument, for instructions
+// that have a VL argument, as small as possible. This is implemented by
+// visiting each instruction in reverse order and checking that if it has a VL
+// argument, whether the VL can be reduced.
+//
+//===---------------------------------------------------------------------===//
+
+#include "RISCV.h"
+#include "RISCVMachineFunctionInfo.h"
+#include "RISCVSubtarget.h"
+#include "llvm/ADT/SetVector.h"
+#include "llvm/CodeGen/MachineDominators.h"
+#include "llvm/CodeGen/MachineFunctionPass.h"
+#include "llvm/InitializePasses.h"
+
+#include <algorithm>
+
+using namespace llvm;
+
+#define DEBUG_TYPE "riscv-vl-optimizer"
+
+namespace {
+
+class RISCVVLOptimizer : public MachineFunctionPass {
+  const MachineRegisterInfo *MRI;
+  const MachineDominatorTree *MDT;
+
+public:
+  static char ID;
+
+  RISCVVLOptimizer() : MachineFunctionPass(ID) {
+    initializeRISCVVLOptimizerPass(*PassRegistry::getPassRegistry());
+  }
+
+  bool runOnMachineFunction(MachineFunction &MF) override;
+
+  void getAnalysisUsage(AnalysisUsage &AU) const override {
+    AU.setPreservesCFG();
+    AU.addRequired<MachineDominatorTreeWrapperPass>();
+    MachineFunctionPass::getAnalysisUsage(AU);
+  }
+
+  StringRef getPassName() const override { return "RISC-V VL Optimizer"; }
+
+private:
+  bool tryReduceVL(MachineInstr &MI);
+};
+
+} // end anonymous namespace
+
+char RISCVVLOptimizer::ID = 0;
+INITIALIZE_PASS_BEGIN(RISCVVLOptimizer, DEBUG_TYPE, "RISC-V VL Optimizer",
+                      false, false)
+INITIALIZE_PASS_DEPENDENCY(MachineDominatorTreeWrapperPass)
+INITIALIZE_PASS_END(RISCVVLOptimizer, DEBUG_TYPE, "RISC-V VL Optimizer", false,
+                    false)
+
+FunctionPass *llvm::createRISCVVLOptimizerPass() {
+  return new RISCVVLOptimizer();
+}
+
+/// Return true if R is a physical or virtual vector register, false otherwise.
+static bool isVectorRegClass(Register R, const MachineRegisterInfo *MRI) {
+  if (R.isPhysical())
+    return RISCV::VRRegClass.contains(R);
+  const TargetRegisterClass *RC = MRI->getRegClass(R);
+  return RISCV::VRRegClass.hasSubClassEq(RC) ||
+         RISCV::VRM2RegClass.hasSubClassEq(RC) ||
+         RISCV::VRM4RegClass.hasSubClassEq(RC) ||
+         RISCV::VRM8RegClass.hasSubClassEq(RC);
+}
+
+/// Represents the EMUL and EEW of a MachineOperand.
+struct OperandInfo {
+  enum class State {
+    Unknown,
+    Known,
+  } S;
+
+  // Represent as 1,2,4,8, ... and fractional indicator. This is because
+  // EMUL can take on values that don't map to RISCVII::VLMUL values exactly.
+  // For example, a mask operand can have an EMUL less than MF8.
+  std::pair<unsigned, bool> EMUL;
+
+  unsigned Log2EEW;
+
+  OperandInfo(RISCVII::VLMUL EMUL, unsigned Log2EEW)
+      : S(State::Known), EMUL(RISCVVType::decodeVLMUL(EMUL)), Log2EEW(Log2EEW) {
+  }
+
+  OperandInfo(std::pair<unsigned, bool> EMUL, unsigned Log2EEW)
+      : S(State::Known), EMUL(EMUL), Log2EEW(Log2EEW) {}
+
+  OperandInfo(State S) : S(S) {
+    assert(S != State::Known &&
+           "This constructor may only be used to construct "
+           "an Unknown OperandInfo");
+  }
+
+  bool isUnknown() { return S == State::Unknown; }
+  bool isKnown() { return S == State::Known; }
+
+  static bool EMULAndEEWAreEqual(OperandInfo A, OperandInfo B) {
+    assert(A.isKnown() && B.isKnown() && "Both operands must be known");
+    return A.Log2EEW == B.Log2EEW && A.EMUL.first == B.EMUL.first &&
+           A.EMUL.second == B.EMUL.second;
+  }
+};
+
+/// Return the RISCVII::VLMUL that is two times VLMul.
+/// Precondition: VLMul is not LMUL_RESERVED or LMUL_8.
+static RISCVII::VLMUL twoTimesVLMUL(RISCVII::VLMUL VLMul) {
+  switch (VLMul) {
+  case RISCVII::VLMUL::LMUL_F8:
+    return RISCVII::VLMUL::LMUL_F4;
+  case RISCVII::VLMUL::LMUL_F4:
+    return RISCVII::VLMUL::LMUL_F2;
+  case RISCVII::VLMUL::LMUL_F2:
+    return RISCVII::VLMUL::LMUL_1;
+  case RISCVII::VLMUL::LMUL_1:
+    return RISCVII::VLMUL::LMUL_2;
+  case RISCVII::VLMUL::LMUL_2:
+    return RISCVII::VLMUL::LMUL_4;
+  case RISCVII::VLMUL::LMUL_4:
+    return RISCVII::VLMUL::LMUL_8;
+  case RISCVII::VLMUL::LMUL_8:
+  default:
+    llvm_unreachable("Could not multiply VLMul by 2");
+  }
+}
+
+/// Return the RISCVII::VLMUL that is VLMul / 2.
+/// Precondition: VLMul is not LMUL_RESERVED or LMUL_MF8.
+static RISCVII::VLMUL halfVLMUL(RISCVII::VLMUL VLMul) {
+  switch (VLMul) {
+  case RISCVII::VLMUL::LMUL_F4:
+    return RISCVII::VLMUL::LMUL_F8;
+  case RISCVII::VLMUL::LMUL_F2:
+    return RISCVII::VLMUL::LMUL_F4;
+  case RISCVII::VLMUL::LMUL_1:
+    return RISCVII::VLMUL::LMUL_F2;
+  case RISCVII::VLMUL::LMUL_2:
+    return RISCVII::VLMUL::LMUL_1;
+  case RISCVII::VLMUL::LMUL_4:
+    return RISCVII::VLMUL::LMUL_2;
+  case RISCVII::VLMUL::LMUL_8:
+    return RISCVII::VLMUL::LMUL_4;
+  case RISCVII::VLMUL::LMUL_F8:
+  default:
+    llvm_unreachable("Could not divide VLMul by 2");
+  }
+}
+
+/// Return EMUL = (EEW / SEW) * LMUL where EEW comes from Log2EEW and LMUL and
+/// SEW are from the TSFlags of MI.
+static std::pair<unsigned, bool>
+getEMULEqualsEEWDivSEWTimesLMUL(unsigned Log2EEW, const MachineInstr &MI) {
+  RISCVII::VLMUL MIVLMUL = RISCVII::getLMul(MI.getDesc().TSFlags);
+  auto [MILMUL, MILMULIsFractional] = RISCVVType::decodeVLMUL(MIVLMUL);
+  unsigned MILog2SEW =
+      MI.getOperand(RISCVII::getSEWOpNum(MI.getDesc())).getImm();
+  unsigned MISEW = 1 << MILog2SEW;
+
+  unsigned EEW = 1 << Log2EEW;
+  // Calculate (EEW/SEW)*LMUL preserving fractions less than 1. Use GCD
+  // to put fraction in simplest form.
+  unsigned Num = EEW, Denom = MISEW;
+  int GCD = MILMULIsFractional ? std::gcd(Num, Denom * MILMUL)
+                               : std::gcd(Num * MILMUL, Denom);
+  Num = MILMULIsFractional ? Num / GCD : Num * MILMUL / GCD;
+  Denom = MILMULIsFractional ? Denom * MILMUL / GCD : Denom / GCD;
+  return std::make_pair(Num > Denom ? Num : Denom, Denom > Num);
+}
+
+/// An index segment load or store operand has the form v.*seg<nf>ei<eeew>.v.
+/// Data has EEW=SEW, EMUL=LMUL. Index has EEW=<eew>, EMUL=(EEW/SEW)*LMUL. LMUL
+/// and SEW comes from TSFlags of MI.
+static OperandInfo
+getIndexSegmentLoadStoreOperandInfo(unsigned Log2EEW, const MachineInstr &MI,
+                                    const MachineOperand &MO) {
+  // Operand 0 is data register
+  // Data vector register group has EEW=SEW, EMUL=LMUL.
+  if (MO.getOperandNo() == 0) {
+    RISCVII::VLMUL MIVLMul = RISCVII::getLMul(MI.getDesc().TSFlags);
+    unsigned MILog2SEW =
+        MI.getOperand(RISCVII::getSEWOpNum(MI.getDesc())).getImm();
+    return OperandInfo(MIVLMul, MILog2SEW);
+  }
+
+  // Operand 1 is index vector register
+  // v.*seg<nf>ei<eeew>.v
+  // Index vector register group has EEW=<eew>, EMUL=(EEW/SEW)*LMUL.
+  if (MO.getOperandNo() == 2)
+    return OperandInfo(getEMULEqualsEEWDivSEWTimesLMUL(Log2EEW, MI), Log2EEW);
+
+  llvm_unreachable("Could not get OperandInfo for non-vector register of an "
+                   "indexed segment load or store instruction");
+}
+
+/// Dest has EEW=SEW and EMUL=LMUL. Source EEW=SEW/Factor (i.e. F2 => EEW/2).
+/// Source has EMUL=(EEW/SEW)*LMUL. LMUL and SEW comes from TSFlags of MI.
+static OperandInfo getIntegerExtensionOperandInfo(unsigned Factor,
+                                                  const MachineInstr &MI,
+                                                  const MachineOperand &MO) {
+  RISCVII::VLMUL MIVLMul = RISCVII::getLMul(MI.getDesc().TSFlags);
+  unsigned MILog2SEW =
+      MI.getOperand(RISCVII::getSEWOpNum(MI.getDesc())).getImm();
+
+  if (MO.getOperandNo() == 0)
+    return OperandInfo(MIVLMul, MILog2SEW);
+
+  unsigned MISEW = 1 << MILog2SEW;
+  unsigned EEW = MISEW / Factor;
+  unsigned Log2EEW = Log2_32(EEW);
+
+  return OperandInfo(getEMULEqualsEEWDivSEWTimesLMUL(Log2EEW, MI), Log2EEW);
+}
+
+/// Check whether MO is a mask operand of MI.
+static bool isMaskOperand(const MachineInstr &MI, const MachineOperand &MO,
+                          const MachineRegisterInfo *MRI) {
+
+  if (!MO.isReg() || !isVectorRegClass(MO.getReg(), MRI))
+    return false;
+
+  const MCInstrDesc &Desc = MI.getDesc();
+  return Desc.operands()[MO.getOperandNo()].RegClass == RISCV::VMV0RegClassID;
+}
+
+/// Return the OperandInfo for MO, which is an operand of MI.
+static OperandInfo getOperandInfo(const MachineInstr &MI,
+                                  const MachineOperand &MO,
+                                  const MachineRegisterInfo *MRI) {
+  const RISCVVPseudosTable::PseudoInfo *RVV =
+      RISCVVPseudosTable::getPseudoInfo(MI.getOpcode());
+  assert(RVV && "Could not find MI in PseudoTable");
+
+  // MI has a VLMUL and SEW associated with it. The RVV specification defines
+  // the LMUL and SEW of each operand and definition in relation to MI.VLMUL and
+  // MI.SEW.
+  RISCVII::VLMUL MIVLMul = RISCVII::getLMul(MI.getDesc().TSFlags);
+  unsigned MILog2SEW =
+      MI.getOperand(RISCVII::getSEWOpNum(MI.getDesc())).getImm();
+  bool IsMODef = MO.getOperandNo() == 0;
+
+  // All mask operands have EEW=1, EMUL=(EEW/SEW)*LMUL
+  if (isMaskOperand(MI, MO, MRI))
+    return OperandInfo(getEMULEqualsEEWDivSEWTimesLMUL(0, MI), 0);
+
+  // TODO: Pseudos that end in _MASK or _TU can have a merge operand.
+  // We bail out early for instructions that have merge operands for now.
+  // Merge operand is first operand after defs.
+  if (MO.getOperandNo() == MI.getNumExplicitDefs() && MO.isReg() && MO.isTied())
+    return OperandInfo(OperandInfo::State::Unknown);
+
+  // switch against BaseInstr to reduce number of cases that need to be
+  // considered.
+  switch (RVV->BaseInstr) {
+
+  // 6. Configuration-Setting Instructions
+  // Configuration setting instructions do not read or write vector registers
+  case RISCV::VSETIVLI:
+  case RISCV::VSETVL:
+  case RISCV::VSETVLI:
+    llvm_unreachable("Configuration setting instructions do not read or write "
+                     "vector registers");
+
+  // 7. Vector Loads and Stores
+  // 7.4. Vector Unit-Stride Instructions
+  // 7.5. Vector Strided Instructions
+  // 7.7. Unit-stride Fault-Only-First Loads
+  /// Dest EEW encoded in the instruction and EMUL=(EEW/SEW)*LMUL
+  case RISCV::VLE8_V:
+  case RISCV::VSE8_V:
+  case RISCV::VLM_V:
+  case RISCV::VSM_V:
+  case RISCV::VLSE8_V:
+  case RISCV::VSSE8_V:
+  case RISCV::VLE8FF_V:
+    return OperandInfo(getEMULEqualsEEWDivSEWTimesLMUL(3, MI), 3);
+  case RISCV::VLE16_V:
+  case RISCV::VSE16_V:
+  case RISCV::VLSE16_V:
+  case RISCV::VSSE16_V:
+  case RISCV::VLE16FF_V:
+    return OperandInfo(getEMULEqualsEEWDivSEWTimesLMUL(4, MI), 4);
+  case RISCV::VLE32_V:
+  case RISCV::VSE32_V:
+  case RISCV::VLSE32_V:
+  case RISCV::VSSE32_V:
+  case RISCV::VLE32FF_V:
+    return OperandInfo(getEMULEqualsEEWDivSEWTimesLMUL(5, MI), 5);
+  case RISCV::VLE64_V:
+  case RISCV::VSE64_V:
+  case RISCV::VLSE64_V:
+  case RISCV::VSSE64_V:
+  case RISCV::VLE64FF_V:
+    return OperandInfo(getEMULEqualsEEWDivSEWTimesLMUL(6, MI), 6);
+
+  // 7.6. Vector Indexed Instructions
+  // Data EEW=SEW, EMUL=LMUL. Index EEW=<eew> and EMUL=(EEW/SEW)*LMUL
+  case RISCV::VLUXEI8_V:
+  case RISCV::VLOXEI8_V:
+  case RISCV::VSUXEI8_V:
+  case RISCV::VSOXEI8_V:
+    if (MO.getOperandNo() == 0)
+      return OperandInfo(MIVLMul, MILog2SEW);
+    return OperandInfo(getEMULEqualsEEWDivSEWTimesLMUL(3, MI), 3);
+  case RISCV::VLUXEI16_V:
+  case RISCV::VLOXEI16_V:
+  case RISCV::VSUXEI16_V:
+  case RISCV::VSOXEI16_V:
+    if (MO.getOperandNo() == 0)
+      return OperandInfo(MIVLMul, MILog2SEW);
+    return OperandInfo(getEMULEqualsEEWDivSEWTimesLMUL(4, MI), 4);
+  case RISCV::VLUXEI32_V:
+  case RISCV::VLOXEI32_V:
+  case RISCV::VSUXEI32_V:
+  case RISCV::VSOXEI32_V:
+    if (MO.getOperandNo() == 0)
+      return OperandInfo(MIVLMul, MILog2SEW);
+    return OperandInfo(getEMULEqualsEEWDivSEWTimesLMUL(5, MI), 5);
+  case RISCV::VLUXEI64_V:
+  case RISCV::VLOXEI64_V:
+  case RISCV::VSUXEI64_V:
+  case RISCV::VSOXEI64_V:
+    if (MO.getOperandNo() == 0)
+      return OperandInfo(MIVLMul, MILog2SEW);
+    return OperandInfo(getEMULEqualsEEWDivSEWTimesLMUL(6, MI), 6);
+
+  // 7.8. Vector Load/Store Segment Instructions
+  // 7.8.1. Vector Unit-Stride Segment Loads and Stores
+  // v.*seg<nf>e<eew>.*
+  // EEW=eew, EMUL=LMUL
+  case RISCV::VLSEG2E8_V:
+  case RISCV::VLSEG2E8FF_V:
+  case RISCV::VLSEG3E8_V:
+  case RISCV::VLSEG3E8FF_V:
+  case RISCV::VLSEG4E8_V:
+  case RISCV::VLSEG4E8FF_V:
+  case RISCV::VLSEG5E8_V:
+  case RISCV::VLSEG5E8FF_V:
+  case RISCV::VLSEG6E8_V:
+  case RISCV::VLSEG6E8FF_V:
+  case RISCV::VLSEG7E8_V:
+  case RISCV::VLSEG7E8FF_V:
+  case RISCV::VLSEG8E8_V:
+  case RISCV::VLSEG8E8FF_V:
+  case RISCV::VSSEG2E8_V:
+  case RISCV::VSSEG3E8_V:
+  case RISCV::VSSEG4E8_V:
+  case RISCV::VSSEG5E8_V:
+  case RISCV::VSSEG6E8_V:
+  case RISCV::VSSEG7E8_V:
+  case RISCV::VSSEG8E8_V:
+    return OperandInfo(MIVLMul, 3);
+  case RISCV::VLSEG2E16_V:
+  case RISCV::VLSEG2E16FF_V:
+  case RISCV::VLSEG3E16_V:
+  case RISCV::VLSEG3E16FF_V:
+  case RISCV::VLSEG4E16_V:
+  case RISCV::VLSEG4E16FF_V:
+  case RISCV::VLSEG5E16_V:
+  case RISCV::VLSEG5E16FF_V:
+  case RISCV::VLSEG6E16_V:
+  case RISCV::VLSEG6E16FF_V:
+  case RISCV::VLSEG7E16_V:
+  case RISCV::VLSEG7E16FF_V:
+  case RISCV::VLSEG8E16_V:
+  case RISCV::VLSEG8E16FF_V:
+  case RISCV::VSSEG2E16_V:
+  case RISCV::VSSEG3E16_V:
+  case RISCV::VSSEG4E16_V:
+  case RISCV::VSSEG5E16_V:
+  case RISCV::VSSEG6E16_V:
+  case RISCV::VSSEG7E16_V:
+  case RISCV::VSSEG8E16_V:
+    return OperandInfo(MIVLMul, 4);
+  case RISCV::VLSEG2E32_V:
+  case RISCV::VLSEG2E32FF_V:
+  case RISCV::VLSEG3E32_V:
+  case RISCV::VLSEG3E32FF_V:
+  case RISCV::VLSEG4E32_V:
+  case RISCV::VLSEG4E32FF_V:
+  case RISCV::VLSEG5E32_V:
+  case RISCV::VLSEG5E32FF_V:
+  case RISCV::VLSEG6E32_V:
+  case RISCV::VLSEG6E32FF_V:
+  case RISCV::VLSEG7E32_V:
+  case RISCV::VLSEG7E32FF_V:
+  case RISCV::VLSEG8E32_V:
+  case RISCV::VLSEG8E32FF_V:
+  case RISCV::VSSEG2E32_V:
+  case RISCV::VSSEG3E32_V:
+  case RISCV::VSSEG4E32_V:
+  case RISCV::VSSEG5E32_V:
+  case RISCV::VSSEG6E32_V:
+  case RISCV::VSSEG7E32_V:
+  case RISCV::VSSEG8E32_V:
+    return OperandInfo(MIVLMul, 5);
+  case RISCV::VLSEG2E64_V:
+  case RISCV::VLSEG2E64FF_V:
+  case RISCV::VLSEG3E64_V:
+  case RISCV::VLSEG3E64FF_V:
+  case RISCV::VLSEG4E64_V:
+  case RISCV::VLSEG4E64FF_V:
+  case RISCV::VLSEG5E64_V:
+  case RISCV::VLSEG5E64FF_V:
+  case RISCV::VLSEG6E64_V:
+  case RISCV::VLSEG6E64FF_V:
+  case RISCV::VLSEG7E64_V:
+  case RISCV::VLSEG7E64FF_V:
+  case RISCV::VLSEG8E64_V:
+  case RISCV::VLSEG8E64FF_V:
+  case RISCV::VSSEG2E64_V:
+  case RISCV::VSSEG3E64_V:
+  case RISCV::VSSEG4E64_V:
+  case RISCV::VSSEG5E64_V:
+  case RISCV::VSSEG6E64_V:
+  case RISCV::VSSEG7E64_V:
+  case RISCV::VSSEG8E64_V:
+    return OperandInfo(MIVLMul, 6);
+
+  // 7.8.2. Vector Strided Segment Loads and Stores
+  case RISCV::VLSSEG2E8_V:
+  case RISCV::VLSSEG3E8_V:
+  case RISCV::VLSSEG4E8_V:
+  case RISCV::VLSSEG5E8_V:
+  case RISCV::VLSSEG6E8_V:
+  case RISCV::VLSSEG7E8_V:
+  case RISCV::VLSSEG8E8_V:
+  case RISCV::VSSSEG2E8_V:
+  case RISCV::VSSSEG3E8_V:
+  case RISCV::VSSSEG4E8_V:
+  case RISCV::VSSSEG5E8_V:
+  case RISCV::VSSSEG6E8_V:
+  case RISCV::VSSSEG7E8_V:
+  case RISCV::VSSSEG8E8_V:
+    return OperandInfo(MIVLMul, 3);
+  case RISCV::VLSSEG2E16_V:
+  case RISCV::VLSSEG3E16_V:
+  case RISCV::VLSSEG4E16_V:
+  case RISCV::VLSSEG5E16_V:
+  case RISCV::VLSSEG6E16_V:
+  case RISCV::VLSSEG7E16_V:
+  case RISCV::VLSSEG8E16_V:
+  case RISCV::VSSSEG2E16_V:
+  case RISCV::VSSSEG3E16_V:
+  case RISCV::VSSSEG4E16_V:
+  case RISCV::VSSSEG5E16_V:
+  case RISCV::VSSSEG6E16_V:
+  case RISCV::VSSSEG7E16_V:
+  case RISCV::VSSSEG8E16_V:
+    return OperandInfo(MIVLMul, 4);
+  case RISCV::VLSSEG2E32_V:
+  case RISCV::VLSSEG3E32_V:
+  case RISCV::VLSSEG4E32_V:
+  case RISCV::VLSSEG5E32_V:
+  case RISCV::VLSSEG6E32_V:
+  case RISCV::VLSSEG7E32_V:
+  case RISCV::VLSSEG8E32_V:
+  case RISCV::VSSSEG2E32_V:
+  case RISCV::VSSSEG3E32_V:
+  case RISCV::VSSSEG4E32_V:
+  case RISCV::VSSSEG5E32_V:
+  case RISCV::VSSSEG6E32_V:
+  case RISCV::VSSSEG7E32_V:
+  case RISCV::VSSSEG8E32_V:
+    return OperandInfo(MIVLMul, 5);
+  case RISCV::VLSSEG2E64_V:
+  case RISCV::VLSSEG3E64_V:
+  case RISCV::VLSSEG4E64_V:
+  case RISCV::VLSSEG5E64_V:
+  case RISCV::VLSSEG6E64_V:
+  case RISCV::VLSSEG7E64_V:
+  case RISCV::VLSSEG8E64_V:
+  case RISCV::VSSSEG2E64_V:
+  case RISCV::VSSSEG3E64_V:
+  case RISCV::VSSSEG4E64_V:
+  case RISCV::VSSSEG5E64_V:
+  case RISCV::VSSSEG6E64_V:
+  case RISCV::VSSSEG7E64_V:
+  case RISCV::VSSSEG8E64_V:
+    return OperandInfo(MIVLMul, 6);
+
+  // 7.8.3. Vector Indexed Segment Loads and Stores
+  case RISCV::VLUXSEG2EI8_V:
+  case RISCV::VLUXSEG3EI8_V:
+  case RISCV::VLUXSEG4EI8_V:
+  case RISCV::VLUXSEG5EI8_V:
+  case RISCV::VLUXSEG6EI8_V:
+  case RISCV::VLUXSEG7EI8_V:
+  case RISCV::VLUXSEG8EI8_V:
+  case RISCV::VLOXSEG2EI8_V:
+  case RISCV::VLOXSEG3EI8_V:
+  case RISCV::VLOXSEG4EI8_V:
+  case RISCV::VLOXSEG5EI8_V:
+  case RISCV::VLOXSEG6EI8_V:
+  case RISCV::VLOXSEG7EI8_V:
+  case RISCV::VLOXSEG8EI8_V:
+  case RISCV::VSUXSEG2EI8_V:
+  case RISCV::VSUXSEG3EI8_V:
+  case RISCV::VSUXSEG4EI8_V:
+  case RISCV::VSUXSEG5EI8_V:
+  case RISCV::VSUXSEG6EI8_V:
+  case RISCV::VSUXSEG7EI8_V:
+  case RISCV::VSUXSEG8EI8_V:
+  case RISCV::VSOXSEG2EI8_V...
[truncated]

@michaelmaitland
Copy link
Contributor Author

I have forced push because there had not been any review and I was missing some important bug fixes. I also opted to remove some functionality that I think would be better as a follow up patch to make the review here more straightforward.

Copy link

github-actions bot commented Sep 13, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

@preames
Copy link
Collaborator

preames commented Sep 13, 2024

Hey, awesome! Really happy to see this making it's way upstream. As I'm sure you know, I've been slowly expanding tryToReduceVL in RISCVVectorPeephole in this direction. We should definitely make sure we're sharing as much code as possible. I think we still need the early peephole (to drive the vmv.v.v conversions), but we should at least share code.

I'm going to take a real look at the code on Monday, but wanted to state initial support!

Copy link
Contributor

@wangpc-pp wangpc-pp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! I just left some comments. I will have a detailed review later.

llvm/lib/Target/RISCV/RISCVTargetMachine.cpp Outdated Show resolved Hide resolved
llvm/lib/Target/RISCV/RISCVVLOptimizer.cpp Outdated Show resolved Hide resolved
llvm/lib/Target/RISCV/RISCVVLOptimizer.cpp Outdated Show resolved Hide resolved
llvm/lib/Target/RISCV/RISCVVLOptimizer.cpp Show resolved Hide resolved
llvm/lib/Target/RISCV/RISCVVLOptimizer.cpp Show resolved Hide resolved
Copy link
Collaborator

@preames preames left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of detail comments inline, but I want to focus on high level review strategy.

Looking at the code for this, I spot a couple possible bugs (flagged in inline comments), but nothing really terrible code structure wise. I'd like to propose that we defer most of the code cleanup suggestions until after this is in tree. Specifically, I'm proposing the following:

  1. This patch is repurposed to add the transform pass off-by-default. We subset as needed to get something correct in tree with minimal delay.
  2. We iterate in tree to common code, and to add back any key features subset in step (1).
  3. We enable the pass in it's own commit.

I'm very hesitant to do the usual style review here because there's so much room for subtle bugs to be introduced, and if we get into such a cycle the review could extend nearly indefinitely. I'm willing to assume good faith follow up here, and frankly will likely invest a bunch of time myself in getting this code into long term supportable state.

llvm/test/CodeGen/RISCV/rvv/vl-opt-no-prop.ll Outdated Show resolved Hide resolved
LLVM_DEBUG(dbgs() << "Try reduce VL for " << MI << "\n");
std::optional<Register> CommonVL;
bool CanReduceVL = true;
for (auto &UserOp : MRI->use_operands(MI.getOperand(0).getReg())) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Visiting each user for each time we visit something on the worklist (combined with an extra visit from the outer loop), seems potentially expensive. We could restructure this along the lines of a instcombine style worklist which integrates the outer pass.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a FIXME.

// Instructions like reductions may use a vector register as a scalar
// register. In this case, we should treat it like a scalar register which
// does not impact the decision on whether to optimize VL.
if (isVectorOpUsedAsScalarOp(UserOp)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there might be a bug here. The reduction still needs the VL to be non-zero. If all other users are a register which happens to be zero at runtime, I think this can result in value reaching the reduction (which does not itself have the VL reduced) with a wrong scalar input value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if I follow your thinking here.

On a side note, ff the VL is zero, we cannot reduce it any further. We shouldn't even be calling tryReduceVL.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If all other users are a register which happens to be zero at runtime, I think this can result in value reaching the reduction (which does not itself have the VL reduced) with a wrong scalar input value.

All the users have to have the same VL, so the reduction's VL would also be zero and it wouldn't read the scalar input value I think.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a side note, ff the VL is zero, we cannot reduce it any further. We shouldn't even be calling tryReduceVL.

Zero at runtime, not statically known to be zero.

llvm/lib/Target/RISCV/RISCVVLOptimizer.cpp Outdated Show resolved Hide resolved
unsigned VLOpNum = RISCVII::getVLOpNum(Desc);
const MachineOperand &VLOp = UserMI.getOperand(VLOpNum);
// Looking for a register VL that isn't X0.
if (!VLOp.isReg() || VLOp.getReg() == RISCV::X0) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The choice to only handle registers here is a surprising limitation. We can likely generalize this for immediates as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have this generalization written, was going to post as a follow up.

llvm/lib/Target/RISCV/RISCVVLOptimizer.cpp Show resolved Hide resolved
const MachineInstr &MI = *MO.getParent();
bool HasPassthru = RISCVII::isFirstDefTiedToFirstUse(MI.getDesc());

if (HasPassthru)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This silent off by one on the indexing is confusing to say the least. Given this is only used in a handful of places, I'd prefer to see this inlined and specialized for each user.

case RISCV::VL8RE32_V:
case RISCV::VL8RE64_V:
case RISCV::VS8R_V:
return OperandInfo(RISCVII::VLMUL::LMUL_8, 0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe all of these hard coded LMUL on the whole register stuff are equal to MIVLMul.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have pseudo instructions for whole register load/store and we don't appear to set the TSFlags VLMul for non-pseudo instructions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, another idea...

Their LMUL should match the register class right? We don't have fractional register classes, but we also don't have fraction whole register moves? So maybe combine this with isRVVWholeLoadStore from RISCVInstrInfo.cpp for the selection, then use the register class?

(I'm just looking for ways to remove redundancy. If we can share code with other users, that increases the odds we'd catch a bug.)

Again, this is minor, and I'd be happy to iterate on this type of topic in tree.

case RISCV::VWSUB_VV:
case RISCV::VWSUB_VX:
case RISCV::VWSLL_VI: {
unsigned Log2EEW = IsMODef ? MILog2SEW + 1 : MILog2SEW;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to drive these off the recently added TableGen flag for DestEEW.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the suggestion here to use this:

unsigned Log2EEW = IsMODef ? getDestLog2EEW(MILog2SEW) : MILog2SEW;

Is that a simplification?

case RISCV::VFREDMIN_VS:
case RISCV::VFREDOSUM_VS:
case RISCV::VFREDUSUM_VS:
return OperandInfo(MIVLMul, MILog2SEW);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this (and the other reduction) results is wrong. Specifically, the destination has scalar LMUL, but the source has something else. The fact we're returning the same regardless of op number here can't be correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the RVV spec:

The unsigned vwredsumu.vs instruction zero-extends the SEW-wide vector elements before summing them, then adds the 2SEW-width scalar element, and stores the result in a 2SEW-width scalar element.

The vwredsum.vs instruction sign-extends the SEW-wide vector elements before summing them.

The scalar input and output operands are held in element 0 of a single vector register, not a vector register group, so any vector register can be the scalar source or destination of a vector reduction regardless of LMUL setting.

Although the result is a single scalar, it's still written to a vector register. We can see here that the RVV spec refers to the elements it writes to have a SEW. I think it makes less sense to talk about LMUL here, and I think that I said to use the same relation that the SEW had for simplicity. As used, I don't think it would cause any harm since the LMUL being reported is larger than what is actually getting used (just a single element of a vector register).

Truthfully, I cant recall why I wrote it such that 14.1 instructions return Unknown, but 14.2-14.4 don't. Some ideas:

  • return Unknown
  • keep as is
  • modify OperandInfo to describe this case

I probably prefer the first option for now and we can always improve it in the future. Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote in 1c35bd5 to return Unknown. Then I updated the patch to modfify OperandInfo to describe a vector register being used as a scalar in 6785417.

michaelmaitland added a commit that referenced this pull request Sep 17, 2024
These tests are good candidate for VL optimization. This is a pre-commit for
PR #108640, but can could probably also be improved by the peephole VL
optimizations.
@michaelmaitland
Copy link
Contributor Author

michaelmaitland commented Sep 17, 2024

As I am working on responding to reviews, one thing has become clear. We need better tests for this. I'm not sure what the best way to test things are. It feels like there are two things here:

  1. getOperandInfo
  2. the actual VL optimization

I wonder if it would be a good idea to add unit tests for getOperandInfo. We would go through each operand of each instruction and verify that it had the expected OperandInfo. This is tedious, and I'd like to avoid it if we are going to rip it out in favor of tablegen.

Then for the actual VL optimization, I think it would be best to use lit tests to get code coverage of all paths not (not including all paths through getOperandInfo since that would be tested on its own).

Does anyone have different ideas or does this sound okay?

@preames
Copy link
Collaborator

preames commented Sep 17, 2024

Does anyone have different ideas or does this sound okay?

First, I want to explicitly say again that I'd welcome this iteration happening in tree.

Second, I don't tend to be a huge fan of C++ unit tests for this kind of API because they end up just repeating the implementation, and are hard to update. If you want to go in this direction, maybe something along the lines of a printer pass so that diffs can be autogenerated and we only have to (manually) change one place?

Third, part of the problem with getOperandInfo is that it replicates information in other places. In particular, I think this gets a lot simpler if restructured using the recently added DestEEW mechanism.

@michaelmaitland
Copy link
Contributor Author

First, I want to explicitly say again that I'd welcome this iteration happening in tree.

If everyone is on board for this approach, I think that would be good. But I want to highlight here that I think the test coverage is not in a place where I'd feel comfortable to enabling this pass by default. I think this is a must have before turning it on.

hamphet pushed a commit to hamphet/llvm-project that referenced this pull request Sep 18, 2024
These tests are good candidate for VL optimization. This is a pre-commit for
PR llvm#108640, but can could probably also be improved by the peephole VL
optimizations.
@lukel97
Copy link
Contributor

lukel97 commented Sep 18, 2024

If everyone is on board for this approach, I think that would be good. But I want to highlight here that I think the test coverage is not in a place where I'd feel comfortable to enabling this pass by default. I think this is a must have before turning it on.

I would be happy to iterate in tree as well, we can leave it off by default and add some .mir tests. As for moving getOperandInfo out into TableGen, it was mostly a refactoring idea and I don't think we should block this PR on it. I can take a stab at it this week in parallel and see if it's feasible or not.

case RISCV::VREDXOR_VS: {
if (IsMODef || IsOp1)
return OperandInfo(MILog2SEW);
if (IsOp2)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to check whether this is the correct operand, or if it should be op3. Will add tests.

if (isVectorRegClass(MI.getOperand(0).getReg(), MRI))
checkUsers(CommonVL, CanReduceVL, MI);
else {
CommonVL = MI.getOperand(RISCVII::getVLOpNum(MI.getDesc())).getReg();
Copy link
Collaborator

@topperc topperc Sep 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this just change the VL to its own VL? What's the point? And if that's all it does, you need to detect that below and not set MadeChange or add the inputs to the worklist because it didn't really change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed it.

std::optional<Register> CommonVL;
bool CanReduceVL = true;
if (isVectorRegClass(MI.getOperand(0).getReg(), MRI))
checkUsers(CommonVL, CanReduceVL, MI);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make checkUsers return CanReduceVl

// vec_reg_a = ..., vl_op, sew_op
// vec_reg_a = ..., vl_op, sew_op
// scalar_reg = vector_instr vec_reg_a, vec_reg_b, vl_op, sew_op
// We'd like to reduce the vl_op on vector_instr, despite it producing
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment still relevant?

@michaelmaitland
Copy link
Contributor Author

ping

Copy link
Collaborator

@preames preames left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

As discussed in today's sync call, we have consensus to land this in tree (off by default!) and continue iterating. There is a good amount of code sharing, and improved testing which needs to happen before this can be enabled by default. All parties are aware, and enabling that incrementally is the goal of landing now.

@ppenzin
Copy link

ppenzin commented Oct 10, 2024

I've ran this by our arch folks and we are also in favor of the change.

Copy link
Contributor

@lukel97 lukel97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM to land as is, just spotted something I want to point out before I forget

Comment on lines 401 to 402
case RISCV::VSMUL_VV:
case RISCV::VSMUL_VX:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these not single width? I thought the multiply happened in SEW*2 but the destination is still SEW

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The signed fractional multiply instruction produces a 2*SEW product of the two SEW inputs, then shifts the result right by SEW-1 bits, rounding these bits according to vxrm, then saturates the result to fit into SEW bits.

Thats a great catch. I'll add a test specifically for these before merging.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to land first and add a test later, don't let my reviews block this!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've removed it from getOperandInfo, we can add back with tests later.


// All mask operands have EEW=1, EMUL=(EEW/SEW)*LMUL
if (isMaskOperand(MI, MO, MRI))
return OperandInfo(RISCVVType::getEMULEqualsEEWDivSEWTimesLMUL(0, MI), 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we handle compare instructions e.g. vmseq.vv where the destination operand will be a mask register, but not constrained to V0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from isMaskOperand

  return Desc.operands()[MO.getOperandNo()].RegClass == RISCV::VMV0RegClassID;

It looks like we only support the v0 case for now.

@michaelmaitland michaelmaitland merged commit 1c94388 into llvm:main Oct 11, 2024
5 of 7 checks passed
ichaer added a commit to splunk/ichaer-llvm-project that referenced this pull request Oct 11, 2024
…ent-indentonly

* llvm-trunk/main: (6379 commits)
  [gn build] Port 1c94388
  [RISCV] Introduce VLOptimizer pass (llvm#108640)
  [mlir][vector] Add more tests for ConvertVectorToLLVM (7/n) (llvm#111895)
  [libc++] Add output groups to run-buildbot (llvm#111739)
  [libc++abi] Remove unused LIBCXXABI_LIBCXX_INCLUDES CMake option (llvm#111824)
  [clang] Ignore inline namespace for `hasName` (llvm#109147)
  [AArch64] Disable consecutive store merging when Neon is unavailable (llvm#111519)
  [lldb] Fix finding make tool for tests (llvm#111980)
  Turn `-Wdeprecated-literal-operator` on by default (llvm#111027)
  [AMDGPU] Rewrite RegSeqNames using !foreach. NFC. (llvm#111994)
  Revert "Reland: [clang] Finish implementation of P0522 (llvm#111711)"
  Revert "[clang] CWG2398: improve overload resolution backwards compat (llvm#107350)"
  Revert "[clang] Implement TTP P0522 pack matching for deduced function template calls. (llvm#111457)"
  [Clang] Replace Intrinsic::getDeclaration with getOrInsertDeclaration (llvm#111990)
  Revert "[NVPTX] Prefer prmt.b32 over bfi.b32 (llvm#110766)"
  [RISCV] Add DAG combine to turn (sub (shl X, 8-Y), (shr X, Y)) into orc.b (llvm#111828)
  [libc] Fix compilation of new trig functions (llvm#111987)
  [NFC] Rename `Intrinsic::getDeclaration` to `getOrInsertDeclaration` (llvm#111752)
  [NFC][CodingStandard] Add additional example for if-else brace rule (llvm#111733)
  CodeGen: Remove redundant REQUIRES registered-target from tests (llvm#111982)
  ...
@llvm-ci
Copy link
Collaborator

llvm-ci commented Oct 11, 2024

LLVM Buildbot has detected a new failure on builder bolt-x86_64-ubuntu-dylib running on bolt-worker while building llvm at step 6 "test-build-bolt-check-bolt".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/119/builds/2572

Here is the relevant piece of the build log for the reference
Step 6 (test-build-bolt-check-bolt) failure: test (failure)
******************** TEST 'BOLT :: perf2bolt/perf_test.test' FAILED ********************
Exit Code: 1

Command Output (stderr):
--
RUN: at line 5: /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/bin/clang /home/worker/bolt-worker2/llvm-project/bolt/test/perf2bolt/Inputs/perf_test.c -fuse-ld=lld -Wl,--script=/home/worker/bolt-worker2/llvm-project/bolt/test/perf2bolt/Inputs/perf_test.lds -o /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/tools/bolt/test/perf2bolt/Output/perf_test.test.tmp
+ /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/bin/clang /home/worker/bolt-worker2/llvm-project/bolt/test/perf2bolt/Inputs/perf_test.c -fuse-ld=lld -Wl,--script=/home/worker/bolt-worker2/llvm-project/bolt/test/perf2bolt/Inputs/perf_test.lds -o /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/tools/bolt/test/perf2bolt/Output/perf_test.test.tmp
RUN: at line 6: perf record -Fmax -e cycles:u -o /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/tools/bolt/test/perf2bolt/Output/perf_test.test.tmp2 -- /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/tools/bolt/test/perf2bolt/Output/perf_test.test.tmp
+ perf record -Fmax -e cycles:u -o /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/tools/bolt/test/perf2bolt/Output/perf_test.test.tmp2 -- /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/tools/bolt/test/perf2bolt/Output/perf_test.test.tmp
info: Using a maximum frequency rate of 2000 Hz
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.002 MB /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/tools/bolt/test/perf2bolt/Output/perf_test.test.tmp2 (18 samples) ]
RUN: at line 7: /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/bin/perf2bolt /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/tools/bolt/test/perf2bolt/Output/perf_test.test.tmp -p=/home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/tools/bolt/test/perf2bolt/Output/perf_test.test.tmp2 -o /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/tools/bolt/test/perf2bolt/Output/perf_test.test.tmp3 -nl -ignore-build-id 2>&1 | /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/bin/FileCheck /home/worker/bolt-worker2/llvm-project/bolt/test/perf2bolt/perf_test.test
+ /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/bin/perf2bolt /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/tools/bolt/test/perf2bolt/Output/perf_test.test.tmp -p=/home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/tools/bolt/test/perf2bolt/Output/perf_test.test.tmp2 -o /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/tools/bolt/test/perf2bolt/Output/perf_test.test.tmp3 -nl -ignore-build-id
+ /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/bin/FileCheck /home/worker/bolt-worker2/llvm-project/bolt/test/perf2bolt/perf_test.test
RUN: at line 12: /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/bin/clang /home/worker/bolt-worker2/llvm-project/bolt/test/perf2bolt/Inputs/perf_test.c -no-pie -fuse-ld=lld -o /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/tools/bolt/test/perf2bolt/Output/perf_test.test.tmp4
+ /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/bin/clang /home/worker/bolt-worker2/llvm-project/bolt/test/perf2bolt/Inputs/perf_test.c -no-pie -fuse-ld=lld -o /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/tools/bolt/test/perf2bolt/Output/perf_test.test.tmp4
RUN: at line 13: perf record -Fmax -e cycles:u -o /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/tools/bolt/test/perf2bolt/Output/perf_test.test.tmp5 -- /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/tools/bolt/test/perf2bolt/Output/perf_test.test.tmp4
+ perf record -Fmax -e cycles:u -o /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/tools/bolt/test/perf2bolt/Output/perf_test.test.tmp5 -- /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/tools/bolt/test/perf2bolt/Output/perf_test.test.tmp4
info: Using a maximum frequency rate of 2000 Hz
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.002 MB /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/tools/bolt/test/perf2bolt/Output/perf_test.test.tmp5 (9 samples) ]
RUN: at line 14: /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/bin/perf2bolt /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/tools/bolt/test/perf2bolt/Output/perf_test.test.tmp4 -p=/home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/tools/bolt/test/perf2bolt/Output/perf_test.test.tmp5 -o /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/tools/bolt/test/perf2bolt/Output/perf_test.test.tmp6 -nl -ignore-build-id 2>&1 | /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/bin/FileCheck /home/worker/bolt-worker2/llvm-project/bolt/test/perf2bolt/perf_test.test
+ /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/bin/FileCheck /home/worker/bolt-worker2/llvm-project/bolt/test/perf2bolt/perf_test.test
+ /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/bin/perf2bolt /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/tools/bolt/test/perf2bolt/Output/perf_test.test.tmp4 -p=/home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/tools/bolt/test/perf2bolt/Output/perf_test.test.tmp5 -o /home/worker/bolt-worker2/bolt-x86_64-ubuntu-dylib/build/tools/bolt/test/perf2bolt/Output/perf_test.test.tmp6 -nl -ignore-build-id
/home/worker/bolt-worker2/llvm-project/bolt/test/perf2bolt/perf_test.test:10:12: error: CHECK-NOT: excluded string found in input
CHECK-NOT: !! WARNING !! This high mismatch ratio indicates the input binary is probably not the same binary used during profiling collection.
           ^
<stdin>:26:2: note: found here
 !! WARNING !! This high mismatch ratio indicates the input binary is probably not the same binary used during profiling collection. The generated data may be ineffective for improving performance.
 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Input file: <stdin>
Check file: /home/worker/bolt-worker2/llvm-project/bolt/test/perf2bolt/perf_test.test

-dump-input=help explains the following input dump.

Input was:
<<<<<<
        .
        .
        .
       21: BOLT-WARNING: Running parallel work of 0 estimated cost, will switch to trivial scheduling. 
       22: PERF2BOLT: processing basic events (without LBR)... 
       23: PERF2BOLT: read 9 samples 
       24: PERF2BOLT: out of range samples recorded in unknown regions: 9 (100.0%) 
       25:  
       26:  !! WARNING !! This high mismatch ratio indicates the input binary is probably not the same binary used during profiling collection. The generated data may be ineffective for improving performance. 
not:10      !~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                                                                   error: no match expected
       27:  
...

@llvm-ci
Copy link
Collaborator

llvm-ci commented Oct 11, 2024

LLVM Buildbot has detected a new failure on builder lld-x86_64-win running on as-worker-93 while building llvm at step 7 "test-build-unified-tree-check-all".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/146/builds/1360

Here is the relevant piece of the build log for the reference
Step 7 (test-build-unified-tree-check-all) failure: test (failure)
******************** TEST 'LLVM-Unit :: Support/./SupportTests.exe/35/87' FAILED ********************
Script(shard):
--
GTEST_OUTPUT=json:C:\a\lld-x86_64-win\build\unittests\Support\.\SupportTests.exe-LLVM-Unit-9448-35-87.json GTEST_SHUFFLE=0 GTEST_TOTAL_SHARDS=87 GTEST_SHARD_INDEX=35 C:\a\lld-x86_64-win\build\unittests\Support\.\SupportTests.exe
--

Script:
--
C:\a\lld-x86_64-win\build\unittests\Support\.\SupportTests.exe --gtest_filter=ProgramEnvTest.CreateProcessLongPath
--
C:\a\lld-x86_64-win\llvm-project\llvm\unittests\Support\ProgramTest.cpp(160): error: Expected equality of these values:
  0
  RC
    Which is: -2

C:\a\lld-x86_64-win\llvm-project\llvm\unittests\Support\ProgramTest.cpp(163): error: fs::remove(Twine(LongPath)): did not return errc::success.
error number: 13
error message: permission denied



C:\a\lld-x86_64-win\llvm-project\llvm\unittests\Support\ProgramTest.cpp:160
Expected equality of these values:
  0
  RC
    Which is: -2

C:\a\lld-x86_64-win\llvm-project\llvm\unittests\Support\ProgramTest.cpp:163
fs::remove(Twine(LongPath)): did not return errc::success.
error number: 13
error message: permission denied




********************


DanielCChen pushed a commit to DanielCChen/llvm-project that referenced this pull request Oct 16, 2024
The purpose of this optimization is to make the VL argument, for
instructions that have a VL argument, as small as possible. This is
implemented by visiting each instruction in reverse order and checking
that if it has a VL argument, whether the VL can be reduced.

By putting this pass before VSETVLI insertion, we see three kinds of
changes to generated code:
1. Eliminate VSETVLI instructions
2. Reduce the VL toggle on VSETVLI instructions that also change vtype
3. Reduce the VL set by a VSETVLI instruction

The list of supported instructions is currently whitelisted for safety.
In the future, we could add more instructions to `isSupportedInstr` to
support even more VL optimization.

We originally wrote this pass because vector GEP instructions do not
take a VL, which leads us to emit code that uses VL=VLMAX to implement
GEP in the RISC-V backend. As a result, some of the vector instructions
will write to lanes, specifically between the intended VL and VLMAX,
that will never be read. As an alternative to this pass, we considered
adding a vector predicated GEP instruction, but this would not fit well
into the intrinsic type system since GEP has a variable number of
arguments, each with arbitrary types. The second approach we considered
was to put this pass after VSETVLI insertion, but we found that it was
more difficult to recognize optimization opportunities, especially
across basic block boundaries -- the data flow analysis was also a bit
more expensive and complex.

While this pass solves the GEP problem, we have expanded it to handle
more cases of VL optimization, and there is opportunity for the analysis
to be improved to enable even more optimization. We have a few follow up
patches to post, but figured this would be a good start.

---------

Co-authored-by: Craig Topper <[email protected]>
Co-authored-by: Kito Cheng <[email protected]>
bricknerb pushed a commit to bricknerb/llvm-project that referenced this pull request Oct 17, 2024
The purpose of this optimization is to make the VL argument, for
instructions that have a VL argument, as small as possible. This is
implemented by visiting each instruction in reverse order and checking
that if it has a VL argument, whether the VL can be reduced.

By putting this pass before VSETVLI insertion, we see three kinds of
changes to generated code:
1. Eliminate VSETVLI instructions
2. Reduce the VL toggle on VSETVLI instructions that also change vtype
3. Reduce the VL set by a VSETVLI instruction

The list of supported instructions is currently whitelisted for safety.
In the future, we could add more instructions to `isSupportedInstr` to
support even more VL optimization.

We originally wrote this pass because vector GEP instructions do not
take a VL, which leads us to emit code that uses VL=VLMAX to implement
GEP in the RISC-V backend. As a result, some of the vector instructions
will write to lanes, specifically between the intended VL and VLMAX,
that will never be read. As an alternative to this pass, we considered
adding a vector predicated GEP instruction, but this would not fit well
into the intrinsic type system since GEP has a variable number of
arguments, each with arbitrary types. The second approach we considered
was to put this pass after VSETVLI insertion, but we found that it was
more difficult to recognize optimization opportunities, especially
across basic block boundaries -- the data flow analysis was also a bit
more expensive and complex.

While this pass solves the GEP problem, we have expanded it to handle
more cases of VL optimization, and there is opportunity for the analysis
to be improved to enable even more optimization. We have a few follow up
patches to post, but figured this would be a good start.

---------

Co-authored-by: Craig Topper <[email protected]>
Co-authored-by: Kito Cheng <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants