-
Notifications
You must be signed in to change notification settings - Fork 11.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RISCV] Introduce VLOptimizer pass #108640
Conversation
@llvm/pr-subscribers-backend-risc-v Author: Michael Maitland (michaelmaitland) ChangesThe purpose of this optimization is to make the VL argument, for instructions that have a VL argument, as small as possible. This is implemented by visiting each instruction in reverse order and checking that if it has a VL argument, whether the VL can be reduced. By putting this pass before VSETVLI insertion, we see three kinds of changes to generated code:
The list of supported instructions is currently whitelisted for safety. In the future, we could add more instructions to We originally wrote this pass because vector GEP instructions do not take a VL, which leads us to emit code that uses VL=VLMAX to implement GEP in the RISC-V backend. As a result, some of the vector instructions will write to lanes, specifically between the intended VL and VLMAX, that will never be read. As an alternative to this pass, we considered adding a vector predicated GEP instruction, but this would not fit well into the intrinsic type system since GEP has a variable number of arguments, each with arbitrary types. The second approach we considered was to put this pass after VSETVLI insertion, but we found that it was more difficult to recognize optimization opportunities, especially across basic block boundaries -- the data flow analysis was also a bit more expensive and complex. While this pass solves the GEP problem, we have expanded it to handle more cases of VL optimization, and there is opportunity for the analysis to be improved to enable even more optimization. In our downstream compiler, this was able to optimize > 5000 VLs on spec2006 and > 6000 VLs on spec2017. These measurements were using Patch is 242.21 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/108640.diff 29 Files Affected:
diff --git a/llvm/lib/Target/RISCV/CMakeLists.txt b/llvm/lib/Target/RISCV/CMakeLists.txt
index 124bc239451aed..46ef94b4dd2931 100644
--- a/llvm/lib/Target/RISCV/CMakeLists.txt
+++ b/llvm/lib/Target/RISCV/CMakeLists.txt
@@ -59,6 +59,7 @@ add_llvm_target(RISCVCodeGen
RISCVTargetObjectFile.cpp
RISCVTargetTransformInfo.cpp
RISCVVectorPeephole.cpp
+ RISCVVLOptimizer.cpp
GISel/RISCVCallLowering.cpp
GISel/RISCVInstructionSelector.cpp
GISel/RISCVLegalizerInfo.cpp
diff --git a/llvm/lib/Target/RISCV/RISCV.h b/llvm/lib/Target/RISCV/RISCV.h
index 5a94ada8f8dd46..651119d10d1119 100644
--- a/llvm/lib/Target/RISCV/RISCV.h
+++ b/llvm/lib/Target/RISCV/RISCV.h
@@ -99,6 +99,10 @@ void initializeRISCVO0PreLegalizerCombinerPass(PassRegistry &);
FunctionPass *createRISCVPreLegalizerCombiner();
void initializeRISCVPreLegalizerCombinerPass(PassRegistry &);
+
+FunctionPass *createRISCVVLOptimizerPass();
+void initializeRISCVVLOptimizerPass(PassRegistry &);
+
} // namespace llvm
#endif
diff --git a/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp b/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp
index 794df2212dfa53..bfedc6b177332e 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp
+++ b/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp
@@ -103,6 +103,10 @@ static cl::opt<bool> EnableVSETVLIAfterRVVRegAlloc(
cl::desc("Insert vsetvls after vector register allocation"),
cl::init(true));
+static cl::opt<bool> EnableVLOptimizer("riscv-enable-vloptimizer",
+ cl::desc("Enable the VL Optimizer pass"),
+ cl::init(true), cl::Hidden);
+
extern "C" LLVM_EXTERNAL_VISIBILITY void LLVMInitializeRISCVTarget() {
RegisterTargetMachine<RISCVTargetMachine> X(getTheRISCV32Target());
RegisterTargetMachine<RISCVTargetMachine> Y(getTheRISCV64Target());
@@ -550,8 +554,11 @@ void RISCVPassConfig::addMachineSSAOptimization() {
void RISCVPassConfig::addPreRegAlloc() {
addPass(createRISCVPreRAExpandPseudoPass());
- if (TM->getOptLevel() != CodeGenOptLevel::None)
+ if (TM->getOptLevel() != CodeGenOptLevel::None) {
addPass(createRISCVMergeBaseOffsetOptPass());
+ if (EnableVLOptimizer)
+ addPass(createRISCVVLOptimizerPass());
+ }
addPass(createRISCVInsertReadWriteCSRPass());
addPass(createRISCVInsertWriteVXRMPass());
diff --git a/llvm/lib/Target/RISCV/RISCVVLOptimizer.cpp b/llvm/lib/Target/RISCV/RISCVVLOptimizer.cpp
new file mode 100644
index 00000000000000..d25db9b38bec44
--- /dev/null
+++ b/llvm/lib/Target/RISCV/RISCVVLOptimizer.cpp
@@ -0,0 +1,1468 @@
+//===-------------- RISCVVLOptimizer.cpp - VL Optimizer ------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===---------------------------------------------------------------------===//
+//
+// This pass reduces the VL where possible at the MI level, before VSETVLI
+// instructions are inserted.
+//
+// The purpose of this optimization is to make the VL argument, for instructions
+// that have a VL argument, as small as possible. This is implemented by
+// visiting each instruction in reverse order and checking that if it has a VL
+// argument, whether the VL can be reduced.
+//
+//===---------------------------------------------------------------------===//
+
+#include "RISCV.h"
+#include "RISCVMachineFunctionInfo.h"
+#include "RISCVSubtarget.h"
+#include "llvm/ADT/SetVector.h"
+#include "llvm/CodeGen/MachineDominators.h"
+#include "llvm/CodeGen/MachineFunctionPass.h"
+#include "llvm/InitializePasses.h"
+
+#include <algorithm>
+
+using namespace llvm;
+
+#define DEBUG_TYPE "riscv-vl-optimizer"
+
+namespace {
+
+class RISCVVLOptimizer : public MachineFunctionPass {
+ const MachineRegisterInfo *MRI;
+ const MachineDominatorTree *MDT;
+
+public:
+ static char ID;
+
+ RISCVVLOptimizer() : MachineFunctionPass(ID) {
+ initializeRISCVVLOptimizerPass(*PassRegistry::getPassRegistry());
+ }
+
+ bool runOnMachineFunction(MachineFunction &MF) override;
+
+ void getAnalysisUsage(AnalysisUsage &AU) const override {
+ AU.setPreservesCFG();
+ AU.addRequired<MachineDominatorTreeWrapperPass>();
+ MachineFunctionPass::getAnalysisUsage(AU);
+ }
+
+ StringRef getPassName() const override { return "RISC-V VL Optimizer"; }
+
+private:
+ bool tryReduceVL(MachineInstr &MI);
+};
+
+} // end anonymous namespace
+
+char RISCVVLOptimizer::ID = 0;
+INITIALIZE_PASS_BEGIN(RISCVVLOptimizer, DEBUG_TYPE, "RISC-V VL Optimizer",
+ false, false)
+INITIALIZE_PASS_DEPENDENCY(MachineDominatorTreeWrapperPass)
+INITIALIZE_PASS_END(RISCVVLOptimizer, DEBUG_TYPE, "RISC-V VL Optimizer", false,
+ false)
+
+FunctionPass *llvm::createRISCVVLOptimizerPass() {
+ return new RISCVVLOptimizer();
+}
+
+/// Return true if R is a physical or virtual vector register, false otherwise.
+static bool isVectorRegClass(Register R, const MachineRegisterInfo *MRI) {
+ if (R.isPhysical())
+ return RISCV::VRRegClass.contains(R);
+ const TargetRegisterClass *RC = MRI->getRegClass(R);
+ return RISCV::VRRegClass.hasSubClassEq(RC) ||
+ RISCV::VRM2RegClass.hasSubClassEq(RC) ||
+ RISCV::VRM4RegClass.hasSubClassEq(RC) ||
+ RISCV::VRM8RegClass.hasSubClassEq(RC);
+}
+
+/// Represents the EMUL and EEW of a MachineOperand.
+struct OperandInfo {
+ enum class State {
+ Unknown,
+ Known,
+ } S;
+
+ // Represent as 1,2,4,8, ... and fractional indicator. This is because
+ // EMUL can take on values that don't map to RISCVII::VLMUL values exactly.
+ // For example, a mask operand can have an EMUL less than MF8.
+ std::pair<unsigned, bool> EMUL;
+
+ unsigned Log2EEW;
+
+ OperandInfo(RISCVII::VLMUL EMUL, unsigned Log2EEW)
+ : S(State::Known), EMUL(RISCVVType::decodeVLMUL(EMUL)), Log2EEW(Log2EEW) {
+ }
+
+ OperandInfo(std::pair<unsigned, bool> EMUL, unsigned Log2EEW)
+ : S(State::Known), EMUL(EMUL), Log2EEW(Log2EEW) {}
+
+ OperandInfo(State S) : S(S) {
+ assert(S != State::Known &&
+ "This constructor may only be used to construct "
+ "an Unknown OperandInfo");
+ }
+
+ bool isUnknown() { return S == State::Unknown; }
+ bool isKnown() { return S == State::Known; }
+
+ static bool EMULAndEEWAreEqual(OperandInfo A, OperandInfo B) {
+ assert(A.isKnown() && B.isKnown() && "Both operands must be known");
+ return A.Log2EEW == B.Log2EEW && A.EMUL.first == B.EMUL.first &&
+ A.EMUL.second == B.EMUL.second;
+ }
+};
+
+/// Return the RISCVII::VLMUL that is two times VLMul.
+/// Precondition: VLMul is not LMUL_RESERVED or LMUL_8.
+static RISCVII::VLMUL twoTimesVLMUL(RISCVII::VLMUL VLMul) {
+ switch (VLMul) {
+ case RISCVII::VLMUL::LMUL_F8:
+ return RISCVII::VLMUL::LMUL_F4;
+ case RISCVII::VLMUL::LMUL_F4:
+ return RISCVII::VLMUL::LMUL_F2;
+ case RISCVII::VLMUL::LMUL_F2:
+ return RISCVII::VLMUL::LMUL_1;
+ case RISCVII::VLMUL::LMUL_1:
+ return RISCVII::VLMUL::LMUL_2;
+ case RISCVII::VLMUL::LMUL_2:
+ return RISCVII::VLMUL::LMUL_4;
+ case RISCVII::VLMUL::LMUL_4:
+ return RISCVII::VLMUL::LMUL_8;
+ case RISCVII::VLMUL::LMUL_8:
+ default:
+ llvm_unreachable("Could not multiply VLMul by 2");
+ }
+}
+
+/// Return the RISCVII::VLMUL that is VLMul / 2.
+/// Precondition: VLMul is not LMUL_RESERVED or LMUL_MF8.
+static RISCVII::VLMUL halfVLMUL(RISCVII::VLMUL VLMul) {
+ switch (VLMul) {
+ case RISCVII::VLMUL::LMUL_F4:
+ return RISCVII::VLMUL::LMUL_F8;
+ case RISCVII::VLMUL::LMUL_F2:
+ return RISCVII::VLMUL::LMUL_F4;
+ case RISCVII::VLMUL::LMUL_1:
+ return RISCVII::VLMUL::LMUL_F2;
+ case RISCVII::VLMUL::LMUL_2:
+ return RISCVII::VLMUL::LMUL_1;
+ case RISCVII::VLMUL::LMUL_4:
+ return RISCVII::VLMUL::LMUL_2;
+ case RISCVII::VLMUL::LMUL_8:
+ return RISCVII::VLMUL::LMUL_4;
+ case RISCVII::VLMUL::LMUL_F8:
+ default:
+ llvm_unreachable("Could not divide VLMul by 2");
+ }
+}
+
+/// Return EMUL = (EEW / SEW) * LMUL where EEW comes from Log2EEW and LMUL and
+/// SEW are from the TSFlags of MI.
+static std::pair<unsigned, bool>
+getEMULEqualsEEWDivSEWTimesLMUL(unsigned Log2EEW, const MachineInstr &MI) {
+ RISCVII::VLMUL MIVLMUL = RISCVII::getLMul(MI.getDesc().TSFlags);
+ auto [MILMUL, MILMULIsFractional] = RISCVVType::decodeVLMUL(MIVLMUL);
+ unsigned MILog2SEW =
+ MI.getOperand(RISCVII::getSEWOpNum(MI.getDesc())).getImm();
+ unsigned MISEW = 1 << MILog2SEW;
+
+ unsigned EEW = 1 << Log2EEW;
+ // Calculate (EEW/SEW)*LMUL preserving fractions less than 1. Use GCD
+ // to put fraction in simplest form.
+ unsigned Num = EEW, Denom = MISEW;
+ int GCD = MILMULIsFractional ? std::gcd(Num, Denom * MILMUL)
+ : std::gcd(Num * MILMUL, Denom);
+ Num = MILMULIsFractional ? Num / GCD : Num * MILMUL / GCD;
+ Denom = MILMULIsFractional ? Denom * MILMUL / GCD : Denom / GCD;
+ return std::make_pair(Num > Denom ? Num : Denom, Denom > Num);
+}
+
+/// An index segment load or store operand has the form v.*seg<nf>ei<eeew>.v.
+/// Data has EEW=SEW, EMUL=LMUL. Index has EEW=<eew>, EMUL=(EEW/SEW)*LMUL. LMUL
+/// and SEW comes from TSFlags of MI.
+static OperandInfo
+getIndexSegmentLoadStoreOperandInfo(unsigned Log2EEW, const MachineInstr &MI,
+ const MachineOperand &MO) {
+ // Operand 0 is data register
+ // Data vector register group has EEW=SEW, EMUL=LMUL.
+ if (MO.getOperandNo() == 0) {
+ RISCVII::VLMUL MIVLMul = RISCVII::getLMul(MI.getDesc().TSFlags);
+ unsigned MILog2SEW =
+ MI.getOperand(RISCVII::getSEWOpNum(MI.getDesc())).getImm();
+ return OperandInfo(MIVLMul, MILog2SEW);
+ }
+
+ // Operand 1 is index vector register
+ // v.*seg<nf>ei<eeew>.v
+ // Index vector register group has EEW=<eew>, EMUL=(EEW/SEW)*LMUL.
+ if (MO.getOperandNo() == 2)
+ return OperandInfo(getEMULEqualsEEWDivSEWTimesLMUL(Log2EEW, MI), Log2EEW);
+
+ llvm_unreachable("Could not get OperandInfo for non-vector register of an "
+ "indexed segment load or store instruction");
+}
+
+/// Dest has EEW=SEW and EMUL=LMUL. Source EEW=SEW/Factor (i.e. F2 => EEW/2).
+/// Source has EMUL=(EEW/SEW)*LMUL. LMUL and SEW comes from TSFlags of MI.
+static OperandInfo getIntegerExtensionOperandInfo(unsigned Factor,
+ const MachineInstr &MI,
+ const MachineOperand &MO) {
+ RISCVII::VLMUL MIVLMul = RISCVII::getLMul(MI.getDesc().TSFlags);
+ unsigned MILog2SEW =
+ MI.getOperand(RISCVII::getSEWOpNum(MI.getDesc())).getImm();
+
+ if (MO.getOperandNo() == 0)
+ return OperandInfo(MIVLMul, MILog2SEW);
+
+ unsigned MISEW = 1 << MILog2SEW;
+ unsigned EEW = MISEW / Factor;
+ unsigned Log2EEW = Log2_32(EEW);
+
+ return OperandInfo(getEMULEqualsEEWDivSEWTimesLMUL(Log2EEW, MI), Log2EEW);
+}
+
+/// Check whether MO is a mask operand of MI.
+static bool isMaskOperand(const MachineInstr &MI, const MachineOperand &MO,
+ const MachineRegisterInfo *MRI) {
+
+ if (!MO.isReg() || !isVectorRegClass(MO.getReg(), MRI))
+ return false;
+
+ const MCInstrDesc &Desc = MI.getDesc();
+ return Desc.operands()[MO.getOperandNo()].RegClass == RISCV::VMV0RegClassID;
+}
+
+/// Return the OperandInfo for MO, which is an operand of MI.
+static OperandInfo getOperandInfo(const MachineInstr &MI,
+ const MachineOperand &MO,
+ const MachineRegisterInfo *MRI) {
+ const RISCVVPseudosTable::PseudoInfo *RVV =
+ RISCVVPseudosTable::getPseudoInfo(MI.getOpcode());
+ assert(RVV && "Could not find MI in PseudoTable");
+
+ // MI has a VLMUL and SEW associated with it. The RVV specification defines
+ // the LMUL and SEW of each operand and definition in relation to MI.VLMUL and
+ // MI.SEW.
+ RISCVII::VLMUL MIVLMul = RISCVII::getLMul(MI.getDesc().TSFlags);
+ unsigned MILog2SEW =
+ MI.getOperand(RISCVII::getSEWOpNum(MI.getDesc())).getImm();
+ bool IsMODef = MO.getOperandNo() == 0;
+
+ // All mask operands have EEW=1, EMUL=(EEW/SEW)*LMUL
+ if (isMaskOperand(MI, MO, MRI))
+ return OperandInfo(getEMULEqualsEEWDivSEWTimesLMUL(0, MI), 0);
+
+ // TODO: Pseudos that end in _MASK or _TU can have a merge operand.
+ // We bail out early for instructions that have merge operands for now.
+ // Merge operand is first operand after defs.
+ if (MO.getOperandNo() == MI.getNumExplicitDefs() && MO.isReg() && MO.isTied())
+ return OperandInfo(OperandInfo::State::Unknown);
+
+ // switch against BaseInstr to reduce number of cases that need to be
+ // considered.
+ switch (RVV->BaseInstr) {
+
+ // 6. Configuration-Setting Instructions
+ // Configuration setting instructions do not read or write vector registers
+ case RISCV::VSETIVLI:
+ case RISCV::VSETVL:
+ case RISCV::VSETVLI:
+ llvm_unreachable("Configuration setting instructions do not read or write "
+ "vector registers");
+
+ // 7. Vector Loads and Stores
+ // 7.4. Vector Unit-Stride Instructions
+ // 7.5. Vector Strided Instructions
+ // 7.7. Unit-stride Fault-Only-First Loads
+ /// Dest EEW encoded in the instruction and EMUL=(EEW/SEW)*LMUL
+ case RISCV::VLE8_V:
+ case RISCV::VSE8_V:
+ case RISCV::VLM_V:
+ case RISCV::VSM_V:
+ case RISCV::VLSE8_V:
+ case RISCV::VSSE8_V:
+ case RISCV::VLE8FF_V:
+ return OperandInfo(getEMULEqualsEEWDivSEWTimesLMUL(3, MI), 3);
+ case RISCV::VLE16_V:
+ case RISCV::VSE16_V:
+ case RISCV::VLSE16_V:
+ case RISCV::VSSE16_V:
+ case RISCV::VLE16FF_V:
+ return OperandInfo(getEMULEqualsEEWDivSEWTimesLMUL(4, MI), 4);
+ case RISCV::VLE32_V:
+ case RISCV::VSE32_V:
+ case RISCV::VLSE32_V:
+ case RISCV::VSSE32_V:
+ case RISCV::VLE32FF_V:
+ return OperandInfo(getEMULEqualsEEWDivSEWTimesLMUL(5, MI), 5);
+ case RISCV::VLE64_V:
+ case RISCV::VSE64_V:
+ case RISCV::VLSE64_V:
+ case RISCV::VSSE64_V:
+ case RISCV::VLE64FF_V:
+ return OperandInfo(getEMULEqualsEEWDivSEWTimesLMUL(6, MI), 6);
+
+ // 7.6. Vector Indexed Instructions
+ // Data EEW=SEW, EMUL=LMUL. Index EEW=<eew> and EMUL=(EEW/SEW)*LMUL
+ case RISCV::VLUXEI8_V:
+ case RISCV::VLOXEI8_V:
+ case RISCV::VSUXEI8_V:
+ case RISCV::VSOXEI8_V:
+ if (MO.getOperandNo() == 0)
+ return OperandInfo(MIVLMul, MILog2SEW);
+ return OperandInfo(getEMULEqualsEEWDivSEWTimesLMUL(3, MI), 3);
+ case RISCV::VLUXEI16_V:
+ case RISCV::VLOXEI16_V:
+ case RISCV::VSUXEI16_V:
+ case RISCV::VSOXEI16_V:
+ if (MO.getOperandNo() == 0)
+ return OperandInfo(MIVLMul, MILog2SEW);
+ return OperandInfo(getEMULEqualsEEWDivSEWTimesLMUL(4, MI), 4);
+ case RISCV::VLUXEI32_V:
+ case RISCV::VLOXEI32_V:
+ case RISCV::VSUXEI32_V:
+ case RISCV::VSOXEI32_V:
+ if (MO.getOperandNo() == 0)
+ return OperandInfo(MIVLMul, MILog2SEW);
+ return OperandInfo(getEMULEqualsEEWDivSEWTimesLMUL(5, MI), 5);
+ case RISCV::VLUXEI64_V:
+ case RISCV::VLOXEI64_V:
+ case RISCV::VSUXEI64_V:
+ case RISCV::VSOXEI64_V:
+ if (MO.getOperandNo() == 0)
+ return OperandInfo(MIVLMul, MILog2SEW);
+ return OperandInfo(getEMULEqualsEEWDivSEWTimesLMUL(6, MI), 6);
+
+ // 7.8. Vector Load/Store Segment Instructions
+ // 7.8.1. Vector Unit-Stride Segment Loads and Stores
+ // v.*seg<nf>e<eew>.*
+ // EEW=eew, EMUL=LMUL
+ case RISCV::VLSEG2E8_V:
+ case RISCV::VLSEG2E8FF_V:
+ case RISCV::VLSEG3E8_V:
+ case RISCV::VLSEG3E8FF_V:
+ case RISCV::VLSEG4E8_V:
+ case RISCV::VLSEG4E8FF_V:
+ case RISCV::VLSEG5E8_V:
+ case RISCV::VLSEG5E8FF_V:
+ case RISCV::VLSEG6E8_V:
+ case RISCV::VLSEG6E8FF_V:
+ case RISCV::VLSEG7E8_V:
+ case RISCV::VLSEG7E8FF_V:
+ case RISCV::VLSEG8E8_V:
+ case RISCV::VLSEG8E8FF_V:
+ case RISCV::VSSEG2E8_V:
+ case RISCV::VSSEG3E8_V:
+ case RISCV::VSSEG4E8_V:
+ case RISCV::VSSEG5E8_V:
+ case RISCV::VSSEG6E8_V:
+ case RISCV::VSSEG7E8_V:
+ case RISCV::VSSEG8E8_V:
+ return OperandInfo(MIVLMul, 3);
+ case RISCV::VLSEG2E16_V:
+ case RISCV::VLSEG2E16FF_V:
+ case RISCV::VLSEG3E16_V:
+ case RISCV::VLSEG3E16FF_V:
+ case RISCV::VLSEG4E16_V:
+ case RISCV::VLSEG4E16FF_V:
+ case RISCV::VLSEG5E16_V:
+ case RISCV::VLSEG5E16FF_V:
+ case RISCV::VLSEG6E16_V:
+ case RISCV::VLSEG6E16FF_V:
+ case RISCV::VLSEG7E16_V:
+ case RISCV::VLSEG7E16FF_V:
+ case RISCV::VLSEG8E16_V:
+ case RISCV::VLSEG8E16FF_V:
+ case RISCV::VSSEG2E16_V:
+ case RISCV::VSSEG3E16_V:
+ case RISCV::VSSEG4E16_V:
+ case RISCV::VSSEG5E16_V:
+ case RISCV::VSSEG6E16_V:
+ case RISCV::VSSEG7E16_V:
+ case RISCV::VSSEG8E16_V:
+ return OperandInfo(MIVLMul, 4);
+ case RISCV::VLSEG2E32_V:
+ case RISCV::VLSEG2E32FF_V:
+ case RISCV::VLSEG3E32_V:
+ case RISCV::VLSEG3E32FF_V:
+ case RISCV::VLSEG4E32_V:
+ case RISCV::VLSEG4E32FF_V:
+ case RISCV::VLSEG5E32_V:
+ case RISCV::VLSEG5E32FF_V:
+ case RISCV::VLSEG6E32_V:
+ case RISCV::VLSEG6E32FF_V:
+ case RISCV::VLSEG7E32_V:
+ case RISCV::VLSEG7E32FF_V:
+ case RISCV::VLSEG8E32_V:
+ case RISCV::VLSEG8E32FF_V:
+ case RISCV::VSSEG2E32_V:
+ case RISCV::VSSEG3E32_V:
+ case RISCV::VSSEG4E32_V:
+ case RISCV::VSSEG5E32_V:
+ case RISCV::VSSEG6E32_V:
+ case RISCV::VSSEG7E32_V:
+ case RISCV::VSSEG8E32_V:
+ return OperandInfo(MIVLMul, 5);
+ case RISCV::VLSEG2E64_V:
+ case RISCV::VLSEG2E64FF_V:
+ case RISCV::VLSEG3E64_V:
+ case RISCV::VLSEG3E64FF_V:
+ case RISCV::VLSEG4E64_V:
+ case RISCV::VLSEG4E64FF_V:
+ case RISCV::VLSEG5E64_V:
+ case RISCV::VLSEG5E64FF_V:
+ case RISCV::VLSEG6E64_V:
+ case RISCV::VLSEG6E64FF_V:
+ case RISCV::VLSEG7E64_V:
+ case RISCV::VLSEG7E64FF_V:
+ case RISCV::VLSEG8E64_V:
+ case RISCV::VLSEG8E64FF_V:
+ case RISCV::VSSEG2E64_V:
+ case RISCV::VSSEG3E64_V:
+ case RISCV::VSSEG4E64_V:
+ case RISCV::VSSEG5E64_V:
+ case RISCV::VSSEG6E64_V:
+ case RISCV::VSSEG7E64_V:
+ case RISCV::VSSEG8E64_V:
+ return OperandInfo(MIVLMul, 6);
+
+ // 7.8.2. Vector Strided Segment Loads and Stores
+ case RISCV::VLSSEG2E8_V:
+ case RISCV::VLSSEG3E8_V:
+ case RISCV::VLSSEG4E8_V:
+ case RISCV::VLSSEG5E8_V:
+ case RISCV::VLSSEG6E8_V:
+ case RISCV::VLSSEG7E8_V:
+ case RISCV::VLSSEG8E8_V:
+ case RISCV::VSSSEG2E8_V:
+ case RISCV::VSSSEG3E8_V:
+ case RISCV::VSSSEG4E8_V:
+ case RISCV::VSSSEG5E8_V:
+ case RISCV::VSSSEG6E8_V:
+ case RISCV::VSSSEG7E8_V:
+ case RISCV::VSSSEG8E8_V:
+ return OperandInfo(MIVLMul, 3);
+ case RISCV::VLSSEG2E16_V:
+ case RISCV::VLSSEG3E16_V:
+ case RISCV::VLSSEG4E16_V:
+ case RISCV::VLSSEG5E16_V:
+ case RISCV::VLSSEG6E16_V:
+ case RISCV::VLSSEG7E16_V:
+ case RISCV::VLSSEG8E16_V:
+ case RISCV::VSSSEG2E16_V:
+ case RISCV::VSSSEG3E16_V:
+ case RISCV::VSSSEG4E16_V:
+ case RISCV::VSSSEG5E16_V:
+ case RISCV::VSSSEG6E16_V:
+ case RISCV::VSSSEG7E16_V:
+ case RISCV::VSSSEG8E16_V:
+ return OperandInfo(MIVLMul, 4);
+ case RISCV::VLSSEG2E32_V:
+ case RISCV::VLSSEG3E32_V:
+ case RISCV::VLSSEG4E32_V:
+ case RISCV::VLSSEG5E32_V:
+ case RISCV::VLSSEG6E32_V:
+ case RISCV::VLSSEG7E32_V:
+ case RISCV::VLSSEG8E32_V:
+ case RISCV::VSSSEG2E32_V:
+ case RISCV::VSSSEG3E32_V:
+ case RISCV::VSSSEG4E32_V:
+ case RISCV::VSSSEG5E32_V:
+ case RISCV::VSSSEG6E32_V:
+ case RISCV::VSSSEG7E32_V:
+ case RISCV::VSSSEG8E32_V:
+ return OperandInfo(MIVLMul, 5);
+ case RISCV::VLSSEG2E64_V:
+ case RISCV::VLSSEG3E64_V:
+ case RISCV::VLSSEG4E64_V:
+ case RISCV::VLSSEG5E64_V:
+ case RISCV::VLSSEG6E64_V:
+ case RISCV::VLSSEG7E64_V:
+ case RISCV::VLSSEG8E64_V:
+ case RISCV::VSSSEG2E64_V:
+ case RISCV::VSSSEG3E64_V:
+ case RISCV::VSSSEG4E64_V:
+ case RISCV::VSSSEG5E64_V:
+ case RISCV::VSSSEG6E64_V:
+ case RISCV::VSSSEG7E64_V:
+ case RISCV::VSSSEG8E64_V:
+ return OperandInfo(MIVLMul, 6);
+
+ // 7.8.3. Vector Indexed Segment Loads and Stores
+ case RISCV::VLUXSEG2EI8_V:
+ case RISCV::VLUXSEG3EI8_V:
+ case RISCV::VLUXSEG4EI8_V:
+ case RISCV::VLUXSEG5EI8_V:
+ case RISCV::VLUXSEG6EI8_V:
+ case RISCV::VLUXSEG7EI8_V:
+ case RISCV::VLUXSEG8EI8_V:
+ case RISCV::VLOXSEG2EI8_V:
+ case RISCV::VLOXSEG3EI8_V:
+ case RISCV::VLOXSEG4EI8_V:
+ case RISCV::VLOXSEG5EI8_V:
+ case RISCV::VLOXSEG6EI8_V:
+ case RISCV::VLOXSEG7EI8_V:
+ case RISCV::VLOXSEG8EI8_V:
+ case RISCV::VSUXSEG2EI8_V:
+ case RISCV::VSUXSEG3EI8_V:
+ case RISCV::VSUXSEG4EI8_V:
+ case RISCV::VSUXSEG5EI8_V:
+ case RISCV::VSUXSEG6EI8_V:
+ case RISCV::VSUXSEG7EI8_V:
+ case RISCV::VSUXSEG8EI8_V:
+ case RISCV::VSOXSEG2EI8_V...
[truncated]
|
bddad03
to
f52913b
Compare
I have forced push because there had not been any review and I was missing some important bug fixes. I also opted to remove some functionality that I think would be better as a follow up patch to make the review here more straightforward. |
✅ With the latest revision this PR passed the C/C++ code formatter. |
Hey, awesome! Really happy to see this making it's way upstream. As I'm sure you know, I've been slowly expanding tryToReduceVL in RISCVVectorPeephole in this direction. We should definitely make sure we're sharing as much code as possible. I think we still need the early peephole (to drive the vmv.v.v conversions), but we should at least share code. I'm going to take a real look at the code on Monday, but wanted to state initial support! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! I just left some comments. I will have a detailed review later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of detail comments inline, but I want to focus on high level review strategy.
Looking at the code for this, I spot a couple possible bugs (flagged in inline comments), but nothing really terrible code structure wise. I'd like to propose that we defer most of the code cleanup suggestions until after this is in tree. Specifically, I'm proposing the following:
- This patch is repurposed to add the transform pass off-by-default. We subset as needed to get something correct in tree with minimal delay.
- We iterate in tree to common code, and to add back any key features subset in step (1).
- We enable the pass in it's own commit.
I'm very hesitant to do the usual style review here because there's so much room for subtle bugs to be introduced, and if we get into such a cycle the review could extend nearly indefinitely. I'm willing to assume good faith follow up here, and frankly will likely invest a bunch of time myself in getting this code into long term supportable state.
LLVM_DEBUG(dbgs() << "Try reduce VL for " << MI << "\n"); | ||
std::optional<Register> CommonVL; | ||
bool CanReduceVL = true; | ||
for (auto &UserOp : MRI->use_operands(MI.getOperand(0).getReg())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Visiting each user for each time we visit something on the worklist (combined with an extra visit from the outer loop), seems potentially expensive. We could restructure this along the lines of a instcombine style worklist which integrates the outer pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a FIXME.
// Instructions like reductions may use a vector register as a scalar | ||
// register. In this case, we should treat it like a scalar register which | ||
// does not impact the decision on whether to optimize VL. | ||
if (isVectorOpUsedAsScalarOp(UserOp)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there might be a bug here. The reduction still needs the VL to be non-zero. If all other users are a register which happens to be zero at runtime, I think this can result in value reaching the reduction (which does not itself have the VL reduced) with a wrong scalar input value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if I follow your thinking here.
On a side note, ff the VL is zero, we cannot reduce it any further. We shouldn't even be calling tryReduceVL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If all other users are a register which happens to be zero at runtime, I think this can result in value reaching the reduction (which does not itself have the VL reduced) with a wrong scalar input value.
All the users have to have the same VL, so the reduction's VL would also be zero and it wouldn't read the scalar input value I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a side note, ff the VL is zero, we cannot reduce it any further. We shouldn't even be calling tryReduceVL.
Zero at runtime, not statically known to be zero.
unsigned VLOpNum = RISCVII::getVLOpNum(Desc); | ||
const MachineOperand &VLOp = UserMI.getOperand(VLOpNum); | ||
// Looking for a register VL that isn't X0. | ||
if (!VLOp.isReg() || VLOp.getReg() == RISCV::X0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The choice to only handle registers here is a surprising limitation. We can likely generalize this for immediates as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have this generalization written, was going to post as a follow up.
const MachineInstr &MI = *MO.getParent(); | ||
bool HasPassthru = RISCVII::isFirstDefTiedToFirstUse(MI.getDesc()); | ||
|
||
if (HasPassthru) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This silent off by one on the indexing is confusing to say the least. Given this is only used in a handful of places, I'd prefer to see this inlined and specialized for each user.
case RISCV::VL8RE32_V: | ||
case RISCV::VL8RE64_V: | ||
case RISCV::VS8R_V: | ||
return OperandInfo(RISCVII::VLMUL::LMUL_8, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe all of these hard coded LMUL on the whole register stuff are equal to MIVLMul.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have pseudo instructions for whole register load/store and we don't appear to set the TSFlags VLMul
for non-pseudo instructions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, another idea...
Their LMUL should match the register class right? We don't have fractional register classes, but we also don't have fraction whole register moves? So maybe combine this with isRVVWholeLoadStore from RISCVInstrInfo.cpp for the selection, then use the register class?
(I'm just looking for ways to remove redundancy. If we can share code with other users, that increases the odds we'd catch a bug.)
Again, this is minor, and I'd be happy to iterate on this type of topic in tree.
case RISCV::VWSUB_VV: | ||
case RISCV::VWSUB_VX: | ||
case RISCV::VWSLL_VI: { | ||
unsigned Log2EEW = IsMODef ? MILog2SEW + 1 : MILog2SEW; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should be able to drive these off the recently added TableGen flag for DestEEW.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the suggestion here to use this:
unsigned Log2EEW = IsMODef ? getDestLog2EEW(MILog2SEW) : MILog2SEW;
Is that a simplification?
case RISCV::VFREDMIN_VS: | ||
case RISCV::VFREDOSUM_VS: | ||
case RISCV::VFREDUSUM_VS: | ||
return OperandInfo(MIVLMul, MILog2SEW); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this (and the other reduction) results is wrong. Specifically, the destination has scalar LMUL, but the source has something else. The fact we're returning the same regardless of op number here can't be correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the RVV spec:
The unsigned vwredsumu.vs instruction zero-extends the SEW-wide vector elements before summing them, then adds the 2SEW-width scalar element, and stores the result in a 2SEW-width scalar element.
The vwredsum.vs instruction sign-extends the SEW-wide vector elements before summing them.
The scalar input and output operands are held in element 0 of a single vector register, not a vector register group, so any vector register can be the scalar source or destination of a vector reduction regardless of LMUL setting.
Although the result is a single scalar, it's still written to a vector register. We can see here that the RVV spec refers to the elements it writes to have a SEW. I think it makes less sense to talk about LMUL here, and I think that I said to use the same relation that the SEW had for simplicity. As used, I don't think it would cause any harm since the LMUL being reported is larger than what is actually getting used (just a single element of a vector register).
Truthfully, I cant recall why I wrote it such that 14.1 instructions return Unknown, but 14.2-14.4 don't. Some ideas:
- return Unknown
- keep as is
- modify OperandInfo to describe this case
I probably prefer the first option for now and we can always improve it in the future. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests are good candidate for VL optimization. This is a pre-commit for PR #108640, but can could probably also be improved by the peephole VL optimizations.
As I am working on responding to reviews, one thing has become clear. We need better tests for this. I'm not sure what the best way to test things are. It feels like there are two things here:
I wonder if it would be a good idea to add unit tests for Then for the actual VL optimization, I think it would be best to use lit tests to get code coverage of all paths not (not including all paths through getOperandInfo since that would be tested on its own). Does anyone have different ideas or does this sound okay? |
First, I want to explicitly say again that I'd welcome this iteration happening in tree. Second, I don't tend to be a huge fan of C++ unit tests for this kind of API because they end up just repeating the implementation, and are hard to update. If you want to go in this direction, maybe something along the lines of a printer pass so that diffs can be autogenerated and we only have to (manually) change one place? Third, part of the problem with getOperandInfo is that it replicates information in other places. In particular, I think this gets a lot simpler if restructured using the recently added DestEEW mechanism. |
4873453
to
b49d910
Compare
If everyone is on board for this approach, I think that would be good. But I want to highlight here that I think the test coverage is not in a place where I'd feel comfortable to enabling this pass by default. I think this is a must have before turning it on. |
These tests are good candidate for VL optimization. This is a pre-commit for PR llvm#108640, but can could probably also be improved by the peephole VL optimizations.
I would be happy to iterate in tree as well, we can leave it off by default and add some .mir tests. As for moving getOperandInfo out into TableGen, it was mostly a refactoring idea and I don't think we should block this PR on it. I can take a stab at it this week in parallel and see if it's feasible or not. |
case RISCV::VREDXOR_VS: { | ||
if (IsMODef || IsOp1) | ||
return OperandInfo(MILog2SEW); | ||
if (IsOp2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to check whether this is the correct operand, or if it should be op3. Will add tests.
6785417
to
bfce099
Compare
fb1deca
to
0266e30
Compare
if (isVectorRegClass(MI.getOperand(0).getReg(), MRI)) | ||
checkUsers(CommonVL, CanReduceVL, MI); | ||
else { | ||
CommonVL = MI.getOperand(RISCVII::getVLOpNum(MI.getDesc())).getReg(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't this just change the VL to its own VL? What's the point? And if that's all it does, you need to detect that below and not set MadeChange
or add the inputs to the worklist because it didn't really change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed it.
std::optional<Register> CommonVL; | ||
bool CanReduceVL = true; | ||
if (isVectorRegClass(MI.getOperand(0).getReg(), MRI)) | ||
checkUsers(CommonVL, CanReduceVL, MI); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make checkUsers
return CanReduceVl
// vec_reg_a = ..., vl_op, sew_op | ||
// vec_reg_a = ..., vl_op, sew_op | ||
// scalar_reg = vector_instr vec_reg_a, vec_reg_b, vl_op, sew_op | ||
// We'd like to reduce the vl_op on vector_instr, despite it producing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this comment still relevant?
350afb0
to
8dc125f
Compare
ping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
As discussed in today's sync call, we have consensus to land this in tree (off by default!) and continue iterating. There is a good amount of code sharing, and improved testing which needs to happen before this can be enabled by default. All parties are aware, and enabling that incrementally is the goal of landing now.
I've ran this by our arch folks and we are also in favor of the change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM to land as is, just spotted something I want to point out before I forget
case RISCV::VSMUL_VV: | ||
case RISCV::VSMUL_VX: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these not single width? I thought the multiply happened in SEW*2 but the destination is still SEW
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The signed fractional multiply instruction produces a 2*SEW product of the two SEW inputs, then shifts the result right by SEW-1 bits, rounding these bits according to vxrm, then saturates the result to fit into SEW bits.
Thats a great catch. I'll add a test specifically for these before merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to land first and add a test later, don't let my reviews block this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i've removed it from getOperandInfo, we can add back with tests later.
|
||
// All mask operands have EEW=1, EMUL=(EEW/SEW)*LMUL | ||
if (isMaskOperand(MI, MO, MRI)) | ||
return OperandInfo(RISCVVType::getEMULEqualsEEWDivSEWTimesLMUL(0, MI), 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we handle compare instructions e.g. vmseq.vv where the destination operand will be a mask register, but not constrained to V0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from isMaskOperand
return Desc.operands()[MO.getOperandNo()].RegClass == RISCV::VMV0RegClassID;
It looks like we only support the v0 case for now.
…nd Saturation from getOperandInfo
…ent-indentonly * llvm-trunk/main: (6379 commits) [gn build] Port 1c94388 [RISCV] Introduce VLOptimizer pass (llvm#108640) [mlir][vector] Add more tests for ConvertVectorToLLVM (7/n) (llvm#111895) [libc++] Add output groups to run-buildbot (llvm#111739) [libc++abi] Remove unused LIBCXXABI_LIBCXX_INCLUDES CMake option (llvm#111824) [clang] Ignore inline namespace for `hasName` (llvm#109147) [AArch64] Disable consecutive store merging when Neon is unavailable (llvm#111519) [lldb] Fix finding make tool for tests (llvm#111980) Turn `-Wdeprecated-literal-operator` on by default (llvm#111027) [AMDGPU] Rewrite RegSeqNames using !foreach. NFC. (llvm#111994) Revert "Reland: [clang] Finish implementation of P0522 (llvm#111711)" Revert "[clang] CWG2398: improve overload resolution backwards compat (llvm#107350)" Revert "[clang] Implement TTP P0522 pack matching for deduced function template calls. (llvm#111457)" [Clang] Replace Intrinsic::getDeclaration with getOrInsertDeclaration (llvm#111990) Revert "[NVPTX] Prefer prmt.b32 over bfi.b32 (llvm#110766)" [RISCV] Add DAG combine to turn (sub (shl X, 8-Y), (shr X, Y)) into orc.b (llvm#111828) [libc] Fix compilation of new trig functions (llvm#111987) [NFC] Rename `Intrinsic::getDeclaration` to `getOrInsertDeclaration` (llvm#111752) [NFC][CodingStandard] Add additional example for if-else brace rule (llvm#111733) CodeGen: Remove redundant REQUIRES registered-target from tests (llvm#111982) ...
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/119/builds/2572 Here is the relevant piece of the build log for the reference
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/146/builds/1360 Here is the relevant piece of the build log for the reference
|
The purpose of this optimization is to make the VL argument, for instructions that have a VL argument, as small as possible. This is implemented by visiting each instruction in reverse order and checking that if it has a VL argument, whether the VL can be reduced. By putting this pass before VSETVLI insertion, we see three kinds of changes to generated code: 1. Eliminate VSETVLI instructions 2. Reduce the VL toggle on VSETVLI instructions that also change vtype 3. Reduce the VL set by a VSETVLI instruction The list of supported instructions is currently whitelisted for safety. In the future, we could add more instructions to `isSupportedInstr` to support even more VL optimization. We originally wrote this pass because vector GEP instructions do not take a VL, which leads us to emit code that uses VL=VLMAX to implement GEP in the RISC-V backend. As a result, some of the vector instructions will write to lanes, specifically between the intended VL and VLMAX, that will never be read. As an alternative to this pass, we considered adding a vector predicated GEP instruction, but this would not fit well into the intrinsic type system since GEP has a variable number of arguments, each with arbitrary types. The second approach we considered was to put this pass after VSETVLI insertion, but we found that it was more difficult to recognize optimization opportunities, especially across basic block boundaries -- the data flow analysis was also a bit more expensive and complex. While this pass solves the GEP problem, we have expanded it to handle more cases of VL optimization, and there is opportunity for the analysis to be improved to enable even more optimization. We have a few follow up patches to post, but figured this would be a good start. --------- Co-authored-by: Craig Topper <[email protected]> Co-authored-by: Kito Cheng <[email protected]>
The purpose of this optimization is to make the VL argument, for instructions that have a VL argument, as small as possible. This is implemented by visiting each instruction in reverse order and checking that if it has a VL argument, whether the VL can be reduced. By putting this pass before VSETVLI insertion, we see three kinds of changes to generated code: 1. Eliminate VSETVLI instructions 2. Reduce the VL toggle on VSETVLI instructions that also change vtype 3. Reduce the VL set by a VSETVLI instruction The list of supported instructions is currently whitelisted for safety. In the future, we could add more instructions to `isSupportedInstr` to support even more VL optimization. We originally wrote this pass because vector GEP instructions do not take a VL, which leads us to emit code that uses VL=VLMAX to implement GEP in the RISC-V backend. As a result, some of the vector instructions will write to lanes, specifically between the intended VL and VLMAX, that will never be read. As an alternative to this pass, we considered adding a vector predicated GEP instruction, but this would not fit well into the intrinsic type system since GEP has a variable number of arguments, each with arbitrary types. The second approach we considered was to put this pass after VSETVLI insertion, but we found that it was more difficult to recognize optimization opportunities, especially across basic block boundaries -- the data flow analysis was also a bit more expensive and complex. While this pass solves the GEP problem, we have expanded it to handle more cases of VL optimization, and there is opportunity for the analysis to be improved to enable even more optimization. We have a few follow up patches to post, but figured this would be a good start. --------- Co-authored-by: Craig Topper <[email protected]> Co-authored-by: Kito Cheng <[email protected]>
The purpose of this optimization is to make the VL argument, for instructions that have a VL argument, as small as possible. This is implemented by visiting each instruction in reverse order and checking that if it has a VL argument, whether the VL can be reduced.
By putting this pass before VSETVLI insertion, we see three kinds of changes to generated code:
The list of supported instructions is currently whitelisted for safety. In the future, we could add more instructions to
isSupportedInstr
to support even more VL optimization.We originally wrote this pass because vector GEP instructions do not take a VL, which leads us to emit code that uses VL=VLMAX to implement GEP in the RISC-V backend. As a result, some of the vector instructions will write to lanes, specifically between the intended VL and VLMAX, that will never be read. As an alternative to this pass, we considered adding a vector predicated GEP instruction, but this would not fit well into the intrinsic type system since GEP has a variable number of arguments, each with arbitrary types. The second approach we considered was to put this pass after VSETVLI insertion, but we found that it was more difficult to recognize optimization opportunities, especially across basic block boundaries -- the data flow analysis was also a bit more expensive and complex.
While this pass solves the GEP problem, we have expanded it to handle more cases of VL optimization, and there is opportunity for the analysis to be improved to enable even more optimization. We have a few follow up patches to post, but figured this would be a good start.
Co-authored-by: Craig Topper [email protected]
Co-authored-by: Kito Cheng [email protected]