-
Notifications
You must be signed in to change notification settings - Fork 11.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AMDGPU/GlobalISel: AMDGPURegBankSelect #112863
base: users/petar-avramovic/new-rbs-skeleton
Are you sure you want to change the base?
AMDGPU/GlobalISel: AMDGPURegBankSelect #112863
Conversation
Warning This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
This stack of pull requests is managed by Graphite. Learn more about stacking. Join @petar-avramovic and the rest of your teammates on Graphite |
@llvm/pr-subscribers-llvm-globalisel Author: Petar Avramovic (petar-avramovic) ChangesAssign register banks to virtual registers. Assign register banks using machine uniformity analysis: RBSelect does not consider available instructions and, in some cases, G_ Exceptions when uniformity analysis does not work:
Patch is 118.08 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/112863.diff 5 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUGlobalISelUtils.cpp b/llvm/lib/Target/AMDGPU/AMDGPUGlobalISelUtils.cpp
index a98d4488bf77fe..6f6ad5cf82cae1 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUGlobalISelUtils.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUGlobalISelUtils.cpp
@@ -7,13 +7,16 @@
//===----------------------------------------------------------------------===//
#include "AMDGPUGlobalISelUtils.h"
+#include "AMDGPURegisterBankInfo.h"
#include "GCNSubtarget.h"
#include "llvm/CodeGen/GlobalISel/GISelKnownBits.h"
#include "llvm/CodeGen/GlobalISel/MIPatternMatch.h"
#include "llvm/CodeGenTypes/LowLevelType.h"
#include "llvm/IR/Constants.h"
+#include "llvm/IR/IntrinsicsAMDGPU.h"
using namespace llvm;
+using namespace AMDGPU;
using namespace MIPatternMatch;
std::pair<Register, unsigned>
@@ -69,3 +72,38 @@ AMDGPU::getBaseWithConstantOffset(MachineRegisterInfo &MRI, Register Reg,
return std::pair(Reg, 0);
}
+
+IntrinsicLaneMaskAnalyzer::IntrinsicLaneMaskAnalyzer(MachineFunction &MF)
+ : MRI(MF.getRegInfo()) {
+ initLaneMaskIntrinsics(MF);
+}
+
+bool IntrinsicLaneMaskAnalyzer::isS32S64LaneMask(Register Reg) {
+ return S32S64LaneMask.contains(Reg);
+}
+
+void IntrinsicLaneMaskAnalyzer::initLaneMaskIntrinsics(MachineFunction &MF) {
+ for (auto &MBB : MF) {
+ for (auto &MI : MBB) {
+ if (MI.getOpcode() == AMDGPU::G_INTRINSIC &&
+ MI.getOperand(MI.getNumExplicitDefs()).getIntrinsicID() ==
+ Intrinsic::amdgcn_if_break) {
+ S32S64LaneMask.insert(MI.getOperand(3).getReg());
+ findLCSSAPhi(MI.getOperand(0).getReg());
+ }
+
+ if (MI.getOpcode() == AMDGPU::SI_IF ||
+ MI.getOpcode() == AMDGPU::SI_ELSE) {
+ findLCSSAPhi(MI.getOperand(0).getReg());
+ }
+ }
+ }
+}
+
+void IntrinsicLaneMaskAnalyzer::findLCSSAPhi(Register Reg) {
+ S32S64LaneMask.insert(Reg);
+ for (auto &LCSSAPhi : MRI.use_instructions(Reg)) {
+ if (LCSSAPhi.isPHI())
+ S32S64LaneMask.insert(LCSSAPhi.getOperand(0).getReg());
+ }
+}
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUGlobalISelUtils.h b/llvm/lib/Target/AMDGPU/AMDGPUGlobalISelUtils.h
index 5972552b9a4fe8..4d504d0204d81a 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUGlobalISelUtils.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPUGlobalISelUtils.h
@@ -9,6 +9,8 @@
#ifndef LLVM_LIB_TARGET_AMDGPU_AMDGPUGLOBALISELUTILS_H
#define LLVM_LIB_TARGET_AMDGPU_AMDGPUGLOBALISELUTILS_H
+#include "llvm/ADT/DenseSet.h"
+#include "llvm/CodeGen/MachineFunction.h"
#include "llvm/CodeGen/Register.h"
#include <utility>
@@ -26,6 +28,26 @@ std::pair<Register, unsigned>
getBaseWithConstantOffset(MachineRegisterInfo &MRI, Register Reg,
GISelKnownBits *KnownBits = nullptr,
bool CheckNUW = false);
+
+// Currently finds S32/S64 lane masks that can be declared as divergent by
+// uniformity analysis (all are phis at the moment).
+// These are defined as i32/i64 in some IR intrinsics (not as i1).
+// Tablegen forces(via telling that lane mask IR intrinsics are uniform) most of
+// S32/S64 lane masks to be uniform, as this results in them ending up with sgpr
+// reg class after instruction-select don't search for all of them.
+class IntrinsicLaneMaskAnalyzer {
+ DenseSet<Register> S32S64LaneMask;
+ MachineRegisterInfo &MRI;
+
+public:
+ IntrinsicLaneMaskAnalyzer(MachineFunction &MF);
+ bool isS32S64LaneMask(Register Reg);
+
+private:
+ void initLaneMaskIntrinsics(MachineFunction &MF);
+ // This will not be needed when we turn of LCSSA for global-isel.
+ void findLCSSAPhi(Register Reg);
+};
}
}
diff --git a/llvm/lib/Target/AMDGPU/AMDGPURBSelect.cpp b/llvm/lib/Target/AMDGPU/AMDGPURBSelect.cpp
index c53a68ff72a8ad..905ad432fe6e0d 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPURBSelect.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPURBSelect.cpp
@@ -16,7 +16,12 @@
//===----------------------------------------------------------------------===//
#include "AMDGPU.h"
+#include "AMDGPUGlobalISelUtils.h"
+#include "AMDGPURegisterBankInfo.h"
+#include "MCTargetDesc/AMDGPUMCTargetDesc.h"
+#include "llvm/CodeGen/GlobalISel/MachineIRBuilder.h"
#include "llvm/CodeGen/MachineFunctionPass.h"
+#include "llvm/CodeGen/MachineUniformityAnalysis.h"
#include "llvm/InitializePasses.h"
#define DEBUG_TYPE "rb-select"
@@ -39,6 +44,7 @@ class AMDGPURBSelect : public MachineFunctionPass {
StringRef getPassName() const override { return "AMDGPU RB select"; }
void getAnalysisUsage(AnalysisUsage &AU) const override {
+ AU.addRequired<MachineUniformityAnalysisPass>();
MachineFunctionPass::getAnalysisUsage(AU);
}
@@ -54,6 +60,7 @@ class AMDGPURBSelect : public MachineFunctionPass {
INITIALIZE_PASS_BEGIN(AMDGPURBSelect, DEBUG_TYPE, "AMDGPU RB select", false,
false)
+INITIALIZE_PASS_DEPENDENCY(MachineUniformityAnalysisPass)
INITIALIZE_PASS_END(AMDGPURBSelect, DEBUG_TYPE, "AMDGPU RB select", false,
false)
@@ -63,4 +70,189 @@ char &llvm::AMDGPURBSelectID = AMDGPURBSelect::ID;
FunctionPass *llvm::createAMDGPURBSelectPass() { return new AMDGPURBSelect(); }
-bool AMDGPURBSelect::runOnMachineFunction(MachineFunction &MF) { return true; }
+bool shouldRBSelect(MachineInstr &MI) {
+ if (isTargetSpecificOpcode(MI.getOpcode()) && !MI.isPreISelOpcode())
+ return false;
+
+ if (MI.getOpcode() == AMDGPU::PHI || MI.getOpcode() == AMDGPU::IMPLICIT_DEF)
+ return false;
+
+ if (MI.isInlineAsm())
+ return false;
+
+ return true;
+}
+
+void setRB(MachineInstr &MI, MachineOperand &DefOP, MachineIRBuilder B,
+ MachineRegisterInfo &MRI, const RegisterBank &RB) {
+ Register Reg = DefOP.getReg();
+ // Register that already has Register class got it during pre-inst selection
+ // of another instruction. Maybe cross bank copy was required so we insert a
+ // copy trat can be removed later. This simplifies post-rb-legalize artifact
+ // combiner and avoids need to special case some patterns.
+ if (MRI.getRegClassOrNull(Reg)) {
+ LLT Ty = MRI.getType(Reg);
+ Register NewReg = MRI.createVirtualRegister({&RB, Ty});
+ DefOP.setReg(NewReg);
+
+ auto &MBB = *MI.getParent();
+ B.setInsertPt(MBB, MI.isPHI() ? MBB.getFirstNonPHI()
+ : std::next(MI.getIterator()));
+ B.buildCopy(Reg, NewReg);
+
+ // The problem was discoverd for uniform S1 that was used as both
+ // lane mask(vcc) and regular sgpr S1.
+ // - lane-mask(vcc) use was by si_if, this use is divergent and requires
+ // non-trivial sgpr-S1-to-vcc copy. But pre-inst-selection of si_if sets
+ // sreg_64_xexec(S1) on def of uniform S1 making it lane-mask.
+ // - the regular regular sgpr S1(uniform) instruction is now broken since
+ // it uses sreg_64_xexec(S1) which is divergent.
+
+ // "Clear" reg classes from uses on generic instructions and but register
+ // banks instead.
+ for (auto &UseMI : MRI.use_instructions(Reg)) {
+ if (shouldRBSelect(UseMI)) {
+ for (MachineOperand &Op : UseMI.operands()) {
+ if (Op.isReg() && Op.isUse() && Op.getReg() == Reg)
+ Op.setReg(NewReg);
+ }
+ }
+ }
+
+ } else {
+ MRI.setRegBank(Reg, RB);
+ }
+}
+
+void setRBUse(MachineInstr &MI, MachineOperand &UseOP, MachineIRBuilder B,
+ MachineRegisterInfo &MRI, const RegisterBank &RB) {
+ Register Reg = UseOP.getReg();
+
+ LLT Ty = MRI.getType(Reg);
+ Register NewReg = MRI.createVirtualRegister({&RB, Ty});
+ UseOP.setReg(NewReg);
+
+ if (MI.isPHI()) {
+ auto DefMI = MRI.getVRegDef(Reg)->getIterator();
+ MachineBasicBlock *DefMBB = DefMI->getParent();
+ B.setInsertPt(*DefMBB, DefMBB->SkipPHIsAndLabels(std::next(DefMI)));
+ } else {
+ B.setInstr(MI);
+ }
+
+ B.buildCopy(NewReg, Reg);
+}
+
+// Temporal divergence copy: COPY to vgpr with implicit use of $exec inside of
+// the cycle
+// Note: uniformity analysis does not consider that registers with vgpr def are
+// divergent (you can have uniform value in vgpr).
+// - TODO: implicit use of $exec could be implemented as indicator that
+// instruction is divergent
+bool isTemporalDivergenceCopy(Register Reg, MachineRegisterInfo &MRI) {
+ MachineInstr *MI = MRI.getVRegDef(Reg);
+ if (MI->getOpcode() == AMDGPU::COPY) {
+ for (auto Op : MI->implicit_operands()) {
+ if (!Op.isReg())
+ continue;
+ Register Reg = Op.getReg();
+ if (Reg == AMDGPU::EXEC) {
+ return true;
+ }
+ }
+ }
+
+ return false;
+}
+
+Register getVReg(MachineOperand &Op) {
+ if (!Op.isReg())
+ return 0;
+
+ Register Reg = Op.getReg();
+ if (!Reg.isVirtual())
+ return 0;
+
+ return Reg;
+}
+
+bool AMDGPURBSelect::runOnMachineFunction(MachineFunction &MF) {
+ MachineUniformityInfo &MUI =
+ getAnalysis<MachineUniformityAnalysisPass>().getUniformityInfo();
+ AMDGPU::IntrinsicLaneMaskAnalyzer ILMA(MF);
+ MachineRegisterInfo &MRI = MF.getRegInfo();
+ const RegisterBankInfo &RBI = *MF.getSubtarget().getRegBankInfo();
+
+ MachineIRBuilder B(MF);
+
+ // Assign register banks to ALL def registers on G_ instructions.
+ // Same for copies if they have no register bank or class on def.
+ for (MachineBasicBlock &MBB : MF) {
+ for (MachineInstr &MI : MBB) {
+ if (!shouldRBSelect(MI))
+ continue;
+
+ for (MachineOperand &DefOP : MI.defs()) {
+ Register DefReg = getVReg(DefOP);
+ if (!DefReg)
+ continue;
+
+ // Copies can have register class on def registers.
+ if (MI.isCopy() && MRI.getRegClassOrNull(DefReg)) {
+ continue;
+ }
+
+ if (MUI.isUniform(DefReg) || ILMA.isS32S64LaneMask(DefReg)) {
+ setRB(MI, DefOP, B, MRI, RBI.getRegBank(AMDGPU::SGPRRegBankID));
+ } else {
+ if (MRI.getType(DefReg) == LLT::scalar(1))
+ setRB(MI, DefOP, B, MRI, RBI.getRegBank(AMDGPU::VCCRegBankID));
+ else
+ setRB(MI, DefOP, B, MRI, RBI.getRegBank(AMDGPU::VGPRRegBankID));
+ }
+ }
+ }
+ }
+
+ // At this point all virtual registers have register class or bank
+ // - Defs of G_ instructions have register banks.
+ // - Defs and uses of inst-selected instructions have register class.
+ // - Defs and uses of copies can have either register class or bank
+ // and most notably
+ // - Uses of G_ instructions can have either register class or bank
+
+ // Reassign uses of G_ instructions to only have register banks.
+ for (MachineBasicBlock &MBB : MF) {
+ for (MachineInstr &MI : MBB) {
+ if (!shouldRBSelect(MI))
+ continue;
+
+ // Copies can have register class on use registers.
+ if (MI.isCopy())
+ continue;
+
+ for (MachineOperand &UseOP : MI.uses()) {
+ Register UseReg = getVReg(UseOP);
+ if (!UseReg)
+ continue;
+
+ if (!MRI.getRegClassOrNull(UseReg))
+ continue;
+
+ if (!isTemporalDivergenceCopy(UseReg, MRI) &&
+ (MUI.isUniform(UseReg) || ILMA.isS32S64LaneMask(UseReg))) {
+ setRBUse(MI, UseOP, B, MRI, RBI.getRegBank(AMDGPU::SGPRRegBankID));
+ } else {
+ if (MRI.getType(UseReg) == LLT::scalar(1))
+ setRBUse(MI, UseOP, B, MRI, RBI.getRegBank(AMDGPU::VCCRegBankID));
+ else
+ setRBUse(MI, UseOP, B, MRI, RBI.getRegBank(AMDGPU::VGPRRegBankID));
+ }
+ }
+ }
+ }
+
+ // Defs and uses of G_ instructions have register banks exclusively.
+
+ return true;
+}
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui-rb-legalize.mir b/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui-rb-legalize.mir
index 880057813adf54..208bf686c98ba8 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui-rb-legalize.mir
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui-rb-legalize.mir
@@ -11,22 +11,22 @@ body: |
; CHECK-LABEL: name: uniform_in_vgpr
; CHECK: liveins: $sgpr0, $sgpr1, $vgpr0, $vgpr1
; CHECK-NEXT: {{ $}}
- ; CHECK-NEXT: [[COPY:%[0-9]+]]:_(s32) = COPY $sgpr0
- ; CHECK-NEXT: [[COPY1:%[0-9]+]]:_(s32) = COPY $sgpr1
- ; CHECK-NEXT: [[COPY2:%[0-9]+]]:_(s32) = COPY $vgpr0
- ; CHECK-NEXT: [[COPY3:%[0-9]+]]:_(s32) = COPY $vgpr1
- ; CHECK-NEXT: [[MV:%[0-9]+]]:_(p1) = G_MERGE_VALUES [[COPY2]](s32), [[COPY3]](s32)
- ; CHECK-NEXT: [[FPTOUI:%[0-9]+]]:_(s32) = G_FPTOUI [[COPY]](s32)
- ; CHECK-NEXT: [[ADD:%[0-9]+]]:_(s32) = G_ADD [[FPTOUI]], [[COPY1]]
+ ; CHECK-NEXT: [[COPY:%[0-9]+]]:sgpr(s32) = COPY $sgpr0
+ ; CHECK-NEXT: [[COPY1:%[0-9]+]]:sgpr(s32) = COPY $sgpr1
+ ; CHECK-NEXT: [[COPY2:%[0-9]+]]:vgpr(s32) = COPY $vgpr0
+ ; CHECK-NEXT: [[COPY3:%[0-9]+]]:vgpr(s32) = COPY $vgpr1
+ ; CHECK-NEXT: [[MV:%[0-9]+]]:vgpr(p1) = G_MERGE_VALUES [[COPY2]](s32), [[COPY3]](s32)
+ ; CHECK-NEXT: [[FPTOUI:%[0-9]+]]:sgpr(s32) = G_FPTOUI [[COPY]](s32)
+ ; CHECK-NEXT: [[ADD:%[0-9]+]]:sgpr(s32) = G_ADD [[FPTOUI]], [[COPY1]]
; CHECK-NEXT: G_STORE [[ADD]](s32), [[MV]](p1) :: (store (s32), addrspace 1)
; CHECK-NEXT: S_ENDPGM 0
- %0:_(s32) = COPY $sgpr0
- %1:_(s32) = COPY $sgpr1
- %3:_(s32) = COPY $vgpr0
- %4:_(s32) = COPY $vgpr1
- %2:_(p1) = G_MERGE_VALUES %3(s32), %4(s32)
- %6:_(s32) = G_FPTOUI %0(s32)
- %7:_(s32) = G_ADD %6, %1
+ %0:sgpr(s32) = COPY $sgpr0
+ %1:sgpr(s32) = COPY $sgpr1
+ %3:vgpr(s32) = COPY $vgpr0
+ %4:vgpr(s32) = COPY $vgpr1
+ %2:vgpr(p1) = G_MERGE_VALUES %3(s32), %4(s32)
+ %6:sgpr(s32) = G_FPTOUI %0(s32)
+ %7:sgpr(s32) = G_ADD %6, %1
G_STORE %7(s32), %2(p1) :: (store (s32), addrspace 1)
S_ENDPGM 0
...
@@ -41,26 +41,26 @@ body: |
; CHECK-LABEL: name: back_to_back_uniform_in_vgpr
; CHECK: liveins: $sgpr0, $sgpr1, $sgpr2, $vgpr0, $vgpr1
; CHECK-NEXT: {{ $}}
- ; CHECK-NEXT: [[COPY:%[0-9]+]]:_(s32) = COPY $sgpr0
- ; CHECK-NEXT: [[COPY1:%[0-9]+]]:_(s32) = COPY $sgpr1
- ; CHECK-NEXT: [[COPY2:%[0-9]+]]:_(s32) = COPY $sgpr2
- ; CHECK-NEXT: [[COPY3:%[0-9]+]]:_(s32) = COPY $vgpr0
- ; CHECK-NEXT: [[COPY4:%[0-9]+]]:_(s32) = COPY $vgpr1
- ; CHECK-NEXT: [[MV:%[0-9]+]]:_(p1) = G_MERGE_VALUES [[COPY3]](s32), [[COPY4]](s32)
- ; CHECK-NEXT: [[FADD:%[0-9]+]]:_(s32) = G_FADD [[COPY]], [[COPY1]]
- ; CHECK-NEXT: [[FPTOUI:%[0-9]+]]:_(s32) = G_FPTOUI [[FADD]](s32)
- ; CHECK-NEXT: [[ADD:%[0-9]+]]:_(s32) = G_ADD [[FPTOUI]], [[COPY2]]
+ ; CHECK-NEXT: [[COPY:%[0-9]+]]:sgpr(s32) = COPY $sgpr0
+ ; CHECK-NEXT: [[COPY1:%[0-9]+]]:sgpr(s32) = COPY $sgpr1
+ ; CHECK-NEXT: [[COPY2:%[0-9]+]]:sgpr(s32) = COPY $sgpr2
+ ; CHECK-NEXT: [[COPY3:%[0-9]+]]:vgpr(s32) = COPY $vgpr0
+ ; CHECK-NEXT: [[COPY4:%[0-9]+]]:vgpr(s32) = COPY $vgpr1
+ ; CHECK-NEXT: [[MV:%[0-9]+]]:vgpr(p1) = G_MERGE_VALUES [[COPY3]](s32), [[COPY4]](s32)
+ ; CHECK-NEXT: [[FADD:%[0-9]+]]:sgpr(s32) = G_FADD [[COPY]], [[COPY1]]
+ ; CHECK-NEXT: [[FPTOUI:%[0-9]+]]:sgpr(s32) = G_FPTOUI [[FADD]](s32)
+ ; CHECK-NEXT: [[ADD:%[0-9]+]]:sgpr(s32) = G_ADD [[FPTOUI]], [[COPY2]]
; CHECK-NEXT: G_STORE [[ADD]](s32), [[MV]](p1) :: (store (s32), addrspace 1)
; CHECK-NEXT: S_ENDPGM 0
- %0:_(s32) = COPY $sgpr0
- %1:_(s32) = COPY $sgpr1
- %2:_(s32) = COPY $sgpr2
- %4:_(s32) = COPY $vgpr0
- %5:_(s32) = COPY $vgpr1
- %3:_(p1) = G_MERGE_VALUES %4(s32), %5(s32)
- %7:_(s32) = G_FADD %0, %1
- %8:_(s32) = G_FPTOUI %7(s32)
- %9:_(s32) = G_ADD %8, %2
+ %0:sgpr(s32) = COPY $sgpr0
+ %1:sgpr(s32) = COPY $sgpr1
+ %2:sgpr(s32) = COPY $sgpr2
+ %4:vgpr(s32) = COPY $vgpr0
+ %5:vgpr(s32) = COPY $vgpr1
+ %3:vgpr(p1) = G_MERGE_VALUES %4(s32), %5(s32)
+ %7:sgpr(s32) = G_FADD %0, %1
+ %8:sgpr(s32) = G_FPTOUI %7(s32)
+ %9:sgpr(s32) = G_ADD %8, %2
G_STORE %9(s32), %3(p1) :: (store (s32), addrspace 1)
S_ENDPGM 0
...
@@ -75,36 +75,36 @@ body: |
; CHECK-LABEL: name: buffer_load_uniform
; CHECK: liveins: $sgpr0, $sgpr1, $sgpr2, $sgpr3, $sgpr4, $vgpr0, $vgpr1
; CHECK-NEXT: {{ $}}
- ; CHECK-NEXT: [[COPY:%[0-9]+]]:_(s32) = COPY $sgpr0
- ; CHECK-NEXT: [[COPY1:%[0-9]+]]:_(s32) = COPY $sgpr1
- ; CHECK-NEXT: [[COPY2:%[0-9]+]]:_(s32) = COPY $sgpr2
- ; CHECK-NEXT: [[COPY3:%[0-9]+]]:_(s32) = COPY $sgpr3
- ; CHECK-NEXT: [[BUILD_VECTOR:%[0-9]+]]:_(<4 x s32>) = G_BUILD_VECTOR [[COPY]](s32), [[COPY1]](s32), [[COPY2]](s32), [[COPY3]](s32)
- ; CHECK-NEXT: [[COPY4:%[0-9]+]]:_(s32) = COPY $sgpr4
- ; CHECK-NEXT: [[COPY5:%[0-9]+]]:_(s32) = COPY $vgpr0
- ; CHECK-NEXT: [[COPY6:%[0-9]+]]:_(s32) = COPY $vgpr1
- ; CHECK-NEXT: [[MV:%[0-9]+]]:_(p1) = G_MERGE_VALUES [[COPY5]](s32), [[COPY6]](s32)
- ; CHECK-NEXT: [[C:%[0-9]+]]:_(s32) = G_CONSTANT i32 0
- ; CHECK-NEXT: [[AMDGPU_BUFFER_LOAD:%[0-9]+]]:_(<4 x s32>) = G_AMDGPU_BUFFER_LOAD [[BUILD_VECTOR]](<4 x s32>), [[C]](s32), [[COPY4]], [[C]], 0, 0, 0 :: (dereferenceable load (<4 x s32>), align 1, addrspace 8)
- ; CHECK-NEXT: [[C1:%[0-9]+]]:_(s32) = G_CONSTANT i32 1
- ; CHECK-NEXT: [[UV:%[0-9]+]]:_(s32), [[UV1:%[0-9]+]]:_(s32), [[UV2:%[0-9]+]]:_(s32), [[UV3:%[0-9]+]]:_(s32) = G_UNMERGE_VALUES [[AMDGPU_BUFFER_LOAD]](<4 x s32>)
- ; CHECK-NEXT: [[ADD:%[0-9]+]]:_(s32) = G_ADD [[UV1]], [[C1]]
+ ; CHECK-NEXT: [[COPY:%[0-9]+]]:sgpr(s32) = COPY $sgpr0
+ ; CHECK-NEXT: [[COPY1:%[0-9]+]]:sgpr(s32) = COPY $sgpr1
+ ; CHECK-NEXT: [[COPY2:%[0-9]+]]:sgpr(s32) = COPY $sgpr2
+ ; CHECK-NEXT: [[COPY3:%[0-9]+]]:sgpr(s32) = COPY $sgpr3
+ ; CHECK-NEXT: [[BUILD_VECTOR:%[0-9]+]]:sgpr(<4 x s32>) = G_BUILD_VECTOR [[COPY]](s32), [[COPY1]](s32), [[COPY2]](s32), [[COPY3]](s32)
+ ; CHECK-NEXT: [[COPY4:%[0-9]+]]:sgpr(s32) = COPY $sgpr4
+ ; CHECK-NEXT: [[COPY5:%[0-9]+]]:vgpr(s32) = COPY $vgpr0
+ ; CHECK-NEXT: [[COPY6:%[0-9]+]]:vgpr(s32) = COPY $vgpr1
+ ; CHECK-NEXT: [[MV:%[0-9]+]]:vgpr(p1) = G_MERGE_VALUES [[COPY5]](s32), [[COPY6]](s32)
+ ; CHECK-NEXT: [[C:%[0-9]+]]:sgpr(s32) = G_CONSTANT i32 0
+ ; CHECK-NEXT: [[AMDGPU_BUFFER_LOAD:%[0-9]+]]:sgpr(<4 x s32>) = G_AMDGPU_BUFFER_LOAD [[BUILD_VECTOR]](<4 x s32>), [[C]](s32), [[COPY4]], [[C]], 0, 0, 0 :: (dereferenceable load (<4 x s32>), align 1, addrspace 8)
+ ; CHECK-NEXT: [[C1:%[0-9]+]]:sgpr(s32) = G_CONSTANT i32 1
+ ; CHECK-NEXT: [[UV:%[0-9]+]]:sgpr(s32), [[UV1:%[0-9]+]]:sgpr(s32), [[UV2:%[0-9]+]]:sgpr(s32), [[UV3:%[0-9]+]]:sgpr(s32) = G_UNMERGE_VALUES [[AMDGPU_BUFFER_LOAD]](<4 x s32>)
+ ; CHECK-NEXT: [[ADD:%[0-9]+]]:sgpr(s32) = G_ADD [[UV1]], [[C1]]
; CHECK-NEXT: G_STORE [[ADD]](s32), [[MV]](p1) :: (store (s32), addrspace 1)
; CHECK-NEXT: S_ENDPGM 0
- %3:_(s32) = COPY $sgpr0
- %4:_(s32) = COPY $sgpr1
- %5:_(s32) = COPY $sgpr2
- %6:_(s32) = COPY $sgpr3
- %0:_(<4 x s32>) = G_BUILD_VECTOR %3(s32), %4(s32), %5(s32), %6(s32)
- %1:_(s32) = COPY $sgpr4
- %7:_(s32) = COPY $vgpr0
- %8:_(s32) = COPY $vgpr1
- %2:_(p1) = G_MERGE_VALUES %7(s32), %8(s32)
- %11:_(s32) = G_CONSTANT i32 0
- %10:_(<4 x s32>) = G_AMDGPU_BUFFER_LOAD %0(<4 x s32>), %11(s32), %1, %11, 0, 0, 0 :: (dereferenceable load (<4 x s32>), align 1, addrspace 8)
- %13:_(s32) = G_CONSTANT i32 1
- %15:_(s32), %16:_(s32), %17:_(s32), %18:_(s32) = G_UNMERGE_VALUES %10(<4 x s32>)
- %14:_(s32) = G_ADD %16, %13
+ %3:sgpr(s32) = COPY $sgpr0
+ %4:sgpr(s32) = COPY $sgpr1
+ %5:sgpr(s32) = COPY $sgpr2
+ %6:sgpr(s32) = COPY $sgpr3
+ %0:sgpr(<4 x s32>) = G_BUILD_VECTOR %3(s32), %4(s32), %5(s32), %6(s32)
+ %1:sgpr(s32) = COPY $sgpr4
+ %7:vgpr(s32) = COPY $vgpr0
+ %8:vgpr(s32) = COPY $vgpr1
+ %2:vgpr(p1) = G_MERGE_VALUES %7(s32), %8(s32)
+ %11:sgpr(s32) = G_CONSTANT i32 0
+ %10:sgpr(<4 x s32>) = G_AMDGPU_BUFFER_LOAD %0(<4 x s32>), %11(s32), %1, %11, 0, 0, 0 :: (dereferenceable load (<4 x s32>), align 1, addrspace 8)
+ %13:sgpr(s32) = G_CONSTANT i32 1
+ %15:sgpr(s32), %16:sgpr(s32), %17:sgpr(s32), %18:sgpr(s32) = G_UNMERGE_VALUES %10(<4 x s32>)
+ %14:sgpr(s32) = G_ADD %16, %13
G_STORE %14(s32), %2(p1) :: (store (s32), addrspace 1)
S_ENDPGM 0
...
@@ -119,36 +119,36 @@ body: |
; CHECK-LABEL: name: buffer_load_divergent
; CHECK: liveins: $sgpr0, $sgpr1, $sgpr2, $sgpr3, $vgpr0, $vgpr1, $vgpr2
; CHECK-NEXT: {{ $}}
- ; CHECK-NEXT: [[COPY:%[0-9]+]]:_(s32) = COPY $sgpr0
- ; CHECK-NEXT: [[COPY1:%[0-9]+]]:_(s32) = COPY $sgpr1
- ; CHECK-NEXT: [[COPY2:%[0-9]+]]:_(s32) = COPY $sgpr2
- ; CHECK-NEXT: [[COPY3:%[0-9]+]]:_(s32) = COPY $sgpr3
- ; CHECK-NEXT: [[BUILD_VECTOR:%[0-9]+]]:_(<4 x s32>) = G_BUILD_VECTOR [[COPY]](s32), [[COPY1]](s32), [[COPY2]...
[truncated]
|
@llvm/pr-subscribers-backend-amdgpu Author: Petar Avramovic (petar-avramovic) ChangesAssign register banks to virtual registers. Assign register banks using machine uniformity analysis: RBSelect does not consider available instructions and, in some cases, G_ Exceptions when uniformity analysis does not work:
Patch is 118.08 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/112863.diff 5 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUGlobalISelUtils.cpp b/llvm/lib/Target/AMDGPU/AMDGPUGlobalISelUtils.cpp
index a98d4488bf77fe..6f6ad5cf82cae1 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUGlobalISelUtils.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUGlobalISelUtils.cpp
@@ -7,13 +7,16 @@
//===----------------------------------------------------------------------===//
#include "AMDGPUGlobalISelUtils.h"
+#include "AMDGPURegisterBankInfo.h"
#include "GCNSubtarget.h"
#include "llvm/CodeGen/GlobalISel/GISelKnownBits.h"
#include "llvm/CodeGen/GlobalISel/MIPatternMatch.h"
#include "llvm/CodeGenTypes/LowLevelType.h"
#include "llvm/IR/Constants.h"
+#include "llvm/IR/IntrinsicsAMDGPU.h"
using namespace llvm;
+using namespace AMDGPU;
using namespace MIPatternMatch;
std::pair<Register, unsigned>
@@ -69,3 +72,38 @@ AMDGPU::getBaseWithConstantOffset(MachineRegisterInfo &MRI, Register Reg,
return std::pair(Reg, 0);
}
+
+IntrinsicLaneMaskAnalyzer::IntrinsicLaneMaskAnalyzer(MachineFunction &MF)
+ : MRI(MF.getRegInfo()) {
+ initLaneMaskIntrinsics(MF);
+}
+
+bool IntrinsicLaneMaskAnalyzer::isS32S64LaneMask(Register Reg) {
+ return S32S64LaneMask.contains(Reg);
+}
+
+void IntrinsicLaneMaskAnalyzer::initLaneMaskIntrinsics(MachineFunction &MF) {
+ for (auto &MBB : MF) {
+ for (auto &MI : MBB) {
+ if (MI.getOpcode() == AMDGPU::G_INTRINSIC &&
+ MI.getOperand(MI.getNumExplicitDefs()).getIntrinsicID() ==
+ Intrinsic::amdgcn_if_break) {
+ S32S64LaneMask.insert(MI.getOperand(3).getReg());
+ findLCSSAPhi(MI.getOperand(0).getReg());
+ }
+
+ if (MI.getOpcode() == AMDGPU::SI_IF ||
+ MI.getOpcode() == AMDGPU::SI_ELSE) {
+ findLCSSAPhi(MI.getOperand(0).getReg());
+ }
+ }
+ }
+}
+
+void IntrinsicLaneMaskAnalyzer::findLCSSAPhi(Register Reg) {
+ S32S64LaneMask.insert(Reg);
+ for (auto &LCSSAPhi : MRI.use_instructions(Reg)) {
+ if (LCSSAPhi.isPHI())
+ S32S64LaneMask.insert(LCSSAPhi.getOperand(0).getReg());
+ }
+}
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUGlobalISelUtils.h b/llvm/lib/Target/AMDGPU/AMDGPUGlobalISelUtils.h
index 5972552b9a4fe8..4d504d0204d81a 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUGlobalISelUtils.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPUGlobalISelUtils.h
@@ -9,6 +9,8 @@
#ifndef LLVM_LIB_TARGET_AMDGPU_AMDGPUGLOBALISELUTILS_H
#define LLVM_LIB_TARGET_AMDGPU_AMDGPUGLOBALISELUTILS_H
+#include "llvm/ADT/DenseSet.h"
+#include "llvm/CodeGen/MachineFunction.h"
#include "llvm/CodeGen/Register.h"
#include <utility>
@@ -26,6 +28,26 @@ std::pair<Register, unsigned>
getBaseWithConstantOffset(MachineRegisterInfo &MRI, Register Reg,
GISelKnownBits *KnownBits = nullptr,
bool CheckNUW = false);
+
+// Currently finds S32/S64 lane masks that can be declared as divergent by
+// uniformity analysis (all are phis at the moment).
+// These are defined as i32/i64 in some IR intrinsics (not as i1).
+// Tablegen forces(via telling that lane mask IR intrinsics are uniform) most of
+// S32/S64 lane masks to be uniform, as this results in them ending up with sgpr
+// reg class after instruction-select don't search for all of them.
+class IntrinsicLaneMaskAnalyzer {
+ DenseSet<Register> S32S64LaneMask;
+ MachineRegisterInfo &MRI;
+
+public:
+ IntrinsicLaneMaskAnalyzer(MachineFunction &MF);
+ bool isS32S64LaneMask(Register Reg);
+
+private:
+ void initLaneMaskIntrinsics(MachineFunction &MF);
+ // This will not be needed when we turn of LCSSA for global-isel.
+ void findLCSSAPhi(Register Reg);
+};
}
}
diff --git a/llvm/lib/Target/AMDGPU/AMDGPURBSelect.cpp b/llvm/lib/Target/AMDGPU/AMDGPURBSelect.cpp
index c53a68ff72a8ad..905ad432fe6e0d 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPURBSelect.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPURBSelect.cpp
@@ -16,7 +16,12 @@
//===----------------------------------------------------------------------===//
#include "AMDGPU.h"
+#include "AMDGPUGlobalISelUtils.h"
+#include "AMDGPURegisterBankInfo.h"
+#include "MCTargetDesc/AMDGPUMCTargetDesc.h"
+#include "llvm/CodeGen/GlobalISel/MachineIRBuilder.h"
#include "llvm/CodeGen/MachineFunctionPass.h"
+#include "llvm/CodeGen/MachineUniformityAnalysis.h"
#include "llvm/InitializePasses.h"
#define DEBUG_TYPE "rb-select"
@@ -39,6 +44,7 @@ class AMDGPURBSelect : public MachineFunctionPass {
StringRef getPassName() const override { return "AMDGPU RB select"; }
void getAnalysisUsage(AnalysisUsage &AU) const override {
+ AU.addRequired<MachineUniformityAnalysisPass>();
MachineFunctionPass::getAnalysisUsage(AU);
}
@@ -54,6 +60,7 @@ class AMDGPURBSelect : public MachineFunctionPass {
INITIALIZE_PASS_BEGIN(AMDGPURBSelect, DEBUG_TYPE, "AMDGPU RB select", false,
false)
+INITIALIZE_PASS_DEPENDENCY(MachineUniformityAnalysisPass)
INITIALIZE_PASS_END(AMDGPURBSelect, DEBUG_TYPE, "AMDGPU RB select", false,
false)
@@ -63,4 +70,189 @@ char &llvm::AMDGPURBSelectID = AMDGPURBSelect::ID;
FunctionPass *llvm::createAMDGPURBSelectPass() { return new AMDGPURBSelect(); }
-bool AMDGPURBSelect::runOnMachineFunction(MachineFunction &MF) { return true; }
+bool shouldRBSelect(MachineInstr &MI) {
+ if (isTargetSpecificOpcode(MI.getOpcode()) && !MI.isPreISelOpcode())
+ return false;
+
+ if (MI.getOpcode() == AMDGPU::PHI || MI.getOpcode() == AMDGPU::IMPLICIT_DEF)
+ return false;
+
+ if (MI.isInlineAsm())
+ return false;
+
+ return true;
+}
+
+void setRB(MachineInstr &MI, MachineOperand &DefOP, MachineIRBuilder B,
+ MachineRegisterInfo &MRI, const RegisterBank &RB) {
+ Register Reg = DefOP.getReg();
+ // Register that already has Register class got it during pre-inst selection
+ // of another instruction. Maybe cross bank copy was required so we insert a
+ // copy trat can be removed later. This simplifies post-rb-legalize artifact
+ // combiner and avoids need to special case some patterns.
+ if (MRI.getRegClassOrNull(Reg)) {
+ LLT Ty = MRI.getType(Reg);
+ Register NewReg = MRI.createVirtualRegister({&RB, Ty});
+ DefOP.setReg(NewReg);
+
+ auto &MBB = *MI.getParent();
+ B.setInsertPt(MBB, MI.isPHI() ? MBB.getFirstNonPHI()
+ : std::next(MI.getIterator()));
+ B.buildCopy(Reg, NewReg);
+
+ // The problem was discoverd for uniform S1 that was used as both
+ // lane mask(vcc) and regular sgpr S1.
+ // - lane-mask(vcc) use was by si_if, this use is divergent and requires
+ // non-trivial sgpr-S1-to-vcc copy. But pre-inst-selection of si_if sets
+ // sreg_64_xexec(S1) on def of uniform S1 making it lane-mask.
+ // - the regular regular sgpr S1(uniform) instruction is now broken since
+ // it uses sreg_64_xexec(S1) which is divergent.
+
+ // "Clear" reg classes from uses on generic instructions and but register
+ // banks instead.
+ for (auto &UseMI : MRI.use_instructions(Reg)) {
+ if (shouldRBSelect(UseMI)) {
+ for (MachineOperand &Op : UseMI.operands()) {
+ if (Op.isReg() && Op.isUse() && Op.getReg() == Reg)
+ Op.setReg(NewReg);
+ }
+ }
+ }
+
+ } else {
+ MRI.setRegBank(Reg, RB);
+ }
+}
+
+void setRBUse(MachineInstr &MI, MachineOperand &UseOP, MachineIRBuilder B,
+ MachineRegisterInfo &MRI, const RegisterBank &RB) {
+ Register Reg = UseOP.getReg();
+
+ LLT Ty = MRI.getType(Reg);
+ Register NewReg = MRI.createVirtualRegister({&RB, Ty});
+ UseOP.setReg(NewReg);
+
+ if (MI.isPHI()) {
+ auto DefMI = MRI.getVRegDef(Reg)->getIterator();
+ MachineBasicBlock *DefMBB = DefMI->getParent();
+ B.setInsertPt(*DefMBB, DefMBB->SkipPHIsAndLabels(std::next(DefMI)));
+ } else {
+ B.setInstr(MI);
+ }
+
+ B.buildCopy(NewReg, Reg);
+}
+
+// Temporal divergence copy: COPY to vgpr with implicit use of $exec inside of
+// the cycle
+// Note: uniformity analysis does not consider that registers with vgpr def are
+// divergent (you can have uniform value in vgpr).
+// - TODO: implicit use of $exec could be implemented as indicator that
+// instruction is divergent
+bool isTemporalDivergenceCopy(Register Reg, MachineRegisterInfo &MRI) {
+ MachineInstr *MI = MRI.getVRegDef(Reg);
+ if (MI->getOpcode() == AMDGPU::COPY) {
+ for (auto Op : MI->implicit_operands()) {
+ if (!Op.isReg())
+ continue;
+ Register Reg = Op.getReg();
+ if (Reg == AMDGPU::EXEC) {
+ return true;
+ }
+ }
+ }
+
+ return false;
+}
+
+Register getVReg(MachineOperand &Op) {
+ if (!Op.isReg())
+ return 0;
+
+ Register Reg = Op.getReg();
+ if (!Reg.isVirtual())
+ return 0;
+
+ return Reg;
+}
+
+bool AMDGPURBSelect::runOnMachineFunction(MachineFunction &MF) {
+ MachineUniformityInfo &MUI =
+ getAnalysis<MachineUniformityAnalysisPass>().getUniformityInfo();
+ AMDGPU::IntrinsicLaneMaskAnalyzer ILMA(MF);
+ MachineRegisterInfo &MRI = MF.getRegInfo();
+ const RegisterBankInfo &RBI = *MF.getSubtarget().getRegBankInfo();
+
+ MachineIRBuilder B(MF);
+
+ // Assign register banks to ALL def registers on G_ instructions.
+ // Same for copies if they have no register bank or class on def.
+ for (MachineBasicBlock &MBB : MF) {
+ for (MachineInstr &MI : MBB) {
+ if (!shouldRBSelect(MI))
+ continue;
+
+ for (MachineOperand &DefOP : MI.defs()) {
+ Register DefReg = getVReg(DefOP);
+ if (!DefReg)
+ continue;
+
+ // Copies can have register class on def registers.
+ if (MI.isCopy() && MRI.getRegClassOrNull(DefReg)) {
+ continue;
+ }
+
+ if (MUI.isUniform(DefReg) || ILMA.isS32S64LaneMask(DefReg)) {
+ setRB(MI, DefOP, B, MRI, RBI.getRegBank(AMDGPU::SGPRRegBankID));
+ } else {
+ if (MRI.getType(DefReg) == LLT::scalar(1))
+ setRB(MI, DefOP, B, MRI, RBI.getRegBank(AMDGPU::VCCRegBankID));
+ else
+ setRB(MI, DefOP, B, MRI, RBI.getRegBank(AMDGPU::VGPRRegBankID));
+ }
+ }
+ }
+ }
+
+ // At this point all virtual registers have register class or bank
+ // - Defs of G_ instructions have register banks.
+ // - Defs and uses of inst-selected instructions have register class.
+ // - Defs and uses of copies can have either register class or bank
+ // and most notably
+ // - Uses of G_ instructions can have either register class or bank
+
+ // Reassign uses of G_ instructions to only have register banks.
+ for (MachineBasicBlock &MBB : MF) {
+ for (MachineInstr &MI : MBB) {
+ if (!shouldRBSelect(MI))
+ continue;
+
+ // Copies can have register class on use registers.
+ if (MI.isCopy())
+ continue;
+
+ for (MachineOperand &UseOP : MI.uses()) {
+ Register UseReg = getVReg(UseOP);
+ if (!UseReg)
+ continue;
+
+ if (!MRI.getRegClassOrNull(UseReg))
+ continue;
+
+ if (!isTemporalDivergenceCopy(UseReg, MRI) &&
+ (MUI.isUniform(UseReg) || ILMA.isS32S64LaneMask(UseReg))) {
+ setRBUse(MI, UseOP, B, MRI, RBI.getRegBank(AMDGPU::SGPRRegBankID));
+ } else {
+ if (MRI.getType(UseReg) == LLT::scalar(1))
+ setRBUse(MI, UseOP, B, MRI, RBI.getRegBank(AMDGPU::VCCRegBankID));
+ else
+ setRBUse(MI, UseOP, B, MRI, RBI.getRegBank(AMDGPU::VGPRRegBankID));
+ }
+ }
+ }
+ }
+
+ // Defs and uses of G_ instructions have register banks exclusively.
+
+ return true;
+}
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui-rb-legalize.mir b/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui-rb-legalize.mir
index 880057813adf54..208bf686c98ba8 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui-rb-legalize.mir
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui-rb-legalize.mir
@@ -11,22 +11,22 @@ body: |
; CHECK-LABEL: name: uniform_in_vgpr
; CHECK: liveins: $sgpr0, $sgpr1, $vgpr0, $vgpr1
; CHECK-NEXT: {{ $}}
- ; CHECK-NEXT: [[COPY:%[0-9]+]]:_(s32) = COPY $sgpr0
- ; CHECK-NEXT: [[COPY1:%[0-9]+]]:_(s32) = COPY $sgpr1
- ; CHECK-NEXT: [[COPY2:%[0-9]+]]:_(s32) = COPY $vgpr0
- ; CHECK-NEXT: [[COPY3:%[0-9]+]]:_(s32) = COPY $vgpr1
- ; CHECK-NEXT: [[MV:%[0-9]+]]:_(p1) = G_MERGE_VALUES [[COPY2]](s32), [[COPY3]](s32)
- ; CHECK-NEXT: [[FPTOUI:%[0-9]+]]:_(s32) = G_FPTOUI [[COPY]](s32)
- ; CHECK-NEXT: [[ADD:%[0-9]+]]:_(s32) = G_ADD [[FPTOUI]], [[COPY1]]
+ ; CHECK-NEXT: [[COPY:%[0-9]+]]:sgpr(s32) = COPY $sgpr0
+ ; CHECK-NEXT: [[COPY1:%[0-9]+]]:sgpr(s32) = COPY $sgpr1
+ ; CHECK-NEXT: [[COPY2:%[0-9]+]]:vgpr(s32) = COPY $vgpr0
+ ; CHECK-NEXT: [[COPY3:%[0-9]+]]:vgpr(s32) = COPY $vgpr1
+ ; CHECK-NEXT: [[MV:%[0-9]+]]:vgpr(p1) = G_MERGE_VALUES [[COPY2]](s32), [[COPY3]](s32)
+ ; CHECK-NEXT: [[FPTOUI:%[0-9]+]]:sgpr(s32) = G_FPTOUI [[COPY]](s32)
+ ; CHECK-NEXT: [[ADD:%[0-9]+]]:sgpr(s32) = G_ADD [[FPTOUI]], [[COPY1]]
; CHECK-NEXT: G_STORE [[ADD]](s32), [[MV]](p1) :: (store (s32), addrspace 1)
; CHECK-NEXT: S_ENDPGM 0
- %0:_(s32) = COPY $sgpr0
- %1:_(s32) = COPY $sgpr1
- %3:_(s32) = COPY $vgpr0
- %4:_(s32) = COPY $vgpr1
- %2:_(p1) = G_MERGE_VALUES %3(s32), %4(s32)
- %6:_(s32) = G_FPTOUI %0(s32)
- %7:_(s32) = G_ADD %6, %1
+ %0:sgpr(s32) = COPY $sgpr0
+ %1:sgpr(s32) = COPY $sgpr1
+ %3:vgpr(s32) = COPY $vgpr0
+ %4:vgpr(s32) = COPY $vgpr1
+ %2:vgpr(p1) = G_MERGE_VALUES %3(s32), %4(s32)
+ %6:sgpr(s32) = G_FPTOUI %0(s32)
+ %7:sgpr(s32) = G_ADD %6, %1
G_STORE %7(s32), %2(p1) :: (store (s32), addrspace 1)
S_ENDPGM 0
...
@@ -41,26 +41,26 @@ body: |
; CHECK-LABEL: name: back_to_back_uniform_in_vgpr
; CHECK: liveins: $sgpr0, $sgpr1, $sgpr2, $vgpr0, $vgpr1
; CHECK-NEXT: {{ $}}
- ; CHECK-NEXT: [[COPY:%[0-9]+]]:_(s32) = COPY $sgpr0
- ; CHECK-NEXT: [[COPY1:%[0-9]+]]:_(s32) = COPY $sgpr1
- ; CHECK-NEXT: [[COPY2:%[0-9]+]]:_(s32) = COPY $sgpr2
- ; CHECK-NEXT: [[COPY3:%[0-9]+]]:_(s32) = COPY $vgpr0
- ; CHECK-NEXT: [[COPY4:%[0-9]+]]:_(s32) = COPY $vgpr1
- ; CHECK-NEXT: [[MV:%[0-9]+]]:_(p1) = G_MERGE_VALUES [[COPY3]](s32), [[COPY4]](s32)
- ; CHECK-NEXT: [[FADD:%[0-9]+]]:_(s32) = G_FADD [[COPY]], [[COPY1]]
- ; CHECK-NEXT: [[FPTOUI:%[0-9]+]]:_(s32) = G_FPTOUI [[FADD]](s32)
- ; CHECK-NEXT: [[ADD:%[0-9]+]]:_(s32) = G_ADD [[FPTOUI]], [[COPY2]]
+ ; CHECK-NEXT: [[COPY:%[0-9]+]]:sgpr(s32) = COPY $sgpr0
+ ; CHECK-NEXT: [[COPY1:%[0-9]+]]:sgpr(s32) = COPY $sgpr1
+ ; CHECK-NEXT: [[COPY2:%[0-9]+]]:sgpr(s32) = COPY $sgpr2
+ ; CHECK-NEXT: [[COPY3:%[0-9]+]]:vgpr(s32) = COPY $vgpr0
+ ; CHECK-NEXT: [[COPY4:%[0-9]+]]:vgpr(s32) = COPY $vgpr1
+ ; CHECK-NEXT: [[MV:%[0-9]+]]:vgpr(p1) = G_MERGE_VALUES [[COPY3]](s32), [[COPY4]](s32)
+ ; CHECK-NEXT: [[FADD:%[0-9]+]]:sgpr(s32) = G_FADD [[COPY]], [[COPY1]]
+ ; CHECK-NEXT: [[FPTOUI:%[0-9]+]]:sgpr(s32) = G_FPTOUI [[FADD]](s32)
+ ; CHECK-NEXT: [[ADD:%[0-9]+]]:sgpr(s32) = G_ADD [[FPTOUI]], [[COPY2]]
; CHECK-NEXT: G_STORE [[ADD]](s32), [[MV]](p1) :: (store (s32), addrspace 1)
; CHECK-NEXT: S_ENDPGM 0
- %0:_(s32) = COPY $sgpr0
- %1:_(s32) = COPY $sgpr1
- %2:_(s32) = COPY $sgpr2
- %4:_(s32) = COPY $vgpr0
- %5:_(s32) = COPY $vgpr1
- %3:_(p1) = G_MERGE_VALUES %4(s32), %5(s32)
- %7:_(s32) = G_FADD %0, %1
- %8:_(s32) = G_FPTOUI %7(s32)
- %9:_(s32) = G_ADD %8, %2
+ %0:sgpr(s32) = COPY $sgpr0
+ %1:sgpr(s32) = COPY $sgpr1
+ %2:sgpr(s32) = COPY $sgpr2
+ %4:vgpr(s32) = COPY $vgpr0
+ %5:vgpr(s32) = COPY $vgpr1
+ %3:vgpr(p1) = G_MERGE_VALUES %4(s32), %5(s32)
+ %7:sgpr(s32) = G_FADD %0, %1
+ %8:sgpr(s32) = G_FPTOUI %7(s32)
+ %9:sgpr(s32) = G_ADD %8, %2
G_STORE %9(s32), %3(p1) :: (store (s32), addrspace 1)
S_ENDPGM 0
...
@@ -75,36 +75,36 @@ body: |
; CHECK-LABEL: name: buffer_load_uniform
; CHECK: liveins: $sgpr0, $sgpr1, $sgpr2, $sgpr3, $sgpr4, $vgpr0, $vgpr1
; CHECK-NEXT: {{ $}}
- ; CHECK-NEXT: [[COPY:%[0-9]+]]:_(s32) = COPY $sgpr0
- ; CHECK-NEXT: [[COPY1:%[0-9]+]]:_(s32) = COPY $sgpr1
- ; CHECK-NEXT: [[COPY2:%[0-9]+]]:_(s32) = COPY $sgpr2
- ; CHECK-NEXT: [[COPY3:%[0-9]+]]:_(s32) = COPY $sgpr3
- ; CHECK-NEXT: [[BUILD_VECTOR:%[0-9]+]]:_(<4 x s32>) = G_BUILD_VECTOR [[COPY]](s32), [[COPY1]](s32), [[COPY2]](s32), [[COPY3]](s32)
- ; CHECK-NEXT: [[COPY4:%[0-9]+]]:_(s32) = COPY $sgpr4
- ; CHECK-NEXT: [[COPY5:%[0-9]+]]:_(s32) = COPY $vgpr0
- ; CHECK-NEXT: [[COPY6:%[0-9]+]]:_(s32) = COPY $vgpr1
- ; CHECK-NEXT: [[MV:%[0-9]+]]:_(p1) = G_MERGE_VALUES [[COPY5]](s32), [[COPY6]](s32)
- ; CHECK-NEXT: [[C:%[0-9]+]]:_(s32) = G_CONSTANT i32 0
- ; CHECK-NEXT: [[AMDGPU_BUFFER_LOAD:%[0-9]+]]:_(<4 x s32>) = G_AMDGPU_BUFFER_LOAD [[BUILD_VECTOR]](<4 x s32>), [[C]](s32), [[COPY4]], [[C]], 0, 0, 0 :: (dereferenceable load (<4 x s32>), align 1, addrspace 8)
- ; CHECK-NEXT: [[C1:%[0-9]+]]:_(s32) = G_CONSTANT i32 1
- ; CHECK-NEXT: [[UV:%[0-9]+]]:_(s32), [[UV1:%[0-9]+]]:_(s32), [[UV2:%[0-9]+]]:_(s32), [[UV3:%[0-9]+]]:_(s32) = G_UNMERGE_VALUES [[AMDGPU_BUFFER_LOAD]](<4 x s32>)
- ; CHECK-NEXT: [[ADD:%[0-9]+]]:_(s32) = G_ADD [[UV1]], [[C1]]
+ ; CHECK-NEXT: [[COPY:%[0-9]+]]:sgpr(s32) = COPY $sgpr0
+ ; CHECK-NEXT: [[COPY1:%[0-9]+]]:sgpr(s32) = COPY $sgpr1
+ ; CHECK-NEXT: [[COPY2:%[0-9]+]]:sgpr(s32) = COPY $sgpr2
+ ; CHECK-NEXT: [[COPY3:%[0-9]+]]:sgpr(s32) = COPY $sgpr3
+ ; CHECK-NEXT: [[BUILD_VECTOR:%[0-9]+]]:sgpr(<4 x s32>) = G_BUILD_VECTOR [[COPY]](s32), [[COPY1]](s32), [[COPY2]](s32), [[COPY3]](s32)
+ ; CHECK-NEXT: [[COPY4:%[0-9]+]]:sgpr(s32) = COPY $sgpr4
+ ; CHECK-NEXT: [[COPY5:%[0-9]+]]:vgpr(s32) = COPY $vgpr0
+ ; CHECK-NEXT: [[COPY6:%[0-9]+]]:vgpr(s32) = COPY $vgpr1
+ ; CHECK-NEXT: [[MV:%[0-9]+]]:vgpr(p1) = G_MERGE_VALUES [[COPY5]](s32), [[COPY6]](s32)
+ ; CHECK-NEXT: [[C:%[0-9]+]]:sgpr(s32) = G_CONSTANT i32 0
+ ; CHECK-NEXT: [[AMDGPU_BUFFER_LOAD:%[0-9]+]]:sgpr(<4 x s32>) = G_AMDGPU_BUFFER_LOAD [[BUILD_VECTOR]](<4 x s32>), [[C]](s32), [[COPY4]], [[C]], 0, 0, 0 :: (dereferenceable load (<4 x s32>), align 1, addrspace 8)
+ ; CHECK-NEXT: [[C1:%[0-9]+]]:sgpr(s32) = G_CONSTANT i32 1
+ ; CHECK-NEXT: [[UV:%[0-9]+]]:sgpr(s32), [[UV1:%[0-9]+]]:sgpr(s32), [[UV2:%[0-9]+]]:sgpr(s32), [[UV3:%[0-9]+]]:sgpr(s32) = G_UNMERGE_VALUES [[AMDGPU_BUFFER_LOAD]](<4 x s32>)
+ ; CHECK-NEXT: [[ADD:%[0-9]+]]:sgpr(s32) = G_ADD [[UV1]], [[C1]]
; CHECK-NEXT: G_STORE [[ADD]](s32), [[MV]](p1) :: (store (s32), addrspace 1)
; CHECK-NEXT: S_ENDPGM 0
- %3:_(s32) = COPY $sgpr0
- %4:_(s32) = COPY $sgpr1
- %5:_(s32) = COPY $sgpr2
- %6:_(s32) = COPY $sgpr3
- %0:_(<4 x s32>) = G_BUILD_VECTOR %3(s32), %4(s32), %5(s32), %6(s32)
- %1:_(s32) = COPY $sgpr4
- %7:_(s32) = COPY $vgpr0
- %8:_(s32) = COPY $vgpr1
- %2:_(p1) = G_MERGE_VALUES %7(s32), %8(s32)
- %11:_(s32) = G_CONSTANT i32 0
- %10:_(<4 x s32>) = G_AMDGPU_BUFFER_LOAD %0(<4 x s32>), %11(s32), %1, %11, 0, 0, 0 :: (dereferenceable load (<4 x s32>), align 1, addrspace 8)
- %13:_(s32) = G_CONSTANT i32 1
- %15:_(s32), %16:_(s32), %17:_(s32), %18:_(s32) = G_UNMERGE_VALUES %10(<4 x s32>)
- %14:_(s32) = G_ADD %16, %13
+ %3:sgpr(s32) = COPY $sgpr0
+ %4:sgpr(s32) = COPY $sgpr1
+ %5:sgpr(s32) = COPY $sgpr2
+ %6:sgpr(s32) = COPY $sgpr3
+ %0:sgpr(<4 x s32>) = G_BUILD_VECTOR %3(s32), %4(s32), %5(s32), %6(s32)
+ %1:sgpr(s32) = COPY $sgpr4
+ %7:vgpr(s32) = COPY $vgpr0
+ %8:vgpr(s32) = COPY $vgpr1
+ %2:vgpr(p1) = G_MERGE_VALUES %7(s32), %8(s32)
+ %11:sgpr(s32) = G_CONSTANT i32 0
+ %10:sgpr(<4 x s32>) = G_AMDGPU_BUFFER_LOAD %0(<4 x s32>), %11(s32), %1, %11, 0, 0, 0 :: (dereferenceable load (<4 x s32>), align 1, addrspace 8)
+ %13:sgpr(s32) = G_CONSTANT i32 1
+ %15:sgpr(s32), %16:sgpr(s32), %17:sgpr(s32), %18:sgpr(s32) = G_UNMERGE_VALUES %10(<4 x s32>)
+ %14:sgpr(s32) = G_ADD %16, %13
G_STORE %14(s32), %2(p1) :: (store (s32), addrspace 1)
S_ENDPGM 0
...
@@ -119,36 +119,36 @@ body: |
; CHECK-LABEL: name: buffer_load_divergent
; CHECK: liveins: $sgpr0, $sgpr1, $sgpr2, $sgpr3, $vgpr0, $vgpr1, $vgpr2
; CHECK-NEXT: {{ $}}
- ; CHECK-NEXT: [[COPY:%[0-9]+]]:_(s32) = COPY $sgpr0
- ; CHECK-NEXT: [[COPY1:%[0-9]+]]:_(s32) = COPY $sgpr1
- ; CHECK-NEXT: [[COPY2:%[0-9]+]]:_(s32) = COPY $sgpr2
- ; CHECK-NEXT: [[COPY3:%[0-9]+]]:_(s32) = COPY $sgpr3
- ; CHECK-NEXT: [[BUILD_VECTOR:%[0-9]+]]:_(<4 x s32>) = G_BUILD_VECTOR [[COPY]](s32), [[COPY1]](s32), [[COPY2]...
[truncated]
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't forget about AGPRs
if (MI.getOpcode() == AMDGPU::PHI || MI.getOpcode() == AMDGPU::IMPLICIT_DEF) | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should have failed isPreISelOpcode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I copied that from existing regbankselect. MI.isPreISelOpcode() || MI.isCopy() also works
if (MI.getOpcode() == AMDGPU::PHI || MI.getOpcode() == AMDGPU::IMPLICIT_DEF) | ||
return false; | ||
|
||
if (MI.isInlineAsm()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should have failed isPreISelOpcode
setRB(MI, DefOP, B, MRI, RBI.getRegBank(AMDGPU::SGPRRegBankID)); | ||
} else { | ||
if (MRI.getType(DefReg) == LLT::scalar(1)) | ||
setRB(MI, DefOP, B, MRI, RBI.getRegBank(AMDGPU::VCCRegBankID)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you directly use the pointer to the const regbank struct?
if (MI.isCopy()) | ||
continue; | ||
|
||
for (MachineOperand &UseOP : MI.uses()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about the defs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previous for loop assigned RegBanks to all defs.
This loop prepares uses for RBLegalize to have register banks only.
} | ||
} | ||
|
||
void setRBUse(MachineInstr &MI, MachineOperand &UseOP, MachineIRBuilder B, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static. Also don't pass MachineIRBuilder by value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
member function
} | ||
|
||
} else { | ||
MRI.setRegBank(Reg, RB); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need to call the observer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is our pass I felt there was no need to complicate it with observers
@@ -63,4 +70,189 @@ char &llvm::AMDGPURBSelectID = AMDGPURBSelect::ID; | |||
|
|||
FunctionPass *llvm::createAMDGPURBSelectPass() { return new AMDGPURBSelect(); } | |||
|
|||
bool AMDGPURBSelect::runOnMachineFunction(MachineFunction &MF) { return true; } | |||
bool shouldRBSelect(MachineInstr &MI) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why free-standing functions, when it is your. register bank select pass?
ff34aa1
to
0c40f68
Compare
2124eb3
to
df50c85
Compare
8bf0a23
to
84284ba
Compare
df50c85
to
36c8a96
Compare
84284ba
to
3b0aaef
Compare
36c8a96
to
69dde87
Compare
3b0aaef
to
07055a7
Compare
69dde87
to
9048f0d
Compare
Assign register banks to virtual registers. Does not use generic RegBankSelect. After register bank selection all register operand of G_ instructions have LLT and register banks exclusively. If they had register class, reassign appropriate register bank. Assign register banks using machine uniformity analysis: Sgpr - uniform values and some lane masks Vgpr - divergent, non S1, values Vcc - divergent S1 values(lane masks) AMDGPURegBankSelect does not consider available instructions and, in some cases, G_ instructions with some register bank assignment can't be inst-selected. This is solved in RegBankLegalize. Exceptions when uniformity analysis does not work: S32/S64 lane masks: - need to end up with sgpr register class after instruction selection - In most cases Uniformity analysis declares them as uniform (forced by tablegen) resulting in sgpr S32/S64 reg bank - When Uniformity analysis declares them as divergent (some phis), use intrinsic lane mask analyzer to still assign sgpr register bank temporal divergence copy: - COPY to vgpr with implicit use of $exec inside of the cycle - this copy is declared as uniform by uniformity analysis - make sure that assigned bank is vgpr Note: uniformity analysis does not consider that registers with vgpr def are divergent (you can have uniform value in vgpr). - TODO: implicit use of $exec could be implemented as indicator that instruction is divergent
07055a7
to
623266f
Compare
9048f0d
to
d17ca95
Compare
Assign register banks to virtual registers. Does not use generic
RegBankSelect. After register bank selection all register operand of
G_ instructions have LLT and register banks exclusively. If they had
register class, reassign appropriate register bank.
Assign register banks using machine uniformity analysis:
Sgpr - uniform values and some lane masks
Vgpr - divergent, non S1, values
Vcc - divergent S1 values(lane masks)
AMDGPURegBankSelect does not consider available instructions and, in
some cases, G_ instructions with some register bank assignment can't be
inst-selected. This is solved in RegBankLegalize.
Exceptions when uniformity analysis does not work:
S32/S64 lane masks:
(forced by tablegen) resulting in sgpr S32/S64 reg bank
use intrinsic lane mask analyzer to still assign sgpr register bank
temporal divergence copy:
Note: uniformity analysis does not consider that registers with vgpr def
are divergent (you can have uniform value in vgpr).
that instruction is divergent