Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AMDGPU/GlobalISel: AMDGPURegBankSelect #112863

Open
wants to merge 1 commit into
base: users/petar-avramovic/new-rbs-skeleton
Choose a base branch
from

Conversation

petar-avramovic
Copy link
Collaborator

@petar-avramovic petar-avramovic commented Oct 18, 2024

Assign register banks to virtual registers. Does not use generic
RegBankSelect. After register bank selection all register operand of
G_ instructions have LLT and register banks exclusively. If they had
register class, reassign appropriate register bank.

Assign register banks using machine uniformity analysis:
Sgpr - uniform values and some lane masks
Vgpr - divergent, non S1, values
Vcc - divergent S1 values(lane masks)

AMDGPURegBankSelect does not consider available instructions and, in
some cases, G_ instructions with some register bank assignment can't be
inst-selected. This is solved in RegBankLegalize.

Exceptions when uniformity analysis does not work:
S32/S64 lane masks:

  • need to end up with sgpr register class after instruction selection
  • In most cases Uniformity analysis declares them as uniform
    (forced by tablegen) resulting in sgpr S32/S64 reg bank
  • When Uniformity analysis declares them as divergent (some phis),
    use intrinsic lane mask analyzer to still assign sgpr register bank
    temporal divergence copy:
  • COPY to vgpr with implicit use of $exec inside of the cycle
  • this copy is declared as uniform by uniformity analysis
  • make sure that assigned bank is vgpr
    Note: uniformity analysis does not consider that registers with vgpr def
    are divergent (you can have uniform value in vgpr).
  • TODO: implicit use of $exec could be implemented as indicator
    that instruction is divergent

Copy link
Collaborator Author

petar-avramovic commented Oct 18, 2024

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @petar-avramovic and the rest of your teammates on Graphite Graphite

@llvmbot
Copy link
Collaborator

llvmbot commented Oct 18, 2024

@llvm/pr-subscribers-llvm-globalisel

Author: Petar Avramovic (petar-avramovic)

Changes

Assign register banks to virtual registers.
Defs and uses of G_ instructions have register banks exclusively,
if they had register class, reassign appropriate register bank.

Assign register banks using machine uniformity analysis:
SGPR - uniform values and some lane masks
VGPR - divergent, non S1, values
VCC - divergent S1 values(lane masks)

RBSelect does not consider available instructions and, in some cases, G_
instructions with some register bank assignment can't be inst-selected.
This is solved in RBLegalize.

Exceptions when uniformity analysis does not work:
S32/S64 lane masks:

  • need to end up with SGPR register class after instruction selection
  • In most cases Uniformity analysis declares them as uniform
    (forced by tablegen) resulting in sgpr S32/S64 reg bank
  • When Uniformity analysis declares them as divergent (some phis),
    use intrinsic lane mask analyzer to still assign sgpr register bank
    temporal divergence copy:
  • COPY to vgpr with implicit use of $exec inside of the cycle
  • this copy is declared as uniform by uniformity analysis
  • make sure that assigned bank is vgpr
    Note: uniformity analysis does not consider that registers with vgpr def
    are divergent (you can have uniform value in vgpr).
  • TODO: implicit use of $exec could be implemented as indicator
    that instruction is divergent

Patch is 118.08 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/112863.diff

5 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPUGlobalISelUtils.cpp (+38)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUGlobalISelUtils.h (+22)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPURBSelect.cpp (+193-1)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui-rb-legalize.mir (+480-458)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui-rb-select.mir (+238-227)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUGlobalISelUtils.cpp b/llvm/lib/Target/AMDGPU/AMDGPUGlobalISelUtils.cpp
index a98d4488bf77fe..6f6ad5cf82cae1 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUGlobalISelUtils.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUGlobalISelUtils.cpp
@@ -7,13 +7,16 @@
 //===----------------------------------------------------------------------===//
 
 #include "AMDGPUGlobalISelUtils.h"
+#include "AMDGPURegisterBankInfo.h"
 #include "GCNSubtarget.h"
 #include "llvm/CodeGen/GlobalISel/GISelKnownBits.h"
 #include "llvm/CodeGen/GlobalISel/MIPatternMatch.h"
 #include "llvm/CodeGenTypes/LowLevelType.h"
 #include "llvm/IR/Constants.h"
+#include "llvm/IR/IntrinsicsAMDGPU.h"
 
 using namespace llvm;
+using namespace AMDGPU;
 using namespace MIPatternMatch;
 
 std::pair<Register, unsigned>
@@ -69,3 +72,38 @@ AMDGPU::getBaseWithConstantOffset(MachineRegisterInfo &MRI, Register Reg,
 
   return std::pair(Reg, 0);
 }
+
+IntrinsicLaneMaskAnalyzer::IntrinsicLaneMaskAnalyzer(MachineFunction &MF)
+    : MRI(MF.getRegInfo()) {
+  initLaneMaskIntrinsics(MF);
+}
+
+bool IntrinsicLaneMaskAnalyzer::isS32S64LaneMask(Register Reg) {
+  return S32S64LaneMask.contains(Reg);
+}
+
+void IntrinsicLaneMaskAnalyzer::initLaneMaskIntrinsics(MachineFunction &MF) {
+  for (auto &MBB : MF) {
+    for (auto &MI : MBB) {
+      if (MI.getOpcode() == AMDGPU::G_INTRINSIC &&
+          MI.getOperand(MI.getNumExplicitDefs()).getIntrinsicID() ==
+              Intrinsic::amdgcn_if_break) {
+        S32S64LaneMask.insert(MI.getOperand(3).getReg());
+        findLCSSAPhi(MI.getOperand(0).getReg());
+      }
+
+      if (MI.getOpcode() == AMDGPU::SI_IF ||
+          MI.getOpcode() == AMDGPU::SI_ELSE) {
+        findLCSSAPhi(MI.getOperand(0).getReg());
+      }
+    }
+  }
+}
+
+void IntrinsicLaneMaskAnalyzer::findLCSSAPhi(Register Reg) {
+  S32S64LaneMask.insert(Reg);
+  for (auto &LCSSAPhi : MRI.use_instructions(Reg)) {
+    if (LCSSAPhi.isPHI())
+      S32S64LaneMask.insert(LCSSAPhi.getOperand(0).getReg());
+  }
+}
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUGlobalISelUtils.h b/llvm/lib/Target/AMDGPU/AMDGPUGlobalISelUtils.h
index 5972552b9a4fe8..4d504d0204d81a 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUGlobalISelUtils.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPUGlobalISelUtils.h
@@ -9,6 +9,8 @@
 #ifndef LLVM_LIB_TARGET_AMDGPU_AMDGPUGLOBALISELUTILS_H
 #define LLVM_LIB_TARGET_AMDGPU_AMDGPUGLOBALISELUTILS_H
 
+#include "llvm/ADT/DenseSet.h"
+#include "llvm/CodeGen/MachineFunction.h"
 #include "llvm/CodeGen/Register.h"
 #include <utility>
 
@@ -26,6 +28,26 @@ std::pair<Register, unsigned>
 getBaseWithConstantOffset(MachineRegisterInfo &MRI, Register Reg,
                           GISelKnownBits *KnownBits = nullptr,
                           bool CheckNUW = false);
+
+// Currently finds S32/S64 lane masks that can be declared as divergent by
+// uniformity analysis (all are phis at the moment).
+// These are defined as i32/i64 in some IR intrinsics (not as i1).
+// Tablegen forces(via telling that lane mask IR intrinsics are uniform) most of
+// S32/S64 lane masks to be uniform, as this results in them ending up with sgpr
+// reg class after instruction-select don't search for all of them.
+class IntrinsicLaneMaskAnalyzer {
+  DenseSet<Register> S32S64LaneMask;
+  MachineRegisterInfo &MRI;
+
+public:
+  IntrinsicLaneMaskAnalyzer(MachineFunction &MF);
+  bool isS32S64LaneMask(Register Reg);
+
+private:
+  void initLaneMaskIntrinsics(MachineFunction &MF);
+  // This will not be needed when we turn of LCSSA for global-isel.
+  void findLCSSAPhi(Register Reg);
+};
 }
 }
 
diff --git a/llvm/lib/Target/AMDGPU/AMDGPURBSelect.cpp b/llvm/lib/Target/AMDGPU/AMDGPURBSelect.cpp
index c53a68ff72a8ad..905ad432fe6e0d 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPURBSelect.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPURBSelect.cpp
@@ -16,7 +16,12 @@
 //===----------------------------------------------------------------------===//
 
 #include "AMDGPU.h"
+#include "AMDGPUGlobalISelUtils.h"
+#include "AMDGPURegisterBankInfo.h"
+#include "MCTargetDesc/AMDGPUMCTargetDesc.h"
+#include "llvm/CodeGen/GlobalISel/MachineIRBuilder.h"
 #include "llvm/CodeGen/MachineFunctionPass.h"
+#include "llvm/CodeGen/MachineUniformityAnalysis.h"
 #include "llvm/InitializePasses.h"
 
 #define DEBUG_TYPE "rb-select"
@@ -39,6 +44,7 @@ class AMDGPURBSelect : public MachineFunctionPass {
   StringRef getPassName() const override { return "AMDGPU RB select"; }
 
   void getAnalysisUsage(AnalysisUsage &AU) const override {
+    AU.addRequired<MachineUniformityAnalysisPass>();
     MachineFunctionPass::getAnalysisUsage(AU);
   }
 
@@ -54,6 +60,7 @@ class AMDGPURBSelect : public MachineFunctionPass {
 
 INITIALIZE_PASS_BEGIN(AMDGPURBSelect, DEBUG_TYPE, "AMDGPU RB select", false,
                       false)
+INITIALIZE_PASS_DEPENDENCY(MachineUniformityAnalysisPass)
 INITIALIZE_PASS_END(AMDGPURBSelect, DEBUG_TYPE, "AMDGPU RB select", false,
                     false)
 
@@ -63,4 +70,189 @@ char &llvm::AMDGPURBSelectID = AMDGPURBSelect::ID;
 
 FunctionPass *llvm::createAMDGPURBSelectPass() { return new AMDGPURBSelect(); }
 
-bool AMDGPURBSelect::runOnMachineFunction(MachineFunction &MF) { return true; }
+bool shouldRBSelect(MachineInstr &MI) {
+  if (isTargetSpecificOpcode(MI.getOpcode()) && !MI.isPreISelOpcode())
+    return false;
+
+  if (MI.getOpcode() == AMDGPU::PHI || MI.getOpcode() == AMDGPU::IMPLICIT_DEF)
+    return false;
+
+  if (MI.isInlineAsm())
+    return false;
+
+  return true;
+}
+
+void setRB(MachineInstr &MI, MachineOperand &DefOP, MachineIRBuilder B,
+           MachineRegisterInfo &MRI, const RegisterBank &RB) {
+  Register Reg = DefOP.getReg();
+  // Register that already has Register class got it during pre-inst selection
+  // of another instruction. Maybe cross bank copy was required so we insert a
+  // copy trat can be removed later. This simplifies post-rb-legalize artifact
+  // combiner and avoids need to special case some patterns.
+  if (MRI.getRegClassOrNull(Reg)) {
+    LLT Ty = MRI.getType(Reg);
+    Register NewReg = MRI.createVirtualRegister({&RB, Ty});
+    DefOP.setReg(NewReg);
+
+    auto &MBB = *MI.getParent();
+    B.setInsertPt(MBB, MI.isPHI() ? MBB.getFirstNonPHI()
+                                  : std::next(MI.getIterator()));
+    B.buildCopy(Reg, NewReg);
+
+    // The problem was discoverd for uniform S1 that was used as both
+    // lane mask(vcc) and regular sgpr S1.
+    // - lane-mask(vcc) use was by si_if, this use is divergent and requires
+    //   non-trivial sgpr-S1-to-vcc copy. But pre-inst-selection of si_if sets
+    //   sreg_64_xexec(S1) on def of uniform S1 making it lane-mask.
+    // - the regular regular sgpr S1(uniform) instruction is now broken since
+    //   it uses sreg_64_xexec(S1) which is divergent.
+
+    // "Clear" reg classes from uses on generic instructions and but register
+    // banks instead.
+    for (auto &UseMI : MRI.use_instructions(Reg)) {
+      if (shouldRBSelect(UseMI)) {
+        for (MachineOperand &Op : UseMI.operands()) {
+          if (Op.isReg() && Op.isUse() && Op.getReg() == Reg)
+            Op.setReg(NewReg);
+        }
+      }
+    }
+
+  } else {
+    MRI.setRegBank(Reg, RB);
+  }
+}
+
+void setRBUse(MachineInstr &MI, MachineOperand &UseOP, MachineIRBuilder B,
+              MachineRegisterInfo &MRI, const RegisterBank &RB) {
+  Register Reg = UseOP.getReg();
+
+  LLT Ty = MRI.getType(Reg);
+  Register NewReg = MRI.createVirtualRegister({&RB, Ty});
+  UseOP.setReg(NewReg);
+
+  if (MI.isPHI()) {
+    auto DefMI = MRI.getVRegDef(Reg)->getIterator();
+    MachineBasicBlock *DefMBB = DefMI->getParent();
+    B.setInsertPt(*DefMBB, DefMBB->SkipPHIsAndLabels(std::next(DefMI)));
+  } else {
+    B.setInstr(MI);
+  }
+
+  B.buildCopy(NewReg, Reg);
+}
+
+// Temporal divergence copy: COPY to vgpr with implicit use of $exec inside of
+// the cycle
+// Note: uniformity analysis does not consider that registers with vgpr def are
+// divergent (you can have uniform value in vgpr).
+// - TODO: implicit use of $exec could be implemented as indicator that
+//   instruction is divergent
+bool isTemporalDivergenceCopy(Register Reg, MachineRegisterInfo &MRI) {
+  MachineInstr *MI = MRI.getVRegDef(Reg);
+  if (MI->getOpcode() == AMDGPU::COPY) {
+    for (auto Op : MI->implicit_operands()) {
+      if (!Op.isReg())
+        continue;
+      Register Reg = Op.getReg();
+      if (Reg == AMDGPU::EXEC) {
+        return true;
+      }
+    }
+  }
+
+  return false;
+}
+
+Register getVReg(MachineOperand &Op) {
+  if (!Op.isReg())
+    return 0;
+
+  Register Reg = Op.getReg();
+  if (!Reg.isVirtual())
+    return 0;
+
+  return Reg;
+}
+
+bool AMDGPURBSelect::runOnMachineFunction(MachineFunction &MF) {
+  MachineUniformityInfo &MUI =
+      getAnalysis<MachineUniformityAnalysisPass>().getUniformityInfo();
+  AMDGPU::IntrinsicLaneMaskAnalyzer ILMA(MF);
+  MachineRegisterInfo &MRI = MF.getRegInfo();
+  const RegisterBankInfo &RBI = *MF.getSubtarget().getRegBankInfo();
+
+  MachineIRBuilder B(MF);
+
+  // Assign register banks to ALL def registers on G_ instructions.
+  // Same for copies if they have no register bank or class on def.
+  for (MachineBasicBlock &MBB : MF) {
+    for (MachineInstr &MI : MBB) {
+      if (!shouldRBSelect(MI))
+        continue;
+
+      for (MachineOperand &DefOP : MI.defs()) {
+        Register DefReg = getVReg(DefOP);
+        if (!DefReg)
+          continue;
+
+        // Copies can have register class on def registers.
+        if (MI.isCopy() && MRI.getRegClassOrNull(DefReg)) {
+          continue;
+        }
+
+        if (MUI.isUniform(DefReg) || ILMA.isS32S64LaneMask(DefReg)) {
+          setRB(MI, DefOP, B, MRI, RBI.getRegBank(AMDGPU::SGPRRegBankID));
+        } else {
+          if (MRI.getType(DefReg) == LLT::scalar(1))
+            setRB(MI, DefOP, B, MRI, RBI.getRegBank(AMDGPU::VCCRegBankID));
+          else
+            setRB(MI, DefOP, B, MRI, RBI.getRegBank(AMDGPU::VGPRRegBankID));
+        }
+      }
+    }
+  }
+
+  // At this point all virtual registers have register class or bank
+  // - Defs of G_ instructions have register banks.
+  // - Defs and uses of inst-selected instructions have register class.
+  // - Defs and uses of copies can have either register class or bank
+  // and most notably
+  // - Uses of G_ instructions can have either register class or bank
+
+  // Reassign uses of G_ instructions to only have register banks.
+  for (MachineBasicBlock &MBB : MF) {
+    for (MachineInstr &MI : MBB) {
+      if (!shouldRBSelect(MI))
+        continue;
+
+      // Copies can have register class on use registers.
+      if (MI.isCopy())
+        continue;
+
+      for (MachineOperand &UseOP : MI.uses()) {
+        Register UseReg = getVReg(UseOP);
+        if (!UseReg)
+          continue;
+
+        if (!MRI.getRegClassOrNull(UseReg))
+          continue;
+
+        if (!isTemporalDivergenceCopy(UseReg, MRI) &&
+            (MUI.isUniform(UseReg) || ILMA.isS32S64LaneMask(UseReg))) {
+          setRBUse(MI, UseOP, B, MRI, RBI.getRegBank(AMDGPU::SGPRRegBankID));
+        } else {
+          if (MRI.getType(UseReg) == LLT::scalar(1))
+            setRBUse(MI, UseOP, B, MRI, RBI.getRegBank(AMDGPU::VCCRegBankID));
+          else
+            setRBUse(MI, UseOP, B, MRI, RBI.getRegBank(AMDGPU::VGPRRegBankID));
+        }
+      }
+    }
+  }
+
+  // Defs and uses of G_ instructions have register banks exclusively.
+
+  return true;
+}
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui-rb-legalize.mir b/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui-rb-legalize.mir
index 880057813adf54..208bf686c98ba8 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui-rb-legalize.mir
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui-rb-legalize.mir
@@ -11,22 +11,22 @@ body: |
     ; CHECK-LABEL: name: uniform_in_vgpr
     ; CHECK: liveins: $sgpr0, $sgpr1, $vgpr0, $vgpr1
     ; CHECK-NEXT: {{  $}}
-    ; CHECK-NEXT: [[COPY:%[0-9]+]]:_(s32) = COPY $sgpr0
-    ; CHECK-NEXT: [[COPY1:%[0-9]+]]:_(s32) = COPY $sgpr1
-    ; CHECK-NEXT: [[COPY2:%[0-9]+]]:_(s32) = COPY $vgpr0
-    ; CHECK-NEXT: [[COPY3:%[0-9]+]]:_(s32) = COPY $vgpr1
-    ; CHECK-NEXT: [[MV:%[0-9]+]]:_(p1) = G_MERGE_VALUES [[COPY2]](s32), [[COPY3]](s32)
-    ; CHECK-NEXT: [[FPTOUI:%[0-9]+]]:_(s32) = G_FPTOUI [[COPY]](s32)
-    ; CHECK-NEXT: [[ADD:%[0-9]+]]:_(s32) = G_ADD [[FPTOUI]], [[COPY1]]
+    ; CHECK-NEXT: [[COPY:%[0-9]+]]:sgpr(s32) = COPY $sgpr0
+    ; CHECK-NEXT: [[COPY1:%[0-9]+]]:sgpr(s32) = COPY $sgpr1
+    ; CHECK-NEXT: [[COPY2:%[0-9]+]]:vgpr(s32) = COPY $vgpr0
+    ; CHECK-NEXT: [[COPY3:%[0-9]+]]:vgpr(s32) = COPY $vgpr1
+    ; CHECK-NEXT: [[MV:%[0-9]+]]:vgpr(p1) = G_MERGE_VALUES [[COPY2]](s32), [[COPY3]](s32)
+    ; CHECK-NEXT: [[FPTOUI:%[0-9]+]]:sgpr(s32) = G_FPTOUI [[COPY]](s32)
+    ; CHECK-NEXT: [[ADD:%[0-9]+]]:sgpr(s32) = G_ADD [[FPTOUI]], [[COPY1]]
     ; CHECK-NEXT: G_STORE [[ADD]](s32), [[MV]](p1) :: (store (s32), addrspace 1)
     ; CHECK-NEXT: S_ENDPGM 0
-    %0:_(s32) = COPY $sgpr0
-    %1:_(s32) = COPY $sgpr1
-    %3:_(s32) = COPY $vgpr0
-    %4:_(s32) = COPY $vgpr1
-    %2:_(p1) = G_MERGE_VALUES %3(s32), %4(s32)
-    %6:_(s32) = G_FPTOUI %0(s32)
-    %7:_(s32) = G_ADD %6, %1
+    %0:sgpr(s32) = COPY $sgpr0
+    %1:sgpr(s32) = COPY $sgpr1
+    %3:vgpr(s32) = COPY $vgpr0
+    %4:vgpr(s32) = COPY $vgpr1
+    %2:vgpr(p1) = G_MERGE_VALUES %3(s32), %4(s32)
+    %6:sgpr(s32) = G_FPTOUI %0(s32)
+    %7:sgpr(s32) = G_ADD %6, %1
     G_STORE %7(s32), %2(p1) :: (store (s32), addrspace 1)
     S_ENDPGM 0
 ...
@@ -41,26 +41,26 @@ body: |
     ; CHECK-LABEL: name: back_to_back_uniform_in_vgpr
     ; CHECK: liveins: $sgpr0, $sgpr1, $sgpr2, $vgpr0, $vgpr1
     ; CHECK-NEXT: {{  $}}
-    ; CHECK-NEXT: [[COPY:%[0-9]+]]:_(s32) = COPY $sgpr0
-    ; CHECK-NEXT: [[COPY1:%[0-9]+]]:_(s32) = COPY $sgpr1
-    ; CHECK-NEXT: [[COPY2:%[0-9]+]]:_(s32) = COPY $sgpr2
-    ; CHECK-NEXT: [[COPY3:%[0-9]+]]:_(s32) = COPY $vgpr0
-    ; CHECK-NEXT: [[COPY4:%[0-9]+]]:_(s32) = COPY $vgpr1
-    ; CHECK-NEXT: [[MV:%[0-9]+]]:_(p1) = G_MERGE_VALUES [[COPY3]](s32), [[COPY4]](s32)
-    ; CHECK-NEXT: [[FADD:%[0-9]+]]:_(s32) = G_FADD [[COPY]], [[COPY1]]
-    ; CHECK-NEXT: [[FPTOUI:%[0-9]+]]:_(s32) = G_FPTOUI [[FADD]](s32)
-    ; CHECK-NEXT: [[ADD:%[0-9]+]]:_(s32) = G_ADD [[FPTOUI]], [[COPY2]]
+    ; CHECK-NEXT: [[COPY:%[0-9]+]]:sgpr(s32) = COPY $sgpr0
+    ; CHECK-NEXT: [[COPY1:%[0-9]+]]:sgpr(s32) = COPY $sgpr1
+    ; CHECK-NEXT: [[COPY2:%[0-9]+]]:sgpr(s32) = COPY $sgpr2
+    ; CHECK-NEXT: [[COPY3:%[0-9]+]]:vgpr(s32) = COPY $vgpr0
+    ; CHECK-NEXT: [[COPY4:%[0-9]+]]:vgpr(s32) = COPY $vgpr1
+    ; CHECK-NEXT: [[MV:%[0-9]+]]:vgpr(p1) = G_MERGE_VALUES [[COPY3]](s32), [[COPY4]](s32)
+    ; CHECK-NEXT: [[FADD:%[0-9]+]]:sgpr(s32) = G_FADD [[COPY]], [[COPY1]]
+    ; CHECK-NEXT: [[FPTOUI:%[0-9]+]]:sgpr(s32) = G_FPTOUI [[FADD]](s32)
+    ; CHECK-NEXT: [[ADD:%[0-9]+]]:sgpr(s32) = G_ADD [[FPTOUI]], [[COPY2]]
     ; CHECK-NEXT: G_STORE [[ADD]](s32), [[MV]](p1) :: (store (s32), addrspace 1)
     ; CHECK-NEXT: S_ENDPGM 0
-    %0:_(s32) = COPY $sgpr0
-    %1:_(s32) = COPY $sgpr1
-    %2:_(s32) = COPY $sgpr2
-    %4:_(s32) = COPY $vgpr0
-    %5:_(s32) = COPY $vgpr1
-    %3:_(p1) = G_MERGE_VALUES %4(s32), %5(s32)
-    %7:_(s32) = G_FADD %0, %1
-    %8:_(s32) = G_FPTOUI %7(s32)
-    %9:_(s32) = G_ADD %8, %2
+    %0:sgpr(s32) = COPY $sgpr0
+    %1:sgpr(s32) = COPY $sgpr1
+    %2:sgpr(s32) = COPY $sgpr2
+    %4:vgpr(s32) = COPY $vgpr0
+    %5:vgpr(s32) = COPY $vgpr1
+    %3:vgpr(p1) = G_MERGE_VALUES %4(s32), %5(s32)
+    %7:sgpr(s32) = G_FADD %0, %1
+    %8:sgpr(s32) = G_FPTOUI %7(s32)
+    %9:sgpr(s32) = G_ADD %8, %2
     G_STORE %9(s32), %3(p1) :: (store (s32), addrspace 1)
     S_ENDPGM 0
 ...
@@ -75,36 +75,36 @@ body: |
     ; CHECK-LABEL: name: buffer_load_uniform
     ; CHECK: liveins: $sgpr0, $sgpr1, $sgpr2, $sgpr3, $sgpr4, $vgpr0, $vgpr1
     ; CHECK-NEXT: {{  $}}
-    ; CHECK-NEXT: [[COPY:%[0-9]+]]:_(s32) = COPY $sgpr0
-    ; CHECK-NEXT: [[COPY1:%[0-9]+]]:_(s32) = COPY $sgpr1
-    ; CHECK-NEXT: [[COPY2:%[0-9]+]]:_(s32) = COPY $sgpr2
-    ; CHECK-NEXT: [[COPY3:%[0-9]+]]:_(s32) = COPY $sgpr3
-    ; CHECK-NEXT: [[BUILD_VECTOR:%[0-9]+]]:_(<4 x s32>) = G_BUILD_VECTOR [[COPY]](s32), [[COPY1]](s32), [[COPY2]](s32), [[COPY3]](s32)
-    ; CHECK-NEXT: [[COPY4:%[0-9]+]]:_(s32) = COPY $sgpr4
-    ; CHECK-NEXT: [[COPY5:%[0-9]+]]:_(s32) = COPY $vgpr0
-    ; CHECK-NEXT: [[COPY6:%[0-9]+]]:_(s32) = COPY $vgpr1
-    ; CHECK-NEXT: [[MV:%[0-9]+]]:_(p1) = G_MERGE_VALUES [[COPY5]](s32), [[COPY6]](s32)
-    ; CHECK-NEXT: [[C:%[0-9]+]]:_(s32) = G_CONSTANT i32 0
-    ; CHECK-NEXT: [[AMDGPU_BUFFER_LOAD:%[0-9]+]]:_(<4 x s32>) = G_AMDGPU_BUFFER_LOAD [[BUILD_VECTOR]](<4 x s32>), [[C]](s32), [[COPY4]], [[C]], 0, 0, 0 :: (dereferenceable load (<4 x s32>), align 1, addrspace 8)
-    ; CHECK-NEXT: [[C1:%[0-9]+]]:_(s32) = G_CONSTANT i32 1
-    ; CHECK-NEXT: [[UV:%[0-9]+]]:_(s32), [[UV1:%[0-9]+]]:_(s32), [[UV2:%[0-9]+]]:_(s32), [[UV3:%[0-9]+]]:_(s32) = G_UNMERGE_VALUES [[AMDGPU_BUFFER_LOAD]](<4 x s32>)
-    ; CHECK-NEXT: [[ADD:%[0-9]+]]:_(s32) = G_ADD [[UV1]], [[C1]]
+    ; CHECK-NEXT: [[COPY:%[0-9]+]]:sgpr(s32) = COPY $sgpr0
+    ; CHECK-NEXT: [[COPY1:%[0-9]+]]:sgpr(s32) = COPY $sgpr1
+    ; CHECK-NEXT: [[COPY2:%[0-9]+]]:sgpr(s32) = COPY $sgpr2
+    ; CHECK-NEXT: [[COPY3:%[0-9]+]]:sgpr(s32) = COPY $sgpr3
+    ; CHECK-NEXT: [[BUILD_VECTOR:%[0-9]+]]:sgpr(<4 x s32>) = G_BUILD_VECTOR [[COPY]](s32), [[COPY1]](s32), [[COPY2]](s32), [[COPY3]](s32)
+    ; CHECK-NEXT: [[COPY4:%[0-9]+]]:sgpr(s32) = COPY $sgpr4
+    ; CHECK-NEXT: [[COPY5:%[0-9]+]]:vgpr(s32) = COPY $vgpr0
+    ; CHECK-NEXT: [[COPY6:%[0-9]+]]:vgpr(s32) = COPY $vgpr1
+    ; CHECK-NEXT: [[MV:%[0-9]+]]:vgpr(p1) = G_MERGE_VALUES [[COPY5]](s32), [[COPY6]](s32)
+    ; CHECK-NEXT: [[C:%[0-9]+]]:sgpr(s32) = G_CONSTANT i32 0
+    ; CHECK-NEXT: [[AMDGPU_BUFFER_LOAD:%[0-9]+]]:sgpr(<4 x s32>) = G_AMDGPU_BUFFER_LOAD [[BUILD_VECTOR]](<4 x s32>), [[C]](s32), [[COPY4]], [[C]], 0, 0, 0 :: (dereferenceable load (<4 x s32>), align 1, addrspace 8)
+    ; CHECK-NEXT: [[C1:%[0-9]+]]:sgpr(s32) = G_CONSTANT i32 1
+    ; CHECK-NEXT: [[UV:%[0-9]+]]:sgpr(s32), [[UV1:%[0-9]+]]:sgpr(s32), [[UV2:%[0-9]+]]:sgpr(s32), [[UV3:%[0-9]+]]:sgpr(s32) = G_UNMERGE_VALUES [[AMDGPU_BUFFER_LOAD]](<4 x s32>)
+    ; CHECK-NEXT: [[ADD:%[0-9]+]]:sgpr(s32) = G_ADD [[UV1]], [[C1]]
     ; CHECK-NEXT: G_STORE [[ADD]](s32), [[MV]](p1) :: (store (s32), addrspace 1)
     ; CHECK-NEXT: S_ENDPGM 0
-    %3:_(s32) = COPY $sgpr0
-    %4:_(s32) = COPY $sgpr1
-    %5:_(s32) = COPY $sgpr2
-    %6:_(s32) = COPY $sgpr3
-    %0:_(<4 x s32>) = G_BUILD_VECTOR %3(s32), %4(s32), %5(s32), %6(s32)
-    %1:_(s32) = COPY $sgpr4
-    %7:_(s32) = COPY $vgpr0
-    %8:_(s32) = COPY $vgpr1
-    %2:_(p1) = G_MERGE_VALUES %7(s32), %8(s32)
-    %11:_(s32) = G_CONSTANT i32 0
-    %10:_(<4 x s32>) = G_AMDGPU_BUFFER_LOAD %0(<4 x s32>), %11(s32), %1, %11, 0, 0, 0 :: (dereferenceable load (<4 x s32>), align 1, addrspace 8)
-    %13:_(s32) = G_CONSTANT i32 1
-    %15:_(s32), %16:_(s32), %17:_(s32), %18:_(s32) = G_UNMERGE_VALUES %10(<4 x s32>)
-    %14:_(s32) = G_ADD %16, %13
+    %3:sgpr(s32) = COPY $sgpr0
+    %4:sgpr(s32) = COPY $sgpr1
+    %5:sgpr(s32) = COPY $sgpr2
+    %6:sgpr(s32) = COPY $sgpr3
+    %0:sgpr(<4 x s32>) = G_BUILD_VECTOR %3(s32), %4(s32), %5(s32), %6(s32)
+    %1:sgpr(s32) = COPY $sgpr4
+    %7:vgpr(s32) = COPY $vgpr0
+    %8:vgpr(s32) = COPY $vgpr1
+    %2:vgpr(p1) = G_MERGE_VALUES %7(s32), %8(s32)
+    %11:sgpr(s32) = G_CONSTANT i32 0
+    %10:sgpr(<4 x s32>) = G_AMDGPU_BUFFER_LOAD %0(<4 x s32>), %11(s32), %1, %11, 0, 0, 0 :: (dereferenceable load (<4 x s32>), align 1, addrspace 8)
+    %13:sgpr(s32) = G_CONSTANT i32 1
+    %15:sgpr(s32), %16:sgpr(s32), %17:sgpr(s32), %18:sgpr(s32) = G_UNMERGE_VALUES %10(<4 x s32>)
+    %14:sgpr(s32) = G_ADD %16, %13
     G_STORE %14(s32), %2(p1) :: (store (s32), addrspace 1)
     S_ENDPGM 0
 ...
@@ -119,36 +119,36 @@ body: |
     ; CHECK-LABEL: name: buffer_load_divergent
     ; CHECK: liveins: $sgpr0, $sgpr1, $sgpr2, $sgpr3, $vgpr0, $vgpr1, $vgpr2
     ; CHECK-NEXT: {{  $}}
-    ; CHECK-NEXT: [[COPY:%[0-9]+]]:_(s32) = COPY $sgpr0
-    ; CHECK-NEXT: [[COPY1:%[0-9]+]]:_(s32) = COPY $sgpr1
-    ; CHECK-NEXT: [[COPY2:%[0-9]+]]:_(s32) = COPY $sgpr2
-    ; CHECK-NEXT: [[COPY3:%[0-9]+]]:_(s32) = COPY $sgpr3
-    ; CHECK-NEXT: [[BUILD_VECTOR:%[0-9]+]]:_(<4 x s32>) = G_BUILD_VECTOR [[COPY]](s32), [[COPY1]](s32), [[COPY2]...
[truncated]

@llvmbot
Copy link
Collaborator

llvmbot commented Oct 18, 2024

@llvm/pr-subscribers-backend-amdgpu

Author: Petar Avramovic (petar-avramovic)

Changes

Assign register banks to virtual registers.
Defs and uses of G_ instructions have register banks exclusively,
if they had register class, reassign appropriate register bank.

Assign register banks using machine uniformity analysis:
SGPR - uniform values and some lane masks
VGPR - divergent, non S1, values
VCC - divergent S1 values(lane masks)

RBSelect does not consider available instructions and, in some cases, G_
instructions with some register bank assignment can't be inst-selected.
This is solved in RBLegalize.

Exceptions when uniformity analysis does not work:
S32/S64 lane masks:

  • need to end up with SGPR register class after instruction selection
  • In most cases Uniformity analysis declares them as uniform
    (forced by tablegen) resulting in sgpr S32/S64 reg bank
  • When Uniformity analysis declares them as divergent (some phis),
    use intrinsic lane mask analyzer to still assign sgpr register bank
    temporal divergence copy:
  • COPY to vgpr with implicit use of $exec inside of the cycle
  • this copy is declared as uniform by uniformity analysis
  • make sure that assigned bank is vgpr
    Note: uniformity analysis does not consider that registers with vgpr def
    are divergent (you can have uniform value in vgpr).
  • TODO: implicit use of $exec could be implemented as indicator
    that instruction is divergent

Patch is 118.08 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/112863.diff

5 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPUGlobalISelUtils.cpp (+38)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUGlobalISelUtils.h (+22)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPURBSelect.cpp (+193-1)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui-rb-legalize.mir (+480-458)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui-rb-select.mir (+238-227)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUGlobalISelUtils.cpp b/llvm/lib/Target/AMDGPU/AMDGPUGlobalISelUtils.cpp
index a98d4488bf77fe..6f6ad5cf82cae1 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUGlobalISelUtils.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUGlobalISelUtils.cpp
@@ -7,13 +7,16 @@
 //===----------------------------------------------------------------------===//
 
 #include "AMDGPUGlobalISelUtils.h"
+#include "AMDGPURegisterBankInfo.h"
 #include "GCNSubtarget.h"
 #include "llvm/CodeGen/GlobalISel/GISelKnownBits.h"
 #include "llvm/CodeGen/GlobalISel/MIPatternMatch.h"
 #include "llvm/CodeGenTypes/LowLevelType.h"
 #include "llvm/IR/Constants.h"
+#include "llvm/IR/IntrinsicsAMDGPU.h"
 
 using namespace llvm;
+using namespace AMDGPU;
 using namespace MIPatternMatch;
 
 std::pair<Register, unsigned>
@@ -69,3 +72,38 @@ AMDGPU::getBaseWithConstantOffset(MachineRegisterInfo &MRI, Register Reg,
 
   return std::pair(Reg, 0);
 }
+
+IntrinsicLaneMaskAnalyzer::IntrinsicLaneMaskAnalyzer(MachineFunction &MF)
+    : MRI(MF.getRegInfo()) {
+  initLaneMaskIntrinsics(MF);
+}
+
+bool IntrinsicLaneMaskAnalyzer::isS32S64LaneMask(Register Reg) {
+  return S32S64LaneMask.contains(Reg);
+}
+
+void IntrinsicLaneMaskAnalyzer::initLaneMaskIntrinsics(MachineFunction &MF) {
+  for (auto &MBB : MF) {
+    for (auto &MI : MBB) {
+      if (MI.getOpcode() == AMDGPU::G_INTRINSIC &&
+          MI.getOperand(MI.getNumExplicitDefs()).getIntrinsicID() ==
+              Intrinsic::amdgcn_if_break) {
+        S32S64LaneMask.insert(MI.getOperand(3).getReg());
+        findLCSSAPhi(MI.getOperand(0).getReg());
+      }
+
+      if (MI.getOpcode() == AMDGPU::SI_IF ||
+          MI.getOpcode() == AMDGPU::SI_ELSE) {
+        findLCSSAPhi(MI.getOperand(0).getReg());
+      }
+    }
+  }
+}
+
+void IntrinsicLaneMaskAnalyzer::findLCSSAPhi(Register Reg) {
+  S32S64LaneMask.insert(Reg);
+  for (auto &LCSSAPhi : MRI.use_instructions(Reg)) {
+    if (LCSSAPhi.isPHI())
+      S32S64LaneMask.insert(LCSSAPhi.getOperand(0).getReg());
+  }
+}
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUGlobalISelUtils.h b/llvm/lib/Target/AMDGPU/AMDGPUGlobalISelUtils.h
index 5972552b9a4fe8..4d504d0204d81a 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUGlobalISelUtils.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPUGlobalISelUtils.h
@@ -9,6 +9,8 @@
 #ifndef LLVM_LIB_TARGET_AMDGPU_AMDGPUGLOBALISELUTILS_H
 #define LLVM_LIB_TARGET_AMDGPU_AMDGPUGLOBALISELUTILS_H
 
+#include "llvm/ADT/DenseSet.h"
+#include "llvm/CodeGen/MachineFunction.h"
 #include "llvm/CodeGen/Register.h"
 #include <utility>
 
@@ -26,6 +28,26 @@ std::pair<Register, unsigned>
 getBaseWithConstantOffset(MachineRegisterInfo &MRI, Register Reg,
                           GISelKnownBits *KnownBits = nullptr,
                           bool CheckNUW = false);
+
+// Currently finds S32/S64 lane masks that can be declared as divergent by
+// uniformity analysis (all are phis at the moment).
+// These are defined as i32/i64 in some IR intrinsics (not as i1).
+// Tablegen forces(via telling that lane mask IR intrinsics are uniform) most of
+// S32/S64 lane masks to be uniform, as this results in them ending up with sgpr
+// reg class after instruction-select don't search for all of them.
+class IntrinsicLaneMaskAnalyzer {
+  DenseSet<Register> S32S64LaneMask;
+  MachineRegisterInfo &MRI;
+
+public:
+  IntrinsicLaneMaskAnalyzer(MachineFunction &MF);
+  bool isS32S64LaneMask(Register Reg);
+
+private:
+  void initLaneMaskIntrinsics(MachineFunction &MF);
+  // This will not be needed when we turn of LCSSA for global-isel.
+  void findLCSSAPhi(Register Reg);
+};
 }
 }
 
diff --git a/llvm/lib/Target/AMDGPU/AMDGPURBSelect.cpp b/llvm/lib/Target/AMDGPU/AMDGPURBSelect.cpp
index c53a68ff72a8ad..905ad432fe6e0d 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPURBSelect.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPURBSelect.cpp
@@ -16,7 +16,12 @@
 //===----------------------------------------------------------------------===//
 
 #include "AMDGPU.h"
+#include "AMDGPUGlobalISelUtils.h"
+#include "AMDGPURegisterBankInfo.h"
+#include "MCTargetDesc/AMDGPUMCTargetDesc.h"
+#include "llvm/CodeGen/GlobalISel/MachineIRBuilder.h"
 #include "llvm/CodeGen/MachineFunctionPass.h"
+#include "llvm/CodeGen/MachineUniformityAnalysis.h"
 #include "llvm/InitializePasses.h"
 
 #define DEBUG_TYPE "rb-select"
@@ -39,6 +44,7 @@ class AMDGPURBSelect : public MachineFunctionPass {
   StringRef getPassName() const override { return "AMDGPU RB select"; }
 
   void getAnalysisUsage(AnalysisUsage &AU) const override {
+    AU.addRequired<MachineUniformityAnalysisPass>();
     MachineFunctionPass::getAnalysisUsage(AU);
   }
 
@@ -54,6 +60,7 @@ class AMDGPURBSelect : public MachineFunctionPass {
 
 INITIALIZE_PASS_BEGIN(AMDGPURBSelect, DEBUG_TYPE, "AMDGPU RB select", false,
                       false)
+INITIALIZE_PASS_DEPENDENCY(MachineUniformityAnalysisPass)
 INITIALIZE_PASS_END(AMDGPURBSelect, DEBUG_TYPE, "AMDGPU RB select", false,
                     false)
 
@@ -63,4 +70,189 @@ char &llvm::AMDGPURBSelectID = AMDGPURBSelect::ID;
 
 FunctionPass *llvm::createAMDGPURBSelectPass() { return new AMDGPURBSelect(); }
 
-bool AMDGPURBSelect::runOnMachineFunction(MachineFunction &MF) { return true; }
+bool shouldRBSelect(MachineInstr &MI) {
+  if (isTargetSpecificOpcode(MI.getOpcode()) && !MI.isPreISelOpcode())
+    return false;
+
+  if (MI.getOpcode() == AMDGPU::PHI || MI.getOpcode() == AMDGPU::IMPLICIT_DEF)
+    return false;
+
+  if (MI.isInlineAsm())
+    return false;
+
+  return true;
+}
+
+void setRB(MachineInstr &MI, MachineOperand &DefOP, MachineIRBuilder B,
+           MachineRegisterInfo &MRI, const RegisterBank &RB) {
+  Register Reg = DefOP.getReg();
+  // Register that already has Register class got it during pre-inst selection
+  // of another instruction. Maybe cross bank copy was required so we insert a
+  // copy trat can be removed later. This simplifies post-rb-legalize artifact
+  // combiner and avoids need to special case some patterns.
+  if (MRI.getRegClassOrNull(Reg)) {
+    LLT Ty = MRI.getType(Reg);
+    Register NewReg = MRI.createVirtualRegister({&RB, Ty});
+    DefOP.setReg(NewReg);
+
+    auto &MBB = *MI.getParent();
+    B.setInsertPt(MBB, MI.isPHI() ? MBB.getFirstNonPHI()
+                                  : std::next(MI.getIterator()));
+    B.buildCopy(Reg, NewReg);
+
+    // The problem was discoverd for uniform S1 that was used as both
+    // lane mask(vcc) and regular sgpr S1.
+    // - lane-mask(vcc) use was by si_if, this use is divergent and requires
+    //   non-trivial sgpr-S1-to-vcc copy. But pre-inst-selection of si_if sets
+    //   sreg_64_xexec(S1) on def of uniform S1 making it lane-mask.
+    // - the regular regular sgpr S1(uniform) instruction is now broken since
+    //   it uses sreg_64_xexec(S1) which is divergent.
+
+    // "Clear" reg classes from uses on generic instructions and but register
+    // banks instead.
+    for (auto &UseMI : MRI.use_instructions(Reg)) {
+      if (shouldRBSelect(UseMI)) {
+        for (MachineOperand &Op : UseMI.operands()) {
+          if (Op.isReg() && Op.isUse() && Op.getReg() == Reg)
+            Op.setReg(NewReg);
+        }
+      }
+    }
+
+  } else {
+    MRI.setRegBank(Reg, RB);
+  }
+}
+
+void setRBUse(MachineInstr &MI, MachineOperand &UseOP, MachineIRBuilder B,
+              MachineRegisterInfo &MRI, const RegisterBank &RB) {
+  Register Reg = UseOP.getReg();
+
+  LLT Ty = MRI.getType(Reg);
+  Register NewReg = MRI.createVirtualRegister({&RB, Ty});
+  UseOP.setReg(NewReg);
+
+  if (MI.isPHI()) {
+    auto DefMI = MRI.getVRegDef(Reg)->getIterator();
+    MachineBasicBlock *DefMBB = DefMI->getParent();
+    B.setInsertPt(*DefMBB, DefMBB->SkipPHIsAndLabels(std::next(DefMI)));
+  } else {
+    B.setInstr(MI);
+  }
+
+  B.buildCopy(NewReg, Reg);
+}
+
+// Temporal divergence copy: COPY to vgpr with implicit use of $exec inside of
+// the cycle
+// Note: uniformity analysis does not consider that registers with vgpr def are
+// divergent (you can have uniform value in vgpr).
+// - TODO: implicit use of $exec could be implemented as indicator that
+//   instruction is divergent
+bool isTemporalDivergenceCopy(Register Reg, MachineRegisterInfo &MRI) {
+  MachineInstr *MI = MRI.getVRegDef(Reg);
+  if (MI->getOpcode() == AMDGPU::COPY) {
+    for (auto Op : MI->implicit_operands()) {
+      if (!Op.isReg())
+        continue;
+      Register Reg = Op.getReg();
+      if (Reg == AMDGPU::EXEC) {
+        return true;
+      }
+    }
+  }
+
+  return false;
+}
+
+Register getVReg(MachineOperand &Op) {
+  if (!Op.isReg())
+    return 0;
+
+  Register Reg = Op.getReg();
+  if (!Reg.isVirtual())
+    return 0;
+
+  return Reg;
+}
+
+bool AMDGPURBSelect::runOnMachineFunction(MachineFunction &MF) {
+  MachineUniformityInfo &MUI =
+      getAnalysis<MachineUniformityAnalysisPass>().getUniformityInfo();
+  AMDGPU::IntrinsicLaneMaskAnalyzer ILMA(MF);
+  MachineRegisterInfo &MRI = MF.getRegInfo();
+  const RegisterBankInfo &RBI = *MF.getSubtarget().getRegBankInfo();
+
+  MachineIRBuilder B(MF);
+
+  // Assign register banks to ALL def registers on G_ instructions.
+  // Same for copies if they have no register bank or class on def.
+  for (MachineBasicBlock &MBB : MF) {
+    for (MachineInstr &MI : MBB) {
+      if (!shouldRBSelect(MI))
+        continue;
+
+      for (MachineOperand &DefOP : MI.defs()) {
+        Register DefReg = getVReg(DefOP);
+        if (!DefReg)
+          continue;
+
+        // Copies can have register class on def registers.
+        if (MI.isCopy() && MRI.getRegClassOrNull(DefReg)) {
+          continue;
+        }
+
+        if (MUI.isUniform(DefReg) || ILMA.isS32S64LaneMask(DefReg)) {
+          setRB(MI, DefOP, B, MRI, RBI.getRegBank(AMDGPU::SGPRRegBankID));
+        } else {
+          if (MRI.getType(DefReg) == LLT::scalar(1))
+            setRB(MI, DefOP, B, MRI, RBI.getRegBank(AMDGPU::VCCRegBankID));
+          else
+            setRB(MI, DefOP, B, MRI, RBI.getRegBank(AMDGPU::VGPRRegBankID));
+        }
+      }
+    }
+  }
+
+  // At this point all virtual registers have register class or bank
+  // - Defs of G_ instructions have register banks.
+  // - Defs and uses of inst-selected instructions have register class.
+  // - Defs and uses of copies can have either register class or bank
+  // and most notably
+  // - Uses of G_ instructions can have either register class or bank
+
+  // Reassign uses of G_ instructions to only have register banks.
+  for (MachineBasicBlock &MBB : MF) {
+    for (MachineInstr &MI : MBB) {
+      if (!shouldRBSelect(MI))
+        continue;
+
+      // Copies can have register class on use registers.
+      if (MI.isCopy())
+        continue;
+
+      for (MachineOperand &UseOP : MI.uses()) {
+        Register UseReg = getVReg(UseOP);
+        if (!UseReg)
+          continue;
+
+        if (!MRI.getRegClassOrNull(UseReg))
+          continue;
+
+        if (!isTemporalDivergenceCopy(UseReg, MRI) &&
+            (MUI.isUniform(UseReg) || ILMA.isS32S64LaneMask(UseReg))) {
+          setRBUse(MI, UseOP, B, MRI, RBI.getRegBank(AMDGPU::SGPRRegBankID));
+        } else {
+          if (MRI.getType(UseReg) == LLT::scalar(1))
+            setRBUse(MI, UseOP, B, MRI, RBI.getRegBank(AMDGPU::VCCRegBankID));
+          else
+            setRBUse(MI, UseOP, B, MRI, RBI.getRegBank(AMDGPU::VGPRRegBankID));
+        }
+      }
+    }
+  }
+
+  // Defs and uses of G_ instructions have register banks exclusively.
+
+  return true;
+}
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui-rb-legalize.mir b/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui-rb-legalize.mir
index 880057813adf54..208bf686c98ba8 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui-rb-legalize.mir
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui-rb-legalize.mir
@@ -11,22 +11,22 @@ body: |
     ; CHECK-LABEL: name: uniform_in_vgpr
     ; CHECK: liveins: $sgpr0, $sgpr1, $vgpr0, $vgpr1
     ; CHECK-NEXT: {{  $}}
-    ; CHECK-NEXT: [[COPY:%[0-9]+]]:_(s32) = COPY $sgpr0
-    ; CHECK-NEXT: [[COPY1:%[0-9]+]]:_(s32) = COPY $sgpr1
-    ; CHECK-NEXT: [[COPY2:%[0-9]+]]:_(s32) = COPY $vgpr0
-    ; CHECK-NEXT: [[COPY3:%[0-9]+]]:_(s32) = COPY $vgpr1
-    ; CHECK-NEXT: [[MV:%[0-9]+]]:_(p1) = G_MERGE_VALUES [[COPY2]](s32), [[COPY3]](s32)
-    ; CHECK-NEXT: [[FPTOUI:%[0-9]+]]:_(s32) = G_FPTOUI [[COPY]](s32)
-    ; CHECK-NEXT: [[ADD:%[0-9]+]]:_(s32) = G_ADD [[FPTOUI]], [[COPY1]]
+    ; CHECK-NEXT: [[COPY:%[0-9]+]]:sgpr(s32) = COPY $sgpr0
+    ; CHECK-NEXT: [[COPY1:%[0-9]+]]:sgpr(s32) = COPY $sgpr1
+    ; CHECK-NEXT: [[COPY2:%[0-9]+]]:vgpr(s32) = COPY $vgpr0
+    ; CHECK-NEXT: [[COPY3:%[0-9]+]]:vgpr(s32) = COPY $vgpr1
+    ; CHECK-NEXT: [[MV:%[0-9]+]]:vgpr(p1) = G_MERGE_VALUES [[COPY2]](s32), [[COPY3]](s32)
+    ; CHECK-NEXT: [[FPTOUI:%[0-9]+]]:sgpr(s32) = G_FPTOUI [[COPY]](s32)
+    ; CHECK-NEXT: [[ADD:%[0-9]+]]:sgpr(s32) = G_ADD [[FPTOUI]], [[COPY1]]
     ; CHECK-NEXT: G_STORE [[ADD]](s32), [[MV]](p1) :: (store (s32), addrspace 1)
     ; CHECK-NEXT: S_ENDPGM 0
-    %0:_(s32) = COPY $sgpr0
-    %1:_(s32) = COPY $sgpr1
-    %3:_(s32) = COPY $vgpr0
-    %4:_(s32) = COPY $vgpr1
-    %2:_(p1) = G_MERGE_VALUES %3(s32), %4(s32)
-    %6:_(s32) = G_FPTOUI %0(s32)
-    %7:_(s32) = G_ADD %6, %1
+    %0:sgpr(s32) = COPY $sgpr0
+    %1:sgpr(s32) = COPY $sgpr1
+    %3:vgpr(s32) = COPY $vgpr0
+    %4:vgpr(s32) = COPY $vgpr1
+    %2:vgpr(p1) = G_MERGE_VALUES %3(s32), %4(s32)
+    %6:sgpr(s32) = G_FPTOUI %0(s32)
+    %7:sgpr(s32) = G_ADD %6, %1
     G_STORE %7(s32), %2(p1) :: (store (s32), addrspace 1)
     S_ENDPGM 0
 ...
@@ -41,26 +41,26 @@ body: |
     ; CHECK-LABEL: name: back_to_back_uniform_in_vgpr
     ; CHECK: liveins: $sgpr0, $sgpr1, $sgpr2, $vgpr0, $vgpr1
     ; CHECK-NEXT: {{  $}}
-    ; CHECK-NEXT: [[COPY:%[0-9]+]]:_(s32) = COPY $sgpr0
-    ; CHECK-NEXT: [[COPY1:%[0-9]+]]:_(s32) = COPY $sgpr1
-    ; CHECK-NEXT: [[COPY2:%[0-9]+]]:_(s32) = COPY $sgpr2
-    ; CHECK-NEXT: [[COPY3:%[0-9]+]]:_(s32) = COPY $vgpr0
-    ; CHECK-NEXT: [[COPY4:%[0-9]+]]:_(s32) = COPY $vgpr1
-    ; CHECK-NEXT: [[MV:%[0-9]+]]:_(p1) = G_MERGE_VALUES [[COPY3]](s32), [[COPY4]](s32)
-    ; CHECK-NEXT: [[FADD:%[0-9]+]]:_(s32) = G_FADD [[COPY]], [[COPY1]]
-    ; CHECK-NEXT: [[FPTOUI:%[0-9]+]]:_(s32) = G_FPTOUI [[FADD]](s32)
-    ; CHECK-NEXT: [[ADD:%[0-9]+]]:_(s32) = G_ADD [[FPTOUI]], [[COPY2]]
+    ; CHECK-NEXT: [[COPY:%[0-9]+]]:sgpr(s32) = COPY $sgpr0
+    ; CHECK-NEXT: [[COPY1:%[0-9]+]]:sgpr(s32) = COPY $sgpr1
+    ; CHECK-NEXT: [[COPY2:%[0-9]+]]:sgpr(s32) = COPY $sgpr2
+    ; CHECK-NEXT: [[COPY3:%[0-9]+]]:vgpr(s32) = COPY $vgpr0
+    ; CHECK-NEXT: [[COPY4:%[0-9]+]]:vgpr(s32) = COPY $vgpr1
+    ; CHECK-NEXT: [[MV:%[0-9]+]]:vgpr(p1) = G_MERGE_VALUES [[COPY3]](s32), [[COPY4]](s32)
+    ; CHECK-NEXT: [[FADD:%[0-9]+]]:sgpr(s32) = G_FADD [[COPY]], [[COPY1]]
+    ; CHECK-NEXT: [[FPTOUI:%[0-9]+]]:sgpr(s32) = G_FPTOUI [[FADD]](s32)
+    ; CHECK-NEXT: [[ADD:%[0-9]+]]:sgpr(s32) = G_ADD [[FPTOUI]], [[COPY2]]
     ; CHECK-NEXT: G_STORE [[ADD]](s32), [[MV]](p1) :: (store (s32), addrspace 1)
     ; CHECK-NEXT: S_ENDPGM 0
-    %0:_(s32) = COPY $sgpr0
-    %1:_(s32) = COPY $sgpr1
-    %2:_(s32) = COPY $sgpr2
-    %4:_(s32) = COPY $vgpr0
-    %5:_(s32) = COPY $vgpr1
-    %3:_(p1) = G_MERGE_VALUES %4(s32), %5(s32)
-    %7:_(s32) = G_FADD %0, %1
-    %8:_(s32) = G_FPTOUI %7(s32)
-    %9:_(s32) = G_ADD %8, %2
+    %0:sgpr(s32) = COPY $sgpr0
+    %1:sgpr(s32) = COPY $sgpr1
+    %2:sgpr(s32) = COPY $sgpr2
+    %4:vgpr(s32) = COPY $vgpr0
+    %5:vgpr(s32) = COPY $vgpr1
+    %3:vgpr(p1) = G_MERGE_VALUES %4(s32), %5(s32)
+    %7:sgpr(s32) = G_FADD %0, %1
+    %8:sgpr(s32) = G_FPTOUI %7(s32)
+    %9:sgpr(s32) = G_ADD %8, %2
     G_STORE %9(s32), %3(p1) :: (store (s32), addrspace 1)
     S_ENDPGM 0
 ...
@@ -75,36 +75,36 @@ body: |
     ; CHECK-LABEL: name: buffer_load_uniform
     ; CHECK: liveins: $sgpr0, $sgpr1, $sgpr2, $sgpr3, $sgpr4, $vgpr0, $vgpr1
     ; CHECK-NEXT: {{  $}}
-    ; CHECK-NEXT: [[COPY:%[0-9]+]]:_(s32) = COPY $sgpr0
-    ; CHECK-NEXT: [[COPY1:%[0-9]+]]:_(s32) = COPY $sgpr1
-    ; CHECK-NEXT: [[COPY2:%[0-9]+]]:_(s32) = COPY $sgpr2
-    ; CHECK-NEXT: [[COPY3:%[0-9]+]]:_(s32) = COPY $sgpr3
-    ; CHECK-NEXT: [[BUILD_VECTOR:%[0-9]+]]:_(<4 x s32>) = G_BUILD_VECTOR [[COPY]](s32), [[COPY1]](s32), [[COPY2]](s32), [[COPY3]](s32)
-    ; CHECK-NEXT: [[COPY4:%[0-9]+]]:_(s32) = COPY $sgpr4
-    ; CHECK-NEXT: [[COPY5:%[0-9]+]]:_(s32) = COPY $vgpr0
-    ; CHECK-NEXT: [[COPY6:%[0-9]+]]:_(s32) = COPY $vgpr1
-    ; CHECK-NEXT: [[MV:%[0-9]+]]:_(p1) = G_MERGE_VALUES [[COPY5]](s32), [[COPY6]](s32)
-    ; CHECK-NEXT: [[C:%[0-9]+]]:_(s32) = G_CONSTANT i32 0
-    ; CHECK-NEXT: [[AMDGPU_BUFFER_LOAD:%[0-9]+]]:_(<4 x s32>) = G_AMDGPU_BUFFER_LOAD [[BUILD_VECTOR]](<4 x s32>), [[C]](s32), [[COPY4]], [[C]], 0, 0, 0 :: (dereferenceable load (<4 x s32>), align 1, addrspace 8)
-    ; CHECK-NEXT: [[C1:%[0-9]+]]:_(s32) = G_CONSTANT i32 1
-    ; CHECK-NEXT: [[UV:%[0-9]+]]:_(s32), [[UV1:%[0-9]+]]:_(s32), [[UV2:%[0-9]+]]:_(s32), [[UV3:%[0-9]+]]:_(s32) = G_UNMERGE_VALUES [[AMDGPU_BUFFER_LOAD]](<4 x s32>)
-    ; CHECK-NEXT: [[ADD:%[0-9]+]]:_(s32) = G_ADD [[UV1]], [[C1]]
+    ; CHECK-NEXT: [[COPY:%[0-9]+]]:sgpr(s32) = COPY $sgpr0
+    ; CHECK-NEXT: [[COPY1:%[0-9]+]]:sgpr(s32) = COPY $sgpr1
+    ; CHECK-NEXT: [[COPY2:%[0-9]+]]:sgpr(s32) = COPY $sgpr2
+    ; CHECK-NEXT: [[COPY3:%[0-9]+]]:sgpr(s32) = COPY $sgpr3
+    ; CHECK-NEXT: [[BUILD_VECTOR:%[0-9]+]]:sgpr(<4 x s32>) = G_BUILD_VECTOR [[COPY]](s32), [[COPY1]](s32), [[COPY2]](s32), [[COPY3]](s32)
+    ; CHECK-NEXT: [[COPY4:%[0-9]+]]:sgpr(s32) = COPY $sgpr4
+    ; CHECK-NEXT: [[COPY5:%[0-9]+]]:vgpr(s32) = COPY $vgpr0
+    ; CHECK-NEXT: [[COPY6:%[0-9]+]]:vgpr(s32) = COPY $vgpr1
+    ; CHECK-NEXT: [[MV:%[0-9]+]]:vgpr(p1) = G_MERGE_VALUES [[COPY5]](s32), [[COPY6]](s32)
+    ; CHECK-NEXT: [[C:%[0-9]+]]:sgpr(s32) = G_CONSTANT i32 0
+    ; CHECK-NEXT: [[AMDGPU_BUFFER_LOAD:%[0-9]+]]:sgpr(<4 x s32>) = G_AMDGPU_BUFFER_LOAD [[BUILD_VECTOR]](<4 x s32>), [[C]](s32), [[COPY4]], [[C]], 0, 0, 0 :: (dereferenceable load (<4 x s32>), align 1, addrspace 8)
+    ; CHECK-NEXT: [[C1:%[0-9]+]]:sgpr(s32) = G_CONSTANT i32 1
+    ; CHECK-NEXT: [[UV:%[0-9]+]]:sgpr(s32), [[UV1:%[0-9]+]]:sgpr(s32), [[UV2:%[0-9]+]]:sgpr(s32), [[UV3:%[0-9]+]]:sgpr(s32) = G_UNMERGE_VALUES [[AMDGPU_BUFFER_LOAD]](<4 x s32>)
+    ; CHECK-NEXT: [[ADD:%[0-9]+]]:sgpr(s32) = G_ADD [[UV1]], [[C1]]
     ; CHECK-NEXT: G_STORE [[ADD]](s32), [[MV]](p1) :: (store (s32), addrspace 1)
     ; CHECK-NEXT: S_ENDPGM 0
-    %3:_(s32) = COPY $sgpr0
-    %4:_(s32) = COPY $sgpr1
-    %5:_(s32) = COPY $sgpr2
-    %6:_(s32) = COPY $sgpr3
-    %0:_(<4 x s32>) = G_BUILD_VECTOR %3(s32), %4(s32), %5(s32), %6(s32)
-    %1:_(s32) = COPY $sgpr4
-    %7:_(s32) = COPY $vgpr0
-    %8:_(s32) = COPY $vgpr1
-    %2:_(p1) = G_MERGE_VALUES %7(s32), %8(s32)
-    %11:_(s32) = G_CONSTANT i32 0
-    %10:_(<4 x s32>) = G_AMDGPU_BUFFER_LOAD %0(<4 x s32>), %11(s32), %1, %11, 0, 0, 0 :: (dereferenceable load (<4 x s32>), align 1, addrspace 8)
-    %13:_(s32) = G_CONSTANT i32 1
-    %15:_(s32), %16:_(s32), %17:_(s32), %18:_(s32) = G_UNMERGE_VALUES %10(<4 x s32>)
-    %14:_(s32) = G_ADD %16, %13
+    %3:sgpr(s32) = COPY $sgpr0
+    %4:sgpr(s32) = COPY $sgpr1
+    %5:sgpr(s32) = COPY $sgpr2
+    %6:sgpr(s32) = COPY $sgpr3
+    %0:sgpr(<4 x s32>) = G_BUILD_VECTOR %3(s32), %4(s32), %5(s32), %6(s32)
+    %1:sgpr(s32) = COPY $sgpr4
+    %7:vgpr(s32) = COPY $vgpr0
+    %8:vgpr(s32) = COPY $vgpr1
+    %2:vgpr(p1) = G_MERGE_VALUES %7(s32), %8(s32)
+    %11:sgpr(s32) = G_CONSTANT i32 0
+    %10:sgpr(<4 x s32>) = G_AMDGPU_BUFFER_LOAD %0(<4 x s32>), %11(s32), %1, %11, 0, 0, 0 :: (dereferenceable load (<4 x s32>), align 1, addrspace 8)
+    %13:sgpr(s32) = G_CONSTANT i32 1
+    %15:sgpr(s32), %16:sgpr(s32), %17:sgpr(s32), %18:sgpr(s32) = G_UNMERGE_VALUES %10(<4 x s32>)
+    %14:sgpr(s32) = G_ADD %16, %13
     G_STORE %14(s32), %2(p1) :: (store (s32), addrspace 1)
     S_ENDPGM 0
 ...
@@ -119,36 +119,36 @@ body: |
     ; CHECK-LABEL: name: buffer_load_divergent
     ; CHECK: liveins: $sgpr0, $sgpr1, $sgpr2, $sgpr3, $vgpr0, $vgpr1, $vgpr2
     ; CHECK-NEXT: {{  $}}
-    ; CHECK-NEXT: [[COPY:%[0-9]+]]:_(s32) = COPY $sgpr0
-    ; CHECK-NEXT: [[COPY1:%[0-9]+]]:_(s32) = COPY $sgpr1
-    ; CHECK-NEXT: [[COPY2:%[0-9]+]]:_(s32) = COPY $sgpr2
-    ; CHECK-NEXT: [[COPY3:%[0-9]+]]:_(s32) = COPY $sgpr3
-    ; CHECK-NEXT: [[BUILD_VECTOR:%[0-9]+]]:_(<4 x s32>) = G_BUILD_VECTOR [[COPY]](s32), [[COPY1]](s32), [[COPY2]...
[truncated]

llvm/lib/Target/AMDGPU/AMDGPUGlobalISelUtils.h Outdated Show resolved Hide resolved
llvm/lib/Target/AMDGPU/AMDGPURBSelect.cpp Outdated Show resolved Hide resolved
llvm/lib/Target/AMDGPU/AMDGPURBSelect.cpp Outdated Show resolved Hide resolved
llvm/lib/Target/AMDGPU/AMDGPURBSelect.cpp Outdated Show resolved Hide resolved
llvm/lib/Target/AMDGPU/AMDGPURBSelect.cpp Outdated Show resolved Hide resolved
llvm/lib/Target/AMDGPU/AMDGPURBSelect.cpp Outdated Show resolved Hide resolved
llvm/lib/Target/AMDGPU/AMDGPURBSelect.cpp Outdated Show resolved Hide resolved
llvm/lib/Target/AMDGPU/AMDGPURBSelect.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget about AGPRs

llvm/lib/Target/AMDGPU/AMDGPUGlobalISelUtils.cpp Outdated Show resolved Hide resolved
llvm/lib/Target/AMDGPU/AMDGPUGlobalISelUtils.cpp Outdated Show resolved Hide resolved
Comment on lines 77 to 78
if (MI.getOpcode() == AMDGPU::PHI || MI.getOpcode() == AMDGPU::IMPLICIT_DEF)
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should have failed isPreISelOpcode

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied that from existing regbankselect. MI.isPreISelOpcode() || MI.isCopy() also works

if (MI.getOpcode() == AMDGPU::PHI || MI.getOpcode() == AMDGPU::IMPLICIT_DEF)
return false;

if (MI.isInlineAsm())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should have failed isPreISelOpcode

llvm/lib/Target/AMDGPU/AMDGPURBSelect.cpp Outdated Show resolved Hide resolved
setRB(MI, DefOP, B, MRI, RBI.getRegBank(AMDGPU::SGPRRegBankID));
} else {
if (MRI.getType(DefReg) == LLT::scalar(1))
setRB(MI, DefOP, B, MRI, RBI.getRegBank(AMDGPU::VCCRegBankID));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you directly use the pointer to the const regbank struct?

llvm/lib/Target/AMDGPU/AMDGPURBSelect.cpp Outdated Show resolved Hide resolved
if (MI.isCopy())
continue;

for (MachineOperand &UseOP : MI.uses()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the defs?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous for loop assigned RegBanks to all defs.
This loop prepares uses for RBLegalize to have register banks only.

}
}

void setRBUse(MachineInstr &MI, MachineOperand &UseOP, MachineIRBuilder B,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static. Also don't pass MachineIRBuilder by value

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

member function

}

} else {
MRI.setRegBank(Reg, RB);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to call the observer?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is our pass I felt there was no need to complicate it with observers

@@ -63,4 +70,189 @@ char &llvm::AMDGPURBSelectID = AMDGPURBSelect::ID;

FunctionPass *llvm::createAMDGPURBSelectPass() { return new AMDGPURBSelect(); }

bool AMDGPURBSelect::runOnMachineFunction(MachineFunction &MF) { return true; }
bool shouldRBSelect(MachineInstr &MI) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why free-standing functions, when it is your. register bank select pass?

@petar-avramovic petar-avramovic force-pushed the users/petar-avramovic/new-rbs-skeleton branch from ff34aa1 to 0c40f68 Compare October 22, 2024 16:20
@petar-avramovic petar-avramovic force-pushed the users/petar-avramovic/new-rbs-rb-select branch from 2124eb3 to df50c85 Compare October 22, 2024 16:20
@petar-avramovic petar-avramovic changed the title AMDGPU/GlobalISel: RBSelect AMDGPU/GlobalISel: StandaloneRegBankSelect Oct 23, 2024
@petar-avramovic petar-avramovic force-pushed the users/petar-avramovic/new-rbs-skeleton branch 2 times, most recently from 8bf0a23 to 84284ba Compare October 28, 2024 14:48
@petar-avramovic petar-avramovic force-pushed the users/petar-avramovic/new-rbs-rb-select branch from df50c85 to 36c8a96 Compare October 28, 2024 14:48
@petar-avramovic petar-avramovic force-pushed the users/petar-avramovic/new-rbs-skeleton branch from 84284ba to 3b0aaef Compare October 28, 2024 14:57
@petar-avramovic petar-avramovic force-pushed the users/petar-avramovic/new-rbs-rb-select branch from 36c8a96 to 69dde87 Compare October 28, 2024 14:57
@petar-avramovic petar-avramovic changed the title AMDGPU/GlobalISel: StandaloneRegBankSelect AMDGPU/GlobalISel: AMDGPURegBankSelect Oct 28, 2024
@petar-avramovic petar-avramovic force-pushed the users/petar-avramovic/new-rbs-skeleton branch from 3b0aaef to 07055a7 Compare October 28, 2024 16:03
@petar-avramovic petar-avramovic force-pushed the users/petar-avramovic/new-rbs-rb-select branch from 69dde87 to 9048f0d Compare October 28, 2024 16:03
Assign register banks to virtual registers. Does not use generic
RegBankSelect. After register bank selection all register operand of
G_ instructions have LLT and register banks exclusively. If they had
register class, reassign appropriate register bank.

Assign register banks using machine uniformity analysis:
Sgpr - uniform values and some lane masks
Vgpr - divergent, non S1, values
Vcc  - divergent S1 values(lane masks)

AMDGPURegBankSelect does not consider available instructions and, in
some cases, G_ instructions with some register bank assignment can't be
inst-selected. This is solved in RegBankLegalize.

Exceptions when uniformity analysis does not work:
S32/S64 lane masks:
- need to end up with sgpr register class after instruction selection
- In most cases Uniformity analysis declares them as uniform
  (forced by tablegen) resulting in sgpr S32/S64 reg bank
- When Uniformity analysis declares them as divergent (some phis),
  use intrinsic lane mask analyzer to still assign sgpr register bank
temporal divergence copy:
- COPY to vgpr with implicit use of $exec inside of the cycle
- this copy is declared as uniform by uniformity analysis
- make sure that assigned bank is vgpr
Note: uniformity analysis does not consider that registers with vgpr def
are divergent (you can have uniform value in vgpr).
- TODO: implicit use of $exec could be implemented as indicator
  that instruction is divergent
@petar-avramovic petar-avramovic force-pushed the users/petar-avramovic/new-rbs-skeleton branch from 07055a7 to 623266f Compare October 30, 2024 14:45
@petar-avramovic petar-avramovic force-pushed the users/petar-avramovic/new-rbs-rb-select branch from 9048f0d to d17ca95 Compare October 30, 2024 14:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants