Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Post RA software pipeliner #146

Open
wants to merge 8 commits into
base: aie-public
Choose a base branch
from
Open

Conversation

martien-de-jong
Copy link
Collaborator

First workable version of the post-RegAlloc pipeliner.
It creates a DAG that simulates an unrolled loop, and it check whether all latencies can be met by scheduling every copy in the same cycle modulo II.
Since the DAG considers the physical register deps, there's no register renaming to be done, and it uses negative latencies in the same way as the regular post-scheduler.
The results look promising for underfull loops, typically load-operate-store, where the load latency should span quite a few loop iterations to get the resources saturated.

There are still some rough edges, and more tests need to be added, especially to cover the cases where SWP fails.

UP.Threshold = 200;
BaseT::getUnrollingPreferences(L, SE, UP, ORE);
UP.Partial = UP.Runtime = false;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice! Now our decisions are not changed by BaseT implementation! Maybe those small changes can be merged in small PRs.

assert(MBB.pred_size() == 1 && "MBB contains more than 1 predecessor");
MachineBasicBlock *SinglePredMBB = *MBB.predecessors().begin();
return SinglePredMBB;
MachineBasicBlock *getLoopPredecessor(const MachineBasicBlock &MBB) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: good candidate for LoopUtils.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm hesitant, since it only applies to epilogues that have been selected by the classification of loop-aware candidates here.


// We use ADD_NC, which allows PostPipeliner to tweak it by modifying the
// immediate value.
BuildMI(*MBB, Start, Start->getDebugLoc(), TII->get(AIE2::ADD_NC), AIE2::LC)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We merged e33bbe1 as a bug fix for the trip count adjustment. As this change also overcomes the same problem, we can replace later (revert the first fix, for example).

<< CurrentBlockState->FixPoint.NumIters
<< " II=" << CurrentBlockState->FixPoint.II);
}
namespace {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this file is growing, maybe we can move this class to AIEPostPipeliner* files.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps not the best criterion to do this. It implements an interface to the pipeliner that is dedicated to the interblock data structures. All that knowledge would need to be exported to the postpipeliner, which would clutter it.

bool InLoop = false;

void startPrologue() override {
// Nothing at this time, but let's keep the override around
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have some future plan for this method?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only vaguely. Currently we insert an opaque region which is not seen by the normal scheduler. In future we may need to do some boundary signaling.

@andcarminati
Copy link
Collaborator

andcarminati commented Aug 26, 2024

Hi @martien-de-jong, some commits from this PR were merged:

#152
#162

@@ -4,14 +4,19 @@
// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//
// (c) Copyright 2024 Advanced Micro Devices, Inc. or its affiliates
// (c) Copyright 2023-2024 Advanced Micro Devices, Inc. or its affiliates
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is as old as 2023

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you fix this in a future commit? The diff is still introduced in 46a9ac2

BS.FixPoint.LatencyMargin++;
BS.FixPoint.NumIters++;
int &II = BS.FixPoint.II;
if (!II) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's now a big function with lots of ifs. How about introducing two functions for checking the fixpoint state? E.g.

BS.FixPoint.NumIters++;

if (State== Scheduling)
  return checkLoopAwareScheduling();
return checkPipelining()

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(That would also help limit the noise in diff lines)

if (Move) {
BB->remove_instr(MI);
}
BB->insert(Before, MI);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an equivalent insert_instr? Or is it on purpose so that we insert inside an existing bundle?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can never get my head around these issues. My assumption is that at this point we have no bundles, and we insert instructions before other instructions, leaving BundledWithPred markers to really create the bundles afterwards.

BoundaryEdges = std::make_unique<InterBlockEdges>(Context);
if (Regions.size() == 1) {
// Don't worry, this just constructs a mostly empty container class
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this note for the anxious programmer 😄

PseudoJ_jump_imm %bb.2

bb.2 (align 16):
PseudoRET implicit $lr
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: if the exit is already dedicated, I guess we do not need a new fall-through? But I guess that is not important because we later optimize the CFG anyway?

Copy link
Collaborator Author

@martien-de-jong martien-de-jong Sep 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the fallthrough also makes the body single-regioned, Programming around that is sordid. Also note that hwloop lowering sits just before postmisched.

; CHECK-NEXT: .LBB0_2: // %for.body
; CHECK-NEXT: // =>This Inner Loop Header: Depth=1
; CHECK-NEXT: .L_LEnd0:
; CHECK-NEXT: nopb ; nopa ; st r1, [p0], #4; add r1, r1, #1; nopm ; nopv
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

; CHECK-NEXT: .LBB0_2: // %for.body
; CHECK-NEXT: // =>This Inner Loop Header: Depth=1
; CHECK-NEXT: .L_LEnd0:
; CHECK-NEXT: nopb ; nopa ; st r1, [p0], #4; add r1, r1, #1; nopm ; nopv
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

; CHECK-NEXT: // =>This Inner Loop Header: Depth=1
; CHECK-NEXT: vldb wh0, [p0, #32]; nopa ; vst wh1, [p1, #32]; nopx ; vbneg_ltz.s16 x1, r21, x0; nopv
; CHECK-NEXT: .L_LEnd0:
; CHECK-NEXT: vldb wl0, [p0], #64; nopa ; vst wl1, [p1], #64; nopxm ; nopv
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice 🎉 🎉

@@ -664,6 +669,8 @@ class ZeroOverheadLoop : public AIEBasePipelinerLoopInfo {
SmallVectorImpl<MachineOperand> &Cond) override;

bool canAcceptII(SMSchedule &SMS) override;

bool shouldUseSchedule(SwingSchedulerDAG &SSD, SMSchedule &SMS) override;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably update canAcceptII as well. Typically, this will increase the II until we have a low enough stage count. So I think that if we find a schedule with a low II and high stage count, we should immediately refuse it sop the post-pipeliner can pick it; and not increase the II.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an interface change I think? it should be able to say yes, higherII and stop. That would definitely save some time.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we would accept the II if we are confident we can pick up the loop in the post-pipeliner. And for those loops, shouldUseSchedule would return false.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somehow that makes sense, except for the name of canAcceptII. I will add a fat comment.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, something can accept it, just not the pre-pipeliner 😄

Copy link
Collaborator

@gbossu gbossu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm at a point where I understand all individual pieces, and now I need to go over the code a final time to make sure I understand how it's all connected. But I couldn't find any blocker so far, great work!

Martien de Jong added 8 commits October 1, 2024 16:50
Copyright notice, namespace markings, variable casing, ...
-advance() cleared the wrong cycle
-cleaner interface for reset
-dumper printer required instead of reserved
We use this universally for tripcount update, which allows us
to pipeline loops in both pre and postpipeliner
unittest with complicated binary operator
baseline test for example broken by unnecessary WAW
baseline test for ptradd/load combine with PHI node user
add vmov example
This is quite an elaborate change, since it is interwoven with the
fixed-point loop in interblock scheduling,

The general approach is to first run loop-aware scheduling, then try to
pipeline selected loops.

When loop aware has converged, each fixed point iteration for the loop block
will increase II until a modulo schedule can be found or it fails.
When we find a modulo schedule with more than one stage, we push out bundled
regions into the prologue block and the epilogue block, update the (only)
region of the loop. The prologue and epilogue regions are copied without
further scheduling when commiting the block schedules.

The pipeliner checks min itercount to guarantee that the LC can be corrected
while staying positive.

There is an unconnected name change, from CurrentBlock to CurrentBlockState
to avoid confusion

We force a fallthrough block for loopend. This avoids missing opportunities
because of 'bad' block ordering
BS.FixPoint.NumIters++;
auto &BS = *CurrentBlockState;
switch (updateFixPoint(BS)) {
case SchedulingStage::SchedulingNotConverged:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is just for completeness, but we will have crashed before checking SchedulingStage::SchedulingNotConverged here?

// Try to wrap the linear schedule within II.
// We virtually unroll the body by the stagecount, computed from rounding
// up the length divided by II.
NCopies = (BS.getScheduleLength() + II - 1) / II;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "standard" schedule has been achieved using different scheduling techniques. Can we guarantee there will be enough copies if the linear schedule used for pipelining is different?

int Cycle = -Depth + LocalCycle;
LLVM_DEBUG(dbgs() << " Emit in " << Cycle << "\n");
HR.emitInScoreboard(Scoreboard, MI->getDesc(), MemoryBanks, MI->operands(),
MI->getMF()->getRegInfo(), Cycle);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not debating for correctness, but I wonder if we should not immediately emit the resources at C+II, C+2II, etc. Otherwise, we will only see conflicts due to resources that "wrap around the II" when scheduling the next iterations. At that point, it's too late and we will try a larger II. Maybe if we could anticipate those resources, we could find a better linear schedule that has less conflicts when scheduling all copies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants