Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Coro] Amortize debug info processing cost in CoroSplit #109032

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

artempyanykh
Copy link
Contributor

More details about this stack are in this topic on discourse.

TL;DR: In large modules with a bunch of coroutines the cost of debug info metadata processing can get very high (e.g. adding over a minute to the compile time). This stack does some refactoring around function cloning and caches some module-level calculations to speed up the pass.

Anecdata for a sample C++ source file:

Baseline / 0 IdentityMD set Prebuilt GlobalDI Cached CU DIFinder
CoroSplitPass 306ms 221ms 68ms 17ms
CoroCloner 101ms 72ms 63ms 0.5ms
CollectGlobalDI - - 63ms 13ms
Overall speed up 1x 1.4x 4.5x 18x

Each commit is individually buildable and reviewable. I can extract them into separate PRs if the overall direction makes sense.

Summary:
A helper (2 overloads) that consolidates corocloner creation and the actual cloning.
The helpers create a TimeTraceScope to make it easier to see how long the cloning takes.

Test Plan:
ninja check-llvm-unit
Summary:
Consolidate the logic in a single function. We do an extra pass over Instructions but this is
necessary to untangle things and extract metadata cloning in a future diff.

Test Plan:
ninja check-llvm-unit
Summary:
There was a single usage of CloneBasicBlock with non-default DebugInfoFinder inside
CloneFunctionInto which has been refactored in the previous diff.

Let's remove the parameter to keep the scope of the function more focused.

Test Plan:
ninja check-llvm-unit
…neFunctionInto

Summary:
This makes the flow of the function a bit more straightforward and makes it easier to extract more
into helper functions.

Test Plan:
ninja check-llvm-unit
Summary:
Extract the logic to build up a metadta map to use in metadata cloning into a separate function.

Test Plan:
ninja check-llvm-unit
Summary:
One potentially contentious point in the new function's API is that it expects the caller ot
populate the VMap rather than doing metadata cloning wholesale inside it.

We'll need it this way for a subsequent change.

Test Plan:
ninja check-llvm-unit
Summary:
This and previously extracted functions will be used in a later diff.

Test Plan:
ninja check-llvm-unit
Summary:
Previously, we'd add all SPs distinct from the cloned one into a set. Then when cloning a local
scope we'd check if it's from one of those 'distinct' SPs by checking if it's in the set.

We don't need to do that. We can just check against the cloned SP directly and drop the set.

Test Plan:
ninja check-llvm-unit
Summary:
The typedef was there probably because the type alias in ValueMap was private.

Test Plan:
ninja check-llvm-unit
Summary:
To avoid cloning 'global' debug info, CloneFunction implementation used to eagerly identity map a known
subset of global debug into into ValueMap's MD map. In larger modules with meaningful volume of
debug info this gets very expensive.

By passing such global metadata via an IdentityMD set for the ValueMapper to map on first use, we
get several benefits:

1. Mapping metadata is not cheap, particularly because of tracking. When cloning a Function we
identity map lots of global module-level metadata to avoid cloning it, while only a fraction of it
is actually used by the function. Mapping on first use is a lot faster for modules with meaningful
amount of debug info.

2. Eagerly identity mapping metadata makes it harder to cache module-level data (e.g. a set of
metadata nodes in a \a DICompileUnit). With this patch we can cache certain module-level metadata
calculations to speed things up further.

Anecdata from compiling a sample cpp file with full debug info shows that this moderately speeds up
CoroSplitPass which is one of the heavier users of cloning:

|                 | Baseline | IdentityMD set |
|-----------------+----------+----------------|
| CoroSplitPass   | 306ms    | 221ms          |
| CoroCloner      | 101ms    | 72ms           |
|-----------------+----------+----------------|
| Speed up        | 1x       | 1.4x           |

Test Plan:
ninja check-llvm-unit
ninja check-llvm
…utine clones

Summary:
CoroCloner, by calling into CloneFunctionInto, does a lot of repeated work priming DIFinder and building
a list of global debug info metadata. For programs compiled with full debug info this gets very
expensive.

This diff builds the data once and shares it between all clones.

Anecdata for a sample cpp source file compiled with full debug info:

|                 | Baseline | IdentityMD set | Prebuilt GlobalDI (cur.) |
|-----------------+----------+----------------+--------------------------|
| CoroSplitPass   | 306ms    | 221ms          | 68ms                     |
| CoroCloner      | 101ms    | 72ms           | 0.5ms                    |
| CollectGlobalDI | -        | -              | 63ms                     |
|-----------------+----------+----------------+--------------------------|
| Speed up        | 1x       | 1.4x           | 4.5x                     |

Note that CollectGlobalDI happens once *per coroutine* rather than per clone.

Test Plan:
ninja check-llvm-unit
ninja check-llvm

Compiled a sample internal source file, checked time trace output for scope timings.
Summary:
The analysis simply primes and caches DebugInfoFinders for each DICompileUnit in a module. This
allows (future) callers like CoroSplitPass to compute global debug info metadata (required for
coroutine function cloning) much faster. Specifically, pay the price of DICompileUnit processing
only once per compile unit, rather than once per coroutine.

Test Plan:
Added a smoke test for the new analysis
ninja check-llvm-unit
Summary:
We can use a DebugInfoFinder from DebugInfoCache which is already primed on a compile unit to speed
up collection of global debug info.

The pass could likely be another 2x+ faster if we avoid rebuilding the set of global debug
info. This needs further massaging of CloneFunction and ValueMapper, though, and can be done
incrementally on top of this.

Comparing performance of CoroSplitPass at various points in this stack, this is anecdata from a sample
cpp file compiled with full debug info:
|                 | Baseline | IdentityMD set | Prebuilt GlobalDI | Cached CU DIFinder (cur.) |
|-----------------+----------+----------------+-------------------+---------------------------|
| CoroSplitPass   | 306ms    | 221ms          | 68ms              | 17ms                      |
| CoroCloner      | 101ms    | 72ms           | 0.5ms             | 0.5ms                     |
| CollectGlobalDI | -        | -              | 63ms              | 13ms                      |
|-----------------+----------+----------------+-------------------+---------------------------|
| Speed up        | 1x       | 1.4x           | 4.5x              | 18x                       |

Test Plan:
ninja check-llvm-unit
ninja check-llvm

Compiled a sample cpp file with time trace to get the avg. duration of the pass and inner scopes.
Copy link

github-actions bot commented Sep 17, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

Copy link
Collaborator

@adrian-prantl adrian-prantl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like generally a good idea and the implementation looks clean as well.
What are the behavior changes we expect from this patch series?

@artempyanykh
Copy link
Contributor Author

artempyanykh commented Sep 27, 2024

This sounds like generally a good idea and the implementation looks clean as well.

Thanks for reviewing @adrian-prantl!

What are the behavior changes we expect from this patch series?

  • I expect no functional changes, only faster compile times for code with coroutines.
  • The new DebugInfoCache analysis is not conditioned on the presence of coroutines in a module but it's fast and the analysis results can be used to speed up function cloning in other places than just coroutines.
  • I ran a compile time benchmark on a few thousand internal source files:
    • There were no perf regressions, code without coroutines had neutral perf impact.
    • Source files that use coroutines compiled 7% faster on avg, but there were many cases where it was 25%, 50% and even 65% faster. The larger the module is and the more coroutines it has, the greater the speed up.

@felipepiovezan
Copy link
Contributor

Thanks for working on this and for making each commit reviewable / commitable on its own, that's really great work!
I will start looking at them asap, wish I had seen this sooner! :)

@artempyanykh
Copy link
Contributor Author

Thanks @felipepiovezan! There's some commentary around the commits in this discourse topic, in case it's useful.

@artempyanykh
Copy link
Contributor Author

@adrian-prantl, @felipepiovezan have you had a chance to take a closer look? Could you advise on what would be the best way to proceed with reviewing this PR?

Copy link
Collaborator

@adrian-prantl adrian-prantl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felipepiovezan Can you apply this patch to the LLVM in swiftlang and run the tests with it as a sanity check?

@felipepiovezan
Copy link
Contributor

felipepiovezan commented Oct 17, 2024

Hi @artempyanykh, while I follow Adrian's suggestions, you could extract the first commit into a separate PR and we'll start reviewing/merging the NFC stuff one by one

@felipepiovezan
Copy link
Contributor

Hi @artempyanykh, while I follow Adrian's suggestions, you could extract the first commit into a separate PR and we'll start reviewing/merging the NFC stuff one by one

The second commit can also be a PR on its own right now, it doesn't depend on the first one AFAICT, so we can review them in parallel


// Copy all attributes other than those stored in the AttributeList. We need
// to remap the parameter indices of the AttributeList.
// Copy all attributes other than those stored in the AttributeList. We need
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this comment is making a lot of sense as a function-level comment. To be honest even the original comment was confusing: which AttributeList? If you know the answer to this, could you take the chance provided by this NFC commit to improve the comment?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This second commit could also use a brief commit message describing why this change is useful

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comment refers to the AttributeList field (the only such) of a Function which has e.g. parameters and return value attributes. I updated the comment in #112976 to (hopefully) make it a bit clearer.

DISubprogram *SPClonedWithinModule = nullptr;
if (Changes < CloneFunctionChangeType::DifferentModule) {
SPClonedWithinModule = F.getSubprogram();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LLVM's coding guidelines prescribe not using {} around single line blocks

@@ -205,6 +205,11 @@ void CloneAndPruneFunctionInto(Function *NewFunc, const Function *OldFunc,
const char *NameSuffix = "",
ClonedCodeInfo *CodeInfo = nullptr);

/// Process debug information from function's subprogram attachment.
DISubprogram *ProcessSubprogramAttachment(const Function &F,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate on "Process"? This is more about collecting debug info, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. This function hydrates the passed DebugInfoFinder by visiting relevant components of the function's subprogram attachment. I'll update the name/comment when we get to extracting that commit in a separate PR.

Copy link
Contributor

@felipepiovezan felipepiovezan Oct 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, now that the other two PRs are open & approved, let's proceed with this commit!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, yes! @felipepiovezan a silly question -- now that those PRs are approved, how/when do they get merged? Stacking unmerged PRs on top of each other doesn't have the best UX in Github.

Copy link
Contributor

@felipepiovezan felipepiovezan Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think the best option here is to merge the two PRs that are approved (#112976 and #112948) . Both of those stand on their own and don't depend on each other, so it's fine to merge in w/e order, right?

And then you can grab the third commit (or more, if the subsequent commits are also independent of each other) of this PR and open a new PR (or multiple PRs, if you did the () before), so it will look fine on top of main. My understanding is that the fourth commit depends on the third though.

Regarding what to do about this umbrella PR in the mean time, you have two options: leave this as is (with the understanding that it is not meant to be merged, merely serve as a reference for all the work in the proposal), or rebase on top of main, which will get rid of the two-already merged commits and force push to your branch artempyanykh:fast-coro-upstream, which will mean this PR serves a reference of the work that's left to be merged, also with the understand that we're not merging it.

With all that stack, we'd never have to stack unmerged PRs on top of each other, does that make sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, do you have commit access?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think the best option here is to merge the two PRs that are approved (#112976 and #112948) . Both of those stand on their own and don't depend on each other, so it's fine to merge in w/e order, right?

That's right, yes.

And then you can grab the third commit (or more, if the subsequent commits are also independent of each other) of this PR and open a new PR (or multiple PRs, if you did the () before), so it will look fine on top of main. My understanding is that the fourth commit depends on the third though.

After the first 2 are merged, extracting the third one in a separate PR is straightforward. With the rest of the stack there's quite a bit of sequencing, so we may need to rinse and repeat.

rebase on top of main, which will get rid of the two-already merged commits and force push to your branch artempyanykh:fast-coro-upstream, which will mean this PR serves a reference of the work that's left to be merged, also with the understand that we're not merging it

+1 to this. I was planning to keep rebasing this stack anyway to make sure things keep working end-to-end while the chunks are being extracted and merged.

By the way, do you have commit access?

No, not sure what the process for this is. Why?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay, the LLVM dev conference happened last week and we were mostly out.

No, not sure what the process for this is. Why?

Just wondering if you needed help merging the open PRs. I've merged the first two for you.
The process for getting commit access is described here: https://llvm.org/docs/DeveloperPolicy.html#obtaining-commit-access
If you don't want to do that, you can always let the reviewers know you need someone to press the merge button for you

@artempyanykh
Copy link
Contributor Author

@felipepiovezan I extracted the first 2 commits into #112948 and #112976.

felipepiovezan pushed a commit that referenced this pull request Oct 30, 2024
A helper (2 overloads) that consolidates corocloner creation and the
actual cloning. The helpers create a TimeTraceScope to make it easier to
see how long the cloning takes.

Extracted from #109032 (commit 1)
felipepiovezan pushed a commit that referenced this pull request Oct 30, 2024
…to (#112976)

This patch is a part of step-by-step refactoring of CloneFunctionInto.
The goal is to extract reusable pieces out of it that will be later used
to optimize function cloning e.g. in coroutine processing.

Extracted from #109032 (commit 2)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants