A big part of FCS analysis time is spent in GC #13039

auduchinok · 2022-04-21T07:51:58Z

auduchinok
Apr 21, 2022
Collaborator

I'm looking at various perf snapshots and seeing that in many of them about 20-30% of the time that FCS takes to analyse a project graph is spent in GC. Here's the last example, GC takes 41 seconds:

Could we work on finding a systematic approach that would allow us to reduce allocations where possible? Possible things are:

reuse some collection objects
don't create intermediate lists and use state machine list builders instead
use arrays and mutation
don't recalculate things where possible

Some of these things aren't very idiomatic in a functional language, but when talking about compiler and editor analysis engine, I think a better performance should matter more than being completely idiomatic.

dsyme · 2022-04-25T03:20:23Z

dsyme
Apr 25, 2022
Collaborator

I'd encourage a focus on performance, not allocations. Allocations happen, they are not necessarily bad. Often the gains come from avoiding running code at all more than from removing allocations

Historically the biggest gains by @TIHan have been from removing transient LOH allocations. Those are really important to find and nail down.

0 replies

auduchinok · 2022-04-25T07:39:02Z

auduchinok
Apr 25, 2022
Collaborator Author

I'd encourage a focus on performance, not allocations. Allocations happen, they are not necessarily bad.

I agree that performance in general should be a high priority. Allocations aren't bad until there're too many of them. We often see about 30% of time spent on GC and that does indeed look bad. It's tens seconds to minutes that our users have to wait for analysis to finish in an IDE.

Often the gains come from avoiding running code at all more than from removing allocations

Yes, that would improve analysis time, but if the remaining code still allocates too much, it'd still spend a considerable amount of time doing GC. I think we should tackle both of these things.

@dsyme I agree that in most cases allocations aren't that bad, but in case of FCS analysis we'd like to cut down every wasted second as we expect the analysis to be as fast as possible (as our users do as well). Could you reconsider the importance of doing less allocations? I think we should try to not ignore this problem.

0 replies

auduchinok · 2022-04-25T07:51:24Z

auduchinok
Apr 25, 2022
Collaborator Author

Here's another recent snapshot, where GC spent 26%, about 6 seconds out of ~22 seconds::

It looks similar to this in the most of snapshots we look at.

1 reply

TIHan Apr 25, 2022

I think a decent rule of thumb would be to get the GC to take about 15% of the time. There are still plenty of opportunities to improve here.

Last I checked, though it has been a while, GC started to take more time in Gen1.

En3Tho · 2022-04-26T07:51:40Z

En3Tho
Apr 26, 2022

One thing that I'd like to add is that I believe node should move away from async to some custom lightweight statemachine type. I tried to use a funky () -> Task<'a> type instead of async and it really improved compiler stacks eg. from this:

to this:

It's impossble to see what is going on because of async's bindings everywhere.

I couldn't get this () -> Task thing to work properly tho because of CancellationTokens and async aware behaviour of existing code. But I believe custom type should fix this problem.

Also, there are places where F# is emitting ETW events and even tho event could be skipped, there is still string concatenation and allocations.

Many of the allocations in compiler come from Lists, Tuples and Lambdas/Closures. Maybe a custom allocation-less seq builder can help. Also, array and list equality using core "=" operator also allocates. A custom equality operator for internal usage might be better.

0 replies

dsyme · 2022-04-27T05:38:22Z

dsyme
Apr 27, 2022
Collaborator

@auduchinok The point I'm making is that it's not allocations that matter, it's performance. If we find places where reducing allocations causes better performance that's great. But don't take the approach that reducing allocations is a good in itself - there's a long history of trying to do that in the compiler and it rarely gave measurable benefits (because Gen1 allocations and GC are rarely a bottleneck problem), and plenty of times the allocation reductions led to either more complicated code or more copying or less data-structure-sharing.

LOH allocations were a significant exception to this rule, plus of course "stupid" allocations in tight loops,

@En3Tho Yes agreed. https://github.com/TheAngryByrd/IcedTasks shows how to define async-like cold-start tasks that pass cancellation tokens explicitly. Likewise synchronous code that passes cancellation tokens. We should make an internal copy of this that gives the better debugging for async/cancellable code.

0 replies

charlesroddie · 2022-05-16T19:05:01Z

charlesroddie
May 16, 2022

On (imm)arrays:

Use arrays for storing source files in projects #9259 started to use arrays but the consensus was to wait for ImmutableArray.
IMO it would be good to replace most lists with ImmutableArray unless the linked list structure is specifically advantageous.
I started to do [: :] syntax for ImmutableArray which I think would unblock this (as you just need to add dots). Eventually I will get round to finishing this but if someone has 30mins to show me around the compiler it would speed things up a lot!

Other:

Notes on span by @cartermp Support byref-like parameters (e.g., Span) in local functions that don't capture and are only invoked fsharp/fslang-suggestions#887 (comment) . Larger project.
Struct unions/enums, struct tuples

I don't agree with only looking for bad cases and then optimizing those. I think there are a lot of classes of performance hygiene which do not increase complexity or result in significantly more copying and less sharing.

0 replies

safesparrow · 2022-06-11T13:10:50Z

safesparrow
Jun 11, 2022

I agree with the sentiment @auduchinok and others present.

While I completely agree that allocations per se are nothing bad, I think everyone who mentions the goal of reducing allocations means speeding up overall performance by reducing the total time blocked by GC.

As mentioned in #12526 I think working on a framework&CI for benchmarking the FSharp.Compiler.Service codebase would be very useful mid- and long-term. It would help support these kind of discussions where there is no obvious answer and back them with easily-available objective data.
As all the talks on performance I've seen so far start with, "measure first".

Naively and without any results to prove it, I would think that using more optimized (time and memory wise) data structures widely in the codebase could have the following benefits:

reduced GC time
faster operations (eg. List.map vs Array.map)
more condensed data = more items can fit in CPU caches at the same time = profit (I have no idea what that profit might be though)

13 replies

cartermp Jun 11, 2022

The main challenge is a bit of a conundrum. To have a reliable check in CI, you want a lab environment so that machine variation doesn't impact the results. But that's exactly what makes the benchmarks unrepresentative, because user machines vary greatly. I've observed several times that my performance issues are not the same as others even when working in the same codebase, to the degree of 5-10% more or less CPU time to compute things or 20% more time spent in GC for a given sequence of inputs.

Benchmarks themselves aren't inherently bad, and we've put them to good use here: https://github.com/dotnet/fsharp/blob/main/tests/benchmarks/CompilerServiceBenchmarks/Benchmarks.fs

There are some compiler APIs you can directly validate on your machine and I think folks would welcome adding more, especially when looking into a specific API to improve the speed of. But getting into the realm of automating them is fraught with problems. This is not a one-and-done deal and would mean a pretty significant amount of time, over time, spent twiddling things, or risk the benchmarks no longer being useful in CI anymore.

Ultimately, this is an economic endeavor, I think that it's a lot more economical to just do stuff by hand right now.

safesparrow Jun 11, 2022

I would argue that if benchmarks on the CI are discouraged because they don't represent all users' experience, then an individual dev running those benchmarks represents all users' experience even less so.

If running benchmarks locally is encouraged when working on performance PRs, and contributors are encouraged to present results on PRs to argue their case, I don't see how running the same thing on a baseline hardware is worse.
Why should the particular contributor's perf experience, who happened to run the benchmark with the change they've just made, be more valuable than all the others' who didn't?

Plus, as mentioned, with automated benchmarks one can add a plethora of hardware configurations, simulate busy machines etc - something that's simply not possible to do manually. If the concern is about performance analysis that considers a variety of configurations, then that's the reason for automated benchmarks, including CI benchmarks - not against it.

I find the description in dotnet/performance#2457 very close to my thinking.
Especially the "Approximate needs" section bit:

Execution:

Run regularly on SDK releases and builds

Run regularly where the SDK draws F# compiler tooling from the main and release branches on dotnet/fsharp

Manually request a run where the SDK draws F# compiler tooling from PRs in dotnet/fsharp

RE: Existing benchmarks
It would be nice to reach a point where running all the existing benchmarks and gathering/exporting results is as simple as running the build.cmd commandline. If that's already the case, then I probably missed something.

The CI element of it is the final step. Making it easier to add, perform and export benchmark results in a consistent way, even if that's run on local boxes, has a lot of value IMO. More documentation and guides would help.
I think a few people, including myself, would be happy to help with the efforts.

Thanks for the discussion!

vzarytovskii · 2022-06-11T16:40:28Z

vzarytovskii
Jun 11, 2022
Maintainer

Here's the issue in performance repo dotnet/performance#2457

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A big part of FCS analysis time is spent in GC #13039

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 8 comments 14 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

A big part of FCS analysis time is spent in GC #13039

auduchinok Apr 21, 2022 Collaborator

Replies: 8 comments · 14 replies

dsyme Apr 25, 2022 Collaborator

auduchinok Apr 25, 2022 Collaborator Author

auduchinok Apr 25, 2022 Collaborator Author

dsyme Apr 27, 2022 Collaborator

vzarytovskii Jun 11, 2022 Maintainer

auduchinok
Apr 21, 2022
Collaborator

Replies: 8 comments 14 replies

dsyme
Apr 25, 2022
Collaborator

auduchinok
Apr 25, 2022
Collaborator Author

auduchinok
Apr 25, 2022
Collaborator Author

dsyme
Apr 27, 2022
Collaborator

vzarytovskii
Jun 11, 2022
Maintainer