performance notes with and without tree reductions etc. #34

o-smirnov · 2020-04-23T11:09:26Z

Continuing on from #29, just more systematically.

With tree reduction, 1e+10 points.

shadems ms-4k-cal.ms -x U -y V -c ANTENNA1 --cnum 64 --xmin -42000 --xmax 42000 --ymin -42000 --ymax 42000 -z 10000 --bmap rainbow

Tops out at ~250Gb, runs in 145s.

shadems ms-4k-cal.ms -x U -y V -c ANTENNA1 --cnum 64 --xmin -42000 --xmax 42000 --ymin -42000 --ymax 42000 -z 50000 --bmap rainbow

Blows out my RAM.

shadems ms-4k-cal.ms -x U -y V -c ANTENNA1 --cnum 64 --xmin -42000 --xmax 42000 --ymin -42000 --ymax 42000 -z 1000 --bmap rainbow

Tops out at ~70Gb, runs in 245s (but the run had competition on the box). Going to get more exact numbers from ResourceProfiler shortly.

So @sjperkins, first take: mission accomplished insofar as getting the RAM use under control. Memory-parched users can just dial their chunk size down.

The text was updated successfully, but these errors were encountered:

o-smirnov · 2020-04-23T11:28:01Z

shadems ms-4k-cal.ms -x U -y V -c CORRECTED_DATA:phase --cmin -1 --cmax 1 --cnum 64 --xmin -42000 --xmax 42000 --ymin -42000 --ymax 42000 -z 1000 --bmap rainbow

1065s, RAM touched 300Gb but I'll take it, it's a tough plot.

o-smirnov · 2020-04-23T11:52:39Z

Same settings but 16 colors. Note this is 4e+10 points!

shadems ms-4k-cal.ms -x U -y V -c CORRECTED_DATA:phase --cmin -1 --cmax 1 --cnum 16 --xmin -42000 --xmax 42000 --ymin -42000 --ymax 42000 -z 1000 --bmap rainbow

707s (box was a bit competed for, though). RAM ~200Gb.

2e+10 points. I think I broken core selection though, as it plotted all four (but why not 4e+10 then? odd):

shadems ms-4k-cal.ms -x FREQ -y CORRECTED_DATA:XX:amp -c ANTENNA1 --cnum 64 -z 1000

830s. Didn't notice the RAM. :) OK time to put a proper profiler in.

sjperkins · 2020-04-23T11:58:13Z

1065s, RAM touched 300Gb but I'll take it, it's a tough plot.

I think it may be possible to recover a couple of factors in RAM usage. Running the numbers:

2*1000*4096*4 x 16 bytes (broadcasted MS data) + 1024*1024 * 64 * 8 bytes (image) ~ 1012MB per thread

Multiply by 64 threads ~ 64GB
Lets say fudge factor of 2 ~ 128GB

My instinct would be to check the broadcast. The fix @JSKenyon put into ragavi avoided alot of extraneous communication in the graph.

sjperkins · 2020-04-23T12:03:03Z

Updated estimate to include categories in image

sjperkins · 2020-04-23T12:08:31Z

Updated again to cater for 64 categories

o-smirnov · 2020-04-23T15:22:51Z

OK, systematically now. This is a 1e+10 points plot using with minimal I/O (only UVW) and no colors. Ran for chunk size of 1000, 5000, 10000 like so:

shadems ms-4k-cal.ms -x U -y V -z 10000 -s "z{row_chunk_size}.tree" --profile

Tree vs original reduction (chunk sizes 1000, 5000, 10000):

tree reduction	original reduction

o-smirnov · 2020-04-23T16:22:53Z

shadems ms-4k-cal.ms -x U -y V -c ANTENNA1 -z 10000 -s "z{row_chunk_size}.tree" --profile

Now with 16 colours, chunk size 1000, 5000, 10000. Note how the original reduction code breaks down at lower chunk sizes:

tree reduction	original reduction

sjperkins · 2020-04-23T16:40:32Z

The task ordering looks pretty good, for 10,000 rows on minsize.ms. Colour indicates priority (in the round nodes) with red prioritised first and blue last. There are four independent colour streams, corresponding to four MS chunks

Here's another view of the same graph, with task names in the round nodes.

task-names.pdf

I'm fiddling to try and get both names and priority in the same plot, its a bit difficult to decipher both at the same time.

Edit

Generated by calling the visualize method on a dask array/dataframe collection

R.visualize("graph.pdf")
R.visualize("graph.png", order="color")

o-smirnov · 2020-04-23T17:23:57Z

OK, now we make it harder. 5e+9 points, but they have to come from disk.

shadems ms-4k-cal.ms -x FREQ -y CORRECTED_DATA:XX:amp -c ANTENNA1 --cnum 64 -z 10000 -s "z{row_chunk_size}.tree" --profile

Chunk size 1000, 5000, 10000. Same picture really:

tree reduction	original reduction

o-smirnov · 2020-04-23T20:02:21Z

Another very realistic case. 5e+9 points in total, two fields iterated over.

shadems ms-4k-cal.ms -x CORRECTED_DATA:I:real -y CORRECTED_DATA:I:imag -c ANTENNA1 --cnum 64 --iter-field -z 10000 -s z{row_chunk_size}.tree --profile

Chunksize 10000, tree reduction (separate profiles for two fields):

o-smirnov · 2020-04-23T20:21:40Z

Another realistic case, two fields, UV-plot coloured by phase.

shadems ms-4k-cal.ms -x U -y V -c CORRECTED_DATA:I:phase --cmin -2 --cmax 2 --cnum 64 --iter-field -z 10000 -s "z{row_chunk_size}.tree" --profile

Chunk size 10000, 1000, tree reduction.

0408	gaincal

Chunk size 10000, original reduction. Faster but hungier:

0408	gaincal

Chunk size 1000, original reduction blew out my 512G RAM so I gave up.

Mulan-94 · 2020-04-23T20:40:07Z

@o-smirnov These look really cool. Are these last ones also by the tree reduction?
Also, what is the importance of colouring by phase if I may ask?

o-smirnov · 2020-04-23T20:45:52Z

The top set is for tree reduction (sorry, editing the comment under you!), bottom set original reduction.

Phase should be ==0 on properly corrected calibrator data, so a good plot of this kind is a bland plot. The stripy pattern in the left column suggests a problem in 0408 -- most likely an unmodelled field source contributing a fringe.

Mulan-94 · 2020-04-23T20:56:14Z

Ooh I see now. Would you know what is causing that weird peak in memory towards the end in those original reductions?

o-smirnov · 2020-04-23T21:04:09Z

No, but @sjperkins has also been wondering...

sjperkins · 2020-04-24T06:46:59Z

Ooh I see now. Would you know what is causing that weird peak in memory towards the end in those original reductions?

@Mulan-94 Are you referring to this kind of plot (taken from #34 (comment))?

If so, the climb in memory at the end is almost certainly the np.stack in the combine function of the original reduction.

Having said that, I'm bothered by this kind of pattern in the tree reduction (#34 (comment)).

I would've hoped for a flatter heartbeat pattern, without those two peaks. I'll try block off some time next week to look at this.

JSKenyon · 2020-04-24T07:10:28Z

As a weird check, could someone try running a test using dask==2.9.1 as opposed to the latest version? While the ordering plot Simon included looks correct, I would be interested to see if the ordering changes introduced after 2.9.1 are affecting the results.

JSKenyon · 2020-04-24T07:13:05Z

Oh, and for ordering plots, I was informed of the following useful invocation:

    dask.visualize(dsk,
                   color='order',
                   cmap='autumn',
                   filename='output.pdf', 
                   node_attr={'penwidth': '10'})

It just makes the colours more obvious.

JSKenyon · 2020-04-24T08:40:37Z

So, in a weird coincidence, I was doing some profiling of CubiCalV2 this morning and noticed some very familiar features - specifcially those beautiful mountainous ramps in memory usage followed by a precipitous drop. Here are some plots:

Now this genuinely bothered me as CubiCalV2 is supposed to be clean - it is almost completely functional and all results are written to disk as part of the graph. This was also distressingly similar to #359. As I did in the other issue, I tried embedding a manual garbage collector call in the graph. Behold the results:

Of course, I cannot guarantee that this is the same as the features seen in some of the plots above, but it might be worth checking.

sjperkins · 2020-04-24T09:18:17Z

Yes, I'd think the next step would be to try embedding gc.collect calls in the tree reduction wrapped_combine method -- no need to hang on to MS data or leaf images beyond the chunk or wrapped_combine stages.

@o-smirnov, or @Mulan-94 would you be willing to pursue the above? Otherwise I'll look at it next week when I've cleared my stack a bit.

I dislike the idea of embedding gc.collect() calls in the graph (although this is not a criticism of the necessity to do so and figure this problem out). It may be better to tune this with gc.set_threshold, especially the thresholds for the older generations. My working hypothesis for this memory ramping behaviour is now:

we're allocating very large arrays on which we do a lot of compute.
They live for a fair amount of time
Which means they survive collection of the first generation and make it into the later generation.
Which means that subsequent (automatic) gc calls don't clear out the later generations promptly
Which means that large arrays get bumped into the even older generation.
This leads to the memory ramping -- suspect the peaks happen when the gc finally decides to clear out the oldest generation.

o-smirnov · 2020-04-24T09:28:13Z

I have absolutely no idea what you just said, but that won't stop me from giving it a try anyway!

@JSKenyon, garbage collection is really the answer to everything, isn't it. :)

The only time I got a flat heartbeat was in this case, top left: #34 (comment)

sjperkins · 2020-04-24T09:32:15Z

I have absolutely no idea what you just said, but that won't stop me from giving it a try anyway!

Argh apologies, was assuming familiarity on the subject matter.

The generational garbage collection bits of this article might explain it quickly: https://stackify.com/python-garbage-collection/.

JSKenyon · 2020-04-24T09:45:02Z

I have played around with the threshold settings and I actually think that having manual GC calls is safer/better for applications which don't do much allocation. There are three threshold levels corresponding to different generations - lets call them t0, t1, and t2. The GC is triggered when the number of allocations minus the number of deallocations exceeds t0 (700 by default on my installation). Objects which survive collection are bumped into an older generation. When t0 has been triggered t1 times (10 by default on my installation), the GC is triggered again and examines both generation 0 and generation 1. Again, objects which survive are bumped into the third and final generation. Finally, when t1 had been triggered t2 times (also 10 by default), a complete GC is triggered. This will clear up any lingering objects. However, if you have a very low number of allocations/deallocations, big arrays on which we do lots of compute will almost certainly end up in the oldest generation. And if we don't do much allocation/deallocation, they will stay there basically indefinitely. While it is plausible to manhandle the threshold settings to alleviate the problem, I would argue that it is brittler and more prone to failure than having a single GC call at the point where cleanup is a necessity e.g. at the end of processing a chunk.

sjperkins · 2020-04-24T10:16:46Z

Yes, I think applications such as cubicalv2and shadems have more leeway in using the garbage collector as they wish.

Unfortunately, in this case, I think the optimal place to put the collect calls is in the datashader tree reduction, which is an internal API. To publish code like that in an API is a hard no to me.

Of course this is Python so we can monkeypatch everything as a workaround, within reason ;-)

JSKenyon · 2020-04-24T10:28:33Z

Ah that does make sense - if the goal is taming a dependency, then the thresholds are probably the way to go.

o-smirnov · 2020-04-24T10:44:48Z

I'm happy to monkeypatch it in for now, and if it works and solves the problem, then we discuss how and if to get it into datashader properly.

@sjperkins, where is this stack() call you speak of happening?

sjperkins · 2020-04-24T10:55:58Z

@sjperkins, where is this stack() call you speak of happening?

In datashader's combine function, there was more detail on this here:

#29 (comment)

The tree reduction still calls combine but in batches with far fewer arrays (roughly, the split_every parameter in dask.array.reduction).

o-smirnov · 2020-04-25T13:16:17Z

I added gc.collect() to both entry and exit of wrapped_combine, but it doesn't seem to make much of a difference (compare to top two plots in #34 (comment)):

o-smirnov · 2020-06-02T16:49:26Z

@sjperkins, if you've re-stocked your beer supplies, tonight would be a good night to open one. Here's a memory profile with dataframe_factory:

`df-factory`	previous version

sjperkins · 2020-06-02T19:12:24Z

Ah great, that looks like a factor of 4 improvement

…

On Tue, 2 Jun 2020, 18:49 Oleg Smirnov, ***@***.***> wrote: @sjperkins <https://github.com/sjperkins>, if you've re-stocked your beer supplies, tonight would be a good night to open one. Here's a memory profile with dataframe_factory: df-factory previous version [image: bokeh_plot] <https://user-images.githubusercontent.com/6470079/83546751-73870400-a501-11ea-94d6-b45e7be895e7.png> [image: bokeh_plot(1)] <https://user-images.githubusercontent.com/6470079/83546745-71bd4080-a501-11ea-9a37-a1143701cf14.png> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#34 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA253ZHDD6UDSAX3QQCG5ELRUUUSLANCNFSM4MO7BKMQ> .

sjperkins · 2020-06-03T07:26:19Z

Also possibly: Big-O space complexity for the win: #34 (comment)

I think it may be possible to recover a couple of factors in RAM usage. Running the numbers:

2x1000x4096x4 x 16 bytes (broadcasted MS data) + 1024x1024x64 x 8 bytes (image) ~ 1012MB per thread

Multiply by 64 threads ~ 64GB
Lets say fudge factor of 2 ~ 128GB

I'm not sure if the above figures are right for this:

`df-factory`	previous version

but if they are it's pretty close. It looks like shadems is running 72 threads. On average they're using 500MB each (36GB total), but at peak memory usage they're using ~1GB (75GB total).

@o-smirnov, was the above run done with 1000 rows, 4096 channels ~1024^2 image and 64 categories?

That double-peak in the memory pattern is retained in the new version. I wonder what it is? Images being aggregated to form a final value? Does datashader run through the data twice?

sjperkins · 2020-06-03T08:05:31Z

/cc'ing @rubyvanrooyen, who may also find #34 (comment) useful.

o-smirnov · 2020-06-03T12:41:58Z

@sjperkins: it was 722610 rows, 4k channels, 4 corrs, 16 categories.

The double-peak is there because there's a first run-through to determine the data ranges. We can eliminate that if we fix the plot limits, but we don't always know what to fix them to. #55 will help.

sjperkins · 2020-06-04T10:28:43Z

it was 722610 rows, 4k channels, 4 corrs, 16 categories.

Ah, but what was the -z option, 1000, 10000?

o-smirnov · 2020-06-04T10:33:21Z

10000

sjperkins · 2020-06-04T11:17:21Z

Then maybe we're doing quite well. Let's remove the factor of 2 on the visibilities, because I think numpy broadcasting functionality doesn't actually expand the underlying array to full resolution: it uses 0 strides to give the impression of it. Then:

10,000 x 4,096 x 4 x 16 + 1024 x 1024 x 16 x 8 ~ max 2.5GB per thread

The memory profile is suggesting ~36GB over 72 threads in the average case (i.e. an average of ~500MB per thread) and ~75GB over 72 threads in the peak case (i.e. an average of ~1GB per thread).

I guess the visibility data doesn't stay in memory all that long -- it gets converted to an image before the tree reduction step.

All speculation, but useful to start with some sort of model and refine it.

o-smirnov · 2020-11-05T16:27:21Z

I did a few more benchmarks with and without tree reduction ("tree" versus "master") for varying problem sizes and ask chunk sizes: https://www.dropbox.com/sh/m0fch390vliqkkf/AACBcAHkHCZyzsFW3U7dnXcOa?dl=0

Observations:

Tree reduction is always better or equivalent (both in terms of RAM footprint and runtime)
On small problems, there's not much in the comparison
With small chunks (chunk size 1000), performance deteriorates sharply in all cases (but tree reduction still does better). The long problem did not complete for the "master" version, I killed it after 20 minutes (it was using >200Gb and only a single core).
RAM footprint increases with chunk size, while runtime decreases slightly. At a chunk size of 20000, RAM usage is comparable (with or without tree reduction), but with smaller chunks, tree reduction always wins.
The above use 16 categories. I also did a test with a big problem and 64 categories. This is where the non-tree-reduction version breaks apart. The chunk 1000 test ran out of memory. The rest ran, but very memory hungry:

master version	tree reduction

o-smirnov added the discussion label May 25, 2020

o-smirnov changed the title ~~performance notes with and without tree reductions~~ performance notes with and without tree reductions etc. Jun 2, 2020

sjperkins mentioned this issue Jun 25, 2020

Chunked w-gridder with dask-ms and ducc ratt-ru/codex-africanus#204

Merged

o-smirnov mentioned this issue Nov 3, 2020

Implements tree reduction in the dask layer holoviz/datashader#926

Merged

This comment was marked as spam.

Sign in to view

performance notes with and without tree reductions etc. #34

performance notes with and without tree reductions etc. #34

Comments

o-smirnov commented Apr 23, 2020

o-smirnov commented Apr 23, 2020

o-smirnov commented Apr 23, 2020 • edited Loading

sjperkins commented Apr 23, 2020 • edited Loading

sjperkins commented Apr 23, 2020

sjperkins commented Apr 23, 2020

o-smirnov commented Apr 23, 2020 • edited Loading

o-smirnov commented Apr 23, 2020 • edited Loading

sjperkins commented Apr 23, 2020 • edited Loading

Edit

o-smirnov commented Apr 23, 2020 • edited Loading

o-smirnov commented Apr 23, 2020

o-smirnov commented Apr 23, 2020 • edited Loading

Mulan-94 commented Apr 23, 2020 • edited Loading

o-smirnov commented Apr 23, 2020

Mulan-94 commented Apr 23, 2020 • edited Loading

o-smirnov commented Apr 23, 2020

sjperkins commented Apr 24, 2020 • edited Loading

JSKenyon commented Apr 24, 2020

JSKenyon commented Apr 24, 2020

JSKenyon commented Apr 24, 2020

sjperkins commented Apr 24, 2020 • edited Loading

o-smirnov commented Apr 24, 2020

sjperkins commented Apr 24, 2020

JSKenyon commented Apr 24, 2020 • edited Loading

sjperkins commented Apr 24, 2020 • edited Loading

JSKenyon commented Apr 24, 2020

o-smirnov commented Apr 24, 2020

sjperkins commented Apr 24, 2020

o-smirnov commented Apr 25, 2020

o-smirnov commented Jun 2, 2020

sjperkins commented Jun 2, 2020 via email

sjperkins commented Jun 3, 2020

sjperkins commented Jun 3, 2020

o-smirnov commented Jun 3, 2020

sjperkins commented Jun 4, 2020

o-smirnov commented Jun 4, 2020

sjperkins commented Jun 4, 2020

o-smirnov commented Nov 5, 2020 • edited Loading

This comment was marked as spam.

o-smirnov commented Apr 23, 2020 •

edited

Loading

sjperkins commented Apr 23, 2020 •

edited

Loading

o-smirnov commented Apr 23, 2020 •

edited

Loading

o-smirnov commented Apr 23, 2020 •

edited

Loading

sjperkins commented Apr 23, 2020 •

edited

Loading

o-smirnov commented Apr 23, 2020 •

edited

Loading

o-smirnov commented Apr 23, 2020 •

edited

Loading

Mulan-94 commented Apr 23, 2020 •

edited

Loading

Mulan-94 commented Apr 23, 2020 •

edited

Loading

sjperkins commented Apr 24, 2020 •

edited

Loading

sjperkins commented Apr 24, 2020 •

edited

Loading

JSKenyon commented Apr 24, 2020 •

edited

Loading

sjperkins commented Apr 24, 2020 •

edited

Loading

o-smirnov commented Nov 5, 2020 •

edited

Loading