Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance notes with and without tree reductions etc. #34

Open
o-smirnov opened this issue Apr 23, 2020 · 38 comments
Open

performance notes with and without tree reductions etc. #34

o-smirnov opened this issue Apr 23, 2020 · 38 comments

Comments

@o-smirnov
Copy link
Collaborator

Continuing on from #29, just more systematically.

With tree reduction, 1e+10 points.

shadems ms-4k-cal.ms -x U -y V -c ANTENNA1 --cnum 64 --xmin -42000 --xmax 42000 --ymin -42000 --ymax 42000 -z 10000 --bmap rainbow

Tops out at ~250Gb, runs in 145s.

shadems ms-4k-cal.ms -x U -y V -c ANTENNA1 --cnum 64 --xmin -42000 --xmax 42000 --ymin -42000 --ymax 42000 -z 50000 --bmap rainbow

Blows out my RAM.

shadems ms-4k-cal.ms -x U -y V -c ANTENNA1 --cnum 64 --xmin -42000 --xmax 42000 --ymin -42000 --ymax 42000 -z 1000 --bmap rainbow

Tops out at ~70Gb, runs in 245s (but the run had competition on the box). Going to get more exact numbers from ResourceProfiler shortly.

So @sjperkins, first take: mission accomplished insofar as getting the RAM use under control. Memory-parched users can just dial their chunk size down.

@o-smirnov
Copy link
Collaborator Author

shadems ms-4k-cal.ms -x U -y V -c CORRECTED_DATA:phase --cmin -1 --cmax 1 --cnum 64 --xmin -42000 --xmax 42000 --ymin -42000 --ymax 42000 -z 1000 --bmap rainbow

1065s, RAM touched 300Gb but I'll take it, it's a tough plot.

@o-smirnov
Copy link
Collaborator Author

o-smirnov commented Apr 23, 2020

Same settings but 16 colors. Note this is 4e+10 points!

shadems ms-4k-cal.ms -x U -y V -c CORRECTED_DATA:phase --cmin -1 --cmax 1 --cnum 16 --xmin -42000 --xmax 42000 --ymin -42000 --ymax 42000 -z 1000 --bmap rainbow

707s (box was a bit competed for, though). RAM ~200Gb.

2e+10 points. I think I broken core selection though, as it plotted all four (but why not 4e+10 then? odd):

shadems ms-4k-cal.ms -x FREQ -y CORRECTED_DATA:XX:amp -c ANTENNA1 --cnum 64 -z 1000

830s. Didn't notice the RAM. :) OK time to put a proper profiler in.

@sjperkins
Copy link
Member

sjperkins commented Apr 23, 2020

1065s, RAM touched 300Gb but I'll take it, it's a tough plot.

I think it may be possible to recover a couple of factors in RAM usage. Running the numbers:

2*1000*4096*4 x 16 bytes (broadcasted MS data) + 1024*1024 * 64 * 8 bytes (image) ~ 1012MB per thread

Multiply by 64 threads ~ 64GB
Lets say fudge factor of 2 ~ 128GB

My instinct would be to check the broadcast. The fix @JSKenyon put into ragavi avoided alot of extraneous communication in the graph.

@sjperkins
Copy link
Member

Updated estimate to include categories in image

@sjperkins
Copy link
Member

Updated again to cater for 64 categories

@o-smirnov
Copy link
Collaborator Author

o-smirnov commented Apr 23, 2020

image

OK, systematically now. This is a 1e+10 points plot using with minimal I/O (only UVW) and no colors. Ran for chunk size of 1000, 5000, 10000 like so:

shadems ms-4k-cal.ms -x U -y V -z 10000 -s "z{row_chunk_size}.tree" --profile

Tree vs original reduction (chunk sizes 1000, 5000, 10000):

tree reduction original reduction
image image
image image
image image

@o-smirnov
Copy link
Collaborator Author

o-smirnov commented Apr 23, 2020

image

shadems ms-4k-cal.ms -x U -y V -c ANTENNA1 -z 10000 -s "z{row_chunk_size}.tree" --profile

Now with 16 colours, chunk size 1000, 5000, 10000. Note how the original reduction code breaks down at lower chunk sizes:

tree reduction original reduction
image image
image image
image image

@sjperkins
Copy link
Member

sjperkins commented Apr 23, 2020

The task ordering looks pretty good, for 10,000 rows on minsize.ms. Colour indicates priority (in the round nodes) with red prioritised first and blue last. There are four independent colour streams, corresponding to four MS chunks

graph

Here's another view of the same graph, with task names in the round nodes.

task-names.pdf

I'm fiddling to try and get both names and priority in the same plot, its a bit difficult to decipher both at the same time.

Edit

Generated by calling the visualize method on a dask array/dataframe collection

R.visualize("graph.pdf")
R.visualize("graph.png", order="color")

@o-smirnov
Copy link
Collaborator Author

o-smirnov commented Apr 23, 2020

OK, now we make it harder. 5e+9 points, but they have to come from disk.

shadems ms-4k-cal.ms -x FREQ -y CORRECTED_DATA:XX:amp -c ANTENNA1 --cnum 64 -z 10000 -s "z{row_chunk_size}.tree" --profile

image

Chunk size 1000, 5000, 10000. Same picture really:

tree reduction original reduction
image image
image image
image image

@o-smirnov
Copy link
Collaborator Author

Another very realistic case. 5e+9 points in total, two fields iterated over.

shadems ms-4k-cal.ms -x CORRECTED_DATA:I:real -y CORRECTED_DATA:I:imag -c ANTENNA1 --cnum 64 --iter-field -z 10000 -s z{row_chunk_size}.tree --profile
image image

Chunksize 10000, tree reduction (separate profiles for two fields):

image image

@o-smirnov
Copy link
Collaborator Author

o-smirnov commented Apr 23, 2020

Another realistic case, two fields, UV-plot coloured by phase.

shadems ms-4k-cal.ms -x U -y V -c CORRECTED_DATA:I:phase --cmin -2 --cmax 2 --cnum 64 --iter-field -z 10000 -s "z{row_chunk_size}.tree" --profile

Chunk size 10000, 1000, tree reduction.

0408 gaincal
image image
image image
image image

Chunk size 10000, original reduction. Faster but hungier:

0408 gaincal
image image

Chunk size 1000, original reduction blew out my 512G RAM so I gave up.

@Mulan-94
Copy link
Collaborator

Mulan-94 commented Apr 23, 2020

@o-smirnov These look really cool. Are these last ones also by the tree reduction?
Also, what is the importance of colouring by phase if I may ask?

@o-smirnov
Copy link
Collaborator Author

The top set is for tree reduction (sorry, editing the comment under you!), bottom set original reduction.

Phase should be ==0 on properly corrected calibrator data, so a good plot of this kind is a bland plot. The stripy pattern in the left column suggests a problem in 0408 -- most likely an unmodelled field source contributing a fringe.

@Mulan-94
Copy link
Collaborator

Mulan-94 commented Apr 23, 2020

Ooh I see now. Would you know what is causing that weird peak in memory towards the end in those original reductions?

@o-smirnov
Copy link
Collaborator Author

No, but @sjperkins has also been wondering...

@sjperkins
Copy link
Member

sjperkins commented Apr 24, 2020

Ooh I see now. Would you know what is causing that weird peak in memory towards the end in those original reductions?

@Mulan-94 Are you referring to this kind of plot (taken from #34 (comment))?

plot

If so, the climb in memory at the end is almost certainly the np.stack in the combine function of the original reduction.

Having said that, I'm bothered by this kind of pattern in the tree reduction (#34 (comment)).

plot

I would've hoped for a flatter heartbeat pattern, without those two peaks. I'll try block off some time next week to look at this.

@JSKenyon
Copy link

As a weird check, could someone try running a test using dask==2.9.1 as opposed to the latest version? While the ordering plot Simon included looks correct, I would be interested to see if the ordering changes introduced after 2.9.1 are affecting the results.

@JSKenyon
Copy link

Oh, and for ordering plots, I was informed of the following useful invocation:

    dask.visualize(dsk,
                   color='order',
                   cmap='autumn',
                   filename='output.pdf', 
                   node_attr={'penwidth': '10'})

It just makes the colours more obvious.

@JSKenyon
Copy link

So, in a weird coincidence, I was doing some profiling of CubiCalV2 this morning and noticed some very familiar features - specifcially those beautiful mountainous ramps in memory usage followed by a precipitous drop. Here are some plots:
Screenshot from 2020-04-24 10-18-50
Now this genuinely bothered me as CubiCalV2 is supposed to be clean - it is almost completely functional and all results are written to disk as part of the graph. This was also distressingly similar to #359. As I did in the other issue, I tried embedding a manual garbage collector call in the graph. Behold the results:
Screenshot from 2020-04-24 10-19-23
Of course, I cannot guarantee that this is the same as the features seen in some of the plots above, but it might be worth checking.

@sjperkins
Copy link
Member

sjperkins commented Apr 24, 2020

Yes, I'd think the next step would be to try embedding gc.collect calls in the tree reduction wrapped_combine method -- no need to hang on to MS data or leaf images beyond the chunk or wrapped_combine stages.

@o-smirnov, or @Mulan-94 would you be willing to pursue the above? Otherwise I'll look at it next week when I've cleared my stack a bit.

I dislike the idea of embedding gc.collect() calls in the graph (although this is not a criticism of the necessity to do so and figure this problem out). It may be better to tune this with gc.set_threshold, especially the thresholds for the older generations. My working hypothesis for this memory ramping behaviour is now:

  1. we're allocating very large arrays on which we do a lot of compute.
  2. They live for a fair amount of time
  3. Which means they survive collection of the first generation and make it into the later generation.
  4. Which means that subsequent (automatic) gc calls don't clear out the later generations promptly
  5. Which means that large arrays get bumped into the even older generation.
  6. This leads to the memory ramping -- suspect the peaks happen when the gc finally decides to clear out the oldest generation.

@o-smirnov
Copy link
Collaborator Author

I have absolutely no idea what you just said, but that won't stop me from giving it a try anyway!

@JSKenyon, garbage collection is really the answer to everything, isn't it. :)

The only time I got a flat heartbeat was in this case, top left: #34 (comment)

@sjperkins
Copy link
Member

I have absolutely no idea what you just said, but that won't stop me from giving it a try anyway!

Argh apologies, was assuming familiarity on the subject matter.

The generational garbage collection bits of this article might explain it quickly: https://stackify.com/python-garbage-collection/.

@JSKenyon
Copy link

JSKenyon commented Apr 24, 2020

I have played around with the threshold settings and I actually think that having manual GC calls is safer/better for applications which don't do much allocation. There are three threshold levels corresponding to different generations - lets call them t0, t1, and t2. The GC is triggered when the number of allocations minus the number of deallocations exceeds t0 (700 by default on my installation). Objects which survive collection are bumped into an older generation. When t0 has been triggered t1 times (10 by default on my installation), the GC is triggered again and examines both generation 0 and generation 1. Again, objects which survive are bumped into the third and final generation. Finally, when t1 had been triggered t2 times (also 10 by default), a complete GC is triggered. This will clear up any lingering objects. However, if you have a very low number of allocations/deallocations, big arrays on which we do lots of compute will almost certainly end up in the oldest generation. And if we don't do much allocation/deallocation, they will stay there basically indefinitely. While it is plausible to manhandle the threshold settings to alleviate the problem, I would argue that it is brittler and more prone to failure than having a single GC call at the point where cleanup is a necessity e.g. at the end of processing a chunk.

@sjperkins
Copy link
Member

sjperkins commented Apr 24, 2020

Yes, I think applications such as cubicalv2and shadems have more leeway in using the garbage collector as they wish.

Unfortunately, in this case, I think the optimal place to put the collect calls is in the datashader tree reduction, which is an internal API. To publish code like that in an API is a hard no to me.

Of course this is Python so we can monkeypatch everything as a workaround, within reason ;-)

@JSKenyon
Copy link

Ah that does make sense - if the goal is taming a dependency, then the thresholds are probably the way to go.

@o-smirnov
Copy link
Collaborator Author

I'm happy to monkeypatch it in for now, and if it works and solves the problem, then we discuss how and if to get it into datashader properly.

@sjperkins, where is this stack() call you speak of happening?

@sjperkins
Copy link
Member

@sjperkins, where is this stack() call you speak of happening?

In datashader's combine function, there was more detail on this here:

#29 (comment)

The tree reduction still calls combine but in batches with far fewer arrays (roughly, the split_every parameter in dask.array.reduction).

@o-smirnov
Copy link
Collaborator Author

I added gc.collect() to both entry and exit of wrapped_combine, but it doesn't seem to make much of a difference (compare to top two plots in #34 (comment)):

image image

@o-smirnov o-smirnov changed the title performance notes with and without tree reductions performance notes with and without tree reductions etc. Jun 2, 2020
@o-smirnov
Copy link
Collaborator Author

@sjperkins, if you've re-stocked your beer supplies, tonight would be a good night to open one. Here's a memory profile with dataframe_factory:

df-factory previous version
bokeh_plot bokeh_plot(1)

@sjperkins
Copy link
Member

sjperkins commented Jun 2, 2020 via email

@sjperkins
Copy link
Member

Also possibly: Big-O space complexity for the win: #34 (comment)

I think it may be possible to recover a couple of factors in RAM usage. Running the numbers:

2x1000x4096x4 x 16 bytes (broadcasted MS data) + 1024x1024x64 x 8 bytes (image) ~ 1012MB per thread

Multiply by 64 threads ~ 64GB
Lets say fudge factor of 2 ~ 128GB

I'm not sure if the above figures are right for this:

df-factory previous version
bokeh_plot bokeh_plot(1)

but if they are it's pretty close. It looks like shadems is running 72 threads. On average they're using 500MB each (36GB total), but at peak memory usage they're using ~1GB (75GB total).

@o-smirnov, was the above run done with 1000 rows, 4096 channels ~1024^2 image and 64 categories?

That double-peak in the memory pattern is retained in the new version. I wonder what it is? Images being aggregated to form a final value? Does datashader run through the data twice?

@sjperkins
Copy link
Member

/cc'ing @rubyvanrooyen, who may also find #34 (comment) useful.

@o-smirnov
Copy link
Collaborator Author

@sjperkins: it was 722610 rows, 4k channels, 4 corrs, 16 categories.

The double-peak is there because there's a first run-through to determine the data ranges. We can eliminate that if we fix the plot limits, but we don't always know what to fix them to. #55 will help.

@sjperkins
Copy link
Member

it was 722610 rows, 4k channels, 4 corrs, 16 categories.

Ah, but what was the -z option, 1000, 10000?

@o-smirnov
Copy link
Collaborator Author

10000

@sjperkins
Copy link
Member

Then maybe we're doing quite well. Let's remove the factor of 2 on the visibilities, because I think numpy broadcasting functionality doesn't actually expand the underlying array to full resolution: it uses 0 strides to give the impression of it. Then:

10,000 x 4,096 x 4 x 16 + 1024 x 1024 x 16 x 8 ~ max 2.5GB per thread

The memory profile is suggesting ~36GB over 72 threads in the average case (i.e. an average of ~500MB per thread) and ~75GB over 72 threads in the peak case (i.e. an average of ~1GB per thread).

I guess the visibility data doesn't stay in memory all that long -- it gets converted to an image before the tree reduction step.

All speculation, but useful to start with some sort of model and refine it.

@o-smirnov
Copy link
Collaborator Author

o-smirnov commented Nov 5, 2020

I did a few more benchmarks with and without tree reduction ("tree" versus "master") for varying problem sizes and ask chunk sizes: https://www.dropbox.com/sh/m0fch390vliqkkf/AACBcAHkHCZyzsFW3U7dnXcOa?dl=0

Observations:

  • Tree reduction is always better or equivalent (both in terms of RAM footprint and runtime)

  • On small problems, there's not much in the comparison

  • With small chunks (chunk size 1000), performance deteriorates sharply in all cases (but tree reduction still does better). The long problem did not complete for the "master" version, I killed it after 20 minutes (it was using >200Gb and only a single core).

  • RAM footprint increases with chunk size, while runtime decreases slightly. At a chunk size of 20000, RAM usage is comparable (with or without tree reduction), but with smaller chunks, tree reduction always wins.

  • The above use 16 categories. I also did a test with a big problem and 64 categories. This is where the non-tree-reduction version breaks apart. The chunk 1000 test ran out of memory. The rest ran, but very memory hungry:

master version tree reduction
image image

@codeFairOfficial

This comment was marked as spam.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants