Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An explanation for cascading variants #1

Open
patmorin opened this issue Sep 15, 2015 · 1 comment
Open

An explanation for cascading variants #1

patmorin opened this issue Sep 15, 2015 · 1 comment

Comments

@patmorin
Copy link

I'm currently writing a paper about searching ordered arrays that contains an explanation for why your cascading/prefetching variant of binary heap sort is so fast. It also contains an idea that will probably make it faster: Rather than prefetch a child of the current node, prefetch one of its 4 grandchildren, 8 great-grandchildren, or 16 great-great-granchildren. (Which to do depends on your data size, in particular, how many elements fit into a cache line.)

You can read about this in the draft paper. Section 3.2 and (to a lesser extent) Section 4.1 contain the relevant information.

@tarsa
Copy link
Owner

tarsa commented Sep 15, 2015

Hi Pat,
Thank you for your input.

Actually, I've implemented deep prefetching. Look into eg sortheapbinaryaheadsimplevarianta.hpp and you'll see the code:

                prefetch<1, 0>(a + std::min(left * 8, end));
                ssize_t const newIndex = index * 2
                        + compOp(a[left], Below, a[right]);

If the prefetching depth is N-level then the number of simultaneous prefetches is up to N.

In cascading variants, for a heap of depth N, there can be up to about N simultaneous prefetches, because cascading heapsort does about N simultaneous sift-downs, each with single level depth prefetching.

In practice, it seems that within the processor caches (ie in the top of the heap) prefetching doesn't do any good (what you've probably found) and usually a vast majority of heap levels fit in processor caches. Also, we're limited by memory subsystem as to the number of simultaneous prefetches. Increasing the number of them doesn't improve the performance linearly. I've done some experiments with random array access and found that prefetching can improve performance about 2.5x on Core 2 Duo IIRC.

I was discussing the benchmark on: http://encode.su/threads/1914-Sorting-Algorithms-Benchmark You can find more of my findings there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants