Optimize p_box3x3_f32 function #149

dsebasti · 2015-06-16T14:48:02Z

The current box filter function is replaced with one that takes just over 50% of original running time, and the performance is ~150% that of the OpenCV equivalent (~2/3 of the OpenCV running time).

Signed-off-by: Daniel Sebastian <[email protected]>

aolofsson · 2015-06-16T14:56:29Z

What is your take on removing the first and last row processing?

dsebasti · 2015-06-16T15:00:32Z

Not entirely sure I understand your question, but I extracted the first and last row bits mostly just to avoid evaluating a few 'if' statements that would have been evaluated anyway the vast majority of the time. I didn't check the performance impact of taking that route.

ebadi · 2015-06-16T17:36:03Z

You can remove many parts of you code as I think only "Process (rows - 4) middle rows" part is needed. (If you don't see why, try to compare the output array with standard implementation.)

Can you explain your performance statistics ? It seems the inner block is executed (almost) as many times as the previous implementation, so I think it is even slower !
for (i = 4; i < rows; i++)
for (j = 3; j < cols; j++)
[inner block]

dsebasti · 2015-06-16T21:45:10Z

I tried naively removing the 4 extracted loops, and it crashes for me, but even if it didn't I'm pretty sure that making that change without introducing some 'if' statements to the main loop would not be correct; for a 10 column wide image, the first calculations in the main loop would try to store a value to pr[-8].

As for performance, I can understand why you might think it would be slower, but it improves speed mostly by reducing overall memory accesses. For a 10x10 input image, the first iteration of the naive implementation reads values from the input array at positions 0, 1, 2, 10, 11, 12, 20, 21, and 22. The second iteration will read values from positions 1, 2, 3, 11, 12, 13, 21, 22, and 23; 6 values have been re-accessed. In contrast, my implementation does not re-access any input values, and it only requires 3 stores to each output position (compared to 1 for the naive implementation), so overall, my implementation makes fewer memory accesses.

ebadi · 2015-06-16T22:15:36Z

In case memory access time is much slower then all those operations you introduced in that block, your algorithm works faster. :)

Note: I think the number of memory access is higher than 3 as reading values from variables (j, sum, cols in the inner block) are also considered memory access.

dsebasti · 2015-06-16T22:23:46Z

Good point about those extra accesses; I guess I was just thinking about the image arrays. And as a sample for the timings of the naive implementation vs. mine; I did 4 calls on the same 5000x10000 test image, and the naive runs take a total time of 8.4s, and mine takes 2.9s, so those array accesses must not come cheap. ;)

dsebasti · 2015-06-17T06:16:01Z

I think I owe you guys an apology. I did a quick re-implementation, with the extracted loops reintegrated in the form of several scattered if statements; the result was ~10% performance increase (2.76s vs. 2.52s using the same test I mentioned in the last comment). It seems a bit less readable to me, but I'll add a few comments and get that committed ASAP. Sorry about that, guys. =)

Edit: On further inspection (and after a bit of bug squashing), switching Qt Creator from Debug to Release builds causes the performance difference between the two versions to disappear (<1% difference). At this point I'm open to contributing whichever version you guys would prefer; just let me know if I should leave it as is, or submit the condensed version. (And if I should submit the new version, do I do that by adding another commit to this pull request, creating a new one, or something else? Sorry; I'm a bit new to GitHub.)

The new version is mostly the same, with the four outer loops (which handled the first and last rows) now integrated into the center loop, using 'if' statements rather than separate loops to get the desired behavioral changes for the outer rows. Signed-off-by: Daniel Sebastian <[email protected]>

dsebasti · 2015-06-18T02:38:13Z

I went ahead and added the new version. It is a bit smaller (helps with that code size minimization requirement), and it varies from indistinguishable performance to 10% better performance, so it's at least as good (if not better than) the other version anyway.

Optimize p_box3x3_f32 function

b727499

Signed-off-by: Daniel Sebastian <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize p_box3x3_f32 function #149

Optimize p_box3x3_f32 function #149

dsebasti commented Jun 16, 2015

aolofsson commented Jun 16, 2015

dsebasti commented Jun 16, 2015

ebadi commented Jun 16, 2015

dsebasti commented Jun 16, 2015

ebadi commented Jun 16, 2015

dsebasti commented Jun 16, 2015

dsebasti commented Jun 17, 2015

dsebasti commented Jun 18, 2015

Optimize p_box3x3_f32 function #149

Are you sure you want to change the base?

Optimize p_box3x3_f32 function #149

Conversation

dsebasti commented Jun 16, 2015

aolofsson commented Jun 16, 2015

dsebasti commented Jun 16, 2015

ebadi commented Jun 16, 2015

dsebasti commented Jun 16, 2015

ebadi commented Jun 16, 2015

dsebasti commented Jun 16, 2015

dsebasti commented Jun 17, 2015

dsebasti commented Jun 18, 2015