Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize p_box3x3_f32 function #149

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Conversation

dsebasti
Copy link

The current box filter function is replaced with one that takes just over 50% of original running time, and the performance is ~150% that of the OpenCV equivalent (~2/3 of the OpenCV running time).

Signed-off-by: Daniel Sebastian <[email protected]>
@aolofsson
Copy link
Member

What is your take on removing the first and last row processing?

@dsebasti
Copy link
Author

Not entirely sure I understand your question, but I extracted the first and last row bits mostly just to avoid evaluating a few 'if' statements that would have been evaluated anyway the vast majority of the time. I didn't check the performance impact of taking that route.

@ebadi
Copy link
Contributor

ebadi commented Jun 16, 2015

You can remove many parts of you code as I think only "Process (rows - 4) middle rows" part is needed. (If you don't see why, try to compare the output array with standard implementation.)

Can you explain your performance statistics ? It seems the inner block is executed (almost) as many times as the previous implementation, so I think it is even slower !
for (i = 4; i < rows; i++)
for (j = 3; j < cols; j++)
[inner block]

@dsebasti
Copy link
Author

I tried naively removing the 4 extracted loops, and it crashes for me, but even if it didn't I'm pretty sure that making that change without introducing some 'if' statements to the main loop would not be correct; for a 10 column wide image, the first calculations in the main loop would try to store a value to pr[-8].

As for performance, I can understand why you might think it would be slower, but it improves speed mostly by reducing overall memory accesses. For a 10x10 input image, the first iteration of the naive implementation reads values from the input array at positions 0, 1, 2, 10, 11, 12, 20, 21, and 22. The second iteration will read values from positions 1, 2, 3, 11, 12, 13, 21, 22, and 23; 6 values have been re-accessed. In contrast, my implementation does not re-access any input values, and it only requires 3 stores to each output position (compared to 1 for the naive implementation), so overall, my implementation makes fewer memory accesses.

@ebadi
Copy link
Contributor

ebadi commented Jun 16, 2015

In case memory access time is much slower then all those operations you introduced in that block, your algorithm works faster. :)

Note: I think the number of memory access is higher than 3 as reading values from variables (j, sum, cols in the inner block) are also considered memory access.

@dsebasti
Copy link
Author

Good point about those extra accesses; I guess I was just thinking about the image arrays. And as a sample for the timings of the naive implementation vs. mine; I did 4 calls on the same 5000x10000 test image, and the naive runs take a total time of 8.4s, and mine takes 2.9s, so those array accesses must not come cheap. ;)

@dsebasti
Copy link
Author

I think I owe you guys an apology. I did a quick re-implementation, with the extracted loops reintegrated in the form of several scattered if statements; the result was ~10% performance increase (2.76s vs. 2.52s using the same test I mentioned in the last comment). It seems a bit less readable to me, but I'll add a few comments and get that committed ASAP. Sorry about that, guys. =)

Edit: On further inspection (and after a bit of bug squashing), switching Qt Creator from Debug to Release builds causes the performance difference between the two versions to disappear (<1% difference). At this point I'm open to contributing whichever version you guys would prefer; just let me know if I should leave it as is, or submit the condensed version. (And if I should submit the new version, do I do that by adding another commit to this pull request, creating a new one, or something else? Sorry; I'm a bit new to GitHub.)

The new version is mostly the same, with the four outer loops
(which handled the first and last rows) now integrated into the
center loop, using 'if' statements rather than separate loops to
get the desired behavioral changes for the outer rows.

Signed-off-by: Daniel Sebastian <[email protected]>
@dsebasti
Copy link
Author

I went ahead and added the new version. It is a bit smaller (helps with that code size minimization requirement), and it varies from indistinguishable performance to 10% better performance, so it's at least as good (if not better than) the other version anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants