Optimize the Akaze feature detector #67

stephanemagnenat · 2023-06-09T13:25:56Z

While working on fixing #63, I'm going through all of Akaze's source code quite deeply, and so to not loose the understanding, I'm collecting here the low-hanging fruit optimization opportunities I'm discovering. These are:

In create_nonlinear_scale_space, the Lx and Ly images could be computed in parallel, as these only access Lsmooth in read. This should be useful for the first evolutions when images are of a significant size.
Descriptor could be computed just after angle is calculated, in order to optimize cache locality when accessing the evolutions.
In angle and descriptor computation, window is iterator with x major, while in memory images are y major. This is bad for cache locality.
Improved filter functions with better loop structures.
SIMD in filters.
In find_scale_space_extrema, there is a O(n²) loop for testing duplicates. This could be optimized to O(n) using some form of spacing hashing.

A note here: I did not consider micro-optimizations, such as parallelizing loops within simple image computations. If so, more could be done, but the benefit is not fully clear due to the overhead cost of spawning threads.

The text was updated successfully, but these errors were encountered:

stephanemagnenat · 2023-06-12T15:28:14Z

@vadixidav and @xd009642 on my AMD Ryzen 9 7950X I get a factor 2x performance improvement with changes in #68, essentially by using Rayon in obvious places. I believe the code can be improved further, but serious profiling is needed to do so. Modern processors are so dependent on thermal conditions that it is hard to just compare execution timings.

For the record, I tried to replace all azip with their parallel versions and the performance decreased, so that is not included in #68 at the moment.

Note: the PR will be more clear once #65 is merged.

vadixidav · 2023-06-13T04:12:08Z

@stephanemagnenat I previously found that it was more efficient from a throughput perspective to perform akaze extraction on multiple frames in parallel rather than parallelizing the individual frames, which optimizes for latency. If you add rayon parallelization, please make rayon an optional dependency and gate out the rayon code when it isn't enabled, using ndarray or similar instead. For robotics, it may be preferable to use rayon for low latency odometry, whereas for mapping purposes it may be preferable to boost throughput.

stephanemagnenat · 2023-06-13T07:17:58Z

Of course, rayon is an optional dependency. I always have first the normal code without rayon, and then the rayon-specific code (see #68). Also, my feeling at the moment is that rayon should be used smartly, for example to not spawn threads when processing tiny images.

For my application, which is related to real-time vision, latency is key. I will probably also do some more detailed profiling, because one target is WASM, where rayon is generally not available, and currently the performance of the normal code is much worst than the equivalent OpenCV CPU code.

stephanemagnenat · 2023-06-13T10:06:37Z

Some profiling shows that the hottest function is imageproc::filter::separable_filter (~50 % of time) which is not vectorized, even in release. So now we need to find a middle ground between the incorrect previous version and the correct-but-slow version in imageproc.

stephanemagnenat · 2023-06-13T13:57:08Z

So I have rewritten the filters by considering the memory constraints. They seem to be 15% faster than the imageproc versions, and are unit tested against these.

They are still far slower than the OpenCV equivalents, mostly due to the lack of user-exposed fast math in stable Rust, preventing Rust to vectorize the inner loops. There is an interesting related discussion on the rust user forum. If we would be able to force fast math, we could expect a factor 5 of speed-up! But that is only possible on nightly! Alternatively, we can wait for portable SIMD to stabilize and use that instead. My versions are already a step in that direction.

That state of affair is a bit sad though...

stephanemagnenat · 2023-06-14T08:25:48Z

@vadixidav @xd009642 I've implemented "obvious" SIMD versions of the image filters. The first version, for horizontal filter, has 2x improvement for small kernel (7 wide) and 5x for large ones (71 wide). For the vertical filter, the improvement ranges from 20% (7 wide) to 4x (71 wide). I believe that there is still a bit of room for improvement (manually unrolled the inner loop a bit), but that is for later. It might also be worth waiting for the portable SIMD initiative to land for doing that.

This brings several questions for which I'll like your opinion:

These functions should ideally go in imageproc, but there is little responsiveness from maintainers there. I opened an issue to ask about their availability. Until something moves there, should we keep them here for now?
To develop them, I need to benchmark. To benchmark, I need to have these as public functions due to limits in criterion. This means that horizontal_filter, vertical_filter, separable_filter and gaussian_kernel become public. I do not like this so much especially in the light of their eventual inclusion in imageproc. But given the current reality, I do not see another path. What do you think?

Please note that I'm happy to split this PR, for example to merge the SIMD filters earlier than the use of rayon, if these are less contreversial.

stephanemagnenat · 2023-06-17T09:45:35Z

I think that this is ready to be merged. There are two kinds of changes:

1. Performance optimization on a single thread: mostly SIMD optimization of the filters (using the wide crate while waiting for the portable SIMD initiative to land) plus some minor code change (memory access reorganization).
1. Parallelization using Rayon, behind the rayon feature.

My most pressing need is 1., as I wish to deploy Akaze in an embedded WASM application. But as 2. is behind a feature flag, I think that it makes sense to merge as well.

@vadixidav and @xd009642 anything else you need from me before merging this?

xd009642 · 2023-06-17T09:52:50Z

I don't think there's anything extra, just need to find the time to go over and review the code. One thought would be if the CI doesn't test it builds against any wasm targets with the desired features it may be worth adding that to CI to make sure nothing's added in future that breaks that.

xd009642 · 2023-06-17T09:54:10Z

Oh there is a Cargo.toml conflict after I merged the serde PR, so that is one thing needed.

Btw I'm away this weekend so may not get to it until next week. Just trying to get through some stuff before I leave 😄

stephanemagnenat · 2023-06-17T13:20:58Z

Rebased and made clippy happy.

stephanemagnenat · 2023-06-17T14:40:16Z

Should we also test the "rayon" feature in the CI?

xd009642 · 2023-06-17T21:19:10Z

Ideally we should to avoid fixed only being applied to one and not the other

stephanemagnenat · 2023-06-19T07:01:46Z

Done in 721790e. The PR is ready for review.

stephanemagnenat · 2023-06-19T07:18:51Z

Please note that due to a limitation in github cargo actions, I could not run the test for the rayon feature of akaze in the akaze sub-directory only, so I had to run the whole cargo test again with the --features akaze/rayon argument. This works but is slow as unrelated parts of CV, such as SFM, get compiled again as well. If anyone knows how to streamline that, it would be welcome! Of course, later on, we might want a CV-wide rayon feature, it that case it won't matter much.

stephanemagnenat · 2023-06-21T08:01:13Z

@xd009642 now that the unit test is added for the rayon feature, anything missing to merge this PR?

stephanemagnenat · 2023-06-21T10:02:07Z

I just realized that std::time::Instant is not implemented and panics on WASM platform. I'm using it now, I will thus make it an optional feature, or use a replacement.

xd009642 · 2023-06-21T10:09:42Z

I'd rather use a replacement than make it optional, it's a bit of code noise wrapping all the instants and ideally features should be additive and not conflict with each other

stephanemagnenat · 2023-06-21T10:30:53Z

Ok, fine for me, I'll go for the replacement then! Thanks for the input!

xd009642 · 2023-06-21T10:31:58Z

Sure and I'll try to find some time today or tomorrow for the (hopefully) final review before merge 👍

stephanemagnenat · 2023-06-21T11:42:21Z

So the replacement doesn't have proper support for no_std, although there is a PR for it. I think that for simplicity, at the moment, I will disable timings when on wasm32. I'll use a macro, to easily abstract later.

stephanemagnenat · 2023-06-21T11:56:10Z

One question @xd009642, this PR adds the serde optional feature to Akaze. At the moment, it is with the serde name, which is very common in the Rust world and also what bitarray uses for example.

But elsewhere in CV there is a serde-serialize name that itself activates serde and serde on dependencies. This is confusing because then there are two serde features: serde that just uses the serde crate and serde-serialize that enables serde on dependencies.

That is actually not necessary, as dependencies can have serde enabled by specifying serde in [features] like this: serde = ["dep:serde", "bitarray/serde"].

So I assume that it is historical, and so far have been using just serde. But I could change if we want that, although I feel it makes things uselessly complicated. We could also change the rest of CV to the simpler option as an alternative.

xd009642 · 2023-06-21T12:05:39Z

Yeah the serde thing is historical I'd prefer the simpler serde feature - and updating the crates to do the same thing (in a future PR though)

stephanemagnenat · 2023-06-21T12:34:05Z

Ok, good, we keep it this way then.

The PR is updated with a replacement on WASM target if the optional web-sys feature is enabled.

I had to do this because the WASM target is currently wasm32-unknown-unknown so in principle we cannot assume any specific platform support, that is why Instant doesn't work in the first place. The web-sys feature assumes we target the web and links to web-sys to use the browser's Performance.now() method.

stephanemagnenat · 2023-06-23T11:29:35Z

I'm closing the issue as the related PR has been merged.

kalwalt · 2023-06-23T19:57:53Z

Ok, good, we keep it this way then.

The PR is updated with a replacement on WASM target if the optional web-sys feature is enabled.

I had to do this because the WASM target is currently wasm32-unknown-unknown so in principle we cannot assume any specific platform support, that is why Instant doesn't work in the first place. The web-sys feature assumes we target the web and links to web-sys to use the browser's Performance.now() method.

@stephanemagnenat i did in the past a test to export to wasm https://github.com/kalwalt/wasm-akaze-example i will try with these recent changes and i will let you know.

stephanemagnenat · 2023-06-25T19:40:28Z

@kalwalt great, on my side I successfully deployed to WASM (with SIMD extension) for our application (a coloring book extension to Candli). On my phone (Galaxy S9), it takes about 2s to capture an image, find the Akaze points, match them to a template, find the homography with some outlier rejection, extract a sub-part of the image, rectify and display that part. The phone camera is configured upon getUserMedia to get a 720P video feed if possible.

This is nowhere native performance but it is usable.

This was referenced Jun 12, 2023

Akaze's implementation seems to have issues with rotational invariance #63

Closed

Optimize Akaze #68

Merged

stephanemagnenat changed the title ~~Low-hanging fruits optimizations in Akaze feature detector~~ Optimize the Akaze feature detector Jun 14, 2023

xd009642 closed this as completed in #68 Jun 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize the Akaze feature detector #67

Optimize the Akaze feature detector #67

stephanemagnenat commented Jun 9, 2023 •

edited

Loading

stephanemagnenat commented Jun 12, 2023 •

edited

Loading

vadixidav commented Jun 13, 2023

stephanemagnenat commented Jun 13, 2023 •

edited

Loading

stephanemagnenat commented Jun 13, 2023 •

edited

Loading

stephanemagnenat commented Jun 13, 2023 •

edited

Loading

stephanemagnenat commented Jun 14, 2023 •

edited

Loading

stephanemagnenat commented Jun 17, 2023

xd009642 commented Jun 17, 2023

xd009642 commented Jun 17, 2023

stephanemagnenat commented Jun 17, 2023

stephanemagnenat commented Jun 17, 2023

xd009642 commented Jun 17, 2023

stephanemagnenat commented Jun 19, 2023 •

edited

Loading

stephanemagnenat commented Jun 19, 2023

stephanemagnenat commented Jun 21, 2023

stephanemagnenat commented Jun 21, 2023 •

edited

Loading

xd009642 commented Jun 21, 2023

stephanemagnenat commented Jun 21, 2023

xd009642 commented Jun 21, 2023

stephanemagnenat commented Jun 21, 2023 •

edited

Loading

stephanemagnenat commented Jun 21, 2023 •

edited

Loading

xd009642 commented Jun 21, 2023

stephanemagnenat commented Jun 21, 2023 •

edited

Loading

stephanemagnenat commented Jun 23, 2023

kalwalt commented Jun 23, 2023

stephanemagnenat commented Jun 25, 2023 •

edited

Loading

Optimize the Akaze feature detector #67

Optimize the Akaze feature detector #67

Comments

stephanemagnenat commented Jun 9, 2023 • edited Loading

stephanemagnenat commented Jun 12, 2023 • edited Loading

vadixidav commented Jun 13, 2023

stephanemagnenat commented Jun 13, 2023 • edited Loading

stephanemagnenat commented Jun 13, 2023 • edited Loading

stephanemagnenat commented Jun 13, 2023 • edited Loading

stephanemagnenat commented Jun 14, 2023 • edited Loading

stephanemagnenat commented Jun 17, 2023

xd009642 commented Jun 17, 2023

xd009642 commented Jun 17, 2023

stephanemagnenat commented Jun 17, 2023

stephanemagnenat commented Jun 17, 2023

xd009642 commented Jun 17, 2023

stephanemagnenat commented Jun 19, 2023 • edited Loading

stephanemagnenat commented Jun 19, 2023

stephanemagnenat commented Jun 21, 2023

stephanemagnenat commented Jun 21, 2023 • edited Loading

xd009642 commented Jun 21, 2023

stephanemagnenat commented Jun 21, 2023

xd009642 commented Jun 21, 2023

stephanemagnenat commented Jun 21, 2023 • edited Loading

stephanemagnenat commented Jun 21, 2023 • edited Loading

xd009642 commented Jun 21, 2023

stephanemagnenat commented Jun 21, 2023 • edited Loading

stephanemagnenat commented Jun 23, 2023

kalwalt commented Jun 23, 2023

stephanemagnenat commented Jun 25, 2023 • edited Loading

stephanemagnenat commented Jun 9, 2023 •

edited

Loading

stephanemagnenat commented Jun 12, 2023 •

edited

Loading

stephanemagnenat commented Jun 13, 2023 •

edited

Loading

stephanemagnenat commented Jun 13, 2023 •

edited

Loading

stephanemagnenat commented Jun 13, 2023 •

edited

Loading

stephanemagnenat commented Jun 14, 2023 •

edited

Loading

stephanemagnenat commented Jun 19, 2023 •

edited

Loading

stephanemagnenat commented Jun 21, 2023 •

edited

Loading

stephanemagnenat commented Jun 21, 2023 •

edited

Loading

stephanemagnenat commented Jun 21, 2023 •

edited

Loading

stephanemagnenat commented Jun 21, 2023 •

edited

Loading

stephanemagnenat commented Jun 25, 2023 •

edited

Loading