Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize the Akaze feature detector #67

Closed
5 tasks done
stephanemagnenat opened this issue Jun 9, 2023 · 26 comments · Fixed by #68
Closed
5 tasks done

Optimize the Akaze feature detector #67

stephanemagnenat opened this issue Jun 9, 2023 · 26 comments · Fixed by #68

Comments

@stephanemagnenat
Copy link
Contributor

stephanemagnenat commented Jun 9, 2023

While working on fixing #63, I'm going through all of Akaze's source code quite deeply, and so to not loose the understanding, I'm collecting here the low-hanging fruit optimization opportunities I'm discovering. These are:

  • In create_nonlinear_scale_space, the Lx and Ly images could be computed in parallel, as these only access Lsmooth in read. This should be useful for the first evolutions when images are of a significant size.
  • Descriptor could be computed just after angle is calculated, in order to optimize cache locality when accessing the evolutions.
  • In angle and descriptor computation, window is iterator with x major, while in memory images are y major. This is bad for cache locality.
  • Improved filter functions with better loop structures.
  • SIMD in filters.
  • In find_scale_space_extrema, there is a O(n²) loop for testing duplicates. This could be optimized to O(n) using some form of spacing hashing.

A note here: I did not consider micro-optimizations, such as parallelizing loops within simple image computations. If so, more could be done, but the benefit is not fully clear due to the overhead cost of spawning threads.

@stephanemagnenat
Copy link
Contributor Author

stephanemagnenat commented Jun 12, 2023

@vadixidav and @xd009642 on my AMD Ryzen 9 7950X I get a factor 2x performance improvement with changes in #68, essentially by using Rayon in obvious places. I believe the code can be improved further, but serious profiling is needed to do so. Modern processors are so dependent on thermal conditions that it is hard to just compare execution timings.

For the record, I tried to replace all azip with their parallel versions and the performance decreased, so that is not included in #68 at the moment.

Note: the PR will be more clear once #65 is merged.

@vadixidav
Copy link
Member

@stephanemagnenat I previously found that it was more efficient from a throughput perspective to perform akaze extraction on multiple frames in parallel rather than parallelizing the individual frames, which optimizes for latency. If you add rayon parallelization, please make rayon an optional dependency and gate out the rayon code when it isn't enabled, using ndarray or similar instead. For robotics, it may be preferable to use rayon for low latency odometry, whereas for mapping purposes it may be preferable to boost throughput.

@stephanemagnenat
Copy link
Contributor Author

stephanemagnenat commented Jun 13, 2023

Of course, rayon is an optional dependency. I always have first the normal code without rayon, and then the rayon-specific code (see #68). Also, my feeling at the moment is that rayon should be used smartly, for example to not spawn threads when processing tiny images.

For my application, which is related to real-time vision, latency is key. I will probably also do some more detailed profiling, because one target is WASM, where rayon is generally not available, and currently the performance of the normal code is much worst than the equivalent OpenCV CPU code.

@stephanemagnenat
Copy link
Contributor Author

stephanemagnenat commented Jun 13, 2023

Some profiling shows that the hottest function is imageproc::filter::separable_filter (~50 % of time) which is not vectorized, even in release. So now we need to find a middle ground between the incorrect previous version and the correct-but-slow version in imageproc.

@stephanemagnenat
Copy link
Contributor Author

stephanemagnenat commented Jun 13, 2023

So I have rewritten the filters by considering the memory constraints. They seem to be 15% faster than the imageproc versions, and are unit tested against these.

They are still far slower than the OpenCV equivalents, mostly due to the lack of user-exposed fast math in stable Rust, preventing Rust to vectorize the inner loops. There is an interesting related discussion on the rust user forum. If we would be able to force fast math, we could expect a factor 5 of speed-up! But that is only possible on nightly! Alternatively, we can wait for portable SIMD to stabilize and use that instead. My versions are already a step in that direction.

That state of affair is a bit sad though...

@stephanemagnenat stephanemagnenat changed the title Low-hanging fruits optimizations in Akaze feature detector Optimize the Akaze feature detector Jun 14, 2023
@stephanemagnenat
Copy link
Contributor Author

stephanemagnenat commented Jun 14, 2023

@vadixidav @xd009642 I've implemented "obvious" SIMD versions of the image filters. The first version, for horizontal filter, has 2x improvement for small kernel (7 wide) and 5x for large ones (71 wide). For the vertical filter, the improvement ranges from 20% (7 wide) to 4x (71 wide). I believe that there is still a bit of room for improvement (manually unrolled the inner loop a bit), but that is for later. It might also be worth waiting for the portable SIMD initiative to land for doing that.

This brings several questions for which I'll like your opinion:

  • These functions should ideally go in imageproc, but there is little responsiveness from maintainers there. I opened an issue to ask about their availability. Until something moves there, should we keep them here for now?
  • To develop them, I need to benchmark. To benchmark, I need to have these as public functions due to limits in criterion. This means that horizontal_filter, vertical_filter, separable_filter and gaussian_kernel become public. I do not like this so much especially in the light of their eventual inclusion in imageproc. But given the current reality, I do not see another path. What do you think?

Please note that I'm happy to split this PR, for example to merge the SIMD filters earlier than the use of rayon, if these are less contreversial.

@stephanemagnenat
Copy link
Contributor Author

I think that this is ready to be merged. There are two kinds of changes:

    1. Performance optimization on a single thread: mostly SIMD optimization of the filters (using the wide crate while waiting for the portable SIMD initiative to land) plus some minor code change (memory access reorganization).
    1. Parallelization using Rayon, behind the rayon feature.

My most pressing need is 1., as I wish to deploy Akaze in an embedded WASM application. But as 2. is behind a feature flag, I think that it makes sense to merge as well.

@vadixidav and @xd009642 anything else you need from me before merging this?

@xd009642
Copy link
Member

I don't think there's anything extra, just need to find the time to go over and review the code. One thought would be if the CI doesn't test it builds against any wasm targets with the desired features it may be worth adding that to CI to make sure nothing's added in future that breaks that.

@xd009642
Copy link
Member

Oh there is a Cargo.toml conflict after I merged the serde PR, so that is one thing needed.

Btw I'm away this weekend so may not get to it until next week. Just trying to get through some stuff before I leave 😄

@stephanemagnenat
Copy link
Contributor Author

Rebased and made clippy happy.

@stephanemagnenat
Copy link
Contributor Author

Should we also test the "rayon" feature in the CI?

@xd009642
Copy link
Member

Ideally we should to avoid fixed only being applied to one and not the other

@stephanemagnenat
Copy link
Contributor Author

stephanemagnenat commented Jun 19, 2023

Done in 721790e. The PR is ready for review.

@stephanemagnenat
Copy link
Contributor Author

Please note that due to a limitation in github cargo actions, I could not run the test for the rayon feature of akaze in the akaze sub-directory only, so I had to run the whole cargo test again with the --features akaze/rayon argument. This works but is slow as unrelated parts of CV, such as SFM, get compiled again as well. If anyone knows how to streamline that, it would be welcome! Of course, later on, we might want a CV-wide rayon feature, it that case it won't matter much.

@stephanemagnenat
Copy link
Contributor Author

@xd009642 now that the unit test is added for the rayon feature, anything missing to merge this PR?

@stephanemagnenat
Copy link
Contributor Author

stephanemagnenat commented Jun 21, 2023

I just realized that std::time::Instant is not implemented and panics on WASM platform. I'm using it now, I will thus make it an optional feature, or use a replacement.

@xd009642
Copy link
Member

I'd rather use a replacement than make it optional, it's a bit of code noise wrapping all the instants and ideally features should be additive and not conflict with each other

@stephanemagnenat
Copy link
Contributor Author

Ok, fine for me, I'll go for the replacement then! Thanks for the input!

@xd009642
Copy link
Member

Sure and I'll try to find some time today or tomorrow for the (hopefully) final review before merge 👍

@stephanemagnenat
Copy link
Contributor Author

stephanemagnenat commented Jun 21, 2023

So the replacement doesn't have proper support for no_std, although there is a PR for it. I think that for simplicity, at the moment, I will disable timings when on wasm32. I'll use a macro, to easily abstract later.

@stephanemagnenat
Copy link
Contributor Author

stephanemagnenat commented Jun 21, 2023

One question @xd009642, this PR adds the serde optional feature to Akaze. At the moment, it is with the serde name, which is very common in the Rust world and also what bitarray uses for example.

But elsewhere in CV there is a serde-serialize name that itself activates serde and serde on dependencies. This is confusing because then there are two serde features: serde that just uses the serde crate and serde-serialize that enables serde on dependencies.

That is actually not necessary, as dependencies can have serde enabled by specifying serde in [features] like this: serde = ["dep:serde", "bitarray/serde"].

So I assume that it is historical, and so far have been using just serde. But I could change if we want that, although I feel it makes things uselessly complicated. We could also change the rest of CV to the simpler option as an alternative.

@xd009642
Copy link
Member

Yeah the serde thing is historical I'd prefer the simpler serde feature - and updating the crates to do the same thing (in a future PR though)

@stephanemagnenat
Copy link
Contributor Author

stephanemagnenat commented Jun 21, 2023

Ok, good, we keep it this way then.

The PR is updated with a replacement on WASM target if the optional web-sys feature is enabled.

I had to do this because the WASM target is currently wasm32-unknown-unknown so in principle we cannot assume any specific platform support, that is why Instant doesn't work in the first place. The web-sys feature assumes we target the web and links to web-sys to use the browser's Performance.now() method.

@stephanemagnenat
Copy link
Contributor Author

I'm closing the issue as the related PR has been merged.

@kalwalt
Copy link

kalwalt commented Jun 23, 2023

Ok, good, we keep it this way then.

The PR is updated with a replacement on WASM target if the optional web-sys feature is enabled.

I had to do this because the WASM target is currently wasm32-unknown-unknown so in principle we cannot assume any specific platform support, that is why Instant doesn't work in the first place. The web-sys feature assumes we target the web and links to web-sys to use the browser's Performance.now() method.

@stephanemagnenat i did in the past a test to export to wasm https://github.com/kalwalt/wasm-akaze-example i will try with these recent changes and i will let you know.

@stephanemagnenat
Copy link
Contributor Author

stephanemagnenat commented Jun 25, 2023

@kalwalt great, on my side I successfully deployed to WASM (with SIMD extension) for our application (a coloring book extension to Candli). On my phone (Galaxy S9), it takes about 2s to capture an image, find the Akaze points, match them to a template, find the homography with some outlier rejection, extract a sub-part of the image, rectify and display that part. The phone camera is configured upon getUserMedia to get a 720P video feed if possible.

This is nowhere native performance but it is usable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants