Optimize float32 IQ code path using ARM NEON #89

The availability of NEON instruction is performed implicitly (from the building process point of view) based on __ARM_NEON preprocessor value. This should not be adding restrictions on the portability of the final library since the optimizer itself might be using NEON in the same configuration. Benchmarking was done on Seeed reTerminal hardware which is based on Raspberry Pi CM 4. Both GCC and Clang toolchain was tested using the `-O3 -march=armv8-a+crc -mtune=cortex-a72` flags. The time of signal processing in the consumer thread prior to the callback invocation was measured. Base NEON Speedup GCC-10 2.6488 2.5597 4% Clang-13 2.9867 2.7528 8% The speedup is not liner from the register width due to both GCC and Clang performing auto-vectorization when O3 optimization level is used. The time measurement is not included into this patch. Further speed improvement is possible to cover other sample types, but those can happen as a followup development. It should also be possible to close the gap between GCC and Clang, but this is also not related to this patch.

This instruction is only available on 64bit ARM platforms. Re-implemented this function using 32bit intrinsics, which seemed to be faster than store and sum in benchmarks I was doing for something similar in another project.

@dernasherbrezon

Pointed out by @dernasherbrezon, and reportedly should give ~2x performance boost in the process_fir_taps().

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize float32 IQ code path using ARM NEON #89

Optimize float32 IQ code path using ARM NEON #89

Commits on Oct 15, 2022

Commits on Jun 18, 2023

Commits on Jun 27, 2023

Optimize float32 IQ code path using ARM NEON #89

Are you sure you want to change the base?

Optimize float32 IQ code path using ARM NEON #89

Commits on Oct 15, 2022

Commits on Jun 18, 2023

Commits on Jun 27, 2023