Optimize float32 IQ code path using ARM NEON #89

sergeyvfx · 2022-10-15T12:39:39Z

The availability of NEON instruction is performed implicitly (from the building process point of view) based on __ARM_NEON preprocessor value. This should not be adding restrictions on the portability of the final library since the optimizer itself might be using NEON in the same configuration.

Benchmarking was done on Seeed reTerminal hardware which is based on Raspberry Pi CM 4. Both GCC and Clang toolchain was tested using the -O3 -march=armv8-a+crc -mtune=cortex-a72 flags.

The time of signal processing in the consumer thread prior to the callback invocation was measured.

           Base       NEON     Speedup
GCC-10     2.6488     2.5597   4%
Clang-13   2.9867     2.7528   8%

The speedup is not liner from the register width due to both GCC and Clang performing auto-vectorization when O3 optimization level is used.

The time measurement is not included into this patch.

Further speed improvement is possible to cover other sample types, but those can happen as a followup development. It should also be possible to close the gap between GCC and Clang, but this is also not related to this patch.

The availability of NEON instruction is performed implicitly (from the building process point of view) based on __ARM_NEON preprocessor value. This should not be adding restrictions on the portability of the final library since the optimizer itself might be using NEON in the same configuration. Benchmarking was done on Seeed reTerminal hardware which is based on Raspberry Pi CM 4. Both GCC and Clang toolchain was tested using the `-O3 -march=armv8-a+crc -mtune=cortex-a72` flags. The time of signal processing in the consumer thread prior to the callback invocation was measured. Base NEON Speedup GCC-10 2.6488 2.5597 4% Clang-13 2.9867 2.7528 8% The speedup is not liner from the register width due to both GCC and Clang performing auto-vectorization when O3 optimization level is used. The time measurement is not included into this patch. Further speed improvement is possible to cover other sample types, but those can happen as a followup development. It should also be possible to close the gap between GCC and Clang, but this is also not related to this patch.

touil · 2022-10-15T14:28:17Z

I think we first need a robust test fixture that compares the candidate implementations with the "canonical" implementation. This could take form of a raw dump file from the hardware that passes through the both DSP chains, then we compare the resulting IQ data. Note that the compiler may rearrange the instructions and result in slightly different values for the samples. This is still perfectly fine as it only the bits way below the noise floor of the desired signal. This means the test must be smart enough to validate the time domain and spectral contents instead of doing a plain binary comparison.
EDIT: The compiler options like -ffast-math and its equivalents can get you in the same situation where the binary comparison doesn't pass while the data is still valid.

sergeyvfx · 2022-10-15T15:19:55Z

That is a good point about an automated test of some sort. Maybe even some benchmarking fixture :)

Indeed need to be careful with the exact comparison. I think something like comparison of the actual IQ processor output with the ground-truth with some floating point epsilon will suffice.

Did you have some plans or ideas how to tackle the testing? As in, maybe there is already a testing framework you'd like to be used? Or maybe there is some work going on for getting the regression/unit tests integrated? Otherwise I can give it a go and try to get it going without pulling any external dependencies in.

dernasherbrezon · 2023-06-18T20:33:26Z

libairspy/src/iqconverter_float.c

@@ -208,6 +238,8 @@ static _inline float process_fir_taps(const float *kernel, const float *queue, i
 	float sum = acc.m128_f32[0];
 #endif

+#elif defined(USE_NEON)
+	float sum = vaddvq_f32(acc);


Compilation fails on RaspberryPI. This instruction is from A64 instruction set (aarch64), while RaspberryPI is 32bit.https://developer.arm.com/architectures/instruction-sets/intrinsics/#q=vaddvq_f32

Good catch!
I've committed some tweaks, which should hopefully fix the issue. Unfortunately, do not currently have 32bit ARM platform handy, so could not verify whether there are other fixes needed.

This instruction is only available on 64bit ARM platforms. Re-implemented this function using 32bit intrinsics, which seemed to be faster than store and sum in benchmarks I was doing for something similar in another project.

dernasherbrezon · 2023-06-26T23:01:30Z

libairspy/src/iqconverter_float.c

+			const float32x4_t kern2 = vld1q_f32(kernel + 4);
+
+			acc = vmlaq_f32(acc, kern1, head1);
+			acc = vmlaq_f32(acc, kern2, head2);


Actually I think I found a performance issue here. If you use the same accumulator for both FMLA operations, then CPU won't be able to parallelise them. Even if all quad registers are parallel. Try setting up 2 accumulators and sum them up after the loop. It should give you ~2x performance boost.

I did some analysis here for different number of quad registers: https://dernasherbrezon.com/posts/fir-filter-optimization-simd/

Thanks! Nice to see someone is digging deeper into the patch! :)

Indeed using multiple accumulators will help in this function. I've committed a small change for it. So you might want to give it another whirl!

Btw, the article is pretty cool! There is one thing to clarify though: our time measurements are different. I was measuring the overall processing time which happens in the consumer_threadproc when running ./airspy-tools/src/airspy_rx -t 0 -f 50 -r output.wav, and not the speedup of individual functions. It seemed to be more practical measure, which more closely resembles the amount of freed-up resources for other calculations.

The code I used for benchmarking is in the benchmark branch of my fork: sergeyvfx/airspyone_host@2b6f827f714
One thing to note is that while it seemed to work fine on Raspberry Pi back then, now I was unable to get reliable number on Apple M2. There might be some stupid mistake in the timing code.

Edit: It might worth mentioning. In the airspy_rx test I've mentioned above the process_fir_taps does not seem to be used, it is things like fir_interleaved_24 seems to be a hotspot.
Edit 2: It actually worth double-checking which of the code paths are used with the default configuration. Because the default filter size is 47, so it is not really obvious why the 24 element specialization is used. Don't currently have access to the hardware to verify.
Edit 3: Turned out to be easy: cnv->len = len / 2 + 1; in the iqconverter_float_create.

iqconverter_float_process is the bottleneck. You can run Instruments application from Xcode and connect to the running airspy_rx.

FIR filter can be optimised, but one of the most heavily loaded function remove_dc cannot. It is not SIMD-friendly because on each iteration it uses the result from the previous operation. I tried to figure out how to algorithmically change it, but cannot understand how it removes DC.

One of the ways to remove DC from a signal is to remove its average from the signal. That is effectively what is happening in the remove_dc. It calculates the average by some sort of simplified exponential moving average. Typically you'd see a Lerp, but this FMA style of average update works good enough and is cheaper.

It also seems memmove has considerable contribution to the timing? Perhaps something like double-bufferred circular buffer will help. Will probably help much more on Raspberry Pi, where memory transfers are not that fast. The tricky part is that such approach will not be usable for SSE path, as it ruins data alignment.

@dernasherbrezon

Pointed out by @dernasherbrezon, and reportedly should give ~2x performance boost in the process_fir_taps().

dernasherbrezon · 2023-07-14T19:30:46Z

I spent considerable amount of time trying to optimize this algorithm and got mixed results.

process_fir_taps - is not used by default. Default filter is 47 taps which is processed by fir_interleaved_24
fir_interleaved_24 - is rather optimised because it takes into consideration that filter is symmetrical and can save ~12 multiplications. I tried manually unrolling it into 4 items at a time (https://github.com/dernasherbrezon/arm-tests/blob/main/dot_prod_12.c#L55) and it gave me ~23% improvement on Raspberry pi with "-O3 -mfloat-abi=hard -march=armv8-a -mfpu=neon -ftree-vectorize -ffast-math -funroll-loops" flags.
As I mentioned previously DC removal algorithm has a data dependency issue so cannot be optimized further for SIMD.
I tried combining all algorithms into single function (https://github.com/dernasherbrezon/airspyone_host/blob/neon_tuning/libairspy/src/iqconverter_float_tuned.c#L118). The idea was to eliminate memory copy here https://github.com/airspy/airspyone_host/blob/master/libairspy/src/iqconverter_float.c#L364 which happens every 47 * 32 samples. And also remove branching over here: https://github.com/airspy/airspyone_host/blob/master/libairspy/src/iqconverter_float.c#L436
I wrote a performance test which can compare old implementation with the new one: https://github.com/dernasherbrezon/airspyone_host/blob/neon_tuning/libairspy/test/test_iqconverter_float.c

On MacBook Air M1: 0.027605 (old) vs 0.016459 (tuned) improvement. ~40%.
On Raspberry PI 3b+: 0.272147 (old) vs 0.242934 (tuned) improvement. ~11%

So even if the performance is slightly better I don't think it worth the hassle. There might be other areas of improvement:

Remove DC-removal algorithm which can be suitable for certain applications.
Use INT16_REAL types that should be much faster but suspect to quantisation noise.
Use lower sampling rates at the firmware. Which in turn slightly defeat the purpose of using airspy with oversampled signal.
Decimate as soon as data reached application

I will also check libusb integration. Currently iqconverter_float_process takes 200ms for RaspberryPI to process 6msps rate which should be 20% CPU. On practice I see airspy_rx into /dev/null takes 80% CPU ( airspy_rx -f 472.2 -a 6000000 -r /dev/null -t 0)

touil · 2023-07-14T19:54:31Z

I don't think it worth the hassle.

That's what she said. The code is already hard to read. By specializing it for every hardware you lose the readability completely.

Remove DC-removal algorithm which can be suitable for certain applications.

Could be fine for most apps. A spike will show up in the upper edge of the spectrum, but that's perfectly fine. Mind you, the DC removal can be a couple of adds, a sub and a delay in INT16. Still recursive tho.

Use INT16_REAL types that should be much faster but suspect to quantisation noise.

No problem with the quantization of 12bit data in a 16bit container. The RF noise overrides the quantization noise in a properly working system.

Use lower sampling rates at the firmware. Which in turn slightly defeat the purpose of using airspy with oversampled signal.

Decimate as soon as data reached application

Inject your own IQ conversion half-band filter that only preserves a small part of the spectrum then decimate. Much, much cheaper.

dernasherbrezon · 2023-07-14T20:50:45Z

No problem with the quantization of 12bit data in a 16bit container. The RF noise overrides the quantization noise in a properly working system.

I meant something different though:

12bit -> 16bit -> float -> (filtering) -> float
12bit -> 16bit -> (filtering) -> 16bit

The second one will loose some precision definitely. But will it be noticeable? Have a practical impact? Unlikely?

Inject your own IQ conversion half-band filter that only preserves a small part of the spectrum then decimate. Much, much cheaper.

Do you have some in mind? The code has fir_interleaved_4, fir_interleaved_8, fir_interleaved_12 that might suggest you tested some.

touil · 2023-07-14T20:52:55Z

12bit -> 16bit -> (filtering) -> 16bit

That's exactly what I commented.

bvernoux mentioned this pull request May 20, 2023

libairspy is slow when receiving float32 iq on Apple M1 #91

Closed

dernasherbrezon reviewed Jun 18, 2023

View reviewed changes

sergeyvfx added 2 commits June 18, 2023 23:00

Merge branch 'master' into simd_neon

2dc5463

Fix vaddvq_f32() used on 32bit ARM platform

9467f39

This instruction is only available on 64bit ARM platforms. Re-implemented this function using 32bit intrinsics, which seemed to be faster than store and sum in benchmarks I was doing for something similar in another project.

dernasherbrezon reviewed Jun 26, 2023

View reviewed changes

Use multiple accumulators to further improve performance

204228a

Pointed out by @dernasherbrezon, and reportedly should give ~2x performance boost in the process_fir_taps().

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize float32 IQ code path using ARM NEON #89

Optimize float32 IQ code path using ARM NEON #89

sergeyvfx commented Oct 15, 2022 •

edited

Loading

touil commented Oct 15, 2022 •

edited

Loading

sergeyvfx commented Oct 15, 2022

dernasherbrezon Jun 18, 2023

sergeyvfx Jun 18, 2023

dernasherbrezon Jun 26, 2023

sergeyvfx Jun 27, 2023 •

edited

Loading

dernasherbrezon Jun 27, 2023

sergeyvfx Jun 28, 2023

dernasherbrezon commented Jul 14, 2023

touil commented Jul 14, 2023 •

edited

Loading

dernasherbrezon commented Jul 14, 2023

touil commented Jul 14, 2023

Optimize float32 IQ code path using ARM NEON #89

Are you sure you want to change the base?

Optimize float32 IQ code path using ARM NEON #89

Conversation

sergeyvfx commented Oct 15, 2022 • edited Loading

touil commented Oct 15, 2022 • edited Loading

sergeyvfx commented Oct 15, 2022

dernasherbrezon Jun 18, 2023

Choose a reason for hiding this comment

sergeyvfx Jun 18, 2023

Choose a reason for hiding this comment

dernasherbrezon Jun 26, 2023

Choose a reason for hiding this comment

sergeyvfx Jun 27, 2023 • edited Loading

Choose a reason for hiding this comment

dernasherbrezon Jun 27, 2023

Choose a reason for hiding this comment

sergeyvfx Jun 28, 2023

Choose a reason for hiding this comment

dernasherbrezon commented Jul 14, 2023

touil commented Jul 14, 2023 • edited Loading

dernasherbrezon commented Jul 14, 2023

touil commented Jul 14, 2023

sergeyvfx commented Oct 15, 2022 •

edited

Loading

touil commented Oct 15, 2022 •

edited

Loading

sergeyvfx Jun 27, 2023 •

edited

Loading

touil commented Jul 14, 2023 •

edited

Loading