-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize float32 IQ code path using ARM NEON #89
base: master
Are you sure you want to change the base?
Conversation
The availability of NEON instruction is performed implicitly (from the building process point of view) based on __ARM_NEON preprocessor value. This should not be adding restrictions on the portability of the final library since the optimizer itself might be using NEON in the same configuration. Benchmarking was done on Seeed reTerminal hardware which is based on Raspberry Pi CM 4. Both GCC and Clang toolchain was tested using the `-O3 -march=armv8-a+crc -mtune=cortex-a72` flags. The time of signal processing in the consumer thread prior to the callback invocation was measured. Base NEON Speedup GCC-10 2.6488 2.5597 4% Clang-13 2.9867 2.7528 8% The speedup is not liner from the register width due to both GCC and Clang performing auto-vectorization when O3 optimization level is used. The time measurement is not included into this patch. Further speed improvement is possible to cover other sample types, but those can happen as a followup development. It should also be possible to close the gap between GCC and Clang, but this is also not related to this patch.
I think we first need a robust test fixture that compares the candidate implementations with the "canonical" implementation. This could take form of a raw dump file from the hardware that passes through the both DSP chains, then we compare the resulting IQ data. Note that the compiler may rearrange the instructions and result in slightly different values for the samples. This is still perfectly fine as it only the bits way below the noise floor of the desired signal. This means the test must be smart enough to validate the time domain and spectral contents instead of doing a plain binary comparison. |
That is a good point about an automated test of some sort. Maybe even some benchmarking fixture :) Indeed need to be careful with the exact comparison. I think something like comparison of the actual IQ processor output with the ground-truth with some floating point epsilon will suffice. Did you have some plans or ideas how to tackle the testing? As in, maybe there is already a testing framework you'd like to be used? Or maybe there is some work going on for getting the regression/unit tests integrated? Otherwise I can give it a go and try to get it going without pulling any external dependencies in. |
libairspy/src/iqconverter_float.c
Outdated
@@ -208,6 +238,8 @@ static _inline float process_fir_taps(const float *kernel, const float *queue, i | |||
float sum = acc.m128_f32[0]; | |||
#endif | |||
|
|||
#elif defined(USE_NEON) | |||
float sum = vaddvq_f32(acc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Compilation fails on RaspberryPI. This instruction is from A64 instruction set (aarch64), while RaspberryPI is 32bit.https://developer.arm.com/architectures/instruction-sets/intrinsics/#q=vaddvq_f32
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
I've committed some tweaks, which should hopefully fix the issue. Unfortunately, do not currently have 32bit ARM platform handy, so could not verify whether there are other fixes needed.
This instruction is only available on 64bit ARM platforms. Re-implemented this function using 32bit intrinsics, which seemed to be faster than store and sum in benchmarks I was doing for something similar in another project.
libairspy/src/iqconverter_float.c
Outdated
const float32x4_t kern2 = vld1q_f32(kernel + 4); | ||
|
||
acc = vmlaq_f32(acc, kern1, head1); | ||
acc = vmlaq_f32(acc, kern2, head2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I think I found a performance issue here. If you use the same accumulator for both FMLA operations, then CPU won't be able to parallelise them. Even if all quad registers are parallel. Try setting up 2 accumulators and sum them up after the loop. It should give you ~2x performance boost.
I did some analysis here for different number of quad registers: https://dernasherbrezon.com/posts/fir-filter-optimization-simd/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Nice to see someone is digging deeper into the patch! :)
Indeed using multiple accumulators will help in this function. I've committed a small change for it. So you might want to give it another whirl!
Btw, the article is pretty cool! There is one thing to clarify though: our time measurements are different. I was measuring the overall processing time which happens in the consumer_threadproc
when running ./airspy-tools/src/airspy_rx -t 0 -f 50 -r output.wav
, and not the speedup of individual functions. It seemed to be more practical measure, which more closely resembles the amount of freed-up resources for other calculations.
The code I used for benchmarking is in the benchmark
branch of my fork: sergeyvfx/airspyone_host@2b6f827f714
One thing to note is that while it seemed to work fine on Raspberry Pi back then, now I was unable to get reliable number on Apple M2. There might be some stupid mistake in the timing code.
Edit: It might worth mentioning. In the airspy_rx
test I've mentioned above the process_fir_taps
does not seem to be used, it is things like fir_interleaved_24
seems to be a hotspot.
Edit 2: It actually worth double-checking which of the code paths are used with the default configuration. Because the default filter size is 47, so it is not really obvious why the 24 element specialization is used. Don't currently have access to the hardware to verify.
Edit 3: Turned out to be easy: cnv->len = len / 2 + 1;
in the iqconverter_float_create
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iqconverter_float_process
is the bottleneck. You can run Instruments
application from Xcode and connect to the running airspy_rx.
FIR filter can be optimised, but one of the most heavily loaded function remove_dc
cannot. It is not SIMD-friendly because on each iteration it uses the result from the previous operation. I tried to figure out how to algorithmically change it, but cannot understand how it removes DC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the ways to remove DC from a signal is to remove its average from the signal. That is effectively what is happening in the remove_dc
. It calculates the average by some sort of simplified exponential moving average. Typically you'd see a Lerp, but this FMA style of average update works good enough and is cheaper.
It also seems memmove
has considerable contribution to the timing? Perhaps something like double-bufferred circular buffer will help. Will probably help much more on Raspberry Pi, where memory transfers are not that fast. The tricky part is that such approach will not be usable for SSE path, as it ruins data alignment.
Pointed out by @dernasherbrezon, and reportedly should give ~2x performance boost in the process_fir_taps().
I spent considerable amount of time trying to optimize this algorithm and got mixed results.
On MacBook Air M1: 0.027605 (old) vs 0.016459 (tuned) improvement. ~40%. So even if the performance is slightly better I don't think it worth the hassle. There might be other areas of improvement:
I will also check libusb integration. Currently |
That's what she said. The code is already hard to read. By specializing it for every hardware you lose the readability completely.
Could be fine for most apps. A spike will show up in the upper edge of the spectrum, but that's perfectly fine. Mind you, the DC removal can be a couple of adds, a sub and a delay in INT16. Still recursive tho.
No problem with the quantization of 12bit data in a 16bit container. The RF noise overrides the quantization noise in a properly working system.
Inject your own IQ conversion half-band filter that only preserves a small part of the spectrum then decimate. Much, much cheaper. |
I meant something different though:
The second one will loose some precision definitely. But will it be noticeable? Have a practical impact? Unlikely?
Do you have some in mind? The code has |
That's exactly what I commented. |
The availability of NEON instruction is performed implicitly (from the building process point of view) based on __ARM_NEON preprocessor value. This should not be adding restrictions on the portability of the final library since the optimizer itself might be using NEON in the same configuration.
Benchmarking was done on Seeed reTerminal hardware which is based on Raspberry Pi CM 4. Both GCC and Clang toolchain was tested using the
-O3 -march=armv8-a+crc -mtune=cortex-a72
flags.The time of signal processing in the consumer thread prior to the callback invocation was measured.
The speedup is not liner from the register width due to both GCC and Clang performing auto-vectorization when O3 optimization level is used.
The time measurement is not included into this patch.
Further speed improvement is possible to cover other sample types, but those can happen as a followup development. It should also be possible to close the gap between GCC and Clang, but this is also not related to this patch.