Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize float32 IQ code path using ARM NEON #89

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

sergeyvfx
Copy link
Contributor

@sergeyvfx sergeyvfx commented Oct 15, 2022

The availability of NEON instruction is performed implicitly (from the building process point of view) based on __ARM_NEON preprocessor value. This should not be adding restrictions on the portability of the final library since the optimizer itself might be using NEON in the same configuration.

Benchmarking was done on Seeed reTerminal hardware which is based on Raspberry Pi CM 4. Both GCC and Clang toolchain was tested using the -O3 -march=armv8-a+crc -mtune=cortex-a72 flags.

The time of signal processing in the consumer thread prior to the callback invocation was measured.

           Base       NEON     Speedup
GCC-10     2.6488     2.5597   4%
Clang-13   2.9867     2.7528   8%

The speedup is not liner from the register width due to both GCC and Clang performing auto-vectorization when O3 optimization level is used.

The time measurement is not included into this patch.

Further speed improvement is possible to cover other sample types, but those can happen as a followup development. It should also be possible to close the gap between GCC and Clang, but this is also not related to this patch.

The availability of NEON instruction is performed implicitly (from the
building process point of view) based on __ARM_NEON preprocessor value.
This should not be adding restrictions on the portability of the final
library since the optimizer itself might be using NEON in the same
configuration.

Benchmarking was done on Seeed reTerminal hardware which is based on
Raspberry Pi CM 4. Both GCC and Clang toolchain was tested using the
`-O3 -march=armv8-a+crc -mtune=cortex-a72` flags.

The time of signal processing in the consumer thread prior to the
callback invocation was measured.

           Base       NEON     Speedup
GCC-10     2.6488     2.5597   4%
Clang-13   2.9867     2.7528   8%

The speedup is not liner from the register width due to both GCC
and Clang performing auto-vectorization when O3 optimization level
is used.

The time measurement is not included into this patch.

Further speed improvement is possible to cover other sample types, but
those can happen as a followup development. It should also be possible
to close the gap between GCC and Clang, but this is also not related
to this patch.
@touil
Copy link
Member

touil commented Oct 15, 2022

I think we first need a robust test fixture that compares the candidate implementations with the "canonical" implementation. This could take form of a raw dump file from the hardware that passes through the both DSP chains, then we compare the resulting IQ data. Note that the compiler may rearrange the instructions and result in slightly different values for the samples. This is still perfectly fine as it only the bits way below the noise floor of the desired signal. This means the test must be smart enough to validate the time domain and spectral contents instead of doing a plain binary comparison.
EDIT: The compiler options like -ffast-math and its equivalents can get you in the same situation where the binary comparison doesn't pass while the data is still valid.

@sergeyvfx
Copy link
Contributor Author

That is a good point about an automated test of some sort. Maybe even some benchmarking fixture :)

Indeed need to be careful with the exact comparison. I think something like comparison of the actual IQ processor output with the ground-truth with some floating point epsilon will suffice.

Did you have some plans or ideas how to tackle the testing? As in, maybe there is already a testing framework you'd like to be used? Or maybe there is some work going on for getting the regression/unit tests integrated? Otherwise I can give it a go and try to get it going without pulling any external dependencies in.

@@ -208,6 +238,8 @@ static _inline float process_fir_taps(const float *kernel, const float *queue, i
float sum = acc.m128_f32[0];
#endif

#elif defined(USE_NEON)
float sum = vaddvq_f32(acc);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compilation fails on RaspberryPI. This instruction is from A64 instruction set (aarch64), while RaspberryPI is 32bit.https://developer.arm.com/architectures/instruction-sets/intrinsics/#q=vaddvq_f32

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!
I've committed some tweaks, which should hopefully fix the issue. Unfortunately, do not currently have 32bit ARM platform handy, so could not verify whether there are other fixes needed.

This instruction is only available on 64bit ARM platforms.

Re-implemented this function using 32bit intrinsics, which
seemed to be faster than store and sum in benchmarks I was
doing for something similar in another project.
const float32x4_t kern2 = vld1q_f32(kernel + 4);

acc = vmlaq_f32(acc, kern1, head1);
acc = vmlaq_f32(acc, kern2, head2);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think I found a performance issue here. If you use the same accumulator for both FMLA operations, then CPU won't be able to parallelise them. Even if all quad registers are parallel. Try setting up 2 accumulators and sum them up after the loop. It should give you ~2x performance boost.

I did some analysis here for different number of quad registers: https://dernasherbrezon.com/posts/fir-filter-optimization-simd/

Screenshot 2023-06-27 at 00 00 03

Copy link
Contributor Author

@sergeyvfx sergeyvfx Jun 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Nice to see someone is digging deeper into the patch! :)

Indeed using multiple accumulators will help in this function. I've committed a small change for it. So you might want to give it another whirl!

Btw, the article is pretty cool! There is one thing to clarify though: our time measurements are different. I was measuring the overall processing time which happens in the consumer_threadproc when running ./airspy-tools/src/airspy_rx -t 0 -f 50 -r output.wav, and not the speedup of individual functions. It seemed to be more practical measure, which more closely resembles the amount of freed-up resources for other calculations.

The code I used for benchmarking is in the benchmark branch of my fork: sergeyvfx/airspyone_host@2b6f827f714
One thing to note is that while it seemed to work fine on Raspberry Pi back then, now I was unable to get reliable number on Apple M2. There might be some stupid mistake in the timing code.

Edit: It might worth mentioning. In the airspy_rx test I've mentioned above the process_fir_taps does not seem to be used, it is things like fir_interleaved_24 seems to be a hotspot.
Edit 2: It actually worth double-checking which of the code paths are used with the default configuration. Because the default filter size is 47, so it is not really obvious why the 24 element specialization is used. Don't currently have access to the hardware to verify.
Edit 3: Turned out to be easy: cnv->len = len / 2 + 1; in the iqconverter_float_create.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iqconverter_float_process is the bottleneck. You can run Instruments application from Xcode and connect to the running airspy_rx.

Screenshot 2023-06-27 at 22 18 51

FIR filter can be optimised, but one of the most heavily loaded function remove_dc cannot. It is not SIMD-friendly because on each iteration it uses the result from the previous operation. I tried to figure out how to algorithmically change it, but cannot understand how it removes DC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the ways to remove DC from a signal is to remove its average from the signal. That is effectively what is happening in the remove_dc. It calculates the average by some sort of simplified exponential moving average. Typically you'd see a Lerp, but this FMA style of average update works good enough and is cheaper.

It also seems memmove has considerable contribution to the timing? Perhaps something like double-bufferred circular buffer will help. Will probably help much more on Raspberry Pi, where memory transfers are not that fast. The tricky part is that such approach will not be usable for SSE path, as it ruins data alignment.

Pointed out by @dernasherbrezon, and reportedly should give
~2x performance boost in the process_fir_taps().
@dernasherbrezon
Copy link

I spent considerable amount of time trying to optimize this algorithm and got mixed results.

  1. process_fir_taps - is not used by default. Default filter is 47 taps which is processed by fir_interleaved_24
  2. fir_interleaved_24 - is rather optimised because it takes into consideration that filter is symmetrical and can save ~12 multiplications. I tried manually unrolling it into 4 items at a time (https://github.com/dernasherbrezon/arm-tests/blob/main/dot_prod_12.c#L55) and it gave me ~23% improvement on Raspberry pi with "-O3 -mfloat-abi=hard -march=armv8-a -mfpu=neon -ftree-vectorize -ffast-math -funroll-loops" flags.
  3. As I mentioned previously DC removal algorithm has a data dependency issue so cannot be optimized further for SIMD.
  4. I tried combining all algorithms into single function (https://github.com/dernasherbrezon/airspyone_host/blob/neon_tuning/libairspy/src/iqconverter_float_tuned.c#L118). The idea was to eliminate memory copy here https://github.com/airspy/airspyone_host/blob/master/libairspy/src/iqconverter_float.c#L364 which happens every 47 * 32 samples. And also remove branching over here: https://github.com/airspy/airspyone_host/blob/master/libairspy/src/iqconverter_float.c#L436
  5. I wrote a performance test which can compare old implementation with the new one: https://github.com/dernasherbrezon/airspyone_host/blob/neon_tuning/libairspy/test/test_iqconverter_float.c

On MacBook Air M1: 0.027605 (old) vs 0.016459 (tuned) improvement. ~40%.
On Raspberry PI 3b+: 0.272147 (old) vs 0.242934 (tuned) improvement. ~11%

So even if the performance is slightly better I don't think it worth the hassle. There might be other areas of improvement:

  1. Remove DC-removal algorithm which can be suitable for certain applications.
  2. Use INT16_REAL types that should be much faster but suspect to quantisation noise.
  3. Use lower sampling rates at the firmware. Which in turn slightly defeat the purpose of using airspy with oversampled signal.
  4. Decimate as soon as data reached application

I will also check libusb integration. Currently iqconverter_float_process takes 200ms for RaspberryPI to process 6msps rate which should be 20% CPU. On practice I see airspy_rx into /dev/null takes 80% CPU ( airspy_rx -f 472.2 -a 6000000 -r /dev/null -t 0)

@touil
Copy link
Member

touil commented Jul 14, 2023

I don't think it worth the hassle.

That's what she said. The code is already hard to read. By specializing it for every hardware you lose the readability completely.

  • Remove DC-removal algorithm which can be suitable for certain applications.

Could be fine for most apps. A spike will show up in the upper edge of the spectrum, but that's perfectly fine. Mind you, the DC removal can be a couple of adds, a sub and a delay in INT16. Still recursive tho.

  • Use INT16_REAL types that should be much faster but suspect to quantisation noise.

No problem with the quantization of 12bit data in a 16bit container. The RF noise overrides the quantization noise in a properly working system.

  • Use lower sampling rates at the firmware. Which in turn slightly defeat the purpose of using airspy with oversampled signal.
  • Decimate as soon as data reached application

Inject your own IQ conversion half-band filter that only preserves a small part of the spectrum then decimate. Much, much cheaper.

@dernasherbrezon
Copy link

No problem with the quantization of 12bit data in a 16bit container. The RF noise overrides the quantization noise in a properly working system.

I meant something different though:

  • 12bit -> 16bit -> float -> (filtering) -> float
  • 12bit -> 16bit -> (filtering) -> 16bit

The second one will loose some precision definitely. But will it be noticeable? Have a practical impact? Unlikely?

Inject your own IQ conversion half-band filter that only preserves a small part of the spectrum then decimate. Much, much cheaper.

Do you have some in mind? The code has fir_interleaved_4, fir_interleaved_8, fir_interleaved_12 that might suggest you tested some.

@touil
Copy link
Member

touil commented Jul 14, 2023

  • 12bit -> 16bit -> (filtering) -> 16bit

That's exactly what I commented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants