Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize float32 IQ code path using ARM NEON #89

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Commits on Oct 15, 2022

  1. Optimize float32 IQ code path using ARM NEON

    The availability of NEON instruction is performed implicitly (from the
    building process point of view) based on __ARM_NEON preprocessor value.
    This should not be adding restrictions on the portability of the final
    library since the optimizer itself might be using NEON in the same
    configuration.
    
    Benchmarking was done on Seeed reTerminal hardware which is based on
    Raspberry Pi CM 4. Both GCC and Clang toolchain was tested using the
    `-O3 -march=armv8-a+crc -mtune=cortex-a72` flags.
    
    The time of signal processing in the consumer thread prior to the
    callback invocation was measured.
    
               Base       NEON     Speedup
    GCC-10     2.6488     2.5597   4%
    Clang-13   2.9867     2.7528   8%
    
    The speedup is not liner from the register width due to both GCC
    and Clang performing auto-vectorization when O3 optimization level
    is used.
    
    The time measurement is not included into this patch.
    
    Further speed improvement is possible to cover other sample types, but
    those can happen as a followup development. It should also be possible
    to close the gap between GCC and Clang, but this is also not related
    to this patch.
    sergeyvfx committed Oct 15, 2022
    Configuration menu
    Copy the full SHA
    8916439 View commit details
    Browse the repository at this point in the history

Commits on Jun 18, 2023

  1. Configuration menu
    Copy the full SHA
    2dc5463 View commit details
    Browse the repository at this point in the history
  2. Fix vaddvq_f32() used on 32bit ARM platform

    This instruction is only available on 64bit ARM platforms.
    
    Re-implemented this function using 32bit intrinsics, which
    seemed to be faster than store and sum in benchmarks I was
    doing for something similar in another project.
    sergeyvfx committed Jun 18, 2023
    Configuration menu
    Copy the full SHA
    9467f39 View commit details
    Browse the repository at this point in the history

Commits on Jun 27, 2023

  1. Use multiple accumulators to further improve performance

    Pointed out by @dernasherbrezon, and reportedly should give
    ~2x performance boost in the process_fir_taps().
    sergeyvfx committed Jun 27, 2023
    Configuration menu
    Copy the full SHA
    204228a View commit details
    Browse the repository at this point in the history