Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve matrix multiplication using the Java Vector API on Apple silicon. #7

Open
mukel opened this issue Jul 21, 2024 · 3 comments
Open

Comments

@mukel
Copy link
Owner

mukel commented Jul 21, 2024

llama.cpp runs incredibly fast on Apple silicon, I ran a build with pure CPU, and it is closer to the memory bandwidth e.g. 28 tokens/s on an M3 Pro.
llama3.java seems to be rather slow on Apple silicon e.g. Q8_0 runs as fast as Q4_0 at about 4 tokens/s, something is off. On PC it's within ~10% of llama.cpp

@stephanj
Copy link

stephanj commented Jul 30, 2024

⚠️ Claude Sonnet 3.5 feedback (hopefully this helps?) :

"Based on your observations, it seems there may be some optimization opportunities for llama3.java on Apple Silicon platforms. Let's analyze this situation and consider some potential reasons and solutions:

  1. Vectorization:
    llama.cpp likely uses highly optimized SIMD (Single Instruction, Multiple Data) instructions specific to Apple Silicon (NEON instructions). The Java implementation may not be taking full advantage of these vector instructions.

Possible solution: Ensure that the Java Vector API is being utilized effectively. You might want to check if the JVM is properly recognizing and using the NEON instructions on Apple Silicon.

  1. Memory access patterns:
    Efficient cache usage and memory access patterns are crucial for performance, especially when dealing with large models and matrices.

Possible solution: Analyze and optimize the memory access patterns in the Java code. This could involve restructuring data to improve cache locality or using more efficient data structures.

  1. JIT compilation:
    The Java Just-In-Time (JIT) compiler might not be optimizing the code as effectively as the native C++ compiler used for llama.cpp.

Possible solution: You could try running the Java application for longer periods to allow the JIT compiler more time to optimize the hot code paths. You might also experiment with different JVM flags to tune the JIT compiler's behavior.

  1. Quantization implementation:
    The implementation of the quantization methods (Q4_0 and Q8_0) in Java might not be as efficient as the C++ counterparts.

Possible solution: Review and optimize the quantization code, possibly by profiling to identify bottlenecks.

  1. Native code integration:
    llama.cpp might be using some platform-specific optimizations that are difficult to replicate in pure Java.

Possible solution: Consider using JNI (Java Native Interface) to integrate some critical C++ code directly, especially for the most performance-sensitive parts of the application.

  1. Garbage collection:
    Java's garbage collection could be introducing pauses or overhead that's not present in the C++ version.

Possible solution: Profile the application to see if garbage collection is a significant factor. If so, you might need to optimize object allocation and consider using different GC algorithms or tuning GC parameters.

  1. Threading model:
    The threading and parallelization strategy in the Java version might not be as efficient as the C++ version on Apple Silicon.

Possible solution: Review the parallelization approach, ensuring it's well-suited to the specific architecture of Apple Silicon chips.

To address these issues:

  1. Profile the Java application thoroughly to identify the main bottlenecks.
  2. Compare the assembly output of critical sections between llama.cpp and the JIT-compiled Java code to spot any significant differences in instruction usage.
  3. Experiment with different JVM flags and versions to see if there are significant performance variations.
  4. Consider reaching out to the Java community or Apple's developer resources for insights on optimizing Java performance on Apple Silicon.

By systematically addressing these potential issues, you may be able to significantly improve the performance of llama3.java on Apple Silicon, bringing it closer to the performance levels you're seeing with llama.cpp."

@RainerXE
Copy link

Hi,
I profiled the on graalvm 24ea16 with visual vm in JIT mode and I seems that most time (94+%) is used up by: FloatTensor.vectorDot (). Hope this helps.

@RainerXE
Copy link

There must be something wrong with the Graalvm 24 EA at least on the M1 Mac. I tested the benchmarks from this site:https://alexklibisz.com/2023/02/25/accelerating-vector-operations-jvm-jdk-incubator-vector-project-panama#complete-benchmark-results and got MUCH worse results using JEP338 then
m1macbookpro-graalvm 24ea16.txt
using the baseline (s. attached).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants