Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUESTION] Weird consumption of CPU/GPU (or lack of) #147

Open
jcpraud opened this issue Oct 4, 2024 · 2 comments
Open

[QUESTION] Weird consumption of CPU/GPU (or lack of) #147

jcpraud opened this issue Oct 4, 2024 · 2 comments
Assignees

Comments

@jcpraud
Copy link

jcpraud commented Oct 4, 2024

Question or Issue

Hi all,

I just installed and began to test NexaAI on my Win11 laptop... I first tested Qwen2.5:7b LLM, and...

  • Same tokens/s than running the same LLM on Ollama in the same setup, but:
  • CPU usage between 5-20%
  • CPU usage (CUDA with a RTX3050 4GB VRAM): ZERO %
  • Ollama with the same model (and same prompts) consumes about 30% GPU and 50-70% CPU on EXACTLY the same setup.

What kind of magic is this?
Or more probably, what did I miss?

As a security specialist, thus paranoid, I even ran my tests without any network cx to prevent cheating with online LLMs ;)

Cheers,
JC

OS

Windows 11

Python Version

3.12.7

Nexa SDK Version

0.0.8.6

GPU (if using one)

NVIDIA RTX 3050 Ti

@zhycheng614
Copy link
Collaborator

Thank you so much for bringing this question up. First of all, I would like to make sure that you are using the CUDA version of Nexa SDK for windows:
$env:CMAKE_ARGS="-DGGML_CUDA=ON -DSD_CUBLAS=ON"; pip install nexaai --prefer-binary --index-url https://nexaai.github.io/nexa-sdk/whl/cu124 --extra-index-url https://pypi.org/simple --no-cache-dir.

If your confirm the installation, then we can move forward on your question. However, as a developer of this SDK, I can guarantee that all inferences (which can be verified in our open-source repo) are completely on-device with no online LLM used :).

@zhycheng614 zhycheng614 self-assigned this Oct 4, 2024
@jcpraud
Copy link
Author

jcpraud commented Oct 5, 2024

Yes, I used this command line for the install (as explained on the project page : https://github.com/NexaAI/nexa-sdk):

set CMAKE_ARGS="-DGGML_CUDA=ON -DSD_CUBLAS=ON" & pip install nexaai --prefer-binary --index-url https://nexaai.github.io/nexa-sdk/whl/cu124 --extra-index-url https://pypi.org/simple --no-cache-dir

I tested gemma2-9b, there are activity pikes on the GPU: 100% usage every 3-4 (and up to 10) seconds. 3.8GB of the GPU 4GB VRAM is used. CPU usage is at 17%
The GPU consumption remains lower than running the same model on Ollama, which consumes continuously 50% CPU and 30-40% GPU, and only 2.8 GB GPU VRAM.
Token output seems 1.5 to 2x quicker on Ollama than Nexa, so no magic, in fact :)

I'm planning to test further at my work next week, on a VM with more CPU and RAM, but no GPU. Ollama is far slower in this env of course, I'll compare with Nexa loss of performance, with the same models and prompts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants