[QUESTION] Weird consumption of CPU/GPU (or lack of) #147

jcpraud · 2024-10-04T19:33:36Z

Question or Issue

Hi all,

I just installed and began to test NexaAI on my Win11 laptop... I first tested Qwen2.5:7b LLM, and...

Same tokens/s than running the same LLM on Ollama in the same setup, but:
CPU usage between 5-20%
CPU usage (CUDA with a RTX3050 4GB VRAM): ZERO %
Ollama with the same model (and same prompts) consumes about 30% GPU and 50-70% CPU on EXACTLY the same setup.

What kind of magic is this?
Or more probably, what did I miss?

As a security specialist, thus paranoid, I even ran my tests without any network cx to prevent cheating with online LLMs ;)

Cheers,
JC

OS

Windows 11

Python Version

3.12.7

Nexa SDK Version

0.0.8.6

GPU (if using one)

NVIDIA RTX 3050 Ti

zhycheng614 · 2024-10-04T23:13:52Z

Thank you so much for bringing this question up. First of all, I would like to make sure that you are using the CUDA version of Nexa SDK for windows:
$env:CMAKE_ARGS="-DGGML_CUDA=ON -DSD_CUBLAS=ON"; pip install nexaai --prefer-binary --index-url https://nexaai.github.io/nexa-sdk/whl/cu124 --extra-index-url https://pypi.org/simple --no-cache-dir.

If your confirm the installation, then we can move forward on your question. However, as a developer of this SDK, I can guarantee that all inferences (which can be verified in our open-source repo) are completely on-device with no online LLM used :).

jcpraud · 2024-10-05T13:45:51Z

Yes, I used this command line for the install (as explained on the project page : https://github.com/NexaAI/nexa-sdk):

set CMAKE_ARGS="-DGGML_CUDA=ON -DSD_CUBLAS=ON" & pip install nexaai --prefer-binary --index-url https://nexaai.github.io/nexa-sdk/whl/cu124 --extra-index-url https://pypi.org/simple --no-cache-dir

I tested gemma2-9b, there are activity pikes on the GPU: 100% usage every 3-4 (and up to 10) seconds. 3.8GB of the GPU 4GB VRAM is used. CPU usage is at 17%
The GPU consumption remains lower than running the same model on Ollama, which consumes continuously 50% CPU and 30-40% GPU, and only 2.8 GB GPU VRAM.
Token output seems 1.5 to 2x quicker on Ollama than Nexa, so no magic, in fact :)

I'm planning to test further at my work next week, on a VM with more CPU and RAM, but no GPU. Ollama is far slower in this env of course, I'll compare with Nexa loss of performance, with the same models and prompts.

zhycheng614 self-assigned this Oct 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUESTION] Weird consumption of CPU/GPU (or lack of) #147

[QUESTION] Weird consumption of CPU/GPU (or lack of) #147

jcpraud commented Oct 4, 2024

zhycheng614 commented Oct 4, 2024

jcpraud commented Oct 5, 2024

[QUESTION] Weird consumption of CPU/GPU (or lack of) #147

[QUESTION] Weird consumption of CPU/GPU (or lack of) #147

Comments

jcpraud commented Oct 4, 2024

Question or Issue

OS

Python Version

Nexa SDK Version

GPU (if using one)

zhycheng614 commented Oct 4, 2024

jcpraud commented Oct 5, 2024