-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU not being used during inference #112
Comments
The problem appears to be related to llama-cpp-python not being built with CUDA support. Adding To help others experiencing the same issue this was my path to success:
|
Thanks for taking the time to investigate it. I will update the setvars.sh and the documentation based on your findings. You are right - llamacpp is being compiled during the installation and checks these flags to compile with the GPU support. In general, the project will continue to support llamacpp going forward, but would encourage people to use external frameworks that support OpenAI API for the inference (e.g. LiteLLM + Ollama). That way it will allow me to focus on the core RAG functionality, which is a main purpose of this project. |
There wasn't an error when using Python 3.11, it just didn't compile with CUDA support apparently. I didn't dig into it. It could have been a faulty python env or similar. I was just happy to finally have CUDA working. Do you plan on adding an example install & config instructions for using LiteLLM / Ollama? I usually just copy and paste from the documentation to get the project up running as quickly as possible to kick the tires etc. |
The relevant config is here - https://llm-search.readthedocs.io/en/latest/configure_model.html#ollama-litellm But can add more elaborate instructions |
Version affected: current version v0.7.1 (main)
I initially assumed the issue was with my system, outdated nvidia drivers, cuda etc. But after trying on 4 separate machines running different mixes of Debian 11, 12 and Ubuntu 20.4 and 22.04 I haven't be able to get it working properly.
The GPU is being used during indexing but not during interaction/inference.
The GPU is used when using other projects like https://github.com/oobabooga/text-generation-webui, Yolo, CVAT etc.
I've tried different combinations of python 3.10 and 3.11 using venv and conda environments. I've installed pytorch manually. I've tried several CUDA version from 11.8 to 12.4.
Older installs of llmsearch work fine and use the GPU. I believe my first install was v0.4x. So it seems the issue may be a specific environment variable I've missed or possibly the pyllmsearch package itself.
I've installed llmsearch via pip and build it from the github repo, there was no difference. I'm not seeing any warning or notices that might indicate an issue with my machines configuration.
To reiterate, the 4 machines I've used all have working CUDA installs and are able to use the GPU when running projects like Yolo, CVAT and text-generation-webui etc.
Here is the output of one of the interaction sessions:
You can see that only the CPU is being used, no mention of GPU caching etc.
I am open to any ideas or suggestions, thanks!
The text was updated successfully, but these errors were encountered: