Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG:GPU calling failure when starting first clustering #743

Open
Alchemist-Y opened this issue Jul 23, 2024 · 14 comments
Open

BUG:GPU calling failure when starting first clustering #743

Alchemist-Y opened this issue Jul 23, 2024 · 14 comments

Comments

@Alchemist-Y
Copy link

Alchemist-Y commented Jul 23, 2024

Describe the issue:

It is normal during the spike extracting, but when it comes to the first clustering part, kilosort will automatically choose the cpu to computer rather than the GPU.

Reproduce the bug:

GPU calling is falied during the first clustering

Error message:

No response

Version information:

kilosort: latest version; CUDA toolkit: 11.8; NVIDIA driver: windows server 2022 standard; GUP: NVIDIA RTX A2000 12GB

@Alchemist-Y Alchemist-Y changed the title BUG:GPU calling failure when starting first clusting BUG:GPU calling failure when starting first clustering Jul 23, 2024
@jacobpennington
Copy link
Collaborator

I'm sorry, I'm having a hard time understanding the issue. What do you mean it "automatically chooses the cpu?" How are you determining that? Are you getting an error during sorting? What version of Kilosort4 is this?

@Alchemist-Y
Copy link
Author

Alchemist-Y commented Jul 24, 2024

@jacobpennington Hi,I,m sorry about my stodgy description. the kilosort version is V 4.0.13. I mean, I choose the GPU as for the pytorch in the GUI, but during the clustering part, the GPU doesn't work at all observing in my task manager. And inversely, the CPU is 100% working.

@jacobpennington
Copy link
Collaborator

The task manager is not an accurate way to check GPU usage for python processes. If you want to check usage while sorting is running, you can use the nvidia-smi command in a terminal or powershell.

@Alchemist-Y
Copy link
Author

@jacobpennington Hi,here I show the detail information about the GPU usage: extracting spike part and clustering part.
image
image

@jacobpennington
Copy link
Collaborator

Okay. Can you please clarify if you get an error during sorting, or think it's taking too long, or some other problem? Not all steps of the sorting process use the GPU.

@Alchemist-Y
Copy link
Author

Okay. Can you please clarify if you get an error during sorting, or think it's taking too long, or some other problem? Not all steps of the sorting process use the GPU.

Yes, it takes too long to process in the first clustering part. Because in my own computer with GPU-NVIDIA 1650S 4GB, rather than in this win server, it can solve the entire sorting in about 3h, but in this environment, it takes 2 hours in the firsting clustering part and without any processing.

@jacobpennington
Copy link
Collaborator

jacobpennington commented Jul 27, 2024

Are you sure your pytorch installation is set up correctly? You said you're using CUDA toolkit version 11.8, but your screenshots show version 12.4. If you are using 12.4, I would recommend you try setting up a new environment using toolkit version 11.8 and then see if the issue persists. Some other users have reported difficulties using 12.4.

@Alchemist-Y
Copy link
Author

@jacobpennington hi,this figure showed above means the highest edition CUDA which my server support, and what I've installed in CUDA 11.8 showed under.
image

@jacobpennington
Copy link
Collaborator

Ah, I see. Unfortunately, I'm not sure what else I can try to debug for you if the sorting works fine on a different machine. The next thing I would try is the following, just to make sure nothing got messed up with the installation and there aren't any conflicts with other packages:

  1. Restart the machine.
  2. Create a new conda environment to use only for Kilosort (it looks like you're using the base environment right now).
  3. Follow the steps in the readme again to install Kilosort and pytorch.
  4. Retry the sorting.

If you still get the same error after trying that, please upload kilosort4.log from the results directory from the new sorting attempt. Screenshots of the Kilosort4 GUI with that recording loaded might also be helpful, if you're using the GUI.

@Alchemist-Y
Copy link
Author

Hi,jacob,here is my log. Because it takes too long to process the fisrt clustering part, so I interrupt it.
kilosort4.log

@RobertoDF
Copy link
Contributor

RobertoDF commented Aug 29, 2024

I have exactly the same problem, "first clustering "extremely slow (>200 s/it) probably because no GPU usage, first 2 steps 2 it/s as expected. On Task manager I can isolate CUDA usage and it goes down at the "first clustering" step. I am calling kilosrt directly, not via spikeinterface. I reinstalled CUDA toolkit and pytorch but no improvement.

@RobertoDF
Copy link
Contributor

@Alchemist-Y
Copy link
Author

@RobertoDF
Copy link
Contributor

I could solve this by using a totally fresh environment, not just uninstall>reinstall. However now I get the CUDA out of memory error 😰

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants