Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed bug where cuda reserved memory climbs throughout process while allocated memory stays low #758

Closed
wants to merge 2 commits into from

Conversation

Lathomas42
Copy link

@Lathomas42 Lathomas42 commented Aug 7, 2024

This seems to be a bug in torch, however it appears that when reference some fragment of large torch arrays, even when the variable goes out of scope (Xg in cluster), if a portion of it is copied / referenced, the whole array will remain in memory, as reserved memory. You can force torch to release this memory by calling empty_cache. I am not sure if this is specific to my setup, however my system specs are:

GPU: 1080 Ti u
OS: ubuntu 20.04
Cuda: 11.8
Torch: 2.3.1+cu118
Kilosort: 0.1.dev1248+gc664741

The impact of this change is easily viewable by adding after the cluster call.

torch.cuda.reset_peak_memory_stats(device=device)
print(torch.cuda.memory_reserved(device=device))
# or print(torch.cuda.memory_summary())

I think this is related to bugs:
#746
#670
#743

After this change I can sort a file that would fail 100% of the time without this change. when reverting it fails again. My GPU memory consumption actually is drastically lower using this change.

… climb. some local variables appear to not be properly released by the garbage collector, probably due to fragmentation.
@jacobpennington jacobpennington self-requested a review August 7, 2024 22:03
@jacobpennington
Copy link
Collaborator

jacobpennington commented Aug 7, 2024

@Lathomas42 Are you able to share the data you're seeing this issue with? Also, can you please share the error message you're getting without making that change?

I want to look into this more before making that change, because that is not how reserved memory works. Clearing the reserved cache on each iteration can slow down clustering substantially, because it forces pytorch to request new memory each time to allocate the new tensors, whereas reserved memory is already available for allocating new tensors. It does not mean that a tensor is still occupying that memory. Most likely this is instead pointing to a memory fragmentation issue that might be fixable without a performance hit.

@Lathomas42
Copy link
Author

@jacobpennington Sounds good, I figured this is the case, it only slowed down my clustering by a small percent. However I totally understand your desire to fix it, I spent a long time trying to track down exactly where this issue comes from using torch's memory profiling tools, and have no idea. I was more or less putting this here just so people who have these bugs can add these lines of code and get their data sorted. I know this held me up for a week or so, not being able to sort this file.

My error message is the same as #746 usually crashes on some line in cluster function, such as vexp = 2 * Xg @ Xc.T - (Xc**2).sum(1).

@jacobpennington
Copy link
Collaborator

jacobpennington commented Aug 11, 2024

I'm going to close this because I've added the change as an optional feature with v4.0.15 (using the clear_cache argument in run_kilosort or through the GUI). I will continue looking into where the memory fragmentation could be coming from, but that should address the issue in the meantime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants