Fixed bug where cuda reserved memory climbs throughout process while allocated memory stays low #758

Lathomas42 · 2024-08-07T21:13:45Z

This seems to be a bug in torch, however it appears that when reference some fragment of large torch arrays, even when the variable goes out of scope (Xg in cluster), if a portion of it is copied / referenced, the whole array will remain in memory, as reserved memory. You can force torch to release this memory by calling empty_cache. I am not sure if this is specific to my setup, however my system specs are:

GPU: 1080 Ti u
OS: ubuntu 20.04
Cuda: 11.8
Torch: 2.3.1+cu118
Kilosort: 0.1.dev1248+gc664741

The impact of this change is easily viewable by adding after the cluster call.

torch.cuda.reset_peak_memory_stats(device=device)
print(torch.cuda.memory_reserved(device=device))
# or print(torch.cuda.memory_summary())

I think this is related to bugs:
#746
#670
#743

After this change I can sort a file that would fail 100% of the time without this change. when reverting it fails again. My GPU memory consumption actually is drastically lower using this change.

… climb. some local variables appear to not be properly released by the garbage collector, probably due to fragmentation.

jacobpennington · 2024-08-07T22:25:00Z

@Lathomas42 Are you able to share the data you're seeing this issue with? Also, can you please share the error message you're getting without making that change?

I want to look into this more before making that change, because that is not how reserved memory works. Clearing the reserved cache on each iteration can slow down clustering substantially, because it forces pytorch to request new memory each time to allocate the new tensors, whereas reserved memory is already available for allocating new tensors. It does not mean that a tensor is still occupying that memory. Most likely this is instead pointing to a memory fragmentation issue that might be fixable without a performance hit.

Lathomas42 · 2024-08-07T23:45:35Z

@jacobpennington Sounds good, I figured this is the case, it only slowed down my clustering by a small percent. However I totally understand your desire to fix it, I spent a long time trying to track down exactly where this issue comes from using torch's memory profiling tools, and have no idea. I was more or less putting this here just so people who have these bugs can add these lines of code and get their data sorted. I know this held me up for a week or so, not being able to sort this file.

My error message is the same as #746 usually crashes on some line in cluster function, such as vexp = 2 * Xg @ Xc.T - (Xc**2).sum(1).

jacobpennington · 2024-08-11T03:07:34Z

I'm going to close this because I've added the change as an optional feature with v4.0.15 (using the clear_cache argument in run_kilosort or through the GUI). I will continue looking into where the memory fragmentation could be coming from, but that should address the issue in the meantime.

Fixed bug in torch (for some GPUs) where reserved memory continues to…

3fc768f

… climb. some local variables appear to not be properly released by the garbage collector, probably due to fragmentation.

jacobpennington self-requested a review August 7, 2024 22:03

Merge branch 'MouseLand:main' into main

63b9a84

jacobpennington closed this Aug 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed bug where cuda reserved memory climbs throughout process while allocated memory stays low #758

Fixed bug where cuda reserved memory climbs throughout process while allocated memory stays low #758

Lathomas42 commented Aug 7, 2024 •

edited

Loading

jacobpennington commented Aug 7, 2024 •

edited

Loading

Lathomas42 commented Aug 7, 2024

jacobpennington commented Aug 11, 2024 •

edited

Loading

Fixed bug where cuda reserved memory climbs throughout process while allocated memory stays low #758

Fixed bug where cuda reserved memory climbs throughout process while allocated memory stays low #758

Conversation

Lathomas42 commented Aug 7, 2024 • edited Loading

jacobpennington commented Aug 7, 2024 • edited Loading

Lathomas42 commented Aug 7, 2024

jacobpennington commented Aug 11, 2024 • edited Loading

Lathomas42 commented Aug 7, 2024 •

edited

Loading

jacobpennington commented Aug 7, 2024 •

edited

Loading

jacobpennington commented Aug 11, 2024 •

edited

Loading