Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA out of memory #18

Open
ybu-lxd opened this issue May 16, 2024 · 1 comment
Open

CUDA out of memory #18

ybu-lxd opened this issue May 16, 2024 · 1 comment
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@ybu-lxd
Copy link

ybu-lxd commented May 16, 2024

class KanMLP(nn.Module):
    """Some Information about KanLinear"""
    def __init__(self,
              in_features=1152,
              hidden_features = None,
              out_features = None,
               drop=0.
              ):
        super().__init__()
        
        approx_gelu = lambda: nn.GELU(approximate="tanh")
        
        out_features = out_features or in_features
        hidden_features = hidden_features or in_features
        self.mlp = nn.ModuleDict(
            dict(
                c_fc=KAN(width=[in_features, hidden_features]),
                c_proj=KAN(width=[hidden_features, out_features]),
                act=NewGELU(),
                dropout=nn.Dropout(0.0),
            )
        )
        m = self.mlp
        self.mlpf = lambda x: m.dropout(
            m.c_proj(m.act(m.c_fc(x)))
        )  # MLP forward



        
    def forward(self, x):
        x = self.mlpf(x)
        return x

net = KanMLP(1152,1152*4).to("cuda")
x = torch.rand(size=(4,4096*4,1152)).to("cuda")
nex(x)

When the number of tokens reaches a certain size, the following situation will occur

 CUDA out of memory.

@ybu-lxd ybu-lxd added bug Something isn't working help wanted Extra attention is needed labels May 16, 2024
@mfrederico
Copy link

Hello! Can you answer these questions?

  • What GPU are you using?
  • RAM size of GPU
  • Model (if applicable)

I dropped your code into claude, and hopfully this gives you some indication:

The main reason you're running out of CUDA memory is the large size of your input tensor. Let's break down the memory usage:

Input tensor x:

Shape: (4, 4096*4, 1152) = (4, 16384, 1152)
Elements: 4 * 16384 * 1152 = 75,497,472
Assuming float32, this tensor alone requires about 302 MB of memory.

Network parameters:

Input size: 1152
Hidden size: 1152 * 4 = 4608
This results in two large matrices in the KAN layers, each potentially using significant memory.

Intermediate activations:

The forward pass will create several large intermediate tensors, further increasing memory usage.

To address this issue, you can try the following approaches:

Reduce batch size:
Instead of processing 4 samples at once, try reducing it to 1 or 2:
pythonCopyx = torch.rand(size=(1, 4096*4, 1152)).to("cuda")

Use gradient accumulation:
If you need to process larger batches for training stability, you can use gradient accumulation. This involves processing smaller sub-batches and accumulating gradients before performing an optimization step.
Use mixed precision training:
Utilize float16 (half-precision) computations to reduce memory usage. You can use NVIDIA's Apex library or PyTorch's native AMP (Automatic Mixed Precision):
pythonCopyfrom torch.cuda.amp import autocast

with autocast():
output = net(x)

Optimize your model architecture:
Consider if you can reduce the size of your hidden layers or use more memory-efficient architectures.
Use gradient checkpointing:
This technique trades computation for memory by not storing all intermediate activations.
If possible, process your data in smaller chunks:
Instead of processing the entire 4096*4 sequence length at once, you might be able to process it in smaller segments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants