CUDA Out of Memory Error on a 32GB GPU when Running trainer.py for Fine-Tuning #141

xlnn · 2024-10-29T08:52:29Z

Description:
Hello, I encountered a torch.cuda.OutOfMemoryError while fine-tuning a model using trainer.py. My setup includes only a single GPU with 32GB of memory, and the error occurs even at the beginning of training.

modify trainer.py:

import argparse, os, sys, datetime
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
from omegaconf import OmegaConf
from transformers import logging as transf_logging
import pytorch_lightning as pl
from pytorch_lightning import seed_everything
from pytorch_lightning.trainer import Trainer
import torch
sys.path.insert(1, os.path.join(sys.path[0], '..'))
from utils.utils import instantiate_from_config
from utils_train import get_trainer_callbacks, get_trainer_logger, get_trainer_strategy
from utils_train import set_logger, init_workspace, load_checkpoints




def get_parser(**parser_kwargs):
    parser = argparse.ArgumentParser(**parser_kwargs)
    parser.add_argument("--seed", "-s", type=int, default=20230211, help="seed for seed_everything")
    parser.add_argument("--name", "-n", type=str, default="training_1024_v1.0", help="experiment name, as saving folder")

    # parser.add_argument("--base", "-b", nargs="*", metavar="base_config.yaml", help="paths to base configs. Loaded from left-to-right. "
    #                         "Parameters can be overwritten or added with command-line options of the form `--key value`.", default=list())

    parser.add_argument(
        "--base",
        "-b",
        nargs="*",
        metavar="base_config.yaml",
        help=(
            "Paths to base configs. Loaded from left-to-right. "
            "Parameters can be overwritten or added with command-line options of the form `--key value`."
        ),
        default=["/home/cherry2025/DynamiCrafter/configs/training_1024_v1.0/config.yaml"]
    )

    parser.add_argument("--train", "-t", action='store_true', default=True, help='train')
    parser.add_argument("--val", "-v", action='store_true', default=False, help='val')
    parser.add_argument("--test", action='store_true', default=False, help='test')

    parser.add_argument("--logdir", "-l", type=str, default="/home/cherry2025/DynamiCrafter/train_check", help="directory for logging dat shit")
    parser.add_argument("--auto_resume", action='store_true', default=False, help="resume from full-info checkpoint")
    parser.add_argument("--auto_resume_weight_only", action='store_true', default=False, help="resume from weight-only checkpoint")
    parser.add_argument("--debug", "-d", action='store_true', default=False, help="enable post-mortem debugging")

    return parser

#     -----------------
#     parser = argparse.ArgumentParser(**parser_kwargs)
#     parser.add_argument("--seed", "-s", type=int, default=20230211, help="seed for seed_everything")
#     parser.add_argument("--name", "-n", type=str, default="training_1024_v1.0",
#                         help="experiment name, as saving folder")
#
#     # parser.add_argument("--base", "-b", nargs="*", metavar="base_config.yaml", help="paths to base configs. Loaded from left-to-right. "
#     #                         "Parameters can be overwritten or added with command-line options of the form `--key value`.", default=list())
#
#     parser.add_argument("--base", "-b", nargs="*", metavar="base_config.yaml",
#                         help="paths to base configs. Loaded from left-to-right. "
#                              "Parameters can be overwritten or added with command-line options of the form `--key value`.",
#                         default="/home/cherry2025/DynamiCrafter/configs/training_1024_v1.0/config.yaml")
#
#     parser.add_argument("--train", "-t", action='store_true', default=False, help='train')
#     parser.add_argument("--val", "-v", action='store_true', default=False, help='val')
#     parser.add_argument("--test", action='store_true', default=False, help='test')
#
#     parser.add_argument("--logdir", "-l", type=str, default="logs", help="directory for logging dat shit")
#     parser.add_argument("--auto_resume", action='store_true', default=False, help="resume from full-info checkpoint")
#     parser.add_argument("--auto_resume_weight_only", action='store_true', default=False,
#                         help="resume from weight-only checkpoint")
#     parser.add_argument("--debug", "-d", action='store_true', default=False, help="enable post-mortem debugging")
#
#     return parser
    
def get_nondefault_trainer_args(args):
    parser = argparse.ArgumentParser()
    parser = Trainer.add_argparse_args(parser)
    default_trainer_args = parser.parse_args([])
    return sorted(k for k in vars(default_trainer_args) if getattr(args, k) != getattr(default_trainer_args, k))


if __name__ == "__main__":
    now = datetime.datetime.now().strftime("%Y-%m-%dT%H-%M-%S")
    # add
    local_rank = int(os.environ.get('LOCAL_RANK', 0))
    global_rank = int(os.environ.get('RANK', 0))
    num_rank = int(os.environ.get('WORLD_SIZE', 0))
    # end
    # local_rank = int(os.environ.get('LOCAL_RANK'))
    # global_rank = int(os.environ.get('RANK'))
    # num_rank = int(os.environ.get('WORLD_SIZE'))

    parser = get_parser()
    ## Extends existing argparse by default Trainer attributes
    parser = Trainer.add_argparse_args(parser)
    args, unknown = parser.parse_known_args()
    ## disable transformer warning
    transf_logging.set_verbosity_error()
    seed_everything(args.seed)

    ## yaml configs: "model" | "data" | "lightning"
    configs = [OmegaConf.load(cfg) for cfg in args.base]
    # # add
    # configs = []
    # for cfg in args.base:
    #     config = OmegaConf.load(cfg)
    #     configs.append(config)

    cli = OmegaConf.from_dotlist(unknown)
    config = OmegaConf.merge(*configs, cli)
    lightning_config = config.pop("lightning", OmegaConf.create())
    trainer_config = lightning_config.get("trainer", OmegaConf.create()) 

    ## setup workspace directories
    workdir, ckptdir, cfgdir, loginfo = init_workspace(args.name, args.logdir, config, lightning_config, global_rank)
    logger = set_logger(logfile=os.path.join(loginfo, 'log_%d:%s.txt'%(global_rank, now)))
    logger.info("@lightning version: %s [>=1.8 required]"%(pl.__version__))  

    ## MODEL CONFIG >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    logger.info("***** Configing Model *****")
    config.model.params.logdir = workdir
    model = instantiate_from_config(config.model)

    ## load checkpoints
    model = load_checkpoints(model, config.model)

    ## register_schedule again to make ZTSNR work
    if model.rescale_betas_zero_snr:
        model.register_schedule(given_betas=model.given_betas, beta_schedule=model.beta_schedule, timesteps=model.timesteps,
                                linear_start=model.linear_start, linear_end=model.linear_end, cosine_s=model.cosine_s)

    ## update trainer config
    for k in get_nondefault_trainer_args(args):
        trainer_config[k] = getattr(args, k)
        
    # num_nodes = trainer_config.num_nodes
    # ngpu_per_node = trainer_config.devices
    # add
    num_nodes = 1
    ngpu_per_node = 1
    logger.info(f"Running on {num_rank}={num_nodes}x{ngpu_per_node} GPUs")

    ## setup learning rate
    base_lr = config.model.base_learning_rate
    bs = config.data.params.batch_size
    if getattr(config.model, 'scale_lr', True):
        model.learning_rate = num_rank * bs * base_lr
    else:
        model.learning_rate = base_lr


    ## DATA CONFIG >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    logger.info("***** Configing Data *****")
    data = instantiate_from_config(config.data)
    data.setup()
    for k in data.datasets:
        logger.info(f"{k}, {data.datasets[k].__class__.__name__}, {len(data.datasets[k])}")


    ## TRAINER CONFIG >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    logger.info("***** Configing Trainer *****")
    if "accelerator" not in trainer_config:
        trainer_config["accelerator"] = "gpu"

    ## setup trainer args: pl-logger and callbacks
    trainer_kwargs = dict()
    trainer_kwargs["num_sanity_val_steps"] = 0
    logger_cfg = get_trainer_logger(lightning_config, workdir, args.debug)
    trainer_kwargs["logger"] = instantiate_from_config(logger_cfg)
    
    ## setup callbacks
    callbacks_cfg = get_trainer_callbacks(lightning_config, config, workdir, ckptdir, logger)
    trainer_kwargs["callbacks"] = [instantiate_from_config(callbacks_cfg[k]) for k in callbacks_cfg]
    strategy_cfg = get_trainer_strategy(lightning_config)
    trainer_kwargs["strategy"] = strategy_cfg if type(strategy_cfg) == str else instantiate_from_config(strategy_cfg)
    trainer_kwargs['precision'] = lightning_config.get('precision', 32)
    trainer_kwargs["sync_batchnorm"] = False

    ## trainer config: others

    trainer_args = argparse.Namespace(**trainer_config)
    trainer = Trainer.from_argparse_args(trainer_args, **trainer_kwargs)

    ## allow checkpointing via USR1
    def melk(*args, **kwargs):
        ## run all checkpoint hooks
        if trainer.global_rank == 0:
            print("Summoning checkpoint.")
            ckpt_path = os.path.join(ckptdir, "last_summoning.ckpt")
            trainer.save_checkpoint(ckpt_path)

    def divein(*args, **kwargs):
        if trainer.global_rank == 0:
            import pudb;
            pudb.set_trace()

    import signal
    signal.signal(signal.SIGUSR1, melk)
    signal.signal(signal.SIGUSR2, divein)

    ## Running LOOP >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    logger.info("***** Running the Loop *****")
    if args.train:
        try:
            if "strategy" in lightning_config and lightning_config['strategy'].startswith('deepspeed'):
                logger.info("<Training in DeepSpeed Mode>")
                ## deepspeed
                if trainer_kwargs['precision'] == 16:
                    with torch.cuda.amp.autocast():
                        trainer.fit(model, data)
                else:
                    trainer.fit(model, data)
            else:
                logger.info("<Training in DDPSharded Mode>") ## this is default
                ## ddpsharded
                trainer.fit(model, data)
        except Exception:
            #melk()
            raise

    # if args.val:
    #     trainer.validate(model, data)
    # if args.test or not trainer.interrupted:
    #     trainer.test(model, data)

Error Message:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 360.00 MiB (GPU 0; 31.73 GiB total capacity; 29.84 GiB already allocated; 80.19 MiB free; 30.33 GiB reserved in total by PyTorch). 
If reserved memory is >> allocated memory, try setting max_split_size_mb to avoid fragmentation. 
See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF.

This error occurred at:

Epoch 0:   0%|          | 1/80000 [00:53<1195:43:13, 53.81s/it, loss=0.347, v_num=6, train/loss_simple_step=0.347, train/loss_vlb_step=0.347, train/loss_step=0.347]

Steps to Reproduce:

Run trainer.py with a single 32GB GPU.
Start the fine-tuning process.
Error occurs shortly after the first epoch begins.

My Setup:

GPU: Single 32GB GPU

Solutions Tried:

Reduced the batch size to minimize memory usage.

Thank you for your assistance!

The text was updated successfully, but these errors were encountered:

Doubiiu · 2024-10-29T09:18:17Z

Hi What is your bs and your config.yaml? And can you try with more GPUs?

xlnn · 2024-10-29T09:39:34Z

Hi What is your bs and your config.yaml? And can you try with more GPUs?

Hello, thank you for your response.

Batch Size: My batch size is set to 1.

Config File (config.yaml): Below is the content of my config file:

model:
  pretrained_checkpoint: /home/cherry2025/DynamiCrafter/checkpoints/dynamicrafter_1024_v1/model.ckpt
  base_learning_rate: 1.0e-05
  scale_lr: False
  target: lvdm.models.ddpm3d.LatentVisualDiffusion
  params:
    rescale_betas_zero_snr: True
    parameterization: "v"
    linear_start: 0.00085
    linear_end: 0.012
    num_timesteps_cond: 1
    log_every_t: 200
    timesteps: 1000
    first_stage_key: video
    cond_stage_key: caption
    cond_stage_trainable: False
    image_proj_model_trainable: True
    conditioning_key: hybrid
    image_size: [72, 128]
    channels: 4
    scale_by_std: False
    scale_factor: 0.18215
    use_ema: False
    uncond_prob: 0.05
    uncond_type: 'empty_seq'
    rand_cond_frame: true
    use_dynamic_rescale: true
    base_scale: 0.3
    fps_condition_type: 'fps'
    perframe_ae: True
    unet_config:
      target: lvdm.modules.networks.openaimodel3d.UNetModel
      params:
        in_channels: 8
        out_channels: 4
        model_channels: 320
        attention_resolutions: [4, 2, 1]
        num_res_blocks: 2
        channel_mult: [1, 2, 4, 4]
        dropout: 0.1
        num_head_channels: 64
        transformer_depth: 1
        context_dim: 1024
        use_linear: true
        use_checkpoint: True
        temporal_conv: True
        temporal_attention: True
        temporal_selfatt_only: true
        use_relative_position: false
        use_causal_attention: False
        temporal_length: 16
        addition_attention: true
        image_cross_attention: true
        default_fs: 10
        fs_condition: true
    first_stage_config:
      target: lvdm.models.autoencoder.AutoencoderKL
      params:
        embed_dim: 4
        monitor: val/rec_loss
        ddconfig:
          double_z: True
          z_channels: 4
          resolution: 256
          in_channels: 3
          out_ch: 3
          ch: 128
          ch_mult: [1, 2, 4, 4]
          num_res_blocks: 2
          attn_resolutions: []
          dropout: 0.0
        lossconfig:
          target: torch.nn.Identity
    cond_stage_config:
      target: lvdm.modules.encoders.condition.FrozenOpenCLIPEmbedder
      params:
        freeze: true
        layer: "penultimate"
    img_cond_stage_config:
      target: lvdm.modules.encoders.condition.FrozenOpenCLIPImageEmbedderV2
      params:
        freeze: true
    image_proj_stage_config:
      target: lvdm.modules.encoders.resampler.Resampler
      params:
        dim: 1024
        depth: 4
        dim_head: 64
        heads: 12
        num_queries: 16
        embedding_dim: 1280
        output_dim: 1024
        ff_mult: 4
        video_length: 16
data:
  target: utils_data.DataModuleFromConfig
  params:
    batch_size: 1
    num_workers: 2
    wrap: false
    train:
      target: lvdm.data.webvid.WebVid
      params:
        data_dir: "/home/cherry2025/DynamiCrafter/train_data/img1"
        meta_path: "/home/cherry2025/DynamiCrafter/train_data/webvid10m_mini_80k.csv"
        video_length: 16
        frame_stride: 6
        load_raw_resolution: true
        resolution: [1024]
        spatial_transform: resize_center_crop
        random_fs: true
lightning:
  precision: 16
  trainer:
    benchmark: True
    accumulate_grad_batches: 2
    max_steps: 100000
    log_every_n_steps: 50
    val_check_interval: 0.5
    gradient_clip_algorithm: 'norm'
    gradient_clip_val: 0.5
  callbacks:
    model_checkpoint:
      target: pytorch_lightning.callbacks.ModelCheckpoint
      params:
        every_n_train_steps: 9000
        filename: "{epoch}-{step}"
        save_weights_only: True
    metrics_over_trainsteps_checkpoint:
      target: pytorch_lightning.callbacks.ModelCheckpoint
      params:
        filename: '{epoch}-{step}'
        save_weights_only: True
        every_n_train_steps: 10000
    batch_logger:
      target: callbacks.ImageLogger
      params:
        batch_frequency: 500
        to_local: False
        max_images: 8
        log_images_kwargs:
          ddim_steps: 50
          unconditional_guidance_scale: 7.5
          timestep_spacing: uniform_trailing
          guidance_rescale: 0.7

Dataset: I use the training dataset consists of 80,000 images.
GPU Setup: Currently, I have access to a server with three 24GB GPUs.

Would this configuration be sufficient for the training process with this dataset size?

Thank you for your assistance!

Doubiiu · 2024-10-29T22:09:10Z

Hi. I forget the exact hardware requirement for fine-tuning DynamiCrafter-1024 model. It may be hard for a single 32GB GPU... Multiple 32GB GPUs would be possible.

xlnn · 2024-10-30T11:17:01Z

Hi. I forget the exact hardware requirement for fine-tuning DynamiCrafter-1024 model. It may be hard for a single 32GB GPU... Multiple 32GB GPUs would be possible.

Thank you!

xlnn · 2024-10-30T14:11:33Z

Hi. I forget the exact hardware requirement for fine-tuning DynamiCrafter-1024 model. It may be hard for a single 32GB GPU... Multiple 32GB GPUs would be possible.

Hi,

I'm encountering a CUDA out of memory error while fine-tuning my model, even though I'm using 4 GPUs, each with 32GB of memory. Here’s the error message:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 360.00 MiB (GPU 1; 31.73 GiB total capacity; 29.01 GiB already allocated; 102.19 MiB free; 31.14 GiB reserved in total by PyTorch)
If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "./main/trainer.py", line 396, in <module>
    trainer.fit(model, data)

Configuration:

Using 4 GPUs (each with 32GB memory).
Running a fine-tuning task with PyTorch Lightning.

What I’ve Tried:

Reducing batch_size – Lowering batch size to reduce memory usage. Batch_size is 1.

Despite these efforts, the error persists. Any insights into why this might be happening, or additional suggestions to troubleshoot?

Doubiiu · 2024-10-30T17:15:29Z

Hi Did you try to fine-tune the DynamiCrafter-512 model and check the memory usage?

xlnn · 2024-10-31T09:17:05Z

Hi Did you try to fine-tune the DynamiCrafter-512 model and check the memory usage?

Hi，I try to

xlnn · 2024-10-31T10:02:15Z

Hi Did you try to fine-tune the DynamiCrafter-512 model and check the memory usage?

Hello. The issue persists; what is going on?
Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA Out of Memory Error on a 32GB GPU when Running trainer.py for Fine-Tuning #141

CUDA Out of Memory Error on a 32GB GPU when Running trainer.py for Fine-Tuning #141

xlnn commented Oct 29, 2024

Doubiiu commented Oct 29, 2024

xlnn commented Oct 29, 2024 •

edited

Loading

Doubiiu commented Oct 29, 2024

xlnn commented Oct 30, 2024

xlnn commented Oct 30, 2024

Doubiiu commented Oct 30, 2024

xlnn commented Oct 31, 2024

xlnn commented Oct 31, 2024

CUDA Out of Memory Error on a 32GB GPU when Running trainer.py for Fine-Tuning #141

CUDA Out of Memory Error on a 32GB GPU when Running trainer.py for Fine-Tuning #141

Comments

xlnn commented Oct 29, 2024

Doubiiu commented Oct 29, 2024

xlnn commented Oct 29, 2024 • edited Loading

Doubiiu commented Oct 29, 2024

xlnn commented Oct 30, 2024

xlnn commented Oct 30, 2024

Configuration:

What I’ve Tried:

Doubiiu commented Oct 30, 2024

xlnn commented Oct 31, 2024

xlnn commented Oct 31, 2024

xlnn commented Oct 29, 2024 •

edited

Loading