Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA Out of Memory Error on a 32GB GPU when Running trainer.py for Fine-Tuning #141

Open
xlnn opened this issue Oct 29, 2024 · 8 comments
Open

Comments

@xlnn
Copy link

xlnn commented Oct 29, 2024

Description:
Hello, I encountered a torch.cuda.OutOfMemoryError while fine-tuning a model using trainer.py. My setup includes only a single GPU with 32GB of memory, and the error occurs even at the beginning of training.

modify trainer.py:

import argparse, os, sys, datetime
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
from omegaconf import OmegaConf
from transformers import logging as transf_logging
import pytorch_lightning as pl
from pytorch_lightning import seed_everything
from pytorch_lightning.trainer import Trainer
import torch
sys.path.insert(1, os.path.join(sys.path[0], '..'))
from utils.utils import instantiate_from_config
from utils_train import get_trainer_callbacks, get_trainer_logger, get_trainer_strategy
from utils_train import set_logger, init_workspace, load_checkpoints




def get_parser(**parser_kwargs):
    parser = argparse.ArgumentParser(**parser_kwargs)
    parser.add_argument("--seed", "-s", type=int, default=20230211, help="seed for seed_everything")
    parser.add_argument("--name", "-n", type=str, default="training_1024_v1.0", help="experiment name, as saving folder")

    # parser.add_argument("--base", "-b", nargs="*", metavar="base_config.yaml", help="paths to base configs. Loaded from left-to-right. "
    #                         "Parameters can be overwritten or added with command-line options of the form `--key value`.", default=list())

    parser.add_argument(
        "--base",
        "-b",
        nargs="*",
        metavar="base_config.yaml",
        help=(
            "Paths to base configs. Loaded from left-to-right. "
            "Parameters can be overwritten or added with command-line options of the form `--key value`."
        ),
        default=["/home/cherry2025/DynamiCrafter/configs/training_1024_v1.0/config.yaml"]
    )

    parser.add_argument("--train", "-t", action='store_true', default=True, help='train')
    parser.add_argument("--val", "-v", action='store_true', default=False, help='val')
    parser.add_argument("--test", action='store_true', default=False, help='test')

    parser.add_argument("--logdir", "-l", type=str, default="/home/cherry2025/DynamiCrafter/train_check", help="directory for logging dat shit")
    parser.add_argument("--auto_resume", action='store_true', default=False, help="resume from full-info checkpoint")
    parser.add_argument("--auto_resume_weight_only", action='store_true', default=False, help="resume from weight-only checkpoint")
    parser.add_argument("--debug", "-d", action='store_true', default=False, help="enable post-mortem debugging")

    return parser

#     -----------------
#     parser = argparse.ArgumentParser(**parser_kwargs)
#     parser.add_argument("--seed", "-s", type=int, default=20230211, help="seed for seed_everything")
#     parser.add_argument("--name", "-n", type=str, default="training_1024_v1.0",
#                         help="experiment name, as saving folder")
#
#     # parser.add_argument("--base", "-b", nargs="*", metavar="base_config.yaml", help="paths to base configs. Loaded from left-to-right. "
#     #                         "Parameters can be overwritten or added with command-line options of the form `--key value`.", default=list())
#
#     parser.add_argument("--base", "-b", nargs="*", metavar="base_config.yaml",
#                         help="paths to base configs. Loaded from left-to-right. "
#                              "Parameters can be overwritten or added with command-line options of the form `--key value`.",
#                         default="/home/cherry2025/DynamiCrafter/configs/training_1024_v1.0/config.yaml")
#
#     parser.add_argument("--train", "-t", action='store_true', default=False, help='train')
#     parser.add_argument("--val", "-v", action='store_true', default=False, help='val')
#     parser.add_argument("--test", action='store_true', default=False, help='test')
#
#     parser.add_argument("--logdir", "-l", type=str, default="logs", help="directory for logging dat shit")
#     parser.add_argument("--auto_resume", action='store_true', default=False, help="resume from full-info checkpoint")
#     parser.add_argument("--auto_resume_weight_only", action='store_true', default=False,
#                         help="resume from weight-only checkpoint")
#     parser.add_argument("--debug", "-d", action='store_true', default=False, help="enable post-mortem debugging")
#
#     return parser
    
def get_nondefault_trainer_args(args):
    parser = argparse.ArgumentParser()
    parser = Trainer.add_argparse_args(parser)
    default_trainer_args = parser.parse_args([])
    return sorted(k for k in vars(default_trainer_args) if getattr(args, k) != getattr(default_trainer_args, k))


if __name__ == "__main__":
    now = datetime.datetime.now().strftime("%Y-%m-%dT%H-%M-%S")
    # add
    local_rank = int(os.environ.get('LOCAL_RANK', 0))
    global_rank = int(os.environ.get('RANK', 0))
    num_rank = int(os.environ.get('WORLD_SIZE', 0))
    # end
    # local_rank = int(os.environ.get('LOCAL_RANK'))
    # global_rank = int(os.environ.get('RANK'))
    # num_rank = int(os.environ.get('WORLD_SIZE'))

    parser = get_parser()
    ## Extends existing argparse by default Trainer attributes
    parser = Trainer.add_argparse_args(parser)
    args, unknown = parser.parse_known_args()
    ## disable transformer warning
    transf_logging.set_verbosity_error()
    seed_everything(args.seed)

    ## yaml configs: "model" | "data" | "lightning"
    configs = [OmegaConf.load(cfg) for cfg in args.base]
    # # add
    # configs = []
    # for cfg in args.base:
    #     config = OmegaConf.load(cfg)
    #     configs.append(config)

    cli = OmegaConf.from_dotlist(unknown)
    config = OmegaConf.merge(*configs, cli)
    lightning_config = config.pop("lightning", OmegaConf.create())
    trainer_config = lightning_config.get("trainer", OmegaConf.create()) 

    ## setup workspace directories
    workdir, ckptdir, cfgdir, loginfo = init_workspace(args.name, args.logdir, config, lightning_config, global_rank)
    logger = set_logger(logfile=os.path.join(loginfo, 'log_%d:%s.txt'%(global_rank, now)))
    logger.info("@lightning version: %s [>=1.8 required]"%(pl.__version__))  

    ## MODEL CONFIG >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    logger.info("***** Configing Model *****")
    config.model.params.logdir = workdir
    model = instantiate_from_config(config.model)

    ## load checkpoints
    model = load_checkpoints(model, config.model)

    ## register_schedule again to make ZTSNR work
    if model.rescale_betas_zero_snr:
        model.register_schedule(given_betas=model.given_betas, beta_schedule=model.beta_schedule, timesteps=model.timesteps,
                                linear_start=model.linear_start, linear_end=model.linear_end, cosine_s=model.cosine_s)

    ## update trainer config
    for k in get_nondefault_trainer_args(args):
        trainer_config[k] = getattr(args, k)
        
    # num_nodes = trainer_config.num_nodes
    # ngpu_per_node = trainer_config.devices
    # add
    num_nodes = 1
    ngpu_per_node = 1
    logger.info(f"Running on {num_rank}={num_nodes}x{ngpu_per_node} GPUs")

    ## setup learning rate
    base_lr = config.model.base_learning_rate
    bs = config.data.params.batch_size
    if getattr(config.model, 'scale_lr', True):
        model.learning_rate = num_rank * bs * base_lr
    else:
        model.learning_rate = base_lr


    ## DATA CONFIG >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    logger.info("***** Configing Data *****")
    data = instantiate_from_config(config.data)
    data.setup()
    for k in data.datasets:
        logger.info(f"{k}, {data.datasets[k].__class__.__name__}, {len(data.datasets[k])}")


    ## TRAINER CONFIG >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    logger.info("***** Configing Trainer *****")
    if "accelerator" not in trainer_config:
        trainer_config["accelerator"] = "gpu"

    ## setup trainer args: pl-logger and callbacks
    trainer_kwargs = dict()
    trainer_kwargs["num_sanity_val_steps"] = 0
    logger_cfg = get_trainer_logger(lightning_config, workdir, args.debug)
    trainer_kwargs["logger"] = instantiate_from_config(logger_cfg)
    
    ## setup callbacks
    callbacks_cfg = get_trainer_callbacks(lightning_config, config, workdir, ckptdir, logger)
    trainer_kwargs["callbacks"] = [instantiate_from_config(callbacks_cfg[k]) for k in callbacks_cfg]
    strategy_cfg = get_trainer_strategy(lightning_config)
    trainer_kwargs["strategy"] = strategy_cfg if type(strategy_cfg) == str else instantiate_from_config(strategy_cfg)
    trainer_kwargs['precision'] = lightning_config.get('precision', 32)
    trainer_kwargs["sync_batchnorm"] = False

    ## trainer config: others

    trainer_args = argparse.Namespace(**trainer_config)
    trainer = Trainer.from_argparse_args(trainer_args, **trainer_kwargs)

    ## allow checkpointing via USR1
    def melk(*args, **kwargs):
        ## run all checkpoint hooks
        if trainer.global_rank == 0:
            print("Summoning checkpoint.")
            ckpt_path = os.path.join(ckptdir, "last_summoning.ckpt")
            trainer.save_checkpoint(ckpt_path)

    def divein(*args, **kwargs):
        if trainer.global_rank == 0:
            import pudb;
            pudb.set_trace()

    import signal
    signal.signal(signal.SIGUSR1, melk)
    signal.signal(signal.SIGUSR2, divein)

    ## Running LOOP >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    logger.info("***** Running the Loop *****")
    if args.train:
        try:
            if "strategy" in lightning_config and lightning_config['strategy'].startswith('deepspeed'):
                logger.info("<Training in DeepSpeed Mode>")
                ## deepspeed
                if trainer_kwargs['precision'] == 16:
                    with torch.cuda.amp.autocast():
                        trainer.fit(model, data)
                else:
                    trainer.fit(model, data)
            else:
                logger.info("<Training in DDPSharded Mode>") ## this is default
                ## ddpsharded
                trainer.fit(model, data)
        except Exception:
            #melk()
            raise

    # if args.val:
    #     trainer.validate(model, data)
    # if args.test or not trainer.interrupted:
    #     trainer.test(model, data)

Error Message:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 360.00 MiB (GPU 0; 31.73 GiB total capacity; 29.84 GiB already allocated; 80.19 MiB free; 30.33 GiB reserved in total by PyTorch). 
If reserved memory is >> allocated memory, try setting max_split_size_mb to avoid fragmentation. 
See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF.

This error occurred at:

Epoch 0:   0%|          | 1/80000 [00:53<1195:43:13, 53.81s/it, loss=0.347, v_num=6, train/loss_simple_step=0.347, train/loss_vlb_step=0.347, train/loss_step=0.347]

Steps to Reproduce:

  1. Run trainer.py with a single 32GB GPU.
  2. Start the fine-tuning process.
  3. Error occurs shortly after the first epoch begins.

My Setup:

  • GPU: Single 32GB GPU

Solutions Tried:

  1. Reduced the batch size to minimize memory usage.

Thank you for your assistance!

@Doubiiu
Copy link
Owner

Doubiiu commented Oct 29, 2024

Hi What is your bs and your config.yaml? And can you try with more GPUs?

@xlnn
Copy link
Author

xlnn commented Oct 29, 2024

Hi What is your bs and your config.yaml? And can you try with more GPUs?

Hello, thank you for your response.

  • Batch Size: My batch size is set to 1.

  • Config File (config.yaml): Below is the content of my config file:

    model:
      pretrained_checkpoint: /home/cherry2025/DynamiCrafter/checkpoints/dynamicrafter_1024_v1/model.ckpt
      base_learning_rate: 1.0e-05
      scale_lr: False
      target: lvdm.models.ddpm3d.LatentVisualDiffusion
      params:
        rescale_betas_zero_snr: True
        parameterization: "v"
        linear_start: 0.00085
        linear_end: 0.012
        num_timesteps_cond: 1
        log_every_t: 200
        timesteps: 1000
        first_stage_key: video
        cond_stage_key: caption
        cond_stage_trainable: False
        image_proj_model_trainable: True
        conditioning_key: hybrid
        image_size: [72, 128]
        channels: 4
        scale_by_std: False
        scale_factor: 0.18215
        use_ema: False
        uncond_prob: 0.05
        uncond_type: 'empty_seq'
        rand_cond_frame: true
        use_dynamic_rescale: true
        base_scale: 0.3
        fps_condition_type: 'fps'
        perframe_ae: True
        unet_config:
          target: lvdm.modules.networks.openaimodel3d.UNetModel
          params:
            in_channels: 8
            out_channels: 4
            model_channels: 320
            attention_resolutions: [4, 2, 1]
            num_res_blocks: 2
            channel_mult: [1, 2, 4, 4]
            dropout: 0.1
            num_head_channels: 64
            transformer_depth: 1
            context_dim: 1024
            use_linear: true
            use_checkpoint: True
            temporal_conv: True
            temporal_attention: True
            temporal_selfatt_only: true
            use_relative_position: false
            use_causal_attention: False
            temporal_length: 16
            addition_attention: true
            image_cross_attention: true
            default_fs: 10
            fs_condition: true
        first_stage_config:
          target: lvdm.models.autoencoder.AutoencoderKL
          params:
            embed_dim: 4
            monitor: val/rec_loss
            ddconfig:
              double_z: True
              z_channels: 4
              resolution: 256
              in_channels: 3
              out_ch: 3
              ch: 128
              ch_mult: [1, 2, 4, 4]
              num_res_blocks: 2
              attn_resolutions: []
              dropout: 0.0
            lossconfig:
              target: torch.nn.Identity
        cond_stage_config:
          target: lvdm.modules.encoders.condition.FrozenOpenCLIPEmbedder
          params:
            freeze: true
            layer: "penultimate"
        img_cond_stage_config:
          target: lvdm.modules.encoders.condition.FrozenOpenCLIPImageEmbedderV2
          params:
            freeze: true
        image_proj_stage_config:
          target: lvdm.modules.encoders.resampler.Resampler
          params:
            dim: 1024
            depth: 4
            dim_head: 64
            heads: 12
            num_queries: 16
            embedding_dim: 1280
            output_dim: 1024
            ff_mult: 4
            video_length: 16
    data:
      target: utils_data.DataModuleFromConfig
      params:
        batch_size: 1
        num_workers: 2
        wrap: false
        train:
          target: lvdm.data.webvid.WebVid
          params:
            data_dir: "/home/cherry2025/DynamiCrafter/train_data/img1"
            meta_path: "/home/cherry2025/DynamiCrafter/train_data/webvid10m_mini_80k.csv"
            video_length: 16
            frame_stride: 6
            load_raw_resolution: true
            resolution: [1024]
            spatial_transform: resize_center_crop
            random_fs: true
    lightning:
      precision: 16
      trainer:
        benchmark: True
        accumulate_grad_batches: 2
        max_steps: 100000
        log_every_n_steps: 50
        val_check_interval: 0.5
        gradient_clip_algorithm: 'norm'
        gradient_clip_val: 0.5
      callbacks:
        model_checkpoint:
          target: pytorch_lightning.callbacks.ModelCheckpoint
          params:
            every_n_train_steps: 9000
            filename: "{epoch}-{step}"
            save_weights_only: True
        metrics_over_trainsteps_checkpoint:
          target: pytorch_lightning.callbacks.ModelCheckpoint
          params:
            filename: '{epoch}-{step}'
            save_weights_only: True
            every_n_train_steps: 10000
        batch_logger:
          target: callbacks.ImageLogger
          params:
            batch_frequency: 500
            to_local: False
            max_images: 8
            log_images_kwargs:
              ddim_steps: 50
              unconditional_guidance_scale: 7.5
              timestep_spacing: uniform_trailing
              guidance_rescale: 0.7
  • Dataset: I use the training dataset consists of 80,000 images.

  • GPU Setup: Currently, I have access to a server with three 24GB GPUs.

Would this configuration be sufficient for the training process with this dataset size?

Thank you for your assistance!

@Doubiiu
Copy link
Owner

Doubiiu commented Oct 29, 2024

Hi. I forget the exact hardware requirement for fine-tuning DynamiCrafter-1024 model. It may be hard for a single 32GB GPU... Multiple 32GB GPUs would be possible.

@xlnn
Copy link
Author

xlnn commented Oct 30, 2024

Hi. I forget the exact hardware requirement for fine-tuning DynamiCrafter-1024 model. It may be hard for a single 32GB GPU... Multiple 32GB GPUs would be possible.

Thank you!

@xlnn
Copy link
Author

xlnn commented Oct 30, 2024

Hi. I forget the exact hardware requirement for fine-tuning DynamiCrafter-1024 model. It may be hard for a single 32GB GPU... Multiple 32GB GPUs would be possible.

Hi,

I'm encountering a CUDA out of memory error while fine-tuning my model, even though I'm using 4 GPUs, each with 32GB of memory. Here’s the error message:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 360.00 MiB (GPU 1; 31.73 GiB total capacity; 29.01 GiB already allocated; 102.19 MiB free; 31.14 GiB reserved in total by PyTorch)
If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "./main/trainer.py", line 396, in <module>
    trainer.fit(model, data)

Configuration:

  • Using 4 GPUs (each with 32GB memory).
  • Running a fine-tuning task with PyTorch Lightning.

What I’ve Tried:

  1. Reducing batch_size – Lowering batch size to reduce memory usage. Batch_size is 1.

Despite these efforts, the error persists. Any insights into why this might be happening, or additional suggestions to troubleshoot?

@Doubiiu
Copy link
Owner

Doubiiu commented Oct 30, 2024

Hi Did you try to fine-tune the DynamiCrafter-512 model and check the memory usage?

@xlnn
Copy link
Author

xlnn commented Oct 31, 2024

Hi Did you try to fine-tune the DynamiCrafter-512 model and check the memory usage?

Hi,I try to

@xlnn
Copy link
Author

xlnn commented Oct 31, 2024

Hi Did you try to fine-tune the DynamiCrafter-512 model and check the memory usage?

Hello. The issue persists; what is going on?
Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants