[Bug] Silent GPU failures, works with `--debug` #1585

anuragprat1k · 2024-10-04T21:41:38Z

Prerequisite

I have searched Issues and Discussions but cannot get the expected help.
The bug has not been fixed in the latest version.

Type

I'm evaluating with the officially supported tasks/models/datasets.

Environment

{'CUDA available': True,
 'CUDA_HOME': '/usr/local/cuda-12.1',
 'GCC': 'gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-17)',
 'GPU 0,1,2,3,4,5,6,7': 'NVIDIA A100-SXM4-40GB',
 'MMEngine': '0.10.5',
 'MUSA available': False,
 'NVCC': 'Cuda compilation tools, release 12.1, V12.1.105',
 'OpenCV': '4.10.0',
 'PyTorch': '2.4.0+cu121',
 'PyTorch compiling details': 'PyTorch built with:\n'
                              '  - GCC 9.3\n'
                              '  - C++ Version: 201703\n'
                              '  - Intel(R) oneAPI Math Kernel Library Version '
                              '2022.2-Product Build 20220804 for Intel(R) 64 '
                              'architecture applications\n'
                              '  - Intel(R) MKL-DNN v3.4.2 (Git Hash '
                              '1137e04ec0b5251ca2b4400a4fd3c667ce843d67)\n'
                              '  - OpenMP 201511 (a.k.a. OpenMP 4.5)\n'
                              '  - LAPACK is enabled (usually provided by '
                              'MKL)\n'
                              '  - NNPACK is enabled\n'
                              '  - CPU capability usage: AVX512\n'
                              '  - CUDA Runtime 12.1\n'
                              '  - NVCC architecture flags: '
                              '-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90\n'
                              '  - CuDNN 90.1  (built against CUDA 12.4)\n'
                              '  - Magma 2.6.1\n'
                              '  - Build settings: BLAS_INFO=mkl, '
                              'BUILD_TYPE=Release, CUDA_VERSION=12.1, '
                              'CUDNN_VERSION=9.1.0, '
                              'CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, '
                              'CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 '
                              '-fabi-version=11 -fvisibility-inlines-hidden '
                              '-DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO '
                              '-DLIBKINETO_NOROCTRACER -DUSE_FBGEMM '
                              '-DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK '
                              '-DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC '
                              '-Wall -Wextra -Werror=return-type '
                              '-Werror=non-virtual-dtor -Werror=bool-operation '
                              '-Wnarrowing -Wno-missing-field-initializers '
                              '-Wno-type-limits -Wno-array-bounds '
                              '-Wno-unknown-pragmas -Wno-unused-parameter '
                              '-Wno-unused-function -Wno-unused-result '
                              '-Wno-strict-overflow -Wno-strict-aliasing '
                              '-Wno-stringop-overflow -Wsuggest-override '
                              '-Wno-psabi -Wno-error=pedantic '
                              '-Wno-error=old-style-cast -Wno-missing-braces '
                              '-fdiagnostics-color=always -faligned-new '
                              '-Wno-unused-but-set-variable '
                              '-Wno-maybe-uninitialized -fno-math-errno '
                              '-fno-trapping-math -Werror=format '
                              '-Wno-stringop-overflow, LAPACK_INFO=mkl, '
                              'PERF_WITH_AVX=1, PERF_WITH_AVX2=1, '
                              'PERF_WITH_AVX512=1, TORCH_VERSION=2.4.0, '
                              'USE_CUDA=ON, USE_CUDNN=ON, USE_CUSPARSELT=1, '
                              'USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, '
                              'USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, '
                              'USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, '
                              'USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, '
                              'USE_ROCM_KERNEL_ASSERT=OFF, \n',
 'Python': '3.12.5 | packaged by conda-forge | (main, Aug  8 2024, 18:36:51) '
           '[GCC 12.4.0]',
 'TorchVision': '0.19.0+cu121',
 'lmdeploy': '0.6.1',
 'numpy_random_seed': 2147483648,
 'opencompass': '0.3.3+89abcba',
 'sys.platform': 'linux',
 'transformers': '4.45.1'}

Reproduces the problem - code/configuration sample

I am using the following config, called eval_qwen_instruct.py

from opencompass.models import VLLMwithChatTemplate, VLLM
from mmengine.config import read_base
with read_base():
    from opencompass.configs.datasets.gsm8k.gsm8k_gen import gsm8k_datasets
    from opencompass.configs.summarizers.leaderboard import summarizer

datasets = sum([v for k, v in locals().items() if k.endswith('_datasets') or k == 'datasets'], [])

models = [
    dict(
        type=VLLMwithChatTemplate,
        abbr='qwen2.5-7b-instruct-vllm',
        path='Qwen/Qwen2.5-7B-Instruct',
        model_kwargs=dict(tensor_parallel_size=1, gpu_memory_utilization=0.6),
        max_out_len=4096,
        max_seq_len=4096,
        batch_size=16,
        generation_kwargs=dict(temperature=0),
        run_cfg=dict(num_gpus=1),
    )
]

work_dir = 'outputs/debug/qwen_2_5_7b_instruct'

Reproduces the problem - command or script

The config works in debug mode but fails in normal mode.

opencompass eval_qwen_instruct.py -a vllm -m infer # fails, see error below

opencompass eval_qwen_instruct.py -a vllm -m infer --debug # runs successfully, but is very slow

Reproduces the problem - error message

The eval fails silently in normal mode. Here's what the output logs look like.

opencompass]$ opencompass eval_qwen_instruct.py -a vllm -m infer
10/04 21:30:50 - OpenCompass - INFO - Transforming qwen2.5-7b-instruct-vllm to vllm
10/04 21:30:50 - OpenCompass - WARNING - Unsupported model type <class 'opencompass.models.vllm_with_tf_above_v4_33.VLLMwithChatTemplate'>, will keep the original model.
10/04 21:30:50 - OpenCompass - INFO - Current exp folder: outputs/debug/qwen_2_5_7b_instruct/20241004_213050
10/04 21:30:50 - OpenCompass - WARNING - SlurmRunner is not used, so the partition argument is ignored.
10/04 21:30:50 - OpenCompass - INFO - Partitioned into 1 tasks.
launch OpenICLInfer[qwen2.5-7b-instruct-vllm/gsm8k] on GPU 0                                              
  0%|                                                                               | 0/1 [00:00<?, ?it/s]10/04 21:31:27 - OpenCompass - ERROR - /root/opencompass/opencompass/runners/local.py - _launch - 228 - task OpenICLInfer[qwen2.5-7b-instruct-vllm/gsm8k] fail, see
outputs/debug/qwen_2_5_7b_instruct/20241004_213050/logs/infer/qwen2.5-7b-instruct-vllm/gsm8k.out
100%|███████████████████████████████████████████████████████████████████████| 1/1 [00:36<00:00, 36.54s/it]
10/04 21:31:27 - OpenCompass - ERROR - /opencompass/runners/base.py - summarize - 64 - OpenICLInfer[qwen2.5-7b-instruct-vllm/gsm8k] failed with code -11

Here's the output logs in outputs/debug/qwen_2_5_7b_instruct/20241004_213050/logs/infer/qwen2.5-7b-instruct-vllm/gsm8k.out look like

10/04 21:30:54 - OpenCompass - INFO - Task [qwen2.5-7b-instruct-vllm/gsm8k]
INFO 10-04 21:30:59 llm_engine.py:226] Initializing an LLM engine (v0.6.1.dev238+ge2c6e0a82) with config: model='Qwen/Qwen2.5-7B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Qwen/Qwen2.5-7B-Instruct, use_v2_block_manager=False, num_scheduler_steps=1, multi_step_stream_outputs=False, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=False, mm_processor_kwargs=None)
INFO 10-04 21:31:00 model_runner.py:1014] Starting to load model Qwen/Qwen2.5-7B-Instruct...
INFO 10-04 21:31:00 weight_utils.py:242] Using model weights format ['*.safetensors']

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]

Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:02<00:06,  2.11s/it]

Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:04<00:04,  2.12s/it]

Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:06<00:02,  2.05s/it]

Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:08<00:00,  2.08s/it]

Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:08<00:00,  2.08s/it]

INFO 10-04 21:31:09 model_runner.py:1025] Loading model weights took 14.2487 GB
INFO 10-04 21:31:11 gpu_executor.py:122] # GPU blocks: 5400, # CPU blocks: 4681
INFO 10-04 21:31:15 model_runner.py:1329] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 10-04 21:31:15 model_runner.py:1333] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 10-04 21:31:25 model_runner.py:1456] Graph capturing finished in 11 secs.

Other information

Hi OpenCompass team, curious to understand what the best way to debug something like this. Also, what is the correct way to set max_seq_len? I do set it to 4096 in eval_qwen_instruct.py but looking at the output logs in outputs/debug/qwen_2_5_7b_instruct/20241004_213050/logs/infer/qwen2.5-7b-instruct-vllm/gsm8k.out, it seems that it is still set to 32768. What am I missing?

The text was updated successfully, but these errors were encountered:

mm-assistant bot assigned tonysy Oct 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Silent GPU failures, works with `--debug` #1585

[Bug] Silent GPU failures, works with `--debug` #1585

anuragprat1k commented Oct 4, 2024

[Bug] Silent GPU failures, works with --debug #1585

[Bug] Silent GPU failures, works with --debug #1585

Comments

anuragprat1k commented Oct 4, 2024

Prerequisite

Type

Environment

Reproduces the problem - code/configuration sample

Reproduces the problem - command or script

Reproduces the problem - error message

Other information

[Bug] Silent GPU failures, works with `--debug` #1585

[Bug] Silent GPU failures, works with `--debug` #1585