Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Silent GPU failures, works with --debug #1585

Open
2 tasks done
anuragprat1k opened this issue Oct 4, 2024 · 0 comments
Open
2 tasks done

[Bug] Silent GPU failures, works with --debug #1585

anuragprat1k opened this issue Oct 4, 2024 · 0 comments
Assignees

Comments

@anuragprat1k
Copy link

Prerequisite

Type

I'm evaluating with the officially supported tasks/models/datasets.

Environment

{'CUDA available': True,
 'CUDA_HOME': '/usr/local/cuda-12.1',
 'GCC': 'gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-17)',
 'GPU 0,1,2,3,4,5,6,7': 'NVIDIA A100-SXM4-40GB',
 'MMEngine': '0.10.5',
 'MUSA available': False,
 'NVCC': 'Cuda compilation tools, release 12.1, V12.1.105',
 'OpenCV': '4.10.0',
 'PyTorch': '2.4.0+cu121',
 'PyTorch compiling details': 'PyTorch built with:\n'
                              '  - GCC 9.3\n'
                              '  - C++ Version: 201703\n'
                              '  - Intel(R) oneAPI Math Kernel Library Version '
                              '2022.2-Product Build 20220804 for Intel(R) 64 '
                              'architecture applications\n'
                              '  - Intel(R) MKL-DNN v3.4.2 (Git Hash '
                              '1137e04ec0b5251ca2b4400a4fd3c667ce843d67)\n'
                              '  - OpenMP 201511 (a.k.a. OpenMP 4.5)\n'
                              '  - LAPACK is enabled (usually provided by '
                              'MKL)\n'
                              '  - NNPACK is enabled\n'
                              '  - CPU capability usage: AVX512\n'
                              '  - CUDA Runtime 12.1\n'
                              '  - NVCC architecture flags: '
                              '-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90\n'
                              '  - CuDNN 90.1  (built against CUDA 12.4)\n'
                              '  - Magma 2.6.1\n'
                              '  - Build settings: BLAS_INFO=mkl, '
                              'BUILD_TYPE=Release, CUDA_VERSION=12.1, '
                              'CUDNN_VERSION=9.1.0, '
                              'CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, '
                              'CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 '
                              '-fabi-version=11 -fvisibility-inlines-hidden '
                              '-DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO '
                              '-DLIBKINETO_NOROCTRACER -DUSE_FBGEMM '
                              '-DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK '
                              '-DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC '
                              '-Wall -Wextra -Werror=return-type '
                              '-Werror=non-virtual-dtor -Werror=bool-operation '
                              '-Wnarrowing -Wno-missing-field-initializers '
                              '-Wno-type-limits -Wno-array-bounds '
                              '-Wno-unknown-pragmas -Wno-unused-parameter '
                              '-Wno-unused-function -Wno-unused-result '
                              '-Wno-strict-overflow -Wno-strict-aliasing '
                              '-Wno-stringop-overflow -Wsuggest-override '
                              '-Wno-psabi -Wno-error=pedantic '
                              '-Wno-error=old-style-cast -Wno-missing-braces '
                              '-fdiagnostics-color=always -faligned-new '
                              '-Wno-unused-but-set-variable '
                              '-Wno-maybe-uninitialized -fno-math-errno '
                              '-fno-trapping-math -Werror=format '
                              '-Wno-stringop-overflow, LAPACK_INFO=mkl, '
                              'PERF_WITH_AVX=1, PERF_WITH_AVX2=1, '
                              'PERF_WITH_AVX512=1, TORCH_VERSION=2.4.0, '
                              'USE_CUDA=ON, USE_CUDNN=ON, USE_CUSPARSELT=1, '
                              'USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, '
                              'USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, '
                              'USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, '
                              'USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, '
                              'USE_ROCM_KERNEL_ASSERT=OFF, \n',
 'Python': '3.12.5 | packaged by conda-forge | (main, Aug  8 2024, 18:36:51) '
           '[GCC 12.4.0]',
 'TorchVision': '0.19.0+cu121',
 'lmdeploy': '0.6.1',
 'numpy_random_seed': 2147483648,
 'opencompass': '0.3.3+89abcba',
 'sys.platform': 'linux',
 'transformers': '4.45.1'}

Reproduces the problem - code/configuration sample

I am using the following config, called eval_qwen_instruct.py

from opencompass.models import VLLMwithChatTemplate, VLLM
from mmengine.config import read_base
with read_base():
    from opencompass.configs.datasets.gsm8k.gsm8k_gen import gsm8k_datasets
    from opencompass.configs.summarizers.leaderboard import summarizer

datasets = sum([v for k, v in locals().items() if k.endswith('_datasets') or k == 'datasets'], [])

models = [
    dict(
        type=VLLMwithChatTemplate,
        abbr='qwen2.5-7b-instruct-vllm',
        path='Qwen/Qwen2.5-7B-Instruct',
        model_kwargs=dict(tensor_parallel_size=1, gpu_memory_utilization=0.6),
        max_out_len=4096,
        max_seq_len=4096,
        batch_size=16,
        generation_kwargs=dict(temperature=0),
        run_cfg=dict(num_gpus=1),
    )
]

work_dir = 'outputs/debug/qwen_2_5_7b_instruct'

Reproduces the problem - command or script

The config works in debug mode but fails in normal mode.

opencompass eval_qwen_instruct.py -a vllm -m infer # fails, see error below

opencompass eval_qwen_instruct.py -a vllm -m infer --debug # runs successfully, but is very slow

Reproduces the problem - error message

The eval fails silently in normal mode. Here's what the output logs look like.

opencompass]$ opencompass eval_qwen_instruct.py -a vllm -m infer
10/04 21:30:50 - OpenCompass - INFO - Transforming qwen2.5-7b-instruct-vllm to vllm
10/04 21:30:50 - OpenCompass - WARNING - Unsupported model type <class 'opencompass.models.vllm_with_tf_above_v4_33.VLLMwithChatTemplate'>, will keep the original model.
10/04 21:30:50 - OpenCompass - INFO - Current exp folder: outputs/debug/qwen_2_5_7b_instruct/20241004_213050
10/04 21:30:50 - OpenCompass - WARNING - SlurmRunner is not used, so the partition argument is ignored.
10/04 21:30:50 - OpenCompass - INFO - Partitioned into 1 tasks.
launch OpenICLInfer[qwen2.5-7b-instruct-vllm/gsm8k] on GPU 0                                              
  0%|                                                                               | 0/1 [00:00<?, ?it/s]10/04 21:31:27 - OpenCompass - ERROR - /root/opencompass/opencompass/runners/local.py - _launch - 228 - task OpenICLInfer[qwen2.5-7b-instruct-vllm/gsm8k] fail, see
outputs/debug/qwen_2_5_7b_instruct/20241004_213050/logs/infer/qwen2.5-7b-instruct-vllm/gsm8k.out
100%|███████████████████████████████████████████████████████████████████████| 1/1 [00:36<00:00, 36.54s/it]
10/04 21:31:27 - OpenCompass - ERROR - /opencompass/runners/base.py - summarize - 64 - OpenICLInfer[qwen2.5-7b-instruct-vllm/gsm8k] failed with code -11

Here's the output logs in outputs/debug/qwen_2_5_7b_instruct/20241004_213050/logs/infer/qwen2.5-7b-instruct-vllm/gsm8k.out look like

10/04 21:30:54 - OpenCompass - INFO - Task [qwen2.5-7b-instruct-vllm/gsm8k]
INFO 10-04 21:30:59 llm_engine.py:226] Initializing an LLM engine (v0.6.1.dev238+ge2c6e0a82) with config: model='Qwen/Qwen2.5-7B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Qwen/Qwen2.5-7B-Instruct, use_v2_block_manager=False, num_scheduler_steps=1, multi_step_stream_outputs=False, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=False, mm_processor_kwargs=None)
INFO 10-04 21:31:00 model_runner.py:1014] Starting to load model Qwen/Qwen2.5-7B-Instruct...
INFO 10-04 21:31:00 weight_utils.py:242] Using model weights format ['*.safetensors']

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]

Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:02<00:06,  2.11s/it]

Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:04<00:04,  2.12s/it]

Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:06<00:02,  2.05s/it]

Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:08<00:00,  2.08s/it]

Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:08<00:00,  2.08s/it]

INFO 10-04 21:31:09 model_runner.py:1025] Loading model weights took 14.2487 GB
INFO 10-04 21:31:11 gpu_executor.py:122] # GPU blocks: 5400, # CPU blocks: 4681
INFO 10-04 21:31:15 model_runner.py:1329] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 10-04 21:31:15 model_runner.py:1333] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 10-04 21:31:25 model_runner.py:1456] Graph capturing finished in 11 secs.

Other information

Hi OpenCompass team, curious to understand what the best way to debug something like this. Also, what is the correct way to set max_seq_len? I do set it to 4096 in eval_qwen_instruct.py but looking at the output logs in outputs/debug/qwen_2_5_7b_instruct/20241004_213050/logs/infer/qwen2.5-7b-instruct-vllm/gsm8k.out, it seems that it is still set to 32768. What am I missing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants