Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ascend NPU 310P3 chip run error #2372

Open
lq0104 opened this issue Aug 22, 2024 · 4 comments
Open

Ascend NPU 310P3 chip run error #2372

lq0104 opened this issue Aug 22, 2024 · 4 comments

Comments

@lq0104
Copy link

lq0104 commented Aug 22, 2024

I try to run whisper.cpp with CANN in Ascend NPU 310P3, my cann version is 8.0

I follow this cmd to compile:
mkdir build
cd build
cmake .. -D GGML_CANN=on
make -j

and infer command:
./build/bin/main -f samples/jfk.wav -m models/ggml-base.en.bin

but the program failed to run, error message is here:

whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-base.en.bin'
whisper_init_with_params_no_state: use gpu = 1
whisper_init_with_params_no_state: flash attn = 0
whisper_init_with_params_no_state: gpu_device = 0
whisper_init_with_params_no_state: dtw = 0
whisper_model_load: loading model
whisper_model_load: n_vocab = 51864
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 512
whisper_model_load: n_text_head = 8
whisper_model_load: n_text_layer = 6
whisper_model_load: n_mels = 80
whisper_model_load: ftype = 1
whisper_model_load: qntvr = 0
whisper_model_load: type = 2 (base)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: n_langs = 99
whisper_model_load: CPU total size = 147.37 MB
whisper_model_load: model size = 147.37 MB
whisper_backend_init_gpu: using CANN backend
whisper_init_state: kv self size = 18.87 MB
whisper_init_state: kv cross size = 18.87 MB
whisper_init_state: kv pad size = 3.15 MB
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 1)
ggml_gallocr_reserve_n: reallocating CANN buffer from size 0.00 MiB to 14.48 MiB
ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 0.00 MiB
whisper_init_state: compute buffer (conv) = 16.75 MB
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 1)
ggml_gallocr_reserve_n: reallocating CANN buffer from size 0.00 MiB to 124.33 MiB
ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 0.00 MiB
whisper_init_state: compute buffer (encode) = 131.94 MB
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 1)
ggml_gallocr_reserve_n: reallocating CANN buffer from size 0.00 MiB to 3.43 MiB
ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 0.00 MiB
whisper_init_state: compute buffer (cross) = 5.17 MB
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 1)
ggml_gallocr_reserve_n: reallocating CANN buffer from size 0.00 MiB to 140.16 MiB
ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 4.38 MiB
whisper_init_state: compute buffer (decode) = 153.13 MB

system_info: n_threads = 1 / 96 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | METAL = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | CUDA = 0 | COREML = 0 | OPENVINO = 0 | CANN = 1

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 1 threads, 1 processors, 5 beams + best of 5, lang = en, task = transcribe, timestamps = 1 ...

CANN error: EZ9999: Inner Error!
EZ9999: 2024-08-19-14:22:06.035.091 The error from device(6), serial number is 13, there is an aivec error, core id is 0, error code = 0x10, dump info: pc start: 0x8001240005eb108, current: 0x1240005eb650, vec error info: 0x1f55953f, mte error info: 0x91, ifu error info: 0x37b7ddcfefe80, ccu error info: 0x6b8e2406002b3b8e, cube error info: 0, biu error info: 0, aic error mask: 0x6de01200c0122c8, para base: 0x12c0c0340000, errorStr: Illegal instruction, which is usually caused by unaligned UUB addresses.[FUNC:PrintCoreErrorInfo][FILE:device_error_proc.cc][LINE:532]
TraceBack (most recent call last):
The error from device(6), serial number is 13, there is an aivec error, core id is 1, error code = 0x10, dump info: pc start: 0x8001240005eb108, current: 0x1240005eb650, vec error info: 0x16c7fa7b, mte error info: 0x91, ifu error info: 0xf597fa430f00, ccu error info: 0x6b8e2406000fcf8e, cube error info: 0, biu error info: 0, aic error mask: 0x6de01200c0122c8, para base: 0x12c0c0340000, errorStr: Illegal instruction, which is usually caused by unaligned UUB addresses.[FUNC:PrintCoreErrorInfo][FILE:device_error_proc.cc][LINE:532]
The error from device(6), serial number is 13, there is an aivec error, core id is 2, error code = 0x10, dump info: pc start: 0x8001240005eb108, current: 0x1240005eb650, vec error info: 0x1d0fc8f1, mte error info: 0x91, ifu error info: 0x60ebe2bbe800, ccu error info: 0x6b8e2406004cd98e, cube error info: 0, biu error info: 0, aic error mask: 0x6de01200c0122c8, para base: 0x12c0c0340000, errorStr: Illegal instruction, which is usually caused by unaligned UUB addresses.[FUNC:PrintCoreErrorInfo][FILE:device_error_proc.cc][LINE:532]
The error from device(6), serial number is 13, there is an aivec error, core id is 3, error code = 0x10, dump info: pc start: 0x8001240005eb108, current: 0x1240005eb650, vec error info: 0x1f1ef1eb, mte error info: 0x91, ifu error info: 0x2df5cebffff80, ccu error info: 0x6b8e2406004bbb8e, cube error info: 0, biu error info: 0, aic error mask: 0x6de01200c0122c8, para base: 0x12c0c0340000, errorStr: Illegal instruction, which is usually caused by unaligned UUB addresses.[FUNC:PrintCoreErrorInfo][FILE:device_error_proc.cc][LINE:532]
The error from device(6), serial number is 13, there is an aivec error, core id is 4, error code = 0x10, dump info: pc start: 0x8001240005eb108, current: 0x1240005eb650, vec error info: 0x1be2db77, mte error info: 0x91, ifu error info: 0x38754f9295d80, ccu error info: 0x6b8e2406007e538e, cube error info: 0, biu error info: 0, aic error mask: 0x6de01200c0122c8, para base: 0x12c0c0340000, errorStr: Illegal instruction, which is usually caused by unaligned UUB addresses.[FUNC:PrintCoreErrorInfo][FILE:device_error_proc.cc][LINE:532]
The error from device(6), serial number is 13, there is an aivec error, core id is 5, error code = 0x10, dump info: pc start: 0x8001240005eb108, current: 0x1240005eb650, vec error info: 0x1ffddffb, mte error info: 0x91, ifu error info: 0x33ab3f3ffee80, ccu error info: 0x6b8e2406005cf78e, cube error info: 0, biu error info: 0, aic error mask: 0x6de01200c0122c8, para base: 0x12c0c0340000, errorStr: Illegal instruction, which is usually caused by unaligned UUB addresses.[FUNC:PrintCoreErrorInfo][FILE:device_error_proc.cc][LINE:532]
The error from device(6), serial number is 13, there is an aivec error, core id is 6, error code = 0x10, dump info: pc start: 0x8001240005eb108, current: 0x1240005eb650, vec error info: 0x1bfe75e7, mte error info: 0x91, ifu error info: 0x35e0b72cebd80, ccu error info: 0x6b8e24060018e08e, cube error info: 0, biu error info: 0, aic error mask: 0x6de01200c0122c8, para base: 0x12c0c0340000, errorStr: Illegal instruction, which is usually caused by unaligned UUB addresses.[FUNC:PrintCoreErrorInfo][FILE:device_error_proc.cc][LINE:532]
The device(6), core list[0-6], error code is:[FUNC:PrintCoreInfoErrMsg][FILE:device_error_proc.cc][LINE:586]
coreId( 0): 0x10 0x10 0x10 0x10 [FUNC:PrintCoreInfoErrMsg][FILE:device_error_proc.cc][LINE:586]
coreId( 4): 0x10 0x10 0x10 [FUNC:PrintCoreInfoErrMsg][FILE:device_error_proc.cc][LINE:600]
Kernel task happen error, retCode=0x31, [vector core exception].[FUNC:PreCheckTaskErr][FILE:davinic_kernel_task.cc][LINE:1220]
Aicore kernel execute failed, device_id=0, stream_id=17, report_stream_id=17, task_id=86, flip_num=0, fault kernel_name=ascendc_dup_by_rows_fp32_to_fp16_3, fault kernel info ext=none, program id=0, hash=6170300059213965033.[FUNC:GetError][FILE:stream.cc][LINE:1082]
[AIC_INFO] after execute:args print end[FUNC:GetError][FILE:stream.cc][LINE:1082]
rtStreamSynchronize execute failed, reason=[vector core exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
synchronize stream failed, runtime result = 507035[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]

DEVICE[0] PID[272965]:
EXCEPTION STREAM:
Exception info:TGID=563630, model id=65535, stream id=17, stream phase=SCHEDULE
Message info[0]:RTS_HWTS: Vector core exception, slot_id=3, stream_id=17
Other info[0]:time=2024-08-19-14:21:56.320.060, function=process_hwts_error_exception, line=1320, error code=0x31
current device: 0, in function ggml_backend_cann_synchronize at /home/code/whisper.cpp.src/ggml/src/ggml-cann.cpp:1591
aclrtSynchronizeStream(cann_ctx->stream())
/home/code/whisper.cpp.src/ggml/src/ggml-cann.cpp:123: CANN error
Could not attach to process. If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf
warning: process 272965 is already traced by process 272863
ptrace: Operation not permitted.
No stack.
The program is not being run.

I am a beginner, could you please advise on how to fix this issue? @hipudding @MengqingCao

@MengqingCao
Copy link
Contributor

@lq0104 Try to set SOC_TYPE according to your NPU and check if this work
https://github.com/ggerganov/whisper.cpp/blob/master/ggml/src/ggml-cann/kernels/CMakeLists.txt#L2

@lq0104
Copy link
Author

lq0104 commented Aug 22, 2024

OK, I will try it, thank you

@lq0104
Copy link
Author

lq0104 commented Aug 23, 2024

when I set SOC_TYPE = Ascend310P3, I encountered an issue while compiling:

/usr/local/Ascend/ascend-toolkit/latest/tools/tikcpp/tikcfw/inner_interface/inner_kernel_operator_data_copy_intf.cppm:805:5: warning: 'DataCopyPadUB2GMImpl' is deprecated: NOTICE: DataCopyPad is not deprecated. Currently, DataCopyPad is an unsupported API on Ascend310p or Ascend610. Please check your code! [-Wdeprecated-declarations]
DataCopyPadUB2GMImpl((gm T*)dstGlobal.GetPhyAddr(), (ubuf T*)srcLocal.GetPhyAddr(), dataCopyParams);
^
/home/code/whisper.cpp.src/ggml/src/ggml-cann/kernels/dup.cpp:70:9: note: in instantiation of function template specialization 'AscendC::DataCopyPad' requested here
DataCopyPad(dst_gm, dst_local, dataCopyParams);
^
/home/code/whisper.cpp.src/ggml/src/ggml-cann/kernels/dup.cpp:88:9: note: in instantiation of member function 'DupByRows<float, float>::copy_out' requested here
copy_out();
^
/home/code/whisper.cpp.src/ggml/src/ggml-cann/kernels/dup.cpp:175:8: note: in instantiation of member function 'DupByRows<float, float>::dup' requested here
op.dup();
^
/usr/local/Ascend/ascend-toolkit/latest/tools/tikcpp/tikcfw/impl/dav_m200/kernel_operator_data_copy_impl.h:1244:3: note: 'DataCopyPadUB2GMImpl' has been explicitly marked deprecated here
[[deprecated("NOTICE: DataCopyPad is not deprecated. Currently, DataCopyPad is an unsupported API on Ascend310p "

It seems that Ascend310P currently does not support this function'DataCopyPad '

@hipudding
Copy link
Contributor

hipudding commented Aug 26, 2024

@lq0104 I think some function defination is not same between different SOCs. Someone also met this when using 910A. But you can find some function to replace DataCopyPad.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants