Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Onnxruntime-gpu 1.8.0 killed the process on cpu device #3366

Open
zaobao opened this issue Jul 29, 2024 · 4 comments
Open

Onnxruntime-gpu 1.8.0 killed the process on cpu device #3366

zaobao opened this issue Jul 29, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@zaobao
Copy link

zaobao commented Jul 29, 2024

Environment Info

Container: Docker with NO GPU
OS: AlmaLinux
CUDA installed: 12.2
Cudnn installed: 8.9.0
djl version: 0.29.0
onnxruntime_gpu version: 1.8.0

Error Message

[root@r100048367-91051506-l5wvj powerop]# cat /tmp/hs_err_pid1062.log | more
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f6be8b25d12, pid=1062, tid=0x00007f6ddfdff640
#
# JRE version: OpenJDK Runtime Environment (8.0_302-b08) (build 1.8.0_302-b08)
# Java VM: OpenJDK 64-Bit Server VM (25.302-b08 mixed mode linux-amd64 )
# Problematic frame:
# C  [libonnxruntime_providers_cuda.so+0x1a4d12]
#
# Core dump written. Default location: //core or core.1062
#
# If you would like to submit a bug report, please visit:
#   https://github.com/adoptium/adoptium-support/issues
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

---------------  T H R E A D  ---------------

Current thread (0x00007f6ef6394000):  JavaThread "igniteThread" daemon [_thread_in_native, id=1579, stack(0x00007f6ddfdc0000,0x00007f6ddfe00000)]

siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x0000000000000000

Registers:
RAX=0x00007f6c07e18828, RBX=0x00007f6ddfdfc570, RCX=0x0000000000000006, RDX=0x0000000000000000
RSP=0x00007f6ddfdfc550, RBP=0x00007f6ddfdfc650, RSI=0x0000000000000000, RDI=0x00007f6ddfdfc570
R8 =0x00007f6ddd6256a0, R9 =0x00007f6ddd618db8, R10=0x0000000000000000, R11=0x00007f6ddd625700
R12=0x00007f6d5c686a80, R13=0x00007f6ddfdfc570, R14=0x00007f6c05eccc78, R15=0x0000000000000000
RIP=0x00007f6be8b25d12, EFLAGS=0x0000000000010246, CSGSFS=0x002b000000000033, ERR=0x0000000000000004
  TRAPNO=0x000000000000000e

Top of Stack: (sp=0x00007f6ddfdfc550)
0x00007f6ddfdfc550:   00007f6ddfdfc570 a662eca985aa6800
0x00007f6ddfdfc560:   00007f6ddfdfc590 00007f6be8ae3708
0x00007f6ddfdfc570:   000000770000007c 0000005d0000006e
0x00007f6ddfdfc580:   0000000000000000 0000000001180470
0x00007f6ddfdfc590:   00007f6ddfdfc5a0 0000000000000000
0x00007f6ddfdfc5a0:   00007f6d5ee78b00 00007f79a8ffa838
0x00007f6ddfdfc5b0:   0000000000000000 00007f79a8eb00fe
0x00007f6ddfdfc5c0:   0000000000000000 0000000000000000
0x00007f6ddfdfc5d0:   0000000000000020 00007f79a8ffa838
0x00007f6ddfdfc5e0:   00007f6d5ca51370 00007f6c07e19cd9
0x00007f6ddfdfc5f0:   00007f79a8ffbee8 00007f79a8e577a2
0x00007f6ddfdfc600:   0000000000000040 00007f6ddd4beda0
0x00007f6ddfdfc610:   00007f6c05eccc80 a662eca985aa6800
0x00007f6ddfdfc620:   00007f6ddd4beda0 00007f6ddfdfc650
0x00007f6ddfdfc630:   00007ffc745823a8 00007ffc74582560
0x00007f6ddfdfc640:   00007f6c05eccc78 00007f6be8a1d762
0x00007f6ddfdfc650:   0000000000011c30 0000000000000470
0x00007f6ddfdfc660:   000004a0000011c1 0000000000000002
0x00007f6ddfdfc670:   0000000000000011 000000000000008e
0x00007f6ddfdfc680:   000000790000007c 000000e90000007f
0x00007f6ddfdfc690:   00007f6d5ca3edb0 ffffffffffffffb8
0x00007f6ddfdfc6a0:   0000000000011c00 00007f6dc8000020
0x00007f6ddfdfc6b0:   00007ffc74582560 00007f6bbfe70470
0x00007f6ddfdfc6c0:   00007f6bc59ae680 a662eca985aa6800
0x00007f6ddfdfc6d0:   00007f6bc59ae680 00007f6c05ecc318
0x00007f6ddfdfc6e0:   0000000000000036 00007ffc745823a8
0x00007f6ddfdfc6f0:   00007ffc74582560 00007f6c05eccc78
0x00007f6ddfdfc700:   0000000000000000 00007f79a95cb1ee
0x00007f6ddfdfc710:   fffffffffffffff8 0000000000000036
0x00007f6ddfdfc720:   00007ffc745823a8 00007ffc74582560
0x00007f6ddfdfc730:   00007f6d5c6de6c0 00007f79a95cb2dc
0x00007f6ddfdfc740:   00007ffc745823a8 00007f6ddfdfca40

Instructions: (pc=0x00007f6be8b25d12)
0x00007f6be8b25cf2:   89 fb 48 83 ec 10 64 48 8b 04 25 28 00 00 00 48
0x00007f6be8b25d02:   89 44 24 08 31 c0 48 8d 05 19 2b 2f 1f 48 8b 30
0x00007f6be8b25d12:   48 8b 06 ff 50 30 48 8b 54 24 08 64 48 33 14 25
0x00007f6be8b25d22:   28 00 00 00 75 09 48 83 c4 10 48 89 d8 5b c3 e8

Register to memory mapping:

RAX=0x00007f6c07e18828: <offset 0x1f497828> in /opt/tomcat/temp/onnxruntime-java757573562719520016/libonnxruntime_providers_cuda.so at 0x00007f6be8981000
RBX=0x00007f6ddfdfc570 is pointing into the stack for thread: 0x00007f6ef6394000
RCX=0x0000000000000006 is an unknown value
RDX=0x0000000000000000 is an unknown value
RSP=0x00007f6ddfdfc550 is pointing into the stack for thread: 0x00007f6ef6394000
RBP=0x00007f6ddfdfc650 is pointing into the stack for thread: 0x00007f6ef6394000
RSI=0x0000000000000000 is an unknown value
RDI=0x00007f6ddfdfc570 is pointing into the stack for thread: 0x00007f6ef6394000
R8 =0x00007f6ddd6256a0: <offset 0x2256a0> in /lib64/libstdc++.so.6 at 0x00007f6ddd400000
R9 =0x00007f6ddd618db8: _ZTINSt6locale5facetE+0 in /lib64/libstdc++.so.6 at 0x00007f6ddd400000
R10=0x0000000000000000 is an unknown value
R11=0x00007f6ddd625700: <offset 0x225700> in /lib64/libstdc++.so.6 at 0x00007f6ddd400000
R12=0x00007f6d5c686a80 is an unknown value
R13=0x00007f6ddfdfc570 is pointing into the stack for thread: 0x00007f6ef6394000
R14=0x00007f6c05eccc78: <offset 0x1d54bc78> in /opt/tomcat/temp/onnxruntime-java757573562719520016/libonnxruntime_providers_cuda.so at 0x00007f6be8981000
R15=0x0000000000000000 is an unknown value


Stack: [0x00007f6ddfdc0000,0x00007f6ddfe00000],  sp=0x00007f6ddfdfc550,  free space=241k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [libonnxruntime_providers_cuda.so+0x1a4d12]
C  0x0000000000000470

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j  ai.onnxruntime.OrtSession$SessionOptions.addCUDA(JJI)V+0
j  ai.onnxruntime.OrtSession$SessionOptions.addCUDA(I)V+19
j  ai.onnxruntime.OrtSession$SessionOptions.addCUDA()V+2
j  ai.djl.onnxruntime.engine.OrtEngine.hasCapability(Ljava/lang/String;)Z+29
j  ai.djl.engine.Engine.defaultDevice()Lai/djl/Device;+10
j  ai.djl.ndarray.BaseNDManager.defaultDevice()Lai/djl/Device;+4
j  ai.djl.ndarray.BaseNDManager.<init>(Lai/djl/ndarray/NDManager;Lai/djl/Device;)V+39
j  ai.djl.onnxruntime.engine.OrtNDManager.<init>(Lai/djl/ndarray/NDManager;Lai/djl/Device;Lai/onnxruntime/OrtEnvironment;)V+3
j  ai.djl.onnxruntime.engine.OrtNDManager.<init>(Lai/djl/ndarray/NDManager;Lai/djl/Device;Lai/onnxruntime/OrtEnvironment;Lai/djl/onnxruntime/engine/OrtNDManager$1;)V+4
j  ai.djl.onnxruntime.engine.OrtNDManager$SystemManager.<init>()V+15
j  ai.djl.onnxruntime.engine.OrtNDManager.<clinit>()V+4
v  ~StubRoutines::call_stub
j  ai.djl.onnxruntime.engine.OrtEngine.newBaseManager(Lai/djl/Device;)Lai/djl/ndarray/NDManager;+0
j  ai.djl.onnxruntime.engine.OrtEngine.newModel(Ljava/lang/String;Lai/djl/Device;)Lai/djl/Model;+7
j  ai.djl.Model.newInstance(Ljava/lang/String;Lai/djl/Device;Ljava/lang/String;)Lai/djl/Model;+23
j  ai.djl.repository.zoo.BaseModelLoader.createModel(Ljava/nio/file/Path;Ljava/lang/String;Lai/djl/Device;Lai/djl/nn/Block;Ljava/util/Map;Ljava/lang/String;)Lai/djl/Model;+4
j  ai.djl.repository.zoo.BaseModelLoader.loadModel(Lai/djl/repository/zoo/Criteria;)Lai/djl/repository/zoo/ZooModel;+506
j  ai.djl.repository.zoo.Criteria.loadModel()Lai/djl/repository/zoo/ZooModel;+524

What have you tried to solve it?

I made a change to ai.djl.engine.Engine.java, and the problem no longer reproduces

    public Device defaultDevice() {
        if (defaultDevice == null) {
            if (CudaUtils.getGpuCount() > 0 && hasCapability(StandardCapabilities.CUDA)) { // check gpu-count first
                defaultDevice = Device.gpu();
            } else {
                defaultDevice = Device.cpu();
            }
        }
        return defaultDevice;
    }
@zaobao zaobao added the bug Something isn't working label Jul 29, 2024
@frankfliu
Copy link
Contributor

frankfliu commented Jul 29, 2024

Why you use onnxruntime_gpu dependency in a machine without GPU?

@Justubborn
Copy link

have same question use onnxruntime-1.18.0
Container: Docker with NO GPU
OS: openEuler
djl version: 0.29.0
onnxruntime_gpu version: 1.18.0

#
#  SIGSEGV (0xb) at pc=0x00007fb6285e1e3b, pid=885, tid=917
#
# JRE version: Java(TM) SE Runtime Environment (17.0.12+8) (build 17.0.12+8-LTS-286)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (17.0.12+8-LTS-286, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# C  [libstdc++.so.6+0xd3e3b]
Time: Fri Aug 30 03:13:24 2024 UTC elapsed time: 28.554432 seconds (0d 0h 0m 28s)

---------------  T H R E A D  ---------------

Current thread (0x00007fb590081400):  JavaThread "XNIO-1 task-2" [_thread_in_native, id=917, stack(0x00007fb63823a000,0x00007fb63833a000)]

Stack: [0x00007fb63823a000,0x00007fb63833a000],  sp=0x00007fb6383332d8,  free space=996k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [libstdc++.so.6+0xd3e3b]

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j  jdk.internal.loader.NativeLibraries.load(Ljdk/internal/loader/NativeLibraries$NativeLibraryImpl;Ljava/lang/String;ZZZ)Z+0 [email protected]
j  jdk.internal.loader.NativeLibraries$NativeLibraryImpl.open()Z+61 [email protected]
j  jdk.internal.loader.NativeLibraries.loadLibrary(Ljava/lang/Class;Ljava/lang/String;Z)Ljdk/internal/loader/NativeLibrary;+256 [email protected]
j  jdk.internal.loader.NativeLibraries.loadLibrary(Ljava/lang/Class;Ljava/io/File;)Ljdk/internal/loader/NativeLibrary;+51 [email protected]
j  java.lang.ClassLoader.loadLibrary(Ljava/lang/Class;Ljava/io/File;)Ljdk/internal/loader/NativeLibrary;+31 [email protected]
j  java.lang.Runtime.load0(Ljava/lang/Class;Ljava/lang/String;)V+61 [email protected]
j  java.lang.System.load(Ljava/lang/String;)V+7 [email protected]
j  ai.djl.pytorch.jni.LibUtils.loadNativeLibrary(Ljava/lang/String;)V+39
j  ai.djl.pytorch.jni.LibUtils.loadLibTorch(Lai/djl/pytorch/jni/LibUtils$LibTorch;)V+548
j  ai.djl.pytorch.jni.LibUtils.loadLibrary()V+28
j  ai.djl.pytorch.engine.PtEngine.newInstance()Lai/djl/engine/Engine;+0
j  ai.djl.pytorch.engine.PtEngineProvider.getEngine()Lai/djl/engine/Engine;+17
j  ai.djl.engine.Engine.getEngine(Ljava/lang/String;)Lai/djl/engine/Engine;+45
j  ai.djl.engine.Engine.getInstance()Lai/djl/engine/Engine;+43
j  ai.djl.onnxruntime.engine.OrtEngine.getAlternativeEngine()Lai/djl/engine/Engine;+15
j  ai.djl.ndarray.BaseNDManager.<init>(Lai/djl/ndarray/NDManager;Lai/djl/Device;)V+85
j  ai.djl.onnxruntime.engine.OrtNDManager.<init>(Lai/djl/ndarray/NDManager;Lai/djl/Device;Lai/onnxruntime/OrtEnvironment;)V+3
j  ai.djl.onnxruntime.engine.OrtNDManager.<init>(Lai/djl/ndarray/NDManager;Lai/djl/Device;Lai/onnxruntime/OrtEnvironment;Lai/djl/onnxruntime/engine/OrtNDManager$1;)V+4
j  ai.djl.onnxruntime.engine.OrtNDManager$SystemManager.<init>()V+15
j  ai.djl.onnxruntime.engine.OrtNDManager.<clinit>()V+4
v  ~StubRoutines::call_stub
j  ai.djl.onnxruntime.engine.OrtEngine.newBaseManager(Lai/djl/Device;)Lai/djl/ndarray/NDManager;+0
j  ai.djl.onnxruntime.engine.OrtEngine.newModel(Ljava/lang/String;Lai/djl/Device;)Lai/djl/Model;+7
j  ai.djl.Model.newInstance(Ljava/lang/String;Lai/djl/Device;Ljava/lang/String;)Lai/djl/Model;+23
j  ai.djl.repository.zoo.BaseModelLoader.createModel(Ljava/nio/file/Path;Ljava/lang/String;Lai/djl/Device;Lai/djl/nn/Block;Ljava/util/Map;Ljava/lang/String;)Lai/djl/Model;+4
j  ai.djl.repository.zoo.BaseModelLoader.loadModel(Lai/djl/repository/zoo/Criteria;)Lai/djl/repository/zoo/ZooModel;+506
j  ai.djl.repository.zoo.Criteria.loadModel()Lai/djl/repository/zoo/ZooModel;+524
j  ai.djl.repository.zoo.ModelZoo.loadModel(Lai/djl/repository/zoo/Criteria;)Lai/djl/repository/zoo/ZooModel;+1
j  org.aoju.bus.ocr.toolkit.OcrV4Kit.runOcr(Ljava/io/InputStream;)Lorg/aoju/bus/ocr/entity/OcrResult;+50
j  cn.econta.tangor.service.OcrService.sync([B)Lorg/aoju/bus/ocr/entity/OcrResult;+10
j  cn.econta.tangor.spring.OcrController.jsonPpWorld(Ljava/lang/String;)Ljava/lang/Object;+15
v  ~StubRoutines::call_stub

@frankfliu
Copy link
Contributor

Onnx has cpu and _gpu two jar file. I don't think you can mismatch.

@Justubborn
Copy link

only use onnx cpu with pytorch cause java crash

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants