Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple failures in NONLTO, CLANG and ASAN Unit Tests and RelVals due to PluginNotFound #44821

Closed
aandvalenzuela opened this issue Apr 23, 2024 · 33 comments

Comments

@aandvalenzuela
Copy link
Contributor

Hello,

There are multiple failures in NONLTO, CLANG and ASAN IBs (both in Unit Tests and RelVals) in lastest IBs (CMSSW_14_1_[FLAVOR]_X_2024-04-22-2300) reporting:

===== Test "testROCmTestDeviceAdditionModule" ====
----- Begin Fatal Exception 23-Apr-2024 12:18:10 CEST-----------------------
An exception of category 'PluginNotFound' occurred while
   [0] Initializing message logger
Exception Message:
Unable to find plugin 'SingleThreadMSPresence' because the category 'CMS EDM Framework Presence' has no known plugins
----- End Fatal Exception -------------------------------------------------

---> test testROCmTestDeviceAdditionModule had ERRORS
TestTime:0
^^^^ End Test testROCmTestDeviceAdditionModule ^^^^

There are other variants of the exception, for example:

  • CondCore/SiPixelPlugins:
===== Test "testPixelPayloadInspector" ====
terminate called after throwing an instance of 'cms::Exception'
  what():  An exception of category 'PluginNotFound' occurred.
Exception Message:
Unable to find plugin 'SiteLocalConfigService' because the category 'CMS EDM Framework Service' has no known plugins
  • CondCore/CondDB:
===== Test "testConditionDatabase_1" ====
> Connecting with db in sqlite_file:cms_conditions_1.db
ERROR: An exception of category 'PluginNotFound' occurred.
Exception Message:
Unable to find plugin 'COND/Services/RelationalAuthenticationService' because the category 'CoralService' has no known plugins

I am not sure if it is related, but we had ROCm update yesterday in #44777 and ROCm device builds fine (See log).
However, there was a similar issue in the past reported at cmssw#40680 and related to a ROCm update in which the missing plugins were not properly registered in the .edmplugincache file.

Thanks,
Andrea

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 23, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

A new Issue was created by @aandvalenzuela.

@smuzaffar, @makortel, @rappoccio, @antoniovilela, @Dr15Jones, @sextonkennedy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@Dr15Jones
Copy link
Contributor

The duplicate dictionary checker for those failing IBs says the

/data/cmsbld/jenkins/workspace/ib-run-qa/CMSSW_14_1_NONLTO_X_2024-04-23-1100/lib/el8_amd64_gcc12/.edmplugincache: No such file or directory

@smuzaffar
Copy link
Contributor

at the end of build phase we do run edmPluginRefresh and looks like there was a crash [a] when twas run

[a] https://cmssdt.cern.ch/SDT/jenkins-artifacts/build-any-ib/CMSSW_14_1_NONLTO_X_2024-04-23-1100/el8_amd64_gcc12/177396/cms/cmssw/CMSSW_14_1_NONLTO_X_2024-04-23-1100/log

+ edmPluginRefresh /data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/c1b2a790a03448edf0bfd113d211f5a4/opt/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_NONLTO_X_2024-04-23-1100/lib/el8_amd64_gcc12

 *** Break *** segmentation violation



===========================================================
There was a crash.
This is the entire stack trace of all threads:
===========================================================

Thread 5 (Thread 0x14afeb3d7700 (LWP 1966692) "edmPluginRefres"):
#0  0x000014b04fed645c in pthread_cond_wait

GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000014afebd3bfc1 in void std::condition_variable::wait<progschj::ThreadPool::start_worker(unsigned long, std::unique_lock<std::mutex> const&)::{lambda()#1}::operator()() const::{lambda()#1}>(std::unique_lock<std::mutex>&, progschj::ThreadPool::start_worker(unsigned long, std::unique_lock<std::mutex> const&)::{lambda()#1}::operator()() const::{lambda()#1}) () from /data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/c1b2a790a03448edf0bfd113d211f5a4/opt/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_NONLTO_X_2024-04-23-1100/external/el8_amd64_gcc12/lib/libdip.so
#2  0x000014afebd2d4cb in progschj::ThreadPool::start_worker(unsigned long, std::unique_lock<std::mutex> const&)::{lambda()#1}::operator()() const () from /data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/c1b2a790a03448edf0bfd113d211f5a4/opt/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_NONLTO_X_2024-04-23-1100/external/el8_amd64_gcc12/lib/libdip.so
#3  0x000014afebdab9c2 in void std::__invoke_impl<void, progschj::ThreadPool::start_worker(unsigned long, std::unique_lock<std::mutex> const&)::{lambda()#1}>(std::__invoke_other, progschj::ThreadPool::start_worker(unsigned long, std::unique_lock<std::mutex> const&)::{lambda()#1}&&) () from /data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/c1b2a790a03448edf0bfd113d211f5a4/opt/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_NONLTO_X_2024-04-23-1100/external/el8_amd64_gcc12/lib/libdip.so
#4  0x000014afebdaafc1 in std::__invoke_result<progschj::ThreadPool::start_worker(unsigned long, std::unique_lock<std::mutex> const&)::{lambda()#1}>::type std::__invoke<progschj::ThreadPool::start_worker(unsigned long, std::unique_lock<std::mutex> const&)::{lambda()#1}>(progschj::ThreadPool::start_worker(unsigned long, std::unique_lock<std::mutex> const&)::{lambda()#1}&&) () from /data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/c1b2a790a03448edf0bfd113d211f5a4/opt/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_NONLTO_X_2024-04-23-1100/external/el8_amd64_gcc12/lib/libdip.so
#5  0x000014afebdaa428 in void std::thread::_Invoker<std::tuple<progschj::ThreadPool::start_worker(unsigned long, std::unique_lock<std::mutex> const&)::{lambda()#1}> >::_M_invoke<0ul>(std::_Index_tuple<0ul>) () from /data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/c1b2a790a03448edf0bfd113d211f5a4/opt/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_NONLTO_X_2024-04-23-1100/external/el8_amd64_gcc12/lib/libdip.so
#6  0x000014afebda951e in std::thread::_Invoker<std::tuple<progschj::ThreadPool::start_worker(unsigned long, std::unique_lock<std::mutex> const&)::{lambda()#1}> >::operator()() () from /data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/c1b2a790a03448edf0bfd113d211f5a4/opt/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_NONLTO_X_2024-04-23-1100/external/el8_amd64_gcc12/lib/libdip.so
#7  0x000014afebda7ad4 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<progschj::ThreadPool::start_worker(unsigned long, std::unique_lock<std::mutex> const&)::{lambda()#1}> > >::_M_run() () from /data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/c1b2a790a03448edf0bfd113d211f5a4/opt/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_NONLTO_X_2024-04-23-1100/external/el8_amd64_gcc12/lib/libdip.so
#8  0x000014b050562a73 in std::execute_native_thread_routine (__p=0x2acf1a0) at ../../../../../libstdc++-v3/src/c++11/thread.cc:82
#9  0x000014b04fed01ca in start_thread () from /lib64/libpthread.so.0
#10 0x000014b04fb3ce73 in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x14afeb5d8700 (LWP 1966691) "edmPluginRefres"):
#0  0x000014b04fed645c in pthread_cond_wait

GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000014afebd3bfc1 in void std::condition_variable::wait<progschj::ThreadPool::start_worker(unsigned long, std::unique_lock<std::mutex> const&)::{lambda()#1}::operator()() const::{lambda()#1}>(std::unique_lock<std::mutex>&, progschj::ThreadPool::start_worker(unsigned long, std::unique_lock<std::mutex> const&)::{lambda()#1}::operator()() const::{lambda()#1}) () from /data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/c1b2a790a03448edf0bfd113d211f5a4/opt/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_NONLTO_X_2024-04-23-1100/external/el8_amd64_gcc12/lib/libdip.so
#2  0x000014afebd2d4cb in progschj::ThreadPool::start_worker(unsigned long, std::unique_lock<std::mutex> const&)::{lambda()#1}::operator()() const () from /data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/c1b2a790a03448edf0bfd113d211f5a4/opt/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_NONLTO_X_2024-04-23-1100/external/el8_amd64_gcc12/lib/libdip.so
#3  0x000014afebdab9c2 in void std::__invoke_impl<void, progschj::ThreadPool::start_worker(unsigned long, std::unique_lock<std::mutex> const&)::{lambda()#1}>(std::__invoke_other, progschj::ThreadPool::start_worker(unsigned long, std::unique_lock<std::mutex> const&)::{lambda()#1}&&) () from /data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/c1b2a790a03448edf0bfd113d211f5a4/opt/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_NONLTO_X_2024-04-23-1100/external/el8_amd64_gcc12/lib/libdip.so
#4  0x000014afebdaafc1 in std::__invoke_result<progschj::ThreadPool::start_worker(unsigned long, std::unique_lock<std::mutex> const&)::{lambda()#1}>::type std::__invoke<progschj::ThreadPool::start_worker(unsigned long, std::unique_lock<std::mutex> const&)::{lambda()#1}>(progschj::ThreadPool::start_worker(unsigned long, std::unique_lock<std::mutex> const&)::{lambda()#1}&&) () from /data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/c1b2a790a03448edf0bfd113d211f5a4/opt/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_NONLTO_X_2024-04-23-1100/external/el8_amd64_gcc12/lib/libdip.so
#5  0x000014afebdaa428 in void std::thread::_Invoker<std::tuple<progschj::ThreadPool::start_worker(unsigned long, std::unique_lock<std::mutex> const&)::{lambda()#1}> >::_M_invoke<0ul>(std::_Index_tuple<0ul>) () from /data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/c1b2a790a03448edf0bfd113d211f5a4/opt/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_NONLTO_X_2024-04-23-1100/external/el8_amd64_gcc12/lib/libdip.so
#6  0x000014afebda951e in std::thread::_Invoker<std::tuple<progschj::ThreadPool::start_worker(unsigned long, std::unique_lock<std::mutex> const&)::{lambda()#1}> >::operator()() () from /data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/c1b2a790a03448edf0bfd113d211f5a4/opt/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_NONLTO_X_2024-04-23-1100/external/el8_amd64_gcc12/lib/libdip.so
#7  0x000014afebda7ad4 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<progschj::ThreadPool::start_worker(unsigned long, std::unique_lock<std::mutex> const&)::{lambda()#1}> > >::_M_run() () from /data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/c1b2a790a03448edf0bfd113d211f5a4/opt/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_NONLTO_X_2024-04-23-1100/external/el8_amd64_gcc12/lib/libdip.so
#8  0x000014b050562a73 in std::execute_native_thread_routine (__p=0x2acec90) at ../../../../../libstdc++-v3/src/c++11/thread.cc:82
#9  0x000014b04fed01ca in start_thread () from /lib64/libpthread.so.0
#10 0x000014b04fb3ce73 in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x14afeb7d9700 (LWP 1966690) "edmPluginRefres"):
#0  0x000014b04fed645c in pthread_cond_wait

GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000014afebd3bfc1 in void std::condition_variable::wait<progschj::ThreadPool::start_worker(unsigned long, std::unique_lock<std::mutex> const&)::{lambda()#1}::operator()() const::{lambda()#1}>(std::unique_lock<std::mutex>&, progschj::ThreadPool::start_worker(unsigned long, std::unique_lock<std::mutex> const&)::{lambda()#1}::operator()() const::{lambda()#1}) () from /data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/c1b2a790a03448edf0bfd113d211f5a4/opt/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_NONLTO_X_2024-04-23-1100/external/el8_amd64_gcc12/lib/libdip.so
#2  0x000014afebd2d4cb in progschj::ThreadPool::start_worker(unsigned long, std::unique_lock<std::mutex> const&)::{lambda()#1}::operator()() const () from /data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/c1b2a790a03448edf0bfd113d211f5a4/opt/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_NONLTO_X_2024-04-23-1100/external/el8_amd64_gcc12/lib/libdip.so
#3  0x000014afebdab9c2 in void std::__invoke_impl<void, progschj::ThreadPool::start_worker(unsigned long, std::unique_lock<std::mutex> const&)::{lambda()#1}>(std::__invoke_other, progschj::ThreadPool::start_worker(unsigned long, std::unique_lock<std::mutex> const&)::{lambda()#1}&&) () from /data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/c1b2a790a03448edf0bfd113d211f5a4/opt/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_NONLTO_X_2024-04-23-1100/external/el8_amd64_gcc12/lib/libdip.so
#4  0x000014afebdaafc1 in std::__invoke_result<progschj::ThreadPool::start_worker(unsigned long, std::unique_lock<std::mutex> const&)::{lambda()#1}>::type std::__invoke<progschj::ThreadPool::start_worker(unsigned long, std::unique_lock<std::mutex> const&)::{lambda()#1}>(progschj::ThreadPool::start_worker(unsigned long, std::unique_lock<std::mutex> const&)::{lambda()#1}&&) () from /data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/c1b2a790a03448edf0bfd113d211f5a4/opt/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_NONLTO_X_2024-04-23-1100/external/el8_amd64_gcc12/lib/libdip.so
#5  0x000014afebdaa428 in void std::thread::_Invoker<std::tuple<progschj::ThreadPool::start_worker(unsigned long, std::unique_lock<std::mutex> const&)::{lambda()#1}> >::_M_invoke<0ul>(std::_Index_tuple<0ul>) () from /data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/c1b2a790a03448edf0bfd113d211f5a4/opt/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_NONLTO_X_2024-04-23-1100/external/el8_amd64_gcc12/lib/libdip.so
#6  0x000014afebda951e in std::thread::_Invoker<std::tuple<progschj::ThreadPool::start_worker(unsigned long, std::unique_lock<std::mutex> const&)::{lambda()#1}> >::operator()() () from /data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/c1b2a790a03448edf0bfd113d211f5a4/opt/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_NONLTO_X_2024-04-23-1100/external/el8_amd64_gcc12/lib/libdip.so
#7  0x000014afebda7ad4 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<progschj::ThreadPool::start_worker(unsigned long, std::unique_lock<std::mutex> const&)::{lambda()#1}> > >::_M_run() () from /data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/c1b2a790a03448edf0bfd113d211f5a4/opt/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_NONLTO_X_2024-04-23-1100/external/el8_amd64_gcc12/lib/libdip.so
#8  0x000014b050562a73 in std::execute_native_thread_routine (__p=0x2acecb0) at ../../../../../libstdc++-v3/src/c++11/thread.cc:82
#9  0x000014b04fed01ca in start_thread () from /lib64/libpthread.so.0
#10 0x000014b04fb3ce73 in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x14afeb9da700 (LWP 1966689) "edmPluginRefres"):
#0  0x000014b04fed645c in pthread_cond_wait

GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000014afebd3bfc1 in void std::condition_variable::wait<progschj::ThreadPool::start_worker(unsigned long, std::unique_lock<std::mutex> const&)::{lambda()#1}::operator()() const::{lambda()#1}>(std::unique_lock<std::mutex>&, progschj::ThreadPool::start_worker(unsigned long, std::unique_lock<std::mutex> const&)::{lambda()#1}::operator()() const::{lambda()#1}) () from /data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/c1b2a790a03448edf0bfd113d211f5a4/opt/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_NONLTO_X_2024-04-23-1100/external/el8_amd64_gcc12/lib/libdip.so
#2  0x000014afebd2d4cb in progschj::ThreadPool::start_worker(unsigned long, std::unique_lock<std::mutex> const&)::{lambda()#1}::operator()() const () from /data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/c1b2a790a03448edf0bfd113d211f5a4/opt/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_NONLTO_X_2024-04-23-1100/external/el8_amd64_gcc12/lib/libdip.so
#3  0x000014afebdab9c2 in void std::__invoke_impl<void, progschj::ThreadPool::start_worker(unsigned long, std::unique_lock<std::mutex> const&)::{lambda()#1}>(std::__invoke_other, progschj::ThreadPool::start_worker(unsigned long, std::unique_lock<std::mutex> const&)::{lambda()#1}&&) () from /data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/c1b2a790a03448edf0bfd113d211f5a4/opt/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_NONLTO_X_2024-04-23-1100/external/el8_amd64_gcc12/lib/libdip.so
#4  0x000014afebdaafc1 in std::__invoke_result<progschj::ThreadPool::start_worker(unsigned long, std::unique_lock<std::mutex> const&)::{lambda()#1}>::type std::__invoke<progschj::ThreadPool::start_worker(unsigned long, std::unique_lock<std::mutex> const&)::{lambda()#1}>(progschj::ThreadPool::start_worker(unsigned long, std::unique_lock<std::mutex> const&)::{lambda()#1}&&) () from /data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/c1b2a790a03448edf0bfd113d211f5a4/opt/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_NONLTO_X_2024-04-23-1100/external/el8_amd64_gcc12/lib/libdip.so
#5  0x000014afebdaa428 in void std::thread::_Invoker<std::tuple<progschj::ThreadPool::start_worker(unsigned long, std::unique_lock<std::mutex> const&)::{lambda()#1}> >::_M_invoke<0ul>(std::_Index_tuple<0ul>) () from /data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/c1b2a790a03448edf0bfd113d211f5a4/opt/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_NONLTO_X_2024-04-23-1100/external/el8_amd64_gcc12/lib/libdip.so
#6  0x000014afebda951e in std::thread::_Invoker<std::tuple<progschj::ThreadPool::start_worker(unsigned long, std::unique_lock<std::mutex> const&)::{lambda()#1}> >::operator()() () from /data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/c1b2a790a03448edf0bfd113d211f5a4/opt/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_NONLTO_X_2024-04-23-1100/external/el8_amd64_gcc12/lib/libdip.so
#7  0x000014afebda7ad4 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<progschj::ThreadPool::start_worker(unsigned long, std::unique_lock<std::mutex> const&)::{lambda()#1}> > >::_M_run() () from /data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/c1b2a790a03448edf0bfd113d211f5a4/opt/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_NONLTO_X_2024-04-23-1100/external/el8_amd64_gcc12/lib/libdip.so
#8  0x000014b050562a73 in std::execute_native_thread_routine (__p=0x2ace7a0) at ../../../../../libstdc++-v3/src/c++11/thread.cc:82
#9  0x000014b04fed01ca in start_thread () from /lib64/libpthread.so.0
#10 0x000014b04fb3ce73 in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x14b04fa02a80 (LWP 1966688) "edmPluginRefres"):
#0  0x000014b04fbfc6c2 in waitpid () from /lib64/libc.so.6
#1  0x000014b04fb5ece7 in do_system () from /lib64/libc.so.6
#2  0x000014b05108caed in TUnixSystem::StackTrace() () from /data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/c1b2a790a03448edf0bfd113d211f5a4/opt/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_NONLTO_X_2024-04-23-1100/external/el8_amd64_gcc12/lib/libCore.so
#3  0x000014b05108c4a4 in TUnixSystem::DispatchSignals(ESignals) () from /data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/c1b2a790a03448edf0bfd113d211f5a4/opt/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_NONLTO_X_2024-04-23-1100/external/el8_amd64_gcc12/lib/libCore.so
#4  <signal handler called>
#5  0x000014afec641a2b in std::filesystem::__cxx11::path::~path() () from /data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/c1b2a790a03448edf0bfd113d211f5a4/opt/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_NONLTO_X_2024-04-23-1100/external/el8_amd64_gcc12/lib/librocprofiler-register.so.0
#6  0x000014afec641a4c in std::filesystem::__cxx11::path::~path() () from /data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/c1b2a790a03448edf0bfd113d211f5a4/opt/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_NONLTO_X_2024-04-23-1100/external/el8_amd64_gcc12/lib/librocprofiler-register.so.0
#7  0x000014afb3d005a0 in DD4hep_Flavor::PluginService::v2::Details::Registry::initialize() () from /data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/c1b2a790a03448edf0bfd113d211f5a4/opt/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_NONLTO_X_2024-04-23-1100/external/el8_amd64_gcc12/lib/libDD4hepGaudiPluginMgr.so
#8  0x000014b04fed7e67 in __pthread_once_slow () from /lib64/libpthread.so.0
#9  0x000014afb3cfbb3a in DD4hep_Flavor::PluginService::v2::Details::Registry::factories[abi:cxx11]() () from /data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/c1b2a790a03448edf0bfd113d211f5a4/opt/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_NONLTO_X_2024-04-23-1100/external/el8_amd64_gcc12/lib/libDD4hepGaudiPluginMgr.so
#10 0x000014afb3cfc3a6 in DD4hep_Flavor::PluginService::v2::Details::Registry::add(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, DD4hep_Flavor::PluginService::v2::Details::Registry::FactoryInfo) () from /data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/c1b2a790a03448edf0bfd113d211f5a4/opt/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_NONLTO_X_2024-04-23-1100/external/el8_amd64_gcc12/lib/libDD4hepGaudiPluginMgr.so
#11 0x000014afb3d0781c in dd4hep_pluginmgr_add_factory_V2 () from /data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/c1b2a790a03448edf0bfd113d211f5a4/opt/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_NONLTO_X_2024-04-23-1100/external/el8_amd64_gcc12/lib/libDD4hepGaudiPluginMgr.so
#12 0x000014afb3d30180 in _GLOBAL__sub_I_DDTestVectorAlgo.cc () from /data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/c1b2a790a03448edf0bfd113d211f5a4/opt/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_NONLTO_X_2024-04-23-1100/lib/el8_amd64_gcc12/pluginDetectorDescriptionTestPlugins.so
#13 0x000014b0511d8f7a in call_init (l=<optimized out>, argc=argc
entry=2, argv=argv
entry=0x7ffd97e157c8, env=env
entry=0x2b5a500) at dl-init.c:72
#14 0x000014b0511d907a in call_init (env=0x2b5a500, argv=0x7ffd97e157c8, argc=2, l=<optimized out>) at dl-init.c:118
#15 _dl_init (main_map=0x9ac2130, argc=2, argv=0x7ffd97e157c8, env=0x2b5a500) at dl-init.c:119
#16 0x000014b04fc6be2c in _dl_catch_exception () from /lib64/libc.so.6
#17 0x000014b0511e078e in dl_open_worker (a=0x7ffd97e145a0) at dl-open.c:813
#18 dl_open_worker (a=0x7ffd97e145a0) at dl-open.c:776
#19 0x000014b04fc6bdd4 in _dl_catch_exception () from /lib64/libc.so.6
#20 0x000014b0511e09e1 in _dl_open (file=0x9942c20 "/data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/c1b2a790a03448edf0bfd113d211f5a4/opt/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_NONLTO_X_2024-04-23-1100/lib/el8_amd64_gcc12/pluginDetectorD"..., mode=-2147483391, caller_dlopen=0x14b0513f3003 <edmplugin::SharedLibrary::SharedLibrary(std::filesystem::__cxx11::path const&)+115>, nsid=<optimized out>, argc=2, argv=<optimized out>, env=0x2b5a500) at dl-open.c:895
#21 0x000014b0508b3f8a in dlopen_doit () from /lib64/libdl.so.2
#22 0x000014b04fc6bdd4 in _dl_catch_exception () from /lib64/libc.so.6
#23 0x000014b04fc6be93 in _dl_catch_error () from /lib64/libc.so.6
#24 0x000014b0508b452e in _dlerror_run () from /lib64/libdl.so.2
#25 0x000014b0508b402a in dlopen

GLIBC_2.2.5 () from /lib64/libdl.so.2
#26 0x000014b0513f3003 in edmplugin::SharedLibrary::SharedLibrary(std::filesystem::__cxx11::path const&) () from /data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/c1b2a790a03448edf0bfd113d211f5a4/opt/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_NONLTO_X_2024-04-23-1100/lib/el8_amd64_gcc12/libFWCorePluginManager.so
#27 0x000000000040909c in main ()
===========================================================


The lines below might hint at the cause of the crash. If you see question
marks as part of the stack trace, try to recompile with debugging information
enabled and export CLING_DEBUG=1 environment variable before running.
You may get help by asking at the ROOT forum https://root.cern/forum
preferably using the command (.forum bug) in the ROOT prompt.
Only if you are really convinced it is a bug in ROOT then please submit a
report at https://root.cern/bugs or (preferably) using the command (.gh bug) in
the ROOT prompt. Please post the ENTIRE stack trace
from above as an attachment in addition to anything else
that might help us fixing this issue.
===========================================================
#5  0x000014afec641a2b in std::filesystem::__cxx11::path::~path() () from /data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/c1b2a790a03448edf0bfd113d211f5a4/opt/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_NONLTO_X_2024-04-23-1100/external/el8_amd64_gcc12/lib/librocprofiler-register.so.0
#6  0x000014afec641a4c in std::filesystem::__cxx11::path::~path() () from /data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/c1b2a790a03448edf0bfd113d211f5a4/opt/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_NONLTO_X_2024-04-23-1100/external/el8_amd64_gcc12/lib/librocprofiler-register.so.0
#7  0x000014afb3d005a0 in DD4hep_Flavor::PluginService::v2::Details::Registry::initialize() () from /data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/c1b2a790a03448edf0bfd113d211f5a4/opt/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_NONLTO_X_2024-04-23-1100/external/el8_amd64_gcc12/lib/libDD4hepGaudiPluginMgr.so
#8  0x000014b04fed7e67 in __pthread_once_slow () from /lib64/libpthread.so.0
#9  0x000014afb3cfbb3a in DD4hep_Flavor::PluginService::v2::Details::Registry::factories[abi:cxx11]() () from /data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/c1b2a790a03448edf0bfd113d211f5a4/opt/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_NONLTO_X_2024-04-23-1100/external/el8_amd64_gcc12/lib/libDD4hepGaudiPluginMgr.so
#10 0x000014afb3cfc3a6 in DD4hep_Flavor::PluginService::v2::Details::Registry::add(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, DD4hep_Flavor::PluginService::v2::Details::Registry::FactoryInfo) () from /data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/c1b2a790a03448edf0bfd113d211f5a4/opt/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_NONLTO_X_2024-04-23-1100/external/el8_amd64_gcc12/lib/libDD4hepGaudiPluginMgr.so
#11 0x000014afb3d0781c in dd4hep_pluginmgr_add_factory_V2 () from /data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/c1b2a790a03448edf0bfd113d211f5a4/opt/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_NONLTO_X_2024-04-23-1100/external/el8_amd64_gcc12/lib/libDD4hepGaudiPluginMgr.so
#12 0x000014afb3d30180 in _GLOBAL__sub_I_DDTestVectorAlgo.cc () from /data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/c1b2a790a03448edf0bfd113d211f5a4/opt/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_NONLTO_X_2024-04-23-1100/lib/el8_amd64_gcc12/pluginDetectorDescriptionTestPlugins.so
#13 0x000014b0511d8f7a in call_init (l=<optimized out>, argc=argc
entry=2, argv=argv
entry=0x7ffd97e157c8, env=env
entry=0x2b5a500) at dl-init.c:72
#14 0x000014b0511d907a in call_init (env=0x2b5a500, argv=0x7ffd97e157c8, argc=2, l=<optimized out>) at dl-init.c:118
#15 _dl_init (main_map=0x9ac2130, argc=2, argv=0x7ffd97e157c8, env=0x2b5a500) at dl-init.c:119
#16 0x000014b04fc6be2c in _dl_catch_exception () from /lib64/libc.so.6
#17 0x000014b0511e078e in dl_open_worker (a=0x7ffd97e145a0) at dl-open.c:813
#18 dl_open_worker (a=0x7ffd97e145a0) at dl-open.c:776
#19 0x000014b04fc6bdd4 in _dl_catch_exception () from /lib64/libc.so.6
#20 0x000014b0511e09e1 in _dl_open (file=0x9942c20 "/data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/c1b2a790a03448edf0bfd113d211f5a4/opt/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_NONLTO_X_2024-04-23-1100/lib/el8_amd64_gcc12/pluginDetectorD"..., mode=-2147483391, caller_dlopen=0x14b0513f3003 <edmplugin::SharedLibrary::SharedLibrary(std::filesystem::__cxx11::path const&)+115>, nsid=<optimized out>, argc=2, argv=<optimized out>, env=0x2b5a500) at dl-open.c:895
#21 0x000014b0508b3f8a in dlopen_doit () from /lib64/libdl.so.2
#22 0x000014b04fc6bdd4 in _dl_catch_exception () from /lib64/libc.so.6
#23 0x000014b04fc6be93 in _dl_catch_error () from /lib64/libc.so.6
#24 0x000014b0508b452e in _dlerror_run () from /lib64/libdl.so.2
#25 0x000014b0508b402a in dlopen

GLIBC_2.2.5 () from /lib64/libdl.so.2
#26 0x000014b0513f3003 in edmplugin::SharedLibrary::SharedLibrary(std::filesystem::__cxx11::path const&) () from /data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/c1b2a790a03448edf0bfd113d211f5a4/opt/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_NONLTO_X_2024-04-23-1100/lib/el8_amd64_gcc12/libFWCorePluginManager.so
#27 0x000000000040909c in main ()
===========================================================

Error while processing.

@smuzaffar
Copy link
Contributor

https://github.com/cms-sw/cmsdist/blob/IB/CMSSW_14_1_X/master/scram-project-build.file#L228 is where we run edmPluginRefresh and though there was a crash but edmPluginRefresh did not exit with non-zero code that is why build process did not stop

@smuzaffar
Copy link
Contributor

smuzaffar commented Apr 23, 2024

one can reproduce the crash on cmsdev4X nodes by starting cmssw-el8 and then

> scram p CMSSW_14_1_NONLTO_X_2024-04-23-1100
> cd CMSSW_14_1_NONLTO_X_2024-04-23-1100
> cmsenv
> edmPluginRefresh $CMSSW_RELEASE_BASE/lib/el8_amd64_gcc12
#OR
> rsync -a $CMSSW_RELEASE_BASE/lib/ $CMSSW_BASE/lib/
> edmPluginRefresh $CMSSW_BASE/lib/el8_amd64_gcc12

@smuzaffar
Copy link
Contributor

note that edmPluginRefresh loads/checks 2000 libs at a time. Changing it to 200 allowed me to run it . May be there are libs which does like to be loaded in same process?

@Dr15Jones
Copy link
Contributor

I wonder if we have a 'one definition violation'. Maybe valgrind could spot a problem?

@makortel
Copy link
Contributor

Running valgrind only showed "Invalid read of size 8" in the std::filesystem::__cxx11::path::~path() with the same stack trace as in #44821 (comment) .

But that destructor comes from .../CMSSW_14_1_NONLTO_X_2024-04-23-1100/external/el8_amd64_gcc12/lib/librocprofiler-register.so.0 instead of libstdc++! The IB where all of this started did include an update in ROCm.

@makortel
Copy link
Contributor

assign heterogeneous

@cmsbuild
Copy link
Contributor

New categories assigned: heterogeneous

@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor

makortel commented Apr 23, 2024

(jumping into the rabbit hole with @Dr15Jones) So the rocprofiler-register.so.0.so has

0000000000074a20 W std::filesystem::__cxx11::path::~path()

Our libstdc++.so does not have any symbol to the ~path(). The libFWCorePluginManager.so has

0000000000020ff0 t std::filesystem::__cxx11::path::~path()

and seems to be the only CMSSW shared object having the ~path() symbol.

One thing to note on the ROCm setup is that (as far as I can tell) we are taking the binaries from AMD's RHEL8 RPMs. I would assume those were built with the system GCC against the system libstdc++, that seem to be 8 (or at least lxplus8 has GCC 8.5). Version 8 was the first GCC to include std::filesystem, and seemed to ship it as a static library (libstdc++fs.a).

@makortel
Copy link
Contributor

makortel commented Apr 23, 2024

The ROCm libraries get loaded by edmPluginRefresh as part of loading pluginBeamSpotDeviceProducerROCmAsync.so and pluginCalibTrackerSiPixelESProducersPluginsPortableROCmAsync.so.

@makortel
Copy link
Contributor

The libDD4hepGaudiPluginMgr.so has

0000000000035320 W std::filesystem::__cxx11::path::~path()

@makortel
Copy link
Contributor

Disassembling things, the instructions of ~path() in rocprofiler-register.so.0.so match to the instructions in libstdc++fs.a from GCC 8. The instructions in libFWCorePluginManager.so match to to the instructions in libDD4hepGaudiPluginMgr.so. The instructions in the GCC8 rocprofiler-register.so.0.so/libstdc++fs.a are (very) different from the instructions in the GCC12 libFWCorePluginManager.so/libDD4hepGaudiPluginMgr.so.

It seems like we have an ODR violation from trying to mix libraries that were built with (very) different versions of libstdc++, and thus if we need to keep the rocprofiler, we'd have to build it ourselves.

@fwyzard
Copy link
Contributor

fwyzard commented Apr 23, 2024

I'm not particularly interested in keeping rocprofiler (and in fact we did not have it until now).

Unfortunately it seems to be a dependency of hipcc (via rocminfo) and various HIP libraries:

  • libamdhip64.so
  • libhiprtc.so
  • libhsa-runtime64.so

@smuzaffar
Copy link
Contributor

I have opened cms-sw/cmsdist#9153 and #44824 to revert ROCm update

@makortel
Copy link
Contributor

Adding here cms-sw/cmsdist#9143 (comment)

all the IBs with this failure are non-lto ( https://github.com/search?q=repo%3Acms-sw%2Fcms-bot%20BUILD_OPTS%3Dno-lto&type=code )

The trend continued: in CMSSW_14_1_X_2024-04-23-2300 the NONLTO and CLANG IBs failed, but none of the others.

@makortel
Copy link
Contributor

https://github.com/cms-sw/cmsdist/blob/IB/CMSSW_14_1_X/master/scram-project-build.file#L228 is where we run edmPluginRefresh and though there was a crash but edmPluginRefresh did not exit with non-zero code that is why build process did not stop

#44838 fixes edmPluginRefresh to return a non-zero exit code if the child process fails.

@smuzaffar
Copy link
Contributor

thanks @makortel , I have tested it for NONLTO and confirm that edmPluginRefresh exits with non-zero code

> edmPluginRefresh ./lib/el8_amd64_gcc12
 *** Break *** segmentation violation
===========================================================
There was a crash.
This is the entire stack trace of all threads:
===========================================================
...
...
GLIBC_2.2.5 () from /lib64/libdl.so.2
#26 0x00007fde4779a003 in edmplugin::SharedLibrary::SharedLibrary(std::filesystem::__cxx11::path const&) () from /build/muz/plugin/CMSSW_14_1_NONLTO_X_2024-04-23-2300/lib/el8_amd64_gcc12/libFWCorePluginManager.so
#27 0x00000000004090ac in main ()
===========================================================

Error while processing: 139
> echo $?
139

@smuzaffar
Copy link
Contributor

smuzaffar commented Apr 24, 2024

The libDD4hepGaudiPluginMgr.so has

0000000000035320 W std::filesystem::__cxx11::path::~path()

For LTO builds ( where dd4hep is also build with lto flags) libDD4hepGaudiPluginMgr.so library does not contain this. It only has

Singularity> nm -D external/el8_amd64_gcc12/lib/libDD4hepGaudiPluginMgr.so | c++filt | grep ::path::
                 U std::filesystem::__cxx11::path::_M_find_extension() const@GLIBCXX_3.4.26
                 U std::filesystem::__cxx11::path::_List::_Impl_deleter::operator()(std::filesystem::__cxx11::path::_List::_Impl*) const@GLIBCXX_3.4.26
                 U std::filesystem::__cxx11::path::_List::end() const@GLIBCXX_3.4.26
                 U std::filesystem::__cxx11::path::compare(std::filesystem::__cxx11::path const&) const@GLIBCXX_3.4.26
                 U std::filesystem::__cxx11::path::_M_split_cmpts()@GLIBCXX_3.4.26
                 U std::filesystem::__cxx11::path::_List::_List(std::filesystem::__cxx11::path::_List const&)@GLIBCXX_3.4.26
                 U std::filesystem::__cxx11::path::_List::_List()@GLIBCXX_3.4.26

So may be that is why LTO enabled IBs are not failing.

@fwyzard
Copy link
Contributor

fwyzard commented May 30, 2024

I got a suggestion from a possible workaround from an AMD expert: can we LD_PRELOAD the correct c++ library ?

@makortel
Copy link
Contributor

I got a suggestion from a possible workaround from an AMD expert: can we LD_PRELOAD the correct c++ library ?

As far as I can tell, GCC 12 does not provide any shared object that would provide std::filesystem::path::~path() (poking the headers I see it is defined inline as = default). I guess we could try to create our own such shared object, but even then the setup sounds brittle to me (like how would we figure out what functions to include there). I'd also expect all functionality in rocprofiler-register.so that depends on (out of line) std::filesystem::path functionality to be broken.

@fwyzard
Copy link
Contributor

fwyzard commented May 30, 2024 via email

@fwyzard
Copy link
Contributor

fwyzard commented Oct 3, 2024

As a temporary workaround it might be enough to build a stub library to replace librocprofiler-register.so.

This seems to work to build CMSSW with ROCm 6.1.2:

rocprofiler-register.cc

#include <rocprofiler-register/rocprofiler-register.h>

extern "C" {

  rocprofiler_register_error_code_t
  rocprofiler_register_library_api_table(
    const char*                                 lib_name,
    rocprofiler_register_import_func_t          import_func,
    uint32_t                                    lib_version,
    void**                                      api_tables,
    uint64_t                                    api_table_length,
    rocprofiler_register_library_indentifier_t* register_id)
    ROCPROFILER_REGISTER_PUBLIC_API
  {
    return ROCP_REG_SUCCESS;
  }

}

Makefile

librocprofiler-register.so.0.3.0: rocprofiler-register.cc
        g++ rocprofiler-register.cc -std=c++17 -O2 -I /opt/rocm-6.1.2/include -shared -o librocprofiler-register.so.0.3.0
        rm -f ../external/el8_amd64_gcc12/lib/librocprofiler-register.so*
        cp librocprofiler-register.so.0.3.0 ../external/el8_amd64_gcc12/lib/
        ln -s librocprofiler-register.so.0.3.0 ../external/el8_amd64_gcc12/lib/librocprofiler-register.so.0
        ln -s librocprofiler-register.so.0 ../external/el8_amd64_gcc12/lib/librocprofiler-register.so

@makortel @smuzaffar do you think this is worth trying ?

The only alternative I can think of is to build the whole ROCm stack from the sources - but I haven't found any actual instructions to do it :-/

@fwyzard
Copy link
Contributor

fwyzard commented Oct 3, 2024

Note: one reason to upgrade ROCm is that the current version of the kernel drivers, 6.2.x, are only compatible with ROCm 6.0.x and newer.

Anecdotally, running with ROCm 5.6.x on the 6.2.x driver frequently hangs :-(

@makortel
Copy link
Contributor

makortel commented Oct 4, 2024

do you think this is worth trying ?

The only alternative I can think of is to build the whole ROCm stack from the sources - but I haven't found any actual instructions to do it :-/

Sounds to me like trying out our own stub library could be less painful than figuring out how to build the whole ROCm stack from the sources. Of course only time will tell how painful the maintenance of the stub library would be.

@makortel
Copy link
Contributor

Since the problem itself was worked around by downgrading ROCm, how about we close this issue (which is mostly about the problem), and continue the ROCm discussion either here or in other issue?

@fwyzard
Copy link
Contributor

fwyzard commented Oct 30, 2024

OK for me.

@fwyzard
Copy link
Contributor

fwyzard commented Oct 30, 2024

+1

@fwyzard
Copy link
Contributor

fwyzard commented Oct 30, 2024

please close

@cmsbuild
Copy link
Contributor

This issue is fully signed and ready to be closed.

@smuzaffar
Copy link
Contributor

I have opened cms-sw/cmsdist#9493

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants