Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem detecting ROCm version #2

Open
Rombur opened this issue Feb 24, 2021 · 15 comments
Open

Problem detecting ROCm version #2

Rombur opened this issue Feb 24, 2021 · 15 comments

Comments

@Rombur
Copy link

Rombur commented Feb 24, 2021

I get the following error when trying to configure DAGEE:

CMake Error at cmakeUtils/rocmVersion.cmake:10 (list):
  list GET given empty list
Call Stack (most recent call first):
  CMakeLists.txt:26 (include)

The problem is that this line https://github.com/AMDResearch/DAGEE/blob/master/cmakeUtils/rocmVersion.cmake#L8 assumes that the file is in /opt/rocm/. This is not the case on cluster I am working on. We have multiple version of rocm in /opt/rocm-XXX. Changing the path in cmakeUtils/rocmVersion.cmake fixes the problem.

@amberhassaan
Copy link
Contributor

Hi, sorry for the delay. I'll make some changes so that you can specify -DROCM_ROOT at command line. Will that work for you? Does your ROCm installation directory have .info directory in it?

One thing that ROCm .deb packages do recently (not sure which one in particular, probably rocm-dkms) is that they soft link /opt/rocm to the latest version, such as /opt/rocm-4.0 . This obviously doesn't happen in your installation process.

@Rombur
Copy link
Author

Rombur commented Feb 26, 2021

I'll make some changes so that you can specify -DROCM_ROOT at command line. Will that work for you?

Yes, that works

Does your ROCm installation directory have .info directory in it?

Yes, it does.

One thing that ROCm .deb packages do recently (not sure which one in particular, probably rocm-dkms) is that they soft link /opt/rocm to the latest version, such as /opt/rocm-4.0 . This obviously doesn't happen in your installation process.

There are several versions of ROCm installed on the cluster so you need to load a module to get the version you want. That's why /opt/rocm does not exist.

@amberhassaan
Copy link
Contributor

amberhassaan commented Mar 3, 2021

Please try the latest master branch and run cmake with -DROCM_ROOT=<path-to-rocm-dir>
Let me know if you come across issues. Thanks.

@amberhassaan
Copy link
Contributor

Also, build instructions have changed a little. You can try -DATMI_SRC=<path-to-atmi-repo> to build ATMI automatically with DAGEE.

@Rombur
Copy link
Author

Rombur commented Mar 3, 2021

Please try the latest master branch and run cmake with -DROCM_ROOT=

This works, I can configure the code. However, this new line breaks at compile time. I don't have a .bc file. All I have is this:

ls /opt/rocm-4.1.0/atmi/lib
libatmi_runtime.so  libatmi_runtime.so.0  libatmi_runtime.so.0.7.40100

I guess I need to build my own ATMI.

@amberhassaan
Copy link
Contributor

Like I mentioned, ATMI build is now automated via DAGEE's cmake setup, so please try -DATMI_SRC option with cmake.

@Rombur
Copy link
Author

Rombur commented Mar 3, 2021

Like I mentioned, ATMI build is now automated via DAGEE's cmake setup, so please try -DATMI_SRC option with cmake.

That doesn't work. Here is my configuration line:

cmake -Wno-dev \
  -D CMAKE_CXX_COMPILER=hipcc \
  -D HSA_ROOT="${HSA_PATH};${ROCM_PATH}/lib64" \
  -D ATMI_SRC=/home/users/turcksin/atmi \
  -D AMD_LLVM=${ROCM_PATH}/llvm \
  -D ROCM_ROOT=${ROCM_PATH} \
  -D GFX_VER=908 \
  -D CMAKE_INSTALL_PREFIX=/home/users/turcksin/DAGEE/install \
../

and here is the error:

-- The C compiler identification is GNU 8.3.1
-- The CXX compiler identification is Clang 12.0.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /opt/rocm-4.1.0/bin/hipcc - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- HSA_INCLUDE_DIRS: /opt/rocm-4.1.0/hsa/include
-- HSA_LIBRARIES: /opt/rocm-4.1.0/hsa/lib/libhsa-runtime64.so;/opt/rocm-4.1.0/lib64/libhsakmt.so
-- Found HSA: /opt/rocm-4.1.0/hsa/lib/libhsa-runtime64.so;/opt/rocm-4.1.0/lib64/libhsakmt.so
-- ROCM_DEVICE_LIBS set to /opt/rocm-4.1.0/amdgcn/bitcode. To override, run cmake with -DROCM_DEVICE_LIBS=BLAH
-- ATMI_ROOT set to /opt/rocm-4.1.0/atmi. To override, run cmake with -DATMI_ROOT=BLAH
-- Could NOT find ROCM (missing: ROCM_INCLUDE_DIRS)
-- ATMI: Not building ATMI Runtime: libhsa-runtime64 not found
-- ATMI: Not building ATMI C Extension. Use -DATMI_C_EXTENSION=on in your cmake options to enable.
-- Found LLVM 12.0.0git. Configure: /opt/rocm-4.1.0/llvm/lib/cmake/llvm/LLVMConfig.cmake
-- Found ATMI_DE_DEP_LIB: /opt/rocm-4.1.0/amdgcn
-- ATMI: Preparing to build device_runtime with GFX_VER=908
CMake Error at /home/users/turcksin/atmi/src/device_runtime/CMakeLists.txt:124 (target_sources):
  Cannot specify sources for target "atmi_runtime" which is not built by this
  project.


-- ATMI: build device_runtime with GFXNUM=unknown
fatal: No names found, cannot describe anything.
Using CPACK_DEBIAN_PACKAGE_RELEASE local
Using CPACK_RPM_PACKAGE_RELEASE local
RESULT_VARIABLE 0 OUTPUT_VARIABLE: .el8
CPACK_RPM_PACKAGE_RELEASE: local%{?dist}
-- DAGEE_ROOT set to DAGEE-lib. To override, run cmake with -DDAGEE_ROOT=BLAH
-- Could NOT find Doxygen (missing: DOXYGEN_EXECUTABLE)
-- Install Doxygen to enable 'make doc' to generate documentation
-- Configuring incomplete, errors occurred!
See also "/home/users/coe0132/DAGEE/build/CMakeFiles/CMakeOutput.log".

It looks like the variables are not passed correctly to ATMI. The first part of the configuration finds libhsa-runtime64.so but then I get an error libhsa-runtime64 not found

@amberhassaan
Copy link
Contributor

OK, I see what's going on. ATMI is looking for libhsa-runtime but it doesn't know where rocm is installed. It looks for ROCM_DIR (which defaults to /opt/rocm). See https://github.com/AMDResearch/atmi/blob/15ab2af651a6a394d37e080bfee3735fcaeb6d7b/src/CMakeLists.txt#L42

As a quick fix, can you try adding -DROCM_DIR on command line (same value as ROCM_ROOT). Just trying to see if atmi will pick it up.

@Rombur
Copy link
Author

Rombur commented Mar 4, 2021

Still doesn't work and the error message is not helping:

-- The C compiler identification is GNU 8.3.1
-- The CXX compiler identification is Clang 12.0.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /opt/rocm-4.1.0/bin/hipcc - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- HSA_INCLUDE_DIRS: /opt/rocm-4.1.0/hsa/include
-- HSA_LIBRARIES: /opt/rocm-4.1.0/hsa/lib/libhsa-runtime64.so;/opt/rocm-4.1.0/lib64/libhsakmt.so
-- Found HSA: /opt/rocm-4.1.0/hsa/lib/libhsa-runtime64.so;/opt/rocm-4.1.0/lib64/libhsakmt.so
-- ROCM_DEVICE_LIBS set to /opt/rocm-4.1.0/amdgcn/bitcode. To override, run cmake with -DROCM_DEVICE_LIBS=BLAH
-- ATMI_ROOT set to /opt/rocm-4.1.0/atmi. To override, run cmake with -DATMI_ROOT=BLAH
-- Found ROCM: /opt/rocm-4.1.0/lib/libhsa-runtime64.so;/opt/rocm-4.1.0/lib64/libhsakmt.so
-- ATMI: Preparing to build runtime/core
-- Found LibElf: /usr/lib64/libelf.so
fatal: No names found, cannot describe anything.
-- ATMI: Not building ATMI C Extension. Use -DATMI_C_EXTENSION=on in your cmake options to enable.
-- Found LLVM 12.0.0git. Configure: /opt/rocm-4.1.0/llvm/lib/cmake/llvm/LLVMConfig.cmake
-- Found ATMI_DE_DEP_LIB: /opt/rocm-4.1.0/amdgcn
-- ATMI: Preparing to build device_runtime with GFX_VER=908
-- ATMI: build device_runtime with GFXNUM=unknown
fatal: No names found, cannot describe anything.
Using CPACK_DEBIAN_PACKAGE_RELEASE local
Using CPACK_RPM_PACKAGE_RELEASE local
RESULT_VARIABLE 0 OUTPUT_VARIABLE: .el8
CPACK_RPM_PACKAGE_RELEASE: local%{?dist}
-- DAGEE_ROOT set to DAGEE-lib. To override, run cmake with -DDAGEE_ROOT=BLAH
-- Could NOT find Doxygen (missing: DOXYGEN_EXECUTABLE)
-- Install Doxygen to enable 'make doc' to generate documentation
-- Configuring done
CMake Error: The following variables are used in this project, but they are set to NOTFOUND.
Please set them or make sure they are set and tested correctly in the CMake files:
amd_comgr_INCLUDE_DIRS
   used as include directory in directory /home/users/turcksin/atmi/src/runtime/core
   used as include directory in directory /home/users/turcksin/atmi/src/runtime/core
   used as include directory in directory /home/users/turcksin/atmi/src/runtime/core
   used as include directory in directory /home/users/turcksin/atmi/src/runtime/core
   used as include directory in directory /home/users/turcksin/atmi/src/runtime/core
   used as include directory in directory /home/users/turcksin/atmi/src/runtime/core
   used as include directory in directory /home/users/turcksin/atmi/src/runtime/core
   used as include directory in directory /home/users/turcksin/atmi/src/runtime/core
   used as include directory in directory /home/users/turcksin/atmi/src/runtime/core
   used as include directory in directory /home/users/turcksin/atmi/src/runtime/core
   used as include directory in directory /home/users/turcksin/atmi/src/runtime/core

-- Generating done
CMake Generate step failed.  Build files cannot be regenerated correctly.

@amberhassaan
Copy link
Contributor

OK, looks like we made some progress. I'll try to replicate your setup and hopefully that'll expose more problems. Our scripts do rely on stuff being in /opt/rocm, hence the problems being exposed when that's no longer true.

Can you please confirm if you have the comgr library? It's two pieces. ROCM_ROOT/include/amd_comgr.h and ROCM_ROOT/lib/libamd_comgr.so

Thanks.

@Rombur
Copy link
Author

Rombur commented Mar 4, 2021

Can you please confirm if you have the comgr library? It's two pieces. ROCM_ROOT/include/amd_comgr.h and ROCM_ROOT/lib/libamd_comgr.so

I have the library but the shared library is in lib64 (ROCM_ROOT/lib64/libamd_comgr.so). It is a little bit strange because amd_comgr and hsakmt are in lib64 but everything else in is lib.

I don't know if that helps but the OS is Red Hat 8.2

@amberhassaan
Copy link
Contributor

That definitely helps. Turns out, our cmake module scripts don't look in ROCM_ROOT/lib64 because it doesn't exist in Ubuntu.

Stepping back a bit, you might be able to live with not compiling the atmiDenq target. Can you please try commenting out

addDageeTarget(atmiDenq atmiDenq.cpp)
and L17. Then run cmake for DAGEE with:

cmake -DROCM_ROOT=<rocm-root> -DATMI_ROOT=<rocm-root>/atmi  <path-to-dagee-src>

You might still run into libcomgr and libhsakmt linking issues, but perhaps you can resolve those with LD_LIBRARY_PATH or LD_PRELOAD. Worth a shot I think. In the meanwhile, I'll work on making our scripts more flexible.

@ashwinma
Copy link
Collaborator

We are working on fixing the plumbing of cmake vars HSA_ROOT/ROCM_ROOT/ROCM_DIR from DAGEE->ATMI. Clearly there were a few bugs where we were assuming default ROCm paths. Thanks for your patience.

However, there may still be some hardcoded cmake paths in some dependent ROCm libraries (like comgr) for lib path suffixes and not lib64. So, even if we fix the cmake var plumbing from DAGEE->ATMI, your problem may still persist. For example:

$ cat /opt/rocm/lib/cmake/amd_comgr/amd_comgr-config.cmake

# Derive absolute install prefix from config file path.
get_filename_component(AMD_COMGR_PREFIX "${CMAKE_CURRENT_LIST_FILE}" PATH)
get_filename_component(AMD_COMGR_PREFIX "${AMD_COMGR_PREFIX}" PATH)
get_filename_component(AMD_COMGR_PREFIX "${AMD_COMGR_PREFIX}" PATH)
get_filename_component(AMD_COMGR_PREFIX "${AMD_COMGR_PREFIX}" PATH)

include("${AMD_COMGR_PREFIX}/**lib**/cmake/amd_comgr/amd_comgr-targets.cmake")

@Rombur
Copy link
Author

Rombur commented Apr 28, 2021

Today I've updated DAGEE and the initial ROCm version problem reappeared. It looks like there was a problem when you apply my PR https://github.com/AMDResearch/DAGEE/pull/3/files. My PR only adds inline but somehow the corresponding commit 5b5d333 has a lot more changes including reverting ${ROCM_ROOT} to /opt/rocm

@amberhassaan
Copy link
Contributor

Sorry about that. I must have messed up the versioning in the internal repo's submodules. Please check: 50a9217

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants