Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SM90_TMA_REDUCE_ADD is bulk_group but not mbarrier? #5

Open
ziyuhuang123 opened this issue Sep 19, 2024 · 0 comments
Open

SM90_TMA_REDUCE_ADD is bulk_group but not mbarrier? #5

ziyuhuang123 opened this issue Sep 19, 2024 · 0 comments

Comments

@ziyuhuang123
Copy link

struct SM90_TMA_REDUCE_ADD_2D
{
  CUTE_HOST_DEVICE static void
  copy(void const* const desc_ptr,
       void const* const smem_ptr,
       int32_t const& crd0, int32_t const& crd1)
  {
#if defined(CUTE_ARCH_TMA_SM90_ENABLED)
    uint64_t gmem_int_desc = reinterpret_cast<uint64_t>(desc_ptr);
    uint32_t smem_int_ptr  = cast_smem_ptr_to_uint(smem_ptr);
    asm volatile (
      "cp.reduce.async.bulk.tensor.2d.global.shared::cta.add.bulk_group [%0, {%2, %3}], [%1];"
      :
      : "l"(gmem_int_desc), "r"(smem_int_ptr),
        "r"(crd0), "r"(crd1)
      : "memory");
#else
    CUTE_INVALID_CONTROL_PATH("Trying to use tma without CUTE_ARCH_TMA_SM90_ENABLED.");
#endif
  }
};

I noticed from PTX that reduce is bulk_group, but not mbarrier? So, the grammar will be much different from TMA_STORE, but not just replace SM90_TMA_STORE to SM90_TMA_REDUCE_ADD?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant