Pipelining TMA async memcpy #1789

rickyyx · 2024-05-29T17:14:44Z

rickyyx
May 29, 2024

Hey, I was trying to speed up my kernel that invovles memcpy from global + some computing, looking at the official documentation of async memcpy, I see there are multiple ways of doing async memcpy:

pipelining with memcpy_async doc
using the TMA related memcpy_async doc

The pipelined async_memcpy seems to be natively support with memcpy_async from cuda/pipeline taking in a pipeline instance. However, there seems to be no similar API for the TMA’s group of APIs.

Is one able to speed up the TMA async mecmpy with pipeline with current interfaces and APIs?

Thanks

ahendriksen · 2024-06-12T20:39:39Z

ahendriksen
Jun 12, 2024
Collaborator

Hi @rickyyx,

Good question!

The pipelined async_memcpy seems to be natively support with memcpy_async from cuda/pipeline taking in a pipeline instance. However, there seems to be no similar API for the TMA’s group of APIs.

This is correct.

Is one able to speed up the TMA async mecmpy with pipeline with current interfaces and APIs?

Unfortunately, not (yet).

What is your use case? Does the pipeline example using mempcy async from the CUDA blog speed up your kernel?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pipelining TMA async memcpy #1789

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Pipelining TMA async memcpy #1789

rickyyx May 29, 2024

Replies: 1 comment

ahendriksen Jun 12, 2024 Collaborator

rickyyx
May 29, 2024

ahendriksen
Jun 12, 2024
Collaborator