NBSNEBIUS-258: Support Zero Copy for RDMA Data Path on Disk Agent (#1324) #1617
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The Disk Agent currently copies data buffers multiple times for READ/WRITE
requests received using RDMA transport.
For WRITE requests, the RDMA data buffer is first copied into the memory of
TWriteBlocksRequest and then into a disk block-aligned buffer allocated by
Storage.
For READ requests, disk data is first read into a disk block-aligned buffer
allocated by Storage and then copied into the TReadBlocksResponse message. This
message is then serialized into the RDMA buffer.
To avoid these expensive copies and maintain compatibility with older clients,
we introduce the RDMA_PROTO_FLAG_RDATA flag, which signals the data layout
relative to the allocated RDMA buffer.
Previously, the data layout was:
This layout allows the data offset in memory to be unaligned to 512/4096 bytes,
even though the RDMA buffer is allocated in 4096-byte chunks. Libaio requires
block-aligned memory buffers for writing to the underlying block device with
O_DIRECT, necessitating a different data layout.
With RDMA_PROTO_FLAG_RDATA, the data layout becomes:
Since the Data buffer size is a multiple of 512/4096 bytes (depending on the
device block size) and the buffer is a multiple of 4096-byte chunks, the data
offset in memory will be 512/4096 bytes aligned, allowing its use with libaio.