Skip to content

Commit

Permalink
NBSNEBIUS-65: add rdma documentation (#116)
Browse files Browse the repository at this point in the history
  • Loading branch information
budevg authored Jan 11, 2024
1 parent f3f435e commit 8d611c9
Showing 1 changed file with 86 additions and 0 deletions.
86 changes: 86 additions & 0 deletions doc/rdma/architecture.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
# RDMA architecture
## [RDMA library](../../cloud/blockstore/libs/rdma)

RDMA library provides interface to communicate
through the RDMA transport. The interface is composed of two entities: `Client`
and `Server`. It allows multiple Clients to connect to a single Server.

### [Client](../../cloud/blockstore/libs/rdma/iface/client.h)

Single client object (created using `NRdma::CreateClient`) can connect to multiple
servers using `StartEndpoint`. After successful connection `Endpoint` can be
used to allocate request (`AllocateRequest`) and sending it (`SendRequest`).

Each endpoint pre-allocates `QueueDepth` buffers for request/response
metadata. Additional buffers for request/response data is allocated on demand
using `SendBuffers` or `RecvBuffers` buffer pools kept for each endpoint.

Calling `AllocateRequest` on endpoint will allocate enough buffers for the
request data transfer. After that calling `SendRequest` will send the request
using rdma primitives to the server. After the server responds to the request,
the library will invoke client handler callback which was provided during the
call to the `AllocateRequest`.

### [Server](../../cloud/blockstore/libs/rdma/iface/server.h)

The server object (created using `NRdma::CreateServer`) will start listening for
client connections after invoking `StartEndpoint`. When client connects to the
server new `ServerSession` will be created and the server will start receiving
requests from client. The handler provided in the `StartEndpoint` call will be
invoked on new requests with `in`/`out` buffers which can be used to read/write
data. The handler then invokes `SendResponse` to send response data written to
the `out` buffer. It can signal error by invoking endpoints `SendError` method.

### [RDMA transport](../../cloud/blockstore/libs/rdma/impl)

When `SendRequest` is invoked on the client, the client will send message
consisting of `address`, `length` and `rkey` of the `in`/`out` buffers and the
`request id` to the server. This message will be sent using `IBV_WR_SEND` rdma
verb.

The message will be received on the server side. If the message `in` buffer size
is not zero the server will use `IBV_WR_RDMA_READ` rdma verb to read the
data. If the `out` buffer size is not zero the server will allocate buffer on
the server side for the response. Next the handler will be invoked with the
`in`/`out` buffers which it can use to process request and write response. After
the handler completes, if there is `out` buffer with non zero size, it will be
written using `IBV_WR_RDMA_WRITE` rdma verb. Next the response message will be
sent to the client using `IBV_WR_SEND` rdma verb.

The client will receive the response message and will invoke request handler
signaling completion of request.

## Partition [part_nonrepl_rdma_actor](../../cloud/blockstore/libs/storage/partition_nonrepl/part_nonrepl_rdma_actor.h)

Non replicated partition rdma actor is used to send read/write requests to disk
agent over rdma. Single `Read/WriteBlocksRequest` will be partition into
multiple requests for specific devices. For each device, the rdma actor will use
rdma lib client to send serialized `Read/WriteDeviceBlocksRequest` protobuf. It
will get response and deserialize it into `Read/WriteDeviceBlocksResponse`.
After all device requests will be completed the single `Read/WriteBlocksRequest`
will be completed.

For write request we will copy buffers once from the `WriteBlocksRequest` to the
`WriteDeviceBlocksRequest` when we serialize the `WriteDeviceBlocksRequest`
before we send it.

For read request we are going to copy the buffer once we receive response from
the `ReadDeviceBlocksResponse` to the `ReadBlocksResponse`.

## Disk agent [rdma_target](../../cloud/blockstore/libs/storage/disk_agent/rdma_target.h)

Disk agent is using rdma\_target to handle read/write requests sent to it using
rdma lib. It initializes rdma server and starts endpoint to listen for
connections. Once connection is established the rdma_target will handle
`Read/WriteDeviceBlocksRequest`.

For `WriteDeviceBlocksRequest`, it constructs `WriteBlocksRequest` using the
same data buffers and writes data using `StorageAdaptor`.

For `ReadDeviceBlocksRequest`, it uses `StorageAdaptor` to read data into
`ReadBlocksRequest` and then converts response to `ReadDeviceBlocksRequest`.

There is extra buffer copy inside the `StorageAdaptor`. We use aio for
read/write and it require page_size aligned buffers. When we call
`device->Read/WriteBlocks`, the device will allocate aligned buffer, and copy
data to/from it.

0 comments on commit 8d611c9

Please sign in to comment.