Multi-core Controller and Rackscale State Cleanup #326

hunhoffe · 2023-07-24T02:12:58Z

This PR includes code to simplify the RPC library code, use fine-grained locks (mutexes) in controller and client data structures, and to make the controller multi-threaded.

Previously, even single-core RPCs suffered latency problems becasue the controller is single-core. Even using the shmem RPC transport between the client and the controller, since the controller uses TCP to communicate with DCM, the controller had to poll the ethernet interface between RPC calls which reduced RPC throughput.

Now, the controller uses n + 1 cores, where n is the number of clients. The extra core is used to poll the ethernet interface and to check for RPCs send from DCM. To make this work, I had to re-architect how resource responses are received from the DCM scheduler.

Previously:

The controller held a global client and RPC server (and DCM likewise has an RPC server and an RPC client).
The controller DCM client was used to send resource allocation adn release requests. Release requests are handled immediately, but allocation requests return with just an allocation id.

When the DCM scheduler then actually solves the allocation request, it sends the assigned node as an RPC to the controller RPC server. In the single-threaded controller, it could safely be assumed that the RPC server would receive responses only for a specific set of allocations requests, since there coudl only be one outstanding request at a time. Once the controller become muti-core, this is no longer true.

Now:

Core 0 on the controller has a local copy of the DCM RPC server, while the DCM RPC client is still global. When a client needs to make a request of DCM, it sends an RPC to the controller which is handled by core n where n corresponds to the client number. On response, the client then inserts the allocation IDs into a hashmap with a pointer to an atomic bool set to false. The client then polls on the atomic bool (as opposed to the hash table).

On core 0 on the controller, this core loops while polling the ethernet interface and checking for RPCs from DCM. If it receives an allocation assignment, it updatees the hashmap entry for the allocation ID with the assigned node Id, and then sets the atomic bool to true. In this way, the controller can handle multiple asynchronous resource allocation requests to DCM at a time.

Even for single core, this results in a 2x improvment in RPC throughput.

General global state in rackscale, to be reviewed for scalability/suitability are in the files:

hunhoffe · 2023-07-24T03:06:26Z

@gz I’d appreciate any feedback you have on data structures/scalability!

kernel/src/arch/x86_64/rackscale/get_shmem_frames.rs

hunhoffe added 5 commits July 21, 2023 14:17

Simplify RPC code and clean up controller state

fce937c

Reduce logging

bd6242b

Multi-core rackscale controller

b94c316

Increase client build delay

a1d559e

reduce number of arcs

50550f7

vmwclabot added the cla-not-required label Jul 24, 2023

hunhoffe changed the title ~~Multi-core client and state cleanup~~ Multi-core Controller and Rackscale State Cleanup Jul 24, 2023

hunhoffe marked this pull request as ready for review July 24, 2023 03:04

gz approved these changes Jul 24, 2023

View reviewed changes

kernel/src/arch/x86_64/rackscale/get_shmem_frames.rs Show resolved Hide resolved

hunhoffe merged commit 50550f7 into master Jul 26, 2023
12 checks passed

hunhoffe deleted the dev/rackscale-rpcs branch July 26, 2023 22:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-core Controller and Rackscale State Cleanup #326

Multi-core Controller and Rackscale State Cleanup #326

hunhoffe commented Jul 24, 2023

hunhoffe commented Jul 24, 2023

Multi-core Controller and Rackscale State Cleanup #326

Multi-core Controller and Rackscale State Cleanup #326

Conversation

hunhoffe commented Jul 24, 2023

Previously:

Now:

hunhoffe commented Jul 24, 2023