Starvation scenario in a particular setup of pools and execution streams #277

mdorier · 2024-07-30T15:27:05Z

I have narrowed down a deadlock issue to the reproducer shown hereafter.

The scenario is as follows: I initialize margo with 2 pools ("__primary__" and "rpc"), and one execution stream ("__primary__") associated with these two pools in that order.

The process then sends an RPC to itself, and hangs indefinitely. Adding print statements shows that the last call issued is margo_forward. The RPC handler doesn't execute.

Attaching to the process with GDB shows that it is blocked in HG_Progress.

I initially thought it was because once we send the RPC there are no runnable ULTs in the progress pool apart from the progress ULT, so HG_Progress would block, but actually @carns confirmed that it would only block until a certain timeout (by default 100ms).

So what I think actually happens is that the progress ULT is never "blocked" from an Argobots perspective. it hard-blocks periodically on HG_Progress but never on an Argobots mutex, for example, so technically the progress pool is never empty from the point of view of the xstream's scheduler. The consequence of this is that the scheduler never attempts to pull ULTs from the "rpc" pool, causing starvation.

Note that this would not be specific to running RPC ULTs: whatever ULT is submitted to the "rpc" pool will never run in this setup.

I wonder if this is something we should try to provide a fix for? But I have no idea how we would do that. This configuration seems simple enough and pretty reasonable. It seems to mean "do network progress in priority, execute ULTs in that other pool if there is really nothing else to do, including network progress". This could appear for e.g. really low-priority ULTs such as periodic diagnostics, etc.

#include <margo.h>

static void my_rpc(hg_handle_t h);
static DECLARE_MARGO_RPC_HANDLER(my_rpc)

int main(int argc, char** argv) {

    const char* config = "{\n"
        "\"argobots\": {"
            "\"pools\": ["
                "{\"name\":\"__primary__\",\"access\":\"mpmc\",\"kind\":\"fifo_wait\"},"
                "{\"name\":\"rpc\",\"access\":\"mpmc\",\"kind\":\"fifo_wait\"}"
            "],"
            "\"xstreams\": ["
                "{\"name\":\"__primary__\","
                 "\"scheduler\":{"
                     "\"pools\":[\"__primary__\",\"rpc\"],"
                     "\"type\":\"basic_wait\""
                   "}"
                "}"
            "]"
        "},"
        "\"progress_pool\":\"__primary__\","
        "\"rpc_pool\":\"rpc\""
    "}";

    struct margo_init_info info = MARGO_INIT_INFO_INITIALIZER;
    info.json_config = config;
    margo_instance_id mid = margo_init_ext("na+sm", MARGO_SERVER_MODE, &info);

    hg_id_t rpc_id = MARGO_REGISTER(mid, "my_rpc", void, void, my_rpc);

    hg_addr_t addr = HG_ADDR_NULL;
    margo_addr_self(mid, &addr);

    hg_handle_t handle = HG_HANDLE_NULL;
    margo_create(mid, addr, rpc_id, &handle);

    margo_forward(handle, NULL);

    margo_destroy(handle);

    margo_addr_free(mid, addr);
    margo_finalize(mid);
    return 0;
}

static void my_rpc(hg_handle_t h)
{
    margo_respond(h, NULL);
}
static DEFINE_MARGO_RPC_HANDLER(my_rpc)

The text was updated successfully, but these errors were encountered:

mdorier · 2024-07-30T15:52:57Z

Possible solutions:

Replace "progress_pool" with "progress_xstream" in the JSON configuration. Whatever xstream we are referring to, its last pool will be used for the progress ULT. The downsides of this solution are (1) we need an API-breaking change to the configuration format and (2) we cannot actually guarantee that the last pool of the progress xstream isn't used by another xstream as non-last pool. So this solution actually sucks.
Detect at initialization time (and when dynamically creating new xstreams) that if the progress pool is used by an xstream, it should be listed last. If that is not the case, return an error. Simple.
Make margo version of the Argobots schedulers (basic, basic_wait, and randws) that can detect the progress ULT and look in lower-priority pools before actually running it. This should be paired with some more elaborate mechanism to properly set the timeout in HG_Progress. Complicated, I think the second solution is better.

carns · 2024-07-30T17:39:22Z

I agree with the assessment and also think the 2nd option would be the best. I don't think it's every going to be a "good" configuration to configure a pool so that it can only execute things when work stealing occurs from the network progress loop; it's not intuitive that this would guarantee progress. Not worth development time for a complex solution at any rate.

This is a good catch in terms of figuring out what configurations we want to allow.

mdorier · 2024-07-30T17:51:54Z

Right now I'm testing a configuration that should be working fine, yet it's deadlocking as well. I looked at the Argobots implementation of schedulers and found this:

https://github.com/pmodels/argobots/blob/main/src/sched/basic_wait.c#L217

This function is called to re-order the pools such that the private pools are first, and the pools with more access come after. All my pools are "mpmc" so you would think it doesn't matter,... except qsort is NOT stable. So right here there is a possibility for the pools to change order. I'll try to write a reproducer to see if that's the problem I have in my case.

mdorier · 2024-07-30T20:17:52Z

@carns I have another scenario that deadlocks, event though the progress pool is the last one of its execution stream. This is caused by the fact that we don't take into account the fact that the execution stream could have other pools it could pull from. Not only that, but we don't give it an opportunity to context-switch out of the progress loop.

#include <margo.h>

void my_func(void*) {
    fprintf(stderr, "in ULT\n");
}

int main(int argc, char** argv) {

    const char* config = "{\n"
        "\"argobots\": {"
            "\"pools\": ["
                "{\"name\":\"__primary__\",\"access\":\"mpmc\",\"kind\":\"fifo_wait\"},"
                "{\"name\":\"p1\",\"access\":\"mpmc\",\"kind\":\"fifo_wait\"},"
                "{\"name\":\"p2\",\"access\":\"mpmc\",\"kind\":\"fifo_wait\"}"
            "],"
            "\"xstreams\": ["
                "{\"name\":\"__primary__\","
                 "\"scheduler\":{"
                     "\"pools\":[\"__primary__\",\"p1\"],"
                     "\"type\":\"basic_wait\""
                   "}"
                "},"
                "{\"name\":\"es1\","
                 "\"scheduler\":{"
                     "\"pools\":[\"p2\"],"
                     "\"type\":\"basic_wait\""
                   "}"
                "}"
            "]"
        "},"
        "\"progress_pool\":\"p1\","
        "\"rpc_pool\":\"p1\""
    "}";

    struct margo_init_info info = MARGO_INIT_INFO_INITIALIZER;
    info.json_config = config;
    margo_instance_id mid = margo_init_ext("na+sm", MARGO_SERVER_MODE, &info);

    struct margo_pool_info p2 = {0};
    margo_find_pool_by_name(mid, "p2", &p2);

    ABT_thread ult;
    fprintf(stderr, "Before creating ULT\n");
    ABT_thread_create(p2.pool, my_func, NULL, ABT_THREAD_ATTR_NULL, &ult);
    fprintf(stderr, "Before joining\n");
    ABT_thread_join(ult);
    fprintf(stderr, "After joining\n");
    ABT_thread_free(&ult);

    margo_finalize(mid);
    return 0;
}

This time we have 2 xstreams"

__primary__ xstream uses pools __primary__ and p1
es1 xstream uses pool p2

The progress pool is set to p1, which is the last pool for the primary xstream, in other words what we say here is "execute whatever is in the primary pool first, and do network progress only if there is nothing left to run in the primary pool.

Then the main function, which runs on the primary pool, creates a ULT and submits it to pool p2. p2 is associated with es1, which only has p2 to pull from, so no problem there, the ULT is created and will run. However the ABT_thread_join function blocks indefinitely.

My assumption is that it's again a scheduler not getting back to pulling from its first pool. When the main ULT calls join, it context-switches to the primary ES's scheduler. The scheduler sees that all the ULTs on its first pool are blocked, so it switches to taking from its second pool, p1, and runs the progress loop ULT. From there, it isn't given a chance to check the primary pool again.

Here we should get a size of 0, so we don't yield on the next line. Here pending is 0 and size = 0 so we don't enter the condition and don't yield either. In other words, we never give the progress loop an opportunity to yield back to the scheduler.

I think we need to yield at some point no matter what, otherwise in any configuration in which the progress pool shares its ES with another pool, we could starve that other pool for as long as Margo doesn't have more to do than running the progress loop.

mdorier · 2024-07-31T09:48:19Z

The code above is fixed by #278

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Starvation scenario in a particular setup of pools and execution streams #277

Starvation scenario in a particular setup of pools and execution streams #277

mdorier commented Jul 30, 2024 •

edited

Loading

mdorier commented Jul 30, 2024

carns commented Jul 30, 2024

mdorier commented Jul 30, 2024

mdorier commented Jul 30, 2024

mdorier commented Jul 31, 2024

Starvation scenario in a particular setup of pools and execution streams #277

Starvation scenario in a particular setup of pools and execution streams #277

Comments

mdorier commented Jul 30, 2024 • edited Loading

mdorier commented Jul 30, 2024

carns commented Jul 30, 2024

mdorier commented Jul 30, 2024

mdorier commented Jul 30, 2024

mdorier commented Jul 31, 2024

mdorier commented Jul 30, 2024 •

edited

Loading