Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Starvation scenario in a particular setup of pools and execution streams #277

Open
mdorier opened this issue Jul 30, 2024 · 5 comments
Open

Comments

@mdorier
Copy link
Contributor

mdorier commented Jul 30, 2024

I have narrowed down a deadlock issue to the reproducer shown hereafter.

The scenario is as follows: I initialize margo with 2 pools ("__primary__" and "rpc"), and one execution stream ("__primary__") associated with these two pools in that order.

The process then sends an RPC to itself, and hangs indefinitely. Adding print statements shows that the last call issued is margo_forward. The RPC handler doesn't execute.

Attaching to the process with GDB shows that it is blocked in HG_Progress.

I initially thought it was because once we send the RPC there are no runnable ULTs in the progress pool apart from the progress ULT, so HG_Progress would block, but actually @carns confirmed that it would only block until a certain timeout (by default 100ms).

So what I think actually happens is that the progress ULT is never "blocked" from an Argobots perspective. it hard-blocks periodically on HG_Progress but never on an Argobots mutex, for example, so technically the progress pool is never empty from the point of view of the xstream's scheduler. The consequence of this is that the scheduler never attempts to pull ULTs from the "rpc" pool, causing starvation.

Note that this would not be specific to running RPC ULTs: whatever ULT is submitted to the "rpc" pool will never run in this setup.

I wonder if this is something we should try to provide a fix for? But I have no idea how we would do that. This configuration seems simple enough and pretty reasonable. It seems to mean "do network progress in priority, execute ULTs in that other pool if there is really nothing else to do, including network progress". This could appear for e.g. really low-priority ULTs such as periodic diagnostics, etc.

#include <margo.h>

static void my_rpc(hg_handle_t h);
static DECLARE_MARGO_RPC_HANDLER(my_rpc)

int main(int argc, char** argv) {

    const char* config = "{\n"
        "\"argobots\": {"
            "\"pools\": ["
                "{\"name\":\"__primary__\",\"access\":\"mpmc\",\"kind\":\"fifo_wait\"},"
                "{\"name\":\"rpc\",\"access\":\"mpmc\",\"kind\":\"fifo_wait\"}"
            "],"
            "\"xstreams\": ["
                "{\"name\":\"__primary__\","
                 "\"scheduler\":{"
                     "\"pools\":[\"__primary__\",\"rpc\"],"
                     "\"type\":\"basic_wait\""
                   "}"
                "}"
            "]"
        "},"
        "\"progress_pool\":\"__primary__\","
        "\"rpc_pool\":\"rpc\""
    "}";

    struct margo_init_info info = MARGO_INIT_INFO_INITIALIZER;
    info.json_config = config;
    margo_instance_id mid = margo_init_ext("na+sm", MARGO_SERVER_MODE, &info);

    hg_id_t rpc_id = MARGO_REGISTER(mid, "my_rpc", void, void, my_rpc);

    hg_addr_t addr = HG_ADDR_NULL;
    margo_addr_self(mid, &addr);

    hg_handle_t handle = HG_HANDLE_NULL;
    margo_create(mid, addr, rpc_id, &handle);

    margo_forward(handle, NULL);

    margo_destroy(handle);

    margo_addr_free(mid, addr);
    margo_finalize(mid);
    return 0;
}

static void my_rpc(hg_handle_t h)
{
    margo_respond(h, NULL);
}
static DEFINE_MARGO_RPC_HANDLER(my_rpc)
@mdorier
Copy link
Contributor Author

mdorier commented Jul 30, 2024

Possible solutions:

  • Replace "progress_pool" with "progress_xstream" in the JSON configuration. Whatever xstream we are referring to, its last pool will be used for the progress ULT. The downsides of this solution are (1) we need an API-breaking change to the configuration format and (2) we cannot actually guarantee that the last pool of the progress xstream isn't used by another xstream as non-last pool. So this solution actually sucks.
  • Detect at initialization time (and when dynamically creating new xstreams) that if the progress pool is used by an xstream, it should be listed last. If that is not the case, return an error. Simple.
  • Make margo version of the Argobots schedulers (basic, basic_wait, and randws) that can detect the progress ULT and look in lower-priority pools before actually running it. This should be paired with some more elaborate mechanism to properly set the timeout in HG_Progress. Complicated, I think the second solution is better.

@carns
Copy link
Member

carns commented Jul 30, 2024

I agree with the assessment and also think the 2nd option would be the best. I don't think it's every going to be a "good" configuration to configure a pool so that it can only execute things when work stealing occurs from the network progress loop; it's not intuitive that this would guarantee progress. Not worth development time for a complex solution at any rate.

This is a good catch in terms of figuring out what configurations we want to allow.

@mdorier
Copy link
Contributor Author

mdorier commented Jul 30, 2024

Right now I'm testing a configuration that should be working fine, yet it's deadlocking as well. I looked at the Argobots implementation of schedulers and found this:

https://github.com/pmodels/argobots/blob/main/src/sched/basic_wait.c#L217

This function is called to re-order the pools such that the private pools are first, and the pools with more access come after. All my pools are "mpmc" so you would think it doesn't matter,... except qsort is NOT stable. So right here there is a possibility for the pools to change order. I'll try to write a reproducer to see if that's the problem I have in my case.

@mdorier
Copy link
Contributor Author

mdorier commented Jul 30, 2024

@carns I have another scenario that deadlocks, event though the progress pool is the last one of its execution stream. This is caused by the fact that we don't take into account the fact that the execution stream could have other pools it could pull from. Not only that, but we don't give it an opportunity to context-switch out of the progress loop.

#include <margo.h>

void my_func(void*) {
    fprintf(stderr, "in ULT\n");
}

int main(int argc, char** argv) {

    const char* config = "{\n"
        "\"argobots\": {"
            "\"pools\": ["
                "{\"name\":\"__primary__\",\"access\":\"mpmc\",\"kind\":\"fifo_wait\"},"
                "{\"name\":\"p1\",\"access\":\"mpmc\",\"kind\":\"fifo_wait\"},"
                "{\"name\":\"p2\",\"access\":\"mpmc\",\"kind\":\"fifo_wait\"}"
            "],"
            "\"xstreams\": ["
                "{\"name\":\"__primary__\","
                 "\"scheduler\":{"
                     "\"pools\":[\"__primary__\",\"p1\"],"
                     "\"type\":\"basic_wait\""
                   "}"
                "},"
                "{\"name\":\"es1\","
                 "\"scheduler\":{"
                     "\"pools\":[\"p2\"],"
                     "\"type\":\"basic_wait\""
                   "}"
                "}"
            "]"
        "},"
        "\"progress_pool\":\"p1\","
        "\"rpc_pool\":\"p1\""
    "}";

    struct margo_init_info info = MARGO_INIT_INFO_INITIALIZER;
    info.json_config = config;
    margo_instance_id mid = margo_init_ext("na+sm", MARGO_SERVER_MODE, &info);

    struct margo_pool_info p2 = {0};
    margo_find_pool_by_name(mid, "p2", &p2);

    ABT_thread ult;
    fprintf(stderr, "Before creating ULT\n");
    ABT_thread_create(p2.pool, my_func, NULL, ABT_THREAD_ATTR_NULL, &ult);
    fprintf(stderr, "Before joining\n");
    ABT_thread_join(ult);
    fprintf(stderr, "After joining\n");
    ABT_thread_free(&ult);

    margo_finalize(mid);
    return 0;
}

This time we have 2 xstreams"

  • __primary__ xstream uses pools __primary__ and p1
  • es1 xstream uses pool p2

The progress pool is set to p1, which is the last pool for the primary xstream, in other words what we say here is "execute whatever is in the primary pool first, and do network progress only if there is nothing left to run in the primary pool.

Then the main function, which runs on the primary pool, creates a ULT and submits it to pool p2. p2 is associated with es1, which only has p2 to pull from, so no problem there, the ULT is created and will run. However the ABT_thread_join function blocks indefinitely.

My assumption is that it's again a scheduler not getting back to pulling from its first pool. When the main ULT calls join, it context-switches to the primary ES's scheduler. The scheduler sees that all the ULTs on its first pool are blocked, so it switches to taking from its second pool, p1, and runs the progress loop ULT. From there, it isn't given a chance to check the primary pool again.

Here we should get a size of 0, so we don't yield on the next line. Here pending is 0 and size = 0 so we don't enter the condition and don't yield either. In other words, we never give the progress loop an opportunity to yield back to the scheduler.

I think we need to yield at some point no matter what, otherwise in any configuration in which the progress pool shares its ES with another pool, we could starve that other pool for as long as Margo doesn't have more to do than running the progress loop.

@mdorier
Copy link
Contributor Author

mdorier commented Jul 31, 2024

The code above is fixed by #278

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants