A case for resource description updates for already running tasks #3090

btovar · 2023-01-30T17:09:02Z

btovar
Jan 30, 2023
Maintainer

Motivation

Tasks in topcoffea may run for 20 seconds, or for more than 20 minutes. Usually tasks are submitted clustered together, for example, in the default NDCMS run that processes all the needed data, all short running tasks are submitted together, and then all the long running tasks. This causes runs to be executed slowly as the long tail of tasks finishes.

Solutions that are not likely not work

One immediate thought may be to separate these tasks into two separate categories. However, in this particular case, such change does not improve performance, and in fact is counterproductive, because:

Long running tasks are submitted last, and therefore are much more likely to run last, regardless of their category.
Long running tasks would use whole workers while we get some measure of their resources. In the current case, the long running tasks are assigned at most two cores (first attempt) after what we learned from the short running tasks. (That is, the wrong 'category' has information that is close to the truth.)

Another solution would be to randomize the waiting list of tasks. This helps with a better packing of the tasks (short and long running together), but reduces throughput as some long running tasks are going to use whole workers from the start.

** Proposed solution **

When a task is running using the whole worker, it may receive an updated maximum resource allocation to use. This update is constructed from tasks that finished after the target task was submitted. The update is also constructed using the uniform partition allocation in a worker (i.e., allocating half the cores means allocation half the memory). With this change, resource allocation can always be revised down, and up only if the worker has space for it. Thus, more tasks can be fitted in a worker as soon as we know something about the size of the tasks.

For the topEFT particular case, this change by itself won't help. It must be combined with some other strategy that allows for better packing, such as the randomization of the waiting list.

** Necessary changes **

Change the vine (and wq?) protocol to add a message for resource update.
Update the resource_monitor to allow for resource limits updates. (via files?, sockets?)

tphung3 · 2023-02-06T14:43:13Z

tphung3
Feb 6, 2023
Collaborator

@btovar Do you have a transaction log of the NDCMS run? Maybe it can give me some insights into the workflow.

0 replies

btovar · 2023-02-06T14:45:31Z

btovar
Feb 6, 2023
Maintainer Author

Yes, sending by other means.

0 replies

btovar · 2023-02-06T15:03:31Z

btovar
Feb 6, 2023
Maintainer Author

Also, I was thinking on your idea of sending N tasks and kill those that exhaust resources. The problem with that approach is that the resources are not really partitioned in the machine.

Say I have tasks that use 2 cores, but I don't know that, and that I have a worker with 24 cores. If I send 24 tasks to the worker, then the resource monitor is going to tell me that the tasks used less than 2 cores, and if I'm really unlucky, its going to tell me they are fine using 1 core. Any stats of that run are suspect, because the wall time is going to be at least two times at least what it should be. (And even more given all the context switches required.)

2 replies

tphung3 Feb 6, 2023
Collaborator

I agree. Perhaps setting a threshold on total resource utilization would help. For example, setting a core utilization per node to 90%, then if tasks are using more than 90% we kill some task, else if possible we run some task. The threshold then serves as protection against resource over-subscription. And this problem only happens to cores I think, as they are shareable. Disk is not shareable and memory is sort of not shareable so they would probably behave better than cores.

btovar Feb 6, 2023
Maintainer Author

The issue is that we don't know the correct number of cores. If the tasks legitimately use 1 core, then core usage would be above 90%, and we would be killing tasks that are ok.

Memory and disk are also tricky, because the monitor does not immediately detect that a task has exhausted resources. (It may know only after a few seconds.) This may cause legitimate process to fail, including the work queue worker.

Killing random tasks because we can't figure out which one is misbehaving sounds to me a lot like linux's OOM killer. But I think such strategy should be used as an emergency, not as the first policy.

tphung3 · 2023-03-31T14:53:00Z

tphung3
Mar 31, 2023
Collaborator

I just found this paper. Basically the worker keeps the maximum resource consumption of a task over a time window (most recent 5 mins for example) and readjust that task's resource limit on-the-fly. Sounds pretty promising, and we'll only have to be careful about when spikes happen.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A case for resource description updates for already running tasks #3090

{{title}}

Replies: 4 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

A case for resource description updates for already running tasks #3090

btovar Jan 30, 2023 Maintainer

Replies: 4 comments · 2 replies

tphung3 Feb 6, 2023 Collaborator

btovar Feb 6, 2023 Maintainer Author

btovar Feb 6, 2023 Maintainer Author

tphung3 Feb 6, 2023 Collaborator

btovar Feb 6, 2023 Maintainer Author

tphung3 Mar 31, 2023 Collaborator

btovar
Jan 30, 2023
Maintainer

Replies: 4 comments 2 replies

tphung3
Feb 6, 2023
Collaborator

btovar
Feb 6, 2023
Maintainer Author

btovar
Feb 6, 2023
Maintainer Author

tphung3 Feb 6, 2023
Collaborator

btovar Feb 6, 2023
Maintainer Author

tphung3
Mar 31, 2023
Collaborator