Question about Multi-Processing Training #129

Shua-Kang · 2024-09-17T11:02:49Z

Hi, Thank you for your great work!

I would like to know if BenchMARL supports multi-cpu/gpu for training. Another similar library like Marllib uses Ray for parallel training. I did not find BenchMARL to have the support of this. If not, is there any plan to implement multi-process training in the future?

matteobettini · 2024-09-17T12:19:53Z

Hello,

Thanks for reaching out!

When you say multiple devices for training, you mean for collection or for actual gradient updates?

When it comes to collection, it should be simple to allow to collect from multiple processes for non-vectorized environments. It might be as simple as changing the SerialEnv in this line

BenchMARL/benchmarl/experiment/experiment.py

Line 421 in dc793b5

SerialEnv(self.config.n_envs_per_worker(self.on_policy), env_func),

to ParallelEnv. Ofc this is not very useful in vectorized environments like VMAS as for those it is faster to use a huge batch size on one GPU.

When it comes to training, I never thought about it. The way I remember RLlib did it was with multiple collection workers that feed to one trainer. We could envision the gradient update to be split to different devices, but untill now I did not see a use-case for it. Happy to think about it, how does RLlib do it?

Shua-Kang · 2024-09-18T11:47:28Z

Thank you for your reply!

I'm referring to both aspects. It seems that to adjust the settings from the arguments, I can modify this line:

BenchMARL/benchmarl/conf/experiment/base_experiment.yaml

Line 55 in dc793b5

on_policy_n_envs_per_worker: 10

by using --experiment.on_policy_n_envs_per_worker=20.

I’m not fully familiar with the specific details of how MARllib implements multi-GPU training. I just saw from Marllib that they can set the number of GPUs to use.
https://github.com/Replicable-MARL/MARLlib/blob/368c6173577d0f9c0ad70fb5b4b6afa12c864c15/marllib/marl/ray/ray.yaml#L30

Seems they directly use the implementation from Ray
However, when I use Marllib, using more GPU does not make the training faster. I think the reason is the current models, like MLP or GRU, are relatively small, so the bottleneck is likely in the data collection process.

matteobettini · 2024-09-18T12:32:41Z

I see

by using --experiment.on_policy_n_envs_per_worker=20.

Yes, by changing that you can use more workers for collections.
With VMAS and other vectorized envs, these will be the environments in the batch.
With normal environments these will be workers that collect serially in the same process.

To enable multi-process collection we just have to allow users to change SerialEnv in the snippet I linked above to ParallelEnv. This is a change we can do and has been on the todos (#94) for a while.

I’m not fully familiar with the specific details of how MARllib implements multi-GPU training. I just saw from Marllib that they can set the number of GPUs to use. https://github.com/Replicable-MARL/MARLlib/blob/368c6173577d0f9c0ad70fb5b4b6afa12c864c15/marllib/marl/ray/ray.yaml#L30

Seems they directly use the implementation from Ray However, when I use Marllib, using more GPU does not make the training faster. I think the reason is the current models, like MLP or GRU, are relatively small, so the bottleneck is likely in the data collection process.

num_gpus in ray dictates the total number of gpus (for training and collection) it is unclear to me what ray does when multiple gpus are allocated just to training

Here https://docs.ray.io/en/latest/rllib/rllib-training.html#specifying-resources it says

num_gpus – Number of GPUs to allocate to the algorithm process. Note that not all algorithms can take advantage of GPUs. Support for multi-GPU is currently only available for tf-[PPO/IMPALA/DQN/PG]. This can be fractional (e.g., 0.3 GPUs).

and here https://docs.ray.io/en/latest/rllib/package_ref/doc/ray.rllib.algorithms.algorithm_config.AlgorithmConfig.learners.html it says

num_gpus_per_learner – Number of GPUs allocated per Learner worker. If num_learners=0, any value greater than 0 runs the training on a single GPU on the main process, while a value of 0 runs the training on main process CPUs. If num_gpus_per_learner is > 0, then you shouldn’t change num_cpus_per_learner (from its default value of 1).

if you ask me to interpret this I would say that ray only ever uses max 1 GPU for training but i might be wrong

Anyway, for BenchMARL, I do not envision training on multiple processes as of yet, but collecting in multiple processes is definitely possible and will be implemented

Shua-Kang · 2024-09-20T04:39:08Z

Thank you for your reply and explanation!

I am currently designing a new multi-agent environment. Currently, I am integrating my environment into the Pettingzoo and then testing different algorithms on Benchmarl. After my environment is finished, I think I will also include it in the Benchmarl. Hope that won't take much time. :D

Thank you again for this great work. I saw a lot of other multi-agent libraries are no longer maintained. I believe more and more people will benefit from your work.

Shua-Kang closed this as completed Sep 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about Multi-Processing Training #129

Question about Multi-Processing Training #129

Shua-Kang commented Sep 17, 2024

matteobettini commented Sep 17, 2024

Shua-Kang commented Sep 18, 2024

matteobettini commented Sep 18, 2024 •

edited

Loading

Shua-Kang commented Sep 20, 2024

Question about Multi-Processing Training #129

Question about Multi-Processing Training #129

Comments

Shua-Kang commented Sep 17, 2024

matteobettini commented Sep 17, 2024

Shua-Kang commented Sep 18, 2024

matteobettini commented Sep 18, 2024 • edited Loading

Shua-Kang commented Sep 20, 2024

matteobettini commented Sep 18, 2024 •

edited

Loading