Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about Multi-Processing Training #129

Closed
Shua-Kang opened this issue Sep 17, 2024 · 4 comments
Closed

Question about Multi-Processing Training #129

Shua-Kang opened this issue Sep 17, 2024 · 4 comments

Comments

@Shua-Kang
Copy link

Hi, Thank you for your great work!

I would like to know if BenchMARL supports multi-cpu/gpu for training. Another similar library like Marllib uses Ray for parallel training. I did not find BenchMARL to have the support of this. If not, is there any plan to implement multi-process training in the future?

@matteobettini
Copy link
Collaborator

Hello,

Thanks for reaching out!

When you say multiple devices for training, you mean for collection or for actual gradient updates?

When it comes to collection, it should be simple to allow to collect from multiple processes for non-vectorized environments. It might be as simple as changing the SerialEnv in this line

SerialEnv(self.config.n_envs_per_worker(self.on_policy), env_func),
to ParallelEnv. Ofc this is not very useful in vectorized environments like VMAS as for those it is faster to use a huge batch size on one GPU.

When it comes to training, I never thought about it. The way I remember RLlib did it was with multiple collection workers that feed to one trainer. We could envision the gradient update to be split to different devices, but untill now I did not see a use-case for it. Happy to think about it, how does RLlib do it?

@Shua-Kang
Copy link
Author

Thank you for your reply!

I'm referring to both aspects. It seems that to adjust the settings from the arguments, I can modify this line:

on_policy_n_envs_per_worker: 10

by using --experiment.on_policy_n_envs_per_worker=20.

I’m not fully familiar with the specific details of how MARllib implements multi-GPU training. I just saw from Marllib that they can set the number of GPUs to use.
https://github.com/Replicable-MARL/MARLlib/blob/368c6173577d0f9c0ad70fb5b4b6afa12c864c15/marllib/marl/ray/ray.yaml#L30

Seems they directly use the implementation from Ray
However, when I use Marllib, using more GPU does not make the training faster. I think the reason is the current models, like MLP or GRU, are relatively small, so the bottleneck is likely in the data collection process.

@matteobettini
Copy link
Collaborator

matteobettini commented Sep 18, 2024

I see

by using --experiment.on_policy_n_envs_per_worker=20.

Yes, by changing that you can use more workers for collections.
With VMAS and other vectorized envs, these will be the environments in the batch.
With normal environments these will be workers that collect serially in the same process.

To enable multi-process collection we just have to allow users to change SerialEnv in the snippet I linked above to ParallelEnv. This is a change we can do and has been on the todos (#94) for a while.

I’m not fully familiar with the specific details of how MARllib implements multi-GPU training. I just saw from Marllib that they can set the number of GPUs to use. https://github.com/Replicable-MARL/MARLlib/blob/368c6173577d0f9c0ad70fb5b4b6afa12c864c15/marllib/marl/ray/ray.yaml#L30

Seems they directly use the implementation from Ray However, when I use Marllib, using more GPU does not make the training faster. I think the reason is the current models, like MLP or GRU, are relatively small, so the bottleneck is likely in the data collection process.

num_gpus in ray dictates the total number of gpus (for training and collection) it is unclear to me what ray does when multiple gpus are allocated just to training

Here https://docs.ray.io/en/latest/rllib/rllib-training.html#specifying-resources it says

num_gpus – Number of GPUs to allocate to the algorithm process. Note that not all algorithms can take advantage of GPUs. Support for multi-GPU is currently only available for tf-[PPO/IMPALA/DQN/PG]. This can be fractional (e.g., 0.3 GPUs).

and here https://docs.ray.io/en/latest/rllib/package_ref/doc/ray.rllib.algorithms.algorithm_config.AlgorithmConfig.learners.html it says

num_gpus_per_learner – Number of GPUs allocated per Learner worker. If num_learners=0, any value greater than 0 runs the training on a single GPU on the main process, while a value of 0 runs the training on main process CPUs. If num_gpus_per_learner is > 0, then you shouldn’t change num_cpus_per_learner (from its default value of 1).

if you ask me to interpret this I would say that ray only ever uses max 1 GPU for training but i might be wrong

Anyway, for BenchMARL, I do not envision training on multiple processes as of yet, but collecting in multiple processes is definitely possible and will be implemented

@Shua-Kang
Copy link
Author

Thank you for your reply and explanation!

I am currently designing a new multi-agent environment. Currently, I am integrating my environment into the Pettingzoo and then testing different algorithms on Benchmarl. After my environment is finished, I think I will also include it in the Benchmarl. Hope that won't take much time. :D

Thank you again for this great work. I saw a lot of other multi-agent libraries are no longer maintained. I believe more and more people will benefit from your work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants