Proximal Policy Optimization with Guided Random Search

tldr; Proximal Policy Optimization (PPO) followed by guided search using the action probabilities of the PPO-Model

Authors: Nicolas Fröhlich, Robin Schmöcker, Yannik Mahlau

Publication: Not Available

Results: Geometric Mean: 1.070, results

CompilerGym version: 0.2.1

Is the approach Open Source?: Yes

The source code is available as Open-Source: https://github.com/xtremey/ppo_compiler_gym/tree/72c08d86e4c96ed1e4579e5a00b9377e556609ab

Did you modify the CompilerGym source code?: No (apart from state space wrappers)

What parameters does the approach have?

Hyperparameter	Value
Number of Epochs	80
Epsilon Clip	0.1
Mean Square Error Factor	0.5
Entropy Factor	0.01
Learning Rate	0.0005
Trajectories until Update	20
Hidden Layer Size	128
Activation Function	TanH
Number of Layers	4
Shared Parameter Layers	First 3
Optimizer	Adam

What range of values were considered for the above parameters?

We experimented a little bit with the hyperparameters of PPO, but the results did not change drastically. Therefore, we did not perform any Hyperparameter Optimization.

Is the policy deterministic?: No

Description

Our Model uses the Proximal Policy Optimization (PPO) Architecture: https://arxiv.org/abs/1707.06347

Training

We used a wrapper to extend the state space such that the number of remaining steps in an additional entry in the state space (as a number, not one hot encoded). During training, we limited the number of steps per episode to 200. We trained the PPO Model on all benchmarks except ghostscript. Therefore, the model is not tested on unseen programs, except ghostscript. The performance of ghostscript indicates potential for generalization to unseen programs as it is better than the baseline compiler and most other techniques on the leaderboard.

We trained for 10 hours on a consumer computer and evaluated performance every half hour using only cpu in a docker on windows. The peak performance was reached after 7 hours. The exact specification is: Intel(R) Core(TM) i5-9600K CPU @ 3.70GHz 3.70 GHz, 32GB RAM

Guided Search

In a second step we use the action probabilities of the model to perform a guided random search (also for 200 steps). We limited the search time to one minute for each environment.

In a third step we optimized the best trajectory found during random search by taking 500 additional steps using the models action probabilities. This did not yield improvement for all environments, but sometimes improved solution a little with basically no computational cost. Therefore, the maximum possible length of a trajectory is 700. However, most trajectories are much shorter.

We excluded the ghostscript benchmark during training since it took a lot of computation and presented itself as a bottleneck. Additionally, we excluded the random search and additional steps for this benchmark since it did not yield any improvement and drastically increased the mean walltime.

Reproduction

The following commands can be used to train the PPO Model. To exactly reproduce our experiment, ghostscript has to be excluded from the benchmark file.

benchmarks = []
f = open("cbench-v1.txt", "r")
for line in f:
    benchmarks.append(line.strip())
f.close()

env = llvm_wrapper(benchmarks, max_episode_steps=200, steps_in_observation=True)

ppo_training = PPO(env)
ppo_training.train(log_progress=True, training_time=60 * 60 * 10, progress_log_rate=60 * 30)
Evaluation.evaluate(benchmarks, "default", additional_steps_for_max=500, max_trials_per_benchmark=100000,
                    max_time_per_benchmark=60 * 1)

Credit

Credit to nikhilbarhate99 (https://github.com/nikhilbarhate99/PPO-PyTorch/blob/master/PPO.py). Parts of the rollout buffer and the update method are taken from this repo.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Proximal Policy Optimization with Guided Random Search

Description

Training

Guided Search

Reproduction

Credit

Files

README.md

Latest commit

History

README.md

File metadata and controls

Proximal Policy Optimization with Guided Random Search

Description

Training

Guided Search

Reproduction

Credit