ddpg + her #7

stefanwanckel · 2021-04-13T12:46:39Z

I was wondering whether you have tried to train a model with ddpg + her.
I had some success training sac + her but with ddpg my arm "folds" itself by eventually setting q to joint limits.

If you have, maybe you could share some thoughts on it. Thanks in advance.

stefanwanckel · 2021-04-13T12:53:41Z

*Addendum: Even with only ddpg, the robot arm moves into said position and it is very difficult for him to change its position from there. I also implemented ddpg from scratch, corroborated it with "pendulum-v0" gym sample environment and tried it with my robot environment, but the result was similar. After around 10 optimization steps my cumulative reward begins to (slowly) diverge towards the negative.
Any insight would be much appreciated.

PierreExeter · 2021-04-13T19:34:38Z

Hi Stefan,

I did a quick check and I didn't encounter any problem in training a model with DDPG.
I'm not sure what you mean by "fold", can you attach a screenshot to illustrate?

Can you also give more details on your training environment (i.e. observation shape, reward function, action shape, fixed / random goal, fixed / moving goal, action space) and DDPG hyperparameters?

stefanwanckel · 2021-04-14T09:18:14Z

By folding I mean that the robot goes into a configuration where all joint angles are at max or at min and it cant get out of that configuration.

Here is a picture:

I think it is best to give you a link to my repository.
https://github.com/stefanwanckel/DRL/tree/main/Tryhard

I adapted my GymEnv structure from your repository, so it is pretty similar. I stopped tracking your repository though, so I wills stick with an older version.
I am using the train.py function provided by stable-baselines3-zoo to trian my models.

The init for the environment on the picture are look like this:

id='ur5e_reacher-v5',
entry_point='ur5e_env.envs.ur5e_env:Ur5eEnv',
max_episode_steps=2000,
kwargs={
'random_position' : False,
'random_orientation': False,
'moving_target': False,
'target_type': "sphere",
'goal_oriented' : True,
'obs_type' : 1,
'reward_type' : 13,
'action_type' : 1,
'joint_limits' : "small",
'action_min': [-1, -1, -1, -1, -1, -1],
'action_max': [1, 1, 1, 1, 1, 1],
'alpha_reward': 0.1,
'reward_coeff': 1,
'action_scale': 1,
'eps' : 0.1,
'sim_rep' : 5,
'action_mode' : "force"

The hyperparams look like this:

ur5e_reacher-v5:
n_timesteps: !!float 1e7
policy: 'MlpPolicy'
model_class: 'ddpg'
n_sampled_goal: 4
goal_selection_strategy: 'future'
buffer_size: 1000000
batch_size: 128
gamma: 0.95
learning_rate: !!float 1e-3
noise_type: 'normal'
noise_std: 0.2
policy_kwargs: "dict(net_arch=[512, 512, 512])"
online_sampling: True
#max_episode_length: 100

PierreExeter · 2021-04-15T14:39:29Z

I don't have time for a case by case troubleshooting but I can suggest a few things:

check that you can train with DDPG and the most simple env in rl_reach (fixed target) and adapt your custom environment from a working case
use a dense reward function
plot some useful metrics during evaluation, such as the reward, action, joint position, pybullet action (see https://github.com/PierreExeter/rl_reach/blob/master/code/scripts/plot_episode_eval_log.py)
plot the reward vs timesteps and check that it is increasing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ddpg + her #7

ddpg + her #7

stefanwanckel commented Apr 13, 2021

stefanwanckel commented Apr 13, 2021

PierreExeter commented Apr 13, 2021

stefanwanckel commented Apr 14, 2021

PierreExeter commented Apr 15, 2021

ddpg + her #7

ddpg + her #7

Comments

stefanwanckel commented Apr 13, 2021

stefanwanckel commented Apr 13, 2021

PierreExeter commented Apr 13, 2021

stefanwanckel commented Apr 14, 2021

PierreExeter commented Apr 15, 2021