Is the implementation of final rewards correct? #5

nikhilrayaprolu · 2022-01-29T08:47:30Z

As per the original implementation, the final rewards are supposed to replace the reward at the end of each episode in the replay buffer.

Whereas in the case of this PyTorch implementation the final reward is replaced only for the final return.

Lines 201 to 202 in c836e86

    
           def replace_final_return(self, returns): 
        
               self.rewards[-1] = returns

Did I misunderstand anything in the code?

matteobettini · 2024-02-07T16:42:47Z

I have a question related to this, but also regarding the original implementation.

In particular, in Algorithm 1 from the original paper

I am not able to make sense of the last 3 lines.

$R(\tau)$ is not defined anywhere else in the paper. What is it supposed to mean?

Is it setting the final reward of all 3 agents to the scalar regret. If so, this to me only makes sense for the environment designer.
Is the original $r_t$ used for training the protagonist and antagonists?
Is it overrwriting all rewards at the different timesteps in the trajs with a scalar?

If anyone can provide some clarification on these last 3 lines I would be immensely grateful

Provide feedback