Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The update method in the UCB algorithm is inconsistent with the paper and code #180

Open
kerala21 opened this issue Mar 31, 2024 · 2 comments

Comments

@kerala21
Copy link

Q(p) for each prompt in the UCB algorithm of the paper is updated to Q(p) + r/N(p),

Uploading 2024331203750.jpg…

The following table describes the project update code

def update(self, chosen, scores):

    for i, score in zip(chosen, scores):
        self.counts[i] += self.num_samples
        self.scores[i] += score * self.num_samples

Doesn't match

@donglixp
Copy link
Contributor

The jpg file is unavailable.

@donglixp donglixp reopened this May 10, 2024
@hideaki-j
Copy link

I was also a bit confused by that part. As I understand it, r/N in the paper seems to be a typo—actually, it should be Q + (r - Q)/N. This is because, to calculate the estimated score Q, we need to update the difference between the predicted Q and the observed reward r.

If so, Q + (r - Q)/N can be rewritten as:

((N - 1)Q + r)/N

This represents the average of all the rewards obtained.

self.scores[i] stores the total sum of all scores (rewards) so far. It will then be divided by counts (to calculate the average) in get_scores() when calculating ucb_scores.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants