Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tutorial for using data valuation to select examples for in-context learning #608

Open
AnesBenmerzoug opened this issue Jul 1, 2024 · 0 comments
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Milestone

Comments

@AnesBenmerzoug
Copy link
Collaborator

AnesBenmerzoug commented Jul 1, 2024

We should create a tutorial showing how to use data valuation to select examples for in-context learning.
This should be similar to what was done in the paper "Data Curation Alone Can Stabilize In-context Learning". The paper uses as data value for a specific example the average accuracy on the validation set for all subsets in which this example appears.

$$ s_{\text{ca}}(i) = \mathbb{E}_{\mathcal{Z} \sim D_{\text{ICL}}} \left[ \text{Acc}(Z) | (x_i, y_i) \in \mathcal{Z} \right] $$

Where $i$ is the example's index, $(x_i, y_i)$ is the example's input and output, $\mathcal{Z}$ is the prompt.

The authors show, see Appendix A.1 of the paper, that the Data Shapley value is proportional to this value.

We can do it differently using pyDVL by simply computing Shapley or Banzhaf values.

Here are some of the considerations we have to take into account:

  • We can not put all available examples in the prompt due to context length limitations. This means that we should probably create a new sampler class or post-processor that filters the generated samples to remove subsets with a size greater than the limit.
  • We should consider the order in which examples appear in the prompt. This could make the computation scale much worse.
@AnesBenmerzoug AnesBenmerzoug added documentation Improvements or additions to documentation enhancement New feature or request labels Jul 1, 2024
@AnesBenmerzoug AnesBenmerzoug added this to the v0.11.0 milestone Jul 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant