-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add surprise similarity score #2287
base: master
Are you sure you want to change the base?
Conversation
Hello! I've resolved some of the merge conflicts to play around with this manually. I've also added a debugging commit to add the surprise similarity to
As you can see here, the Surprise Similarity seems to result in lower Spearman similarity scores. This should mean that the embeddings when compared via the surprise similarity should correspond less well to the true semantic similarities. With other words, for these experiments, it does not make sense to use the surprise similarity over the cosine similarity. Please do let me know if I made a mistake with my implementation in (We can revert the Debugging commit if we ever choose to move forward with this PR) cc: @mlschillo this might also interest you!
|
I can also implement this on top of the |
Hi @tomaarsen ! It's great to see some interest in this PR, thank you. I was surprised by your benchmark results and did a small extension to include The fact that for Maybe @mlschillo can comment on whether this is expected and if it is worthwhile to do a classification type benchmark. A classification type benchmark would be closer to the experiments in the paper than the current benchmark, so if you have time @tomaarsen that would be interesting imo. |
Indeed. Something interesting that I noticed was that the mean of the This is perhaps somewhat indicative of the evaluation set. See here some samples from the training set, searched for "obama": https://huggingface.co/datasets/mteb/stsbenchmark-sts/viewer/default/train?q=obama&row=5726 With other words, this might explain why the mean surprise score of all pairs is 0.94 and not something near 0.5. In short, the surprise score might make pairs on the same subject/topic but with different semantics have a higher similarity score, which may cause poor performance with hard negative pairs.
I can definitely have a look at trying to run a classification benchmark when I have a bit more time.
|
Hey @VMinB12 and @tomaarsen Thanks for the interest and discussion! I definitely agree with what @tomaarsen is pointing out here:
Perhaps the most important feature of the surprise score is it's dependence on the ensemble, but this does mean that it will fail in some cases. It could be that the construction of this benchmark uses clusters of topics (e.g. kitchen tasks, transportation related, music related, news/current events) and the topic clustering hurts the score's ability to distinguish within a cluster (as in the Obama example.) This might also explain why @VMinB12 is able to do better without normalizing. But I would think this would end up being more of a feature than a bug in classification tasks, so I also think it's worth the experiment. also want to loop in @tbachlechner @MCMartone in case they have more insightful comments |
It would be great to have an understanding of how common this pitfall of the surprise score is. Perhaps this dataset is an outlier, or perhaps there is a general insight to be extracted here. Repeating @tomaarsen 's current exercise on other datasets would be insightful here. |
I did some tests myself using
On both datasets the surprise score is outperformed by cosine similarity. Do you know of any other datasets that we could include here? |
I've also done some more experiments (feel free to reproduce with Here are the results on the test set:
There are a few things to note here:
In this experiment, the surprise score is also not a good option, I'm afraid. Edit:
|
This PR implements the surprise similarity score of https://arxiv.org/pdf/2308.09765.pdf.
This paper is also implemented in https://github.com/MeetElise/surprise-similarity. This PR implements a minimalistic version for easier integration with sentence-transformer package.
util.py
has grown too large (590 lines) and would benefit from a refactoring into separate modules. This is best done in a separate PR.