-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Computing p-MRR #1
Comments
Hi @jhyuklee! Thanks for the interest! You're correct that the MTEB and p-MRR code is not in the main repo but in the submodule. You can find the evaluation script/function at this line which uses the MTEB format and calls the For my reference, so I can make this repo easier to use - did you use this repository for evaluation? And if so, any problems?If you have any pointers to what was unclear about p-MRR would also love to know so that we could update the paper. Hope that helps and please let me know if there are any issues! |
Hi @orionw, thanks for the pointer! By looking at
It seems like the equation captures 1 well, but not sure how it captures 2. If I understood the equation correctly, it will penalize 2. Let me know if I'm getting this wrong.
|
These are great questions, thanks! (1) is exactly what p-MRR is covering, correct. We included the regular MAP/nDCG score also to measure the score of the regular document ranking (including those relevant documents which didn't change relevance). Re: (2), it's true that p-MRR doesn't measure cases where the relevant document stays relevant, as we only focus on the documents which changed relevance. I don't see how it's penalizing the documents that stay relevant though, as we don't include them in the evaluation at all? We had some initial discussions on how to evaluate and what you're suggesting is an option we considered, and maybe should reconsider :)
You can ignore
This may already be answered above, but we only looked at documents which changed relevance, to measure the instruction-following. So for every query we look at all documents which went from relevant->irrelevant for that query. Then we compute the
It's based on qrels that were > 0 in the qrels_og that become 0 in qrels_changed. There's a function here that created the diff. Does this make sense? Happy to explain more. And again, I definitely appreciate the feedback and the insightful comments! |
Thanks for the detailed answers. These are great and resolves all of my concerns. I also checked the qrels comparing og vs changed, and now have a better understanding of what's going on. I think the paper can be improved though since saying that "... we use the following equation applied to each relevant document per query" confused me by implying that this equation was used for all docs with qrels_og > 0. I have separate topics where I need your help (maybe good to have a separate issue but minor stuffs so..)
Thanks for your help! |
Ah, thanks for pointing that out, we will update the paper to make that clear :)
Yes, those details are in the READMEs of the huggingface, in the weird preview-like thing that has some of the config but not the actual readme: e.g.
We have the averages in Table 2 in our paper, but agree we should add those to make it clear. Thanks!
We used what TREC originally decided, which is MAP@1000 and nDCG@5. I think you're right that I missed the @ 1000 in our paper, thanks for catching that! Since we don't use the full corpus for search (and only do re-ranking) the max we can do is 1000 anyways. And BTW, great work on Gecko (which I assume this is for)! I'm excited to use it! |
Thanks a lot for the detailed explanation. And yes, I'm testing it with Gecko and seeing some positive preliminary results :) To run the reranking experiments, I've done the following (sorry that I cannot run the MTEB eval since I need to rewrite a generic python code to run it on my environment):
So for the second step, I found that the 1) the top documents are often not 1k, and 2) the top document pids are not found from And also for the # of docs that satisfy qrels_og > 0 => qrels_changed = 0, I get a different stat from the paper with the following code:
Thanks for helping out! |
Before I answer I just wanted to say - it seems like API access to the new Gecko is out recently, I'm happy to evaluate it myself in the same way I did OpenAI's and Cohere's. If that's helpful? Selfishly of course evaluating the newest and greatest model is good for maintaining a benchmark :) Though I definitely appreciate your thoroughness, it's been very helpful! To answer your questions:
Yes, given the pooling strategy not all queries have 1k documents but most have close. Fair question, I should probably include that in the documentation and/or paper.
Thanks for pointing this out! It turns out I had filtered the corpus to exclude any documents which were not judged or in the qrels and my MTEB fork didn't double check and skipped them. For comparable results, feel free to skip them also. If we decide to put them back in, I can re-evaluate your model if need be.
I'm not sure I gave that stat in the paper, I think we just said the number of relevant docs if I recall? Let me know where so I can fix it if so - sorry about that! The numbers are 472 for news21, 722 for robust04, and 436 for core17. I used this code to calculate these numbers (I realized the code I sent and that you used expects MTEB formatted qrels and I wanted to use it directly from HF): from datasets import load_dataset
import pandas as pd
dataset_name = "jhu-clsp/core17-instructions"
og_qrels = load_dataset(dataset_name, "qrels_og")["test"].to_pandas()
changed_qrels = load_dataset(dataset_name, "qrels_changed")["test"].to_pandas()
og_qrels = og_qrels.merge(changed_qrels, on=["query-id", "corpus-id"])
diff = og_qrels[og_qrels["score_x"] != og_qrels["score_y"]] |
Talking with my co-authors we're gonna fix the missing unjudged passages and upload a new version of the datasets. Should be up in a few days and I'll update the paper with the new results. FWIW, Core17 has zero missing and News21 has < 1% missing, so your results should be about the same for those two datasets (Robust has 12% missing). Thanks again for pointing this out! I'll be sure to comment here when I push the changes to Huggingface and also give the new results for other models, for comparison. Of course feel free to skip them in the meantime, I know you're busy on a lot of projects. |
Thanks a lot! For the Also the missing pid stat seems to be the same from my end. Thank you for checking them. FYI, I just finalized the full eval of Gecko on FollowIR and it would be nice if I will send the result to you through the email. I can rerun them when the dataset is updated. In the mean time, you can also run it from your end (https://js.langchain.com/docs/modules/data_connection/experimental/multimodal_embeddings/google_vertex_ai seems to be the right one for your case). |
Hi @jhyuklee, thanks again for pointing out those issues in the dataset. They are resolved and updated now. I've also integrated FollowIR fully into I updated the preprint on arxiv with the new results including your great work with Gecko (which comes out as the best bi-encoder for instructions and ties for the best with OpenAI on standard retrieval metrics). You can also view the instruction following leaderboard on the official leaderboard. Let me know if you have any other questions! |
Hi, thanks for the great work.
I'd like to compute p-MRR in the paper, but not sure if it's implemented in this repo. There are some unclear parts so it would be good to see the code.
FYI, I already have a ranked document list for each query given the og and changed instructions.
The text was updated successfully, but these errors were encountered: