Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Computing p-MRR #1

Open
jhyuklee opened this issue Apr 3, 2024 · 10 comments
Open

Computing p-MRR #1

jhyuklee opened this issue Apr 3, 2024 · 10 comments

Comments

@jhyuklee
Copy link

jhyuklee commented Apr 3, 2024

Hi, thanks for the great work.

I'd like to compute p-MRR in the paper, but not sure if it's implemented in this repo. There are some unclear parts so it would be good to see the code.

FYI, I already have a ranked document list for each query given the og and changed instructions.

@orionw
Copy link
Owner

orionw commented Apr 4, 2024

Hi @jhyuklee! Thanks for the interest!

You're correct that the MTEB and p-MRR code is not in the main repo but in the submodule. You can find the evaluation script/function at this line which uses the MTEB format and calls the rank_score method.

For my reference, so I can make this repo easier to use - did you use this repository for evaluation? And if so, any problems?If you have any pointers to what was unclear about p-MRR would also love to know so that we could update the paper.

Hope that helps and please let me know if there are any issues!

@jhyuklee
Copy link
Author

jhyuklee commented Apr 4, 2024

Hi @orionw, thanks for the pointer! By looking at rank_score, equation 1 in the paper seems clear to me now. I guess since you give more constraints with the new instruction, the new relevant documents are the subset of the original relevant documents. It means that there could be two following scenarios:

  1. og gold document that becomes not-relevant: ideally, the new rank should become larger (lower MRR)
  2. og gold document that stays relevant: due to 1, ideally, the new rank should become smaller (higher MRR)

It seems like the equation captures 1 well, but not sure how it captures 2. If I understood the equation correctly, it will penalize 2. Let me know if I'm getting this wrong.

  • I wonder about the difference between rank_score and pointwise_score as I'm not sure about how the og_score (and new_score) is computed. I'd also like to know how you define the "relevant" document when computing p-MRR. Is it based on the qrels_og/test.tsv where score > 0? FYI, I'm using a custom Colab to eval the model so not using this repo directly.

@orionw
Copy link
Owner

orionw commented Apr 4, 2024

These are great questions, thanks! (1) is exactly what p-MRR is covering, correct. We included the regular MAP/nDCG score also to measure the score of the regular document ranking (including those relevant documents which didn't change relevance).

Re: (2), it's true that p-MRR doesn't measure cases where the relevant document stays relevant, as we only focus on the documents which changed relevance. I don't see how it's penalizing the documents that stay relevant though, as we don't include them in the evaluation at all? We had some initial discussions on how to evaluate and what you're suggesting is an option we considered, and maybe should reconsider :)

I wonder about the difference between rank_score and pointwise_score

You can ignore pointwise_score, we had initial plans to include a pointwise score metric (measuring change in the individual doc scores) but decided not to include it as these individual doc scores are known to be incomparable across models (and are unbounded for some models like BM25, Splade). It's not included in the paper results and I just need to delete it from the repo

I'd also like to know how you define the "relevant" document when computing p-MRR

This may already be answered above, but we only looked at documents which changed relevance, to measure the instruction-following. So for every query we look at all documents which went from relevant->irrelevant for that query. Then we compute the rank_score on each of those, average by query, then average overall. We should make this more clear though in the paper, thanks for the feedback!

Is it based on the qrels_og/test.tsv where score > 0?

It's based on qrels that were > 0 in the qrels_og that become 0 in qrels_changed. There's a function here that created the diff.


Does this make sense? Happy to explain more. And again, I definitely appreciate the feedback and the insightful comments!

@jhyuklee
Copy link
Author

jhyuklee commented Apr 4, 2024

Thanks for the detailed answers. These are great and resolves all of my concerns. I also checked the qrels comparing og vs changed, and now have a better understanding of what's going on. I think the paper can be improved though since saying that "... we use the following equation applied to each relevant document per query" confused me by implying that this equation was used for all docs with qrels_og > 0.

I have separate topics where I need your help (maybe good to have a separate issue but minor stuffs so..)

  • can you specify the number of passages for each corpus? I was using corpus.jsonl for each data and wanted to double check its size (e.g. 30097 for news-21).
  • it would be nice to show stats on how many docs satisfy qrels_og > 0 => qrels_changed = 0 (the ones used for p-MRR).
  • for MAP, did you use MAP@k (like MAP@1000) or did you compute it over all the docs like 30k for news-21?

Thanks for your help!

@orionw
Copy link
Owner

orionw commented Apr 5, 2024

Ah, thanks for pointing that out, we will update the paper to make that clear :)

can you specify the number of passages for each corpus? I was using corpus.jsonl for each data and wanted to double check its size (e.g. 30097 for news-21).

Yes, those details are in the READMEs of the huggingface, in the weird preview-like thing that has some of the config but not the actual readme: e.g. num_examples for corpus, qrels, etc. in news21 and the others.

it would be nice to show stats on how many docs satisfy qrels_og > 0 => qrels_changed = 0 (the ones used for p-MRR).

We have the averages in Table 2 in our paper, but agree we should add those to make it clear. Thanks!

for MAP, did you use MAP@k (like MAP@1000) or did you compute it over all the docs like 30k for news-21?

We used what TREC originally decided, which is MAP@1000 and nDCG@5. I think you're right that I missed the @ 1000 in our paper, thanks for catching that! Since we don't use the full corpus for search (and only do re-ranking) the max we can do is 1000 anyways.


And BTW, great work on Gecko (which I assume this is for)! I'm excited to use it!

@jhyuklee
Copy link
Author

jhyuklee commented Apr 9, 2024

Thanks a lot for the detailed explanation. And yes, I'm testing it with Gecko and seeing some positive preliminary results :)

To run the reranking experiments, I've done the following (sorry that I cannot run the MTEB eval since I need to rewrite a generic python code to run it on my environment):

  1. Get all the qid=>[pid] from top_ranked.jsonl, get queries from queries.jsonl and the corpus from corpus.jsonl
  2. Get ranked list for each qid=>[pid]. For each pid, actual (title/text) was indexed by looking up the pid from corpus.jsonl
  3. Eval MAP@1000 or nDCG@5
  4. Eval p-MRR

So for the second step, I found that the 1) the top documents are often not 1k, and 2) the top document pids are not found from corpus.jsonl. What would be the cause of this and are these okay?

And also for the # of docs that satisfy qrels_og > 0 => qrels_changed = 0, I get a different stat from the paper with the following code:

def create_qrel_diff(og_qrels, changed_qrels):
  newly_irrelevant_qrels = {}
  for qid in og_qrels:
      newly_irrelevant_qrels[qid] = []
      for doc_id in og_qrels[qid]:
          if changed_qrels[qid][doc_id] != og_qrels[qid][doc_id]:
              newly_irrelevant_qrels[qid].append(doc_id)

  return newly_irrelevant_qrels


qrels_changed = create_qrel_diff(qrels_og, qrels_new)
# 722 for robust04
print(f'\nqrels_changed = {sum(len(qq) for qq in qrels_changed.values())}')

Thanks for helping out!

@orionw
Copy link
Owner

orionw commented Apr 10, 2024

Before I answer I just wanted to say - it seems like API access to the new Gecko is out recently, I'm happy to evaluate it myself in the same way I did OpenAI's and Cohere's. If that's helpful? Selfishly of course evaluating the newest and greatest model is good for maintaining a benchmark :) Though I definitely appreciate your thoroughness, it's been very helpful!

To answer your questions:

So for the second step, I found that the 1) the top documents are often not 1k,

Yes, given the pooling strategy not all queries have 1k documents but most have close. Fair question, I should probably include that in the documentation and/or paper.

and 2) the top document pids are not found from corpus.jsonl. What would be the cause of this and are these okay?

Thanks for pointing this out! It turns out I had filtered the corpus to exclude any documents which were not judged or in the qrels and my MTEB fork didn't double check and skipped them.

For comparable results, feel free to skip them also. If we decide to put them back in, I can re-evaluate your model if need be.

And also for the # of docs that satisfy qrels_og > 0 => qrels_changed = 0, I get a different stat from the paper with the following code:

I'm not sure I gave that stat in the paper, I think we just said the number of relevant docs if I recall? Let me know where so I can fix it if so - sorry about that! The numbers are 472 for news21, 722 for robust04, and 436 for core17.

I used this code to calculate these numbers (I realized the code I sent and that you used expects MTEB formatted qrels and I wanted to use it directly from HF):

from datasets import load_dataset
import pandas as pd
dataset_name = "jhu-clsp/core17-instructions"
og_qrels = load_dataset(dataset_name, "qrels_og")["test"].to_pandas()
changed_qrels = load_dataset(dataset_name, "qrels_changed")["test"].to_pandas()
og_qrels = og_qrels.merge(changed_qrels, on=["query-id", "corpus-id"])
diff = og_qrels[og_qrels["score_x"] != og_qrels["score_y"]]

@orionw
Copy link
Owner

orionw commented Apr 10, 2024

Talking with my co-authors we're gonna fix the missing unjudged passages and upload a new version of the datasets. Should be up in a few days and I'll update the paper with the new results. FWIW, Core17 has zero missing and News21 has < 1% missing, so your results should be about the same for those two datasets (Robust has 12% missing).

Thanks again for pointing this out! I'll be sure to comment here when I push the changes to Huggingface and also give the new results for other models, for comparison. Of course feel free to skip them in the meantime, I know you're busy on a lot of projects.

@jhyuklee
Copy link
Author

jhyuklee commented Apr 10, 2024

Thanks a lot! For the # of docs that satisfy qrels_og > 0 => qrels_changed = 0, I was referring to the stats in the Table 2 of the paper (Rel. D/Q), but your provided stat (472 for news21, 722 for robust04, and 436 for core17) matches mine as well :)

Also the missing pid stat seems to be the same from my end. Thank you for checking them.

FYI, I just finalized the full eval of Gecko on FollowIR and it would be nice if I will send the result to you through the email. I can rerun them when the dataset is updated. In the mean time, you can also run it from your end (https://js.langchain.com/docs/modules/data_connection/experimental/multimodal_embeddings/google_vertex_ai seems to be the right one for your case).

@orionw
Copy link
Owner

orionw commented May 8, 2024

Hi @jhyuklee, thanks again for pointing out those issues in the dataset. They are resolved and updated now. I've also integrated FollowIR fully into mteb so it can be evaluated as normal now. I double checked their eval setup and they use pytrec_eval also -- I'm pretty confident it is correct.

I updated the preprint on arxiv with the new results including your great work with Gecko (which comes out as the best bi-encoder for instructions and ties for the best with OpenAI on standard retrieval metrics). You can also view the instruction following leaderboard on the official leaderboard.

Let me know if you have any other questions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants