Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Searcher should add an "normalize" argument? #1952

Open
dayuyang1999 opened this issue Jul 30, 2024 · 1 comment
Open

Searcher should add an "normalize" argument? #1952

dayuyang1999 opened this issue Jul 30, 2024 · 1 comment

Comments

@dayuyang1999
Copy link

dayuyang1999 commented Jul 30, 2024

Hi,

If I use my own embedding model like bge-large-en-v1.5.

Because the model is trained on optimizing cosine similarity. When creating index, the correct implementation should add --l2-norm option.

--l2-norm

However, when creating FaissSearcher for search, it seems there is no option for normalizing the embedding.

class FaissSearcher:
    """Simple Searcher for dense representation

    Parameters
    ----------
    index_dir : str
        Path to faiss index directory.
    """

    def __init__(self, index_dir: str, query_encoder: Union[QueryEncoder, str],
                 prebuilt_index_name: Optional[str] = None):
        requires_backends(self, "faiss")
        if not isinstance(query_encoder, str):
            self.query_encoder = query_encoder
        else:
            self.query_encoder = self._init_encoder_from_str(query_encoder)
        self.index, self.docids = self.load_index(index_dir)
        self.dimension = self.index.d
        self.num_docs = self.index.ntotal

        assert self.docids is None or self.num_docs == len(self.docids)
        if prebuilt_index_name:
            sparse_index = get_sparse_index(prebuilt_index_name)
            self.ssearcher = LuceneSearcher.from_prebuilt_index(sparse_index)
@MXueguang
Copy link
Member

hi @dayuyang1999,
At search time, for l2 norm vectors, we assume the indexes are built with vector normalized already and the query encoder is generating normalized vectors. You can make the l2-norm=true when you initialize the query encoder and then pass the query encoder to the searcher.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants