Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring on the Anserini end to enable cleaner Pyserini bindings #2584

Merged
merged 9 commits into from
Sep 6, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 4 additions & 3 deletions docs/fatjar-regressions/fatjar-regressions-v0.37.0.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,19 +36,20 @@ curl -X GET "http://localhost:8081/api/v1.0/indexes/msmarco-v2.1-doc/search?quer

The json results are the same as the output of the `-outputRerankerRequests` option in `SearchCollection`, described below for TREC 2024 RAG.
Use the `hits` parameter to specify the number of hits to return, e.g., `hits=1000` to return the top 1000 hits.
Switch to `msmarco-v2.1-doc-segmented` in the route to query the segmented docs instead.

Details of the built-in webapp and REST API can be found [here](../rest-api.md).

## TREC 2024 RAG

For the TREC 2024 RAG Track, we have thus far only implemented BM25 baselines on the MS MARCO V2.1 document corpus (both the doc and doc segmented variants).
For the [TREC 2024 RAG Track](https://trec-rag.github.io/), we have thus far only implemented BM25 baselines on the MS MARCO V2.1 document corpus (both the doc and doc segmented variants).

❗ Beware, you need lots of space to run these experiments.
The `msmarco-v2.1-doc` prebuilt index is 63 GB uncompressed.
The `msmarco-v2.1-doc-segmented` prebuilt index is 84 GB uncompressed.
Both indexes will be downloaded automatically.

This release of Anserini comes with the test topic for the TREC 2024 RAG track (`-topics rag24.test`).
This release of Anserini comes with bindings for the test topics for the TREC 2024 RAG track (`-topics rag24.test`).
To generate jsonl output containing the raw documents that can be reranked and further processed, use the `-outputRerankerRequests` option to specify an output file.
For example:

Expand All @@ -61,7 +62,7 @@ java -cp $ANSERINI_JAR io.anserini.search.SearchCollection \
-outputRerankerRequests $OUTPUT_DIR/results.msmarco-v2.1-doc.bm25.rag24.test.jsonl
```

And the output looks something like:
And the output looks something like (pipe through `jq` to pretty-print):

```bash
$ head -n 1 $OUTPUT_DIR/results.msmarco-v2.1-doc.bm25.rag24.test.jsonl | jq
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ bin/run.sh io.anserini.search.SearchHnswDenseVectors \
-topics tools/topics-and-qrels/topics.beir-v1.0.0-bioasq.test.bge-base-en-v1.5.jsonl.gz \
-topicReader JsonStringVector \
-output runs/run.beir-v1.0.0-bioasq.bge-base-en-v1.5.bge-hnsw-int8-cached.topics.beir-v1.0.0-bioasq.test.bge-base-en-v1.5.jsonl.txt \
-hits 1000 -efSearch 1000 -removeQuery -threads 16 &
-hits 1000 -efSearch 2000 -removeQuery -threads 16 &
```

Evaluation can be performed using `trec_eval`:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ bin/run.sh io.anserini.search.SearchHnswDenseVectors \
-topics tools/topics-and-qrels/topics.beir-v1.0.0-bioasq.test.tsv.gz \
-topicReader TsvString \
-output runs/run.beir-v1.0.0-bioasq.bge-base-en-v1.5.bge-hnsw-int8-onnx.topics.beir-v1.0.0-bioasq.test.txt \
-encoder BgeBaseEn15 -hits 1000 -efSearch 1000 -removeQuery -threads 16 &
-encoder BgeBaseEn15 -hits 1000 -efSearch 2000 -removeQuery -threads 16 &
```

Evaluation can be performed using `trec_eval`:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ bin/run.sh io.anserini.search.SearchHnswDenseVectors \
-topics tools/topics-and-qrels/topics.beir-v1.0.0-bioasq.test.bge-base-en-v1.5.jsonl.gz \
-topicReader JsonStringVector \
-output runs/run.beir-v1.0.0-bioasq.bge-base-en-v1.5.bge-hnsw-cached.topics.beir-v1.0.0-bioasq.test.bge-base-en-v1.5.jsonl.txt \
-hits 1000 -efSearch 1000 -removeQuery -threads 16 &
-hits 1000 -efSearch 2000 -removeQuery -threads 16 &
```

Evaluation can be performed using `trec_eval`:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ bin/run.sh io.anserini.search.SearchHnswDenseVectors \
-topics tools/topics-and-qrels/topics.beir-v1.0.0-bioasq.test.tsv.gz \
-topicReader TsvString \
-output runs/run.beir-v1.0.0-bioasq.bge-base-en-v1.5.bge-hnsw-onnx.topics.beir-v1.0.0-bioasq.test.txt \
-encoder BgeBaseEn15 -hits 1000 -efSearch 1000 -removeQuery -threads 16 &
-encoder BgeBaseEn15 -hits 1000 -efSearch 2000 -removeQuery -threads 16 &
```

Evaluation can be performed using `trec_eval`:
Expand Down
164 changes: 163 additions & 1 deletion src/main/java/io/anserini/index/IndexInfo.java

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion src/main/java/io/anserini/search/BaseSearchArgs.java
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@
/**
* This is the base class that holds common arguments for configuring searchers. Note that, explicitly, there are no
* arguments that are specific to the retrieval implementation (e.g., for HNSW searchers), and that there are no
* arguments that define queries and outputs (which are to be defined by sub-classes that may call the searcher in
* arguments that define queries and outputs (which are to be defined by subclasses that may call the searcher in
* different ways).
*/
public class BaseSearchArgs {
Expand Down
132 changes: 88 additions & 44 deletions src/main/java/io/anserini/search/FlatDenseSearcher.java
Original file line number Diff line number Diff line change
Expand Up @@ -129,78 +129,122 @@
}
}

public SortedMap<K, ScoredDoc[]> batch_search(List<K> qids, List<String> queries, int hits) {
/**
* Searches the collection in batch using multiple threads.
*
* @param queries list of queries
* @param qids list of unique query ids
* @param k number of hits
* @param threads number of threads
* @return a map of query id to search results
*/
public SortedMap<K, ScoredDoc[]> batch_search(List<String> queries, List<K> qids, int k, int threads) {
final SortedMap<K, ScoredDoc[]> results = new ConcurrentSkipListMap<>();
final ThreadPoolExecutor executor = (ThreadPoolExecutor) Executors.newFixedThreadPool(args.threads);
final AtomicInteger cnt = new AtomicInteger();

final long start = System.nanoTime();
assert qids.size() == queries.size();
for (int i=0; i<qids.size(); i++) {
K qid = qids.get(i);
String queryString = queries.get(i);

// This is the per-query execution, in parallel.
executor.execute(() -> {
try {
results.put(qid, search(qid, queryString, hits));
} catch (IOException e) {
throw new CompletionException(e);
}

int n = cnt.incrementAndGet();
if (n % 100 == 0) {
LOG.info(String.format("%d queries processed", n));
}
});
}

executor.shutdown();
try(ThreadPoolExecutor executor = (ThreadPoolExecutor) Executors.newFixedThreadPool(threads)) {
assert qids.size() == queries.size();
for (int i = 0; i < qids.size(); i++) {
K qid = qids.get(i);
String queryString = queries.get(i);

try {
// Wait for existing tasks to terminate.
while (!executor.awaitTermination(1, TimeUnit.MINUTES));
} catch (InterruptedException ie) {
// (Re-)Cancel if current thread also interrupted.
executor.shutdownNow();
// Preserve interrupt status.
Thread.currentThread().interrupt();
// This is the per-query execution, in parallel.
executor.execute(() -> {
try {
results.put(qid, search(qid, queryString, k));
} catch (IOException e) {
throw new CompletionException(e);

Check warning on line 157 in src/main/java/io/anserini/search/FlatDenseSearcher.java

View check run for this annotation

Codecov / codecov/patch

src/main/java/io/anserini/search/FlatDenseSearcher.java#L156-L157

Added lines #L156 - L157 were not covered by tests
}

int n = cnt.incrementAndGet();
if (n % 100 == 0) {
LOG.info(String.format("%d queries processed", n));

Check warning on line 162 in src/main/java/io/anserini/search/FlatDenseSearcher.java

View check run for this annotation

Codecov / codecov/patch

src/main/java/io/anserini/search/FlatDenseSearcher.java#L162

Added line #L162 was not covered by tests
}
});
}

executor.shutdown();

try {
// Wait for existing tasks to terminate.
while (!executor.awaitTermination(1, TimeUnit.MINUTES));
} catch (InterruptedException ie) {

Check warning on line 172 in src/main/java/io/anserini/search/FlatDenseSearcher.java

View check run for this annotation

Codecov / codecov/patch

src/main/java/io/anserini/search/FlatDenseSearcher.java#L172

Added line #L172 was not covered by tests
// (Re-)Cancel if current thread also interrupted.
executor.shutdownNow();

Check warning on line 174 in src/main/java/io/anserini/search/FlatDenseSearcher.java

View check run for this annotation

Codecov / codecov/patch

src/main/java/io/anserini/search/FlatDenseSearcher.java#L174

Added line #L174 was not covered by tests
// Preserve interrupt status.
Thread.currentThread().interrupt();

Check warning on line 176 in src/main/java/io/anserini/search/FlatDenseSearcher.java

View check run for this annotation

Codecov / codecov/patch

src/main/java/io/anserini/search/FlatDenseSearcher.java#L176

Added line #L176 was not covered by tests
}
}
final long durationMillis = TimeUnit.MILLISECONDS.convert(System.nanoTime() - start, TimeUnit.NANOSECONDS);

LOG.info(queries.size() + " queries processed in " +
DurationFormatUtils.formatDuration(durationMillis, "HH:mm:ss") +
LOG.info("{} queries processed in {}{}", queries.size(),
DurationFormatUtils.formatDuration(durationMillis, "HH:mm:ss"),
String.format(" = ~%.2f q/s", queries.size() / (durationMillis / 1000.0)));

return results;
}

public ScoredDoc[] search(float[] queryFloat, int hits) throws IOException {
return search(null, queryFloat, hits);
/**
* Searches the collection with a query vector.
*
* @param query query vector
* @param k number of hits
* @return array of search results
* @throws IOException if error encountered during search
*/
public ScoredDoc[] search(float[] query, int k) throws IOException {
return search(null, query, k);

Check warning on line 197 in src/main/java/io/anserini/search/FlatDenseSearcher.java

View check run for this annotation

Codecov / codecov/patch

src/main/java/io/anserini/search/FlatDenseSearcher.java#L197

Added line #L197 was not covered by tests
}

public ScoredDoc[] search(@Nullable K qid, float[] queryFloat, int hits) throws IOException {
KnnFloatVectorQuery query = new KnnFloatVectorQuery(Constants.VECTOR, queryFloat, DUMMY_EF_SEARCH);
TopDocs topDocs = getIndexSearcher().search(query, hits, BREAK_SCORE_TIES_BY_DOCID, true);
/**
* Searches the collection with a query vector.
*
* @param qid query id
* @param query query vector
* @param k number of hits
* @return array of search results
* @throws IOException if error encountered during search
*/
public ScoredDoc[] search(@Nullable K qid, float[] query, int k) throws IOException {
KnnFloatVectorQuery vectorQuery = new KnnFloatVectorQuery(Constants.VECTOR, query, DUMMY_EF_SEARCH);
TopDocs topDocs = getIndexSearcher().search(vectorQuery, k, BREAK_SCORE_TIES_BY_DOCID, true);

return super.processLuceneTopDocs(qid, topDocs);
}

public ScoredDoc[] search(String queryString, int hits) throws IOException {
return search(null, queryString, hits);
/**
* Searches the collection with a string query that will be encoded by the underlying encoder.
*
* @param query query
* @param k number of hits
* @return array of search results
* @throws IOException if error encountered during search
*/
public ScoredDoc[] search(String query, int k) throws IOException {
return search(null, query, k);
}

public ScoredDoc[] search(@Nullable K qid, String queryString, int hits) throws IOException {
/**
* Searches the collection with a string query that will be encoded by the underlying encoder.
*
* @param qid query id
* @param query query
* @param k number of hits
* @return array of search results
* @throws IOException if error encountered during search
*/
public ScoredDoc[] search(@Nullable K qid, String query, int k) throws IOException {
if (encoder != null) {
try {
return search(qid, encoder.encode(queryString), hits);
return search(qid, encoder.encode(query), k);
} catch (OrtException e) {
throw new RuntimeException("Error encoding query.");
}
}

KnnFloatVectorQuery query = generator.buildQuery(Constants.VECTOR, queryString, DUMMY_EF_SEARCH);
TopDocs topDocs = getIndexSearcher().search(query, hits, BREAK_SCORE_TIES_BY_DOCID, true);
KnnFloatVectorQuery vectorQuery = generator.buildQuery(Constants.VECTOR, query, DUMMY_EF_SEARCH);
TopDocs topDocs = getIndexSearcher().search(vectorQuery, k, BREAK_SCORE_TIES_BY_DOCID, true);

return super.processLuceneTopDocs(qid, topDocs);
}
Expand Down
Loading
Loading