Anserini: "Neural Hype" Baseline Experiments

This page provides documentation for reproducing results from two "neural hype" papers, which questioned whether neural ranking models actually represent improvements in ad hoc retrieval effectiveness over well-tuned "competitive baselines" in limited data scenarios:

Jimmy Lin. The Neural Hype and Comparisons Against Weak Baselines. SIGIR Forum, 52(2):40-51, 2018.
Wei Yang, Kuang Lu, Peilin Yang, and Jimmy Lin. Critically Examining the "Neural Hype": Weak Baselines and the Additivity of Effectiveness Gains from Neural Ranking Models. SIGIR 2019.

The "competitive baseline" referenced in the two above papers is BM25+RM3, with proper parameter tuning, on the test collection from the TREC 2004 Robust Track (Robust04). Scripts referenced on this page encode automated regressions that allow users to recreate and verify the results reported below.

The SIGIR Forum article references commit 2c8cd7a (11/16/2018), the results of which changed slightly with an upgrade to Lucene 7.6 at commit e71df7a (12/18/2018). The SIGIR 2019 paper contains experiments performed post upgrade.

The Anserini upgrade to Lucene 8.0 at commit 75e36f9 (6/12/2019) broke the regression tests, which was later fixed at commit 64bae9c (7/3/2019).

In September 2023, regression results were updated at commit 6e148c6 (2023/09/16). This commit patched effectiveness differences arising from two main sources: (1) the upgrade to Lucene 9 at 2725655 (2022/08/02) and (2) a fastutil upgrade/bug fix at #1975 that affected relevance feedback results. To our knowledge, this commit represents the latest state of the code where the effectiveness encoded in our scripts can be successfully reproduced.

See summary in "History" section below.

Expected Results

Retrieval models are tuned with respect to following fold definitions:

Folds for 2-fold cross-validation used in "paper 1"
Folds for 5-fold cross-validation used in "paper 2"

Here are expected results for various retrieval models:

AP	Paper 1	Paper 2
BM25 (default)	0.2531	0.2531
BM25 (tuned)	0.2539	0.2531
QL (default)	0.2467	0.2467
QL (tuned)	0.2520	0.2499
BM25+RM3 (default)	0.2903	0.2903
BM25+RM3 (tuned)	0.3043	0.3021
BM25+Ax (default)	0.2896	0.2896
BM25+Ax (tuned)	0.2940	0.2950

(Clarification, 2023/09): Note that these effectiveness figures are from our papers, which may not be what the code currently produces. See notes about differences in regression results above.

Parameter Tuning

Before starting, modify the index path at src/main/resources/fine_tuning/collections.yaml. The tuning script will go through the index_roots, concatenate with the collection's index_path, and take the first match as the location of the index.

Tuning BM25:

python src/main/python/fine_tuning/run_batch.py --collection robust04 --model bm25 --threads 18 --run
python src/main/python/fine_tuning/run_batch.py --collection robust04 --model bm25 --threads 18 --run --fold_settings src/main/resources/fine_tuning/robust04-paper1-folds.json --verbose
python src/main/python/fine_tuning/run_batch.py --collection robust04 --model bm25 --threads 18 --run --fold_settings src/main/resources/fine_tuning/robust04-paper2-folds.json --verbose

The first command runs the parameter sweeps and prints general statistics. The second and third commands use a specific fold setting to perform cross-validation and print out model parameters.

Tuning QL (commands similarly organized):

python src/main/python/fine_tuning/run_batch.py --collection robust04 --model ql --threads 18 --run
python src/main/python/fine_tuning/run_batch.py --collection robust04 --model ql --threads 18 --run --fold_settings src/main/resources/fine_tuning/robust04-paper1-folds.json --verbose
python src/main/python/fine_tuning/run_batch.py --collection robust04 --model ql --threads 18 --run --fold_settings src/main/resources/fine_tuning/robust04-paper2-folds.json --verbose

Tuning BM25+RM3 (commands similarly organized):

python src/main/python/fine_tuning/run_batch.py --collection robust04 --model bm25+rm3 --threads 18 --run
python src/main/python/fine_tuning/run_batch.py --collection robust04 --model bm25+rm3 --threads 18 --run --fold_settings src/main/resources/fine_tuning/robust04-paper1-folds.json --verbose
python src/main/python/fine_tuning/run_batch.py --collection robust04 --model bm25+rm3 --threads 18 --run --fold_settings src/main/resources/fine_tuning/robust04-paper2-folds.json --verbose

Tuning BM25+Ax (commands similarly organized):

python src/main/python/fine_tuning/run_batch.py --collection robust04 --model bm25+axiom --threads 18 --run
python src/main/python/fine_tuning/run_batch.py --collection robust04 --model bm25+axiom --threads 18 --run --fold_settings src/main/resources/fine_tuning/robust04-paper1-folds.json --verbose
python src/main/python/fine_tuning/run_batch.py --collection robust04 --model bm25+axiom --threads 18 --run --fold_settings src/main/resources/fine_tuning/robust04-paper2-folds.json --verbose

Tuned Runs

Tuned parameter values for BM25+RM3:

For the 2-fold cross-validation used in "paper 1", in terms of MAP
For the 5-fold cross-validation used in "paper 2", in terms of MAP

To be clear, these are the tuned parameters on that fold, trained on the remaining folds.

The following script will reconstruct the tuned runs for BM25+RM3:

python src/main/python/fine_tuning/reconstruct_robus04_tuned_run.py \
  --index indexes/lucene-index.disk45 \
  --folds src/main/resources/fine_tuning/robust04-paper1-folds.json \
  --params src/main/resources/fine_tuning/params/params.map.robust04-paper1-folds.bm25+rm3.json \
  --output run.robust04.bm25+rm3.paper1.txt

Change paper1 to paper2 to reconstruct using the folds in paper 2.

To reconstruct runs from other retrieval models, use the parameter definitions in src/main/resources/fine_tuning/params/, plugging them into the above command as appropriate.

Note that applying trec_eval to these reconstructed runs might yield AP that is a tiny bit different from the values reported above (difference of 0.0001 at the most). This difference arises from rounding when averaging across the folds.

(Clarification, 2023/09): Note that the commands above reconstruct runs based on the tuned parameters from our papers. The effectiveness results may differ from those reported in our papers due to the regression differences described above.

History

The following documents commits that have altered effectiveness figures:

commit 6e148c6 (2023/09/16) - Regression experiments updated.
commit 64bae9c (7/3/2019) - Regression experiments here fixed.
commit 75e36f9 (6/12/2019) - Upgrade to Lucene 8.0 breaks regression experiments here.
commit 407f308 (1/2/2019) - Added results for axiomatic semantic term matching.
commit e71df7a (12/18/2018) - Upgrade to Lucene 7.6.
commit 2c8cd7a (11/16/2018) - commit id referenced in SIGIR Forum article.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

experiments-forum2018.md

experiments-forum2018.md

Anserini: "Neural Hype" Baseline Experiments

Expected Results

Parameter Tuning

Tuned Runs

History

Files

experiments-forum2018.md

Latest commit

History

experiments-forum2018.md

File metadata and controls

Anserini: "Neural Hype" Baseline Experiments

Expected Results

Parameter Tuning

Tuned Runs

History