Skip to content

MLE-bench is a benchmark for measuring how well AI agents perform at machine learning engineering

License

Notifications You must be signed in to change notification settings

openai/mle-bench

Repository files navigation

MLE-bench

Code for the paper "MLE-Bench: Evaluating Machine Learning Agents on Machine Learning Engineering". We have released the code used to construct the dataset, the evaluation logic, as well as the agents we evaluated for this benchmark.

Setup

Some MLE-bench competition data is stored using Git-LFS. Once you have downloaded and installed LFS, run:

git lfs fetch --all
git lfs pull

You can install mlebench with pip:

pip install -e .

Pre-Commit Hooks (Optional)

If you're committing code, you can install the pre-commit hooks by running:

pre-commit install

Dataset

The MLE-bench dataset is a collection of 75 Kaggle competitions which we use to evaluate the ML engineering capabilities of AI systems.

Since Kaggle does not provide the held-out test set for each competition, we provide preparation scripts that split the publicly available training set into a new training and test set.

For each competition, we also provide grading scripts that can be used to evaluate the score of a submission.

We use the Kaggle API to download the raw datasets. Ensure that you have downloaded your Kaggle credentials (kaggle.json) and placed it in the ~/.kaggle/ directory (this is the default location where the Kaggle API looks for your credentials). To download and prepare the MLE-bench dataset, run the following, which will download and prepare the dataset in your system's default cache directory. Note, we've found this to take two days when running from scratch:

mlebench prepare --all

Alternatively, you can prepare the dataset for a specific competition by running:

mlebench prepare -c <competition-id>

Run mlebench prepare --help to see the list of available competitions.

Grading Submissions

Answers for competitions must be submitted in CSV format; the required format is described in each competition's description, or shown in a competition's sample submission file. You can grade multiple submissions by using the mlebench grade command. Given a JSONL file, where each line corresponds with a submission for one competition, mlebench grade will produce a grading report for each competition. The JSONL file must contain the following fields:

  • competition_id: the ID of the competition in our dataset.
  • submission_path: a .csv file with the predictions for the specified competition.

See more information by running mlebench grade --help.

You can also grade individual submissions using the mlebench grade-sample command. For example, to grade a submission for the Spaceship Titanic competition, you can run:

mlebench grade-sample <PATH_TO_SUBMISSION> spaceship-titanic

See more information by running mlebench grade-sample --help.

Environment

We provide a base Docker image mlebench-env which is the base environment for our agents. This base image contains:

  • Conda environment used to execute our agents. We optionally (default true) install Python packages in this environment which are commonly used across our agents. If you don't want to install these packages, set the INSTALL_HEAVY_DEPENDENCIES environment variable to false when building the image, by adding --build-arg INSTALL_HEAVY_DEPENDENCIES=false to the docker build command below
  • Instructions for agents to follow when creating their submission
  • Grading server for agents to use when checking that the structure of their submission is correct

Build this image by running:

docker build --platform=linux/amd64 -t mlebench-env -f environment/Dockerfile .

Agents

We purposefully designed our benchmark to not make any assumptions about the agent that produces submissions, so agents can more easily be evaluated on this benchmark. We evaluated three open-source agents; we discuss this procedure in agents/README.md.

Extras

We include additional features in the MLE-bench repository that may be useful for MLE-bench evaluation. These include a rule violation detector and a plagiarism detector. We refer readers to extras/README.md for more information.

Examples

We collect example usage of this library in the examples/ directory, see examples/README.md for more information.

Experiments

We place the code specific to the experiments from our publication of the benchmark in the experiments/ directory:

  • For instance, our competition splits are available in experiments/splits/.
  • For a completed set of runs from a given agent, you can use the provided experiments/make_submission.py script to compile its submission for grading.
  • We release our methodology for the "familiarity" experiments in experiments/familiarity/, see experiments/familiarity/README.md for more information.

Dev

Note, when running pytest locally, be sure to accept the competition rules otherwise the tests will fail.

Authors

Chan Jun Shern, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Lilian Weng, Aleksander Mądry

Citation

Please cite using the following BibTeX entry:

@article{chan2024mle-bench,
  title={MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering},
  author={Jun Shern Chan and Neil Chowdhury and Oliver Jaffe and James Aung and Dane Sherburn and Evan Mays and Giulio Starace and Kevin Liu and Leon Maksin and Tejal Patwardhan and Lilian Weng and Aleksander Mądry},
  year={2024},
  eprint={2410.07095},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2410.07095}
}