Skip to content

Commit

Permalink
feat: Add Weights and Biases support (EleutherAI#1339)
Browse files Browse the repository at this point in the history
* add wandb as extra dependency

* wandb metrics logging

* refactor

* log samples as tables

* fix linter

* refactor: put in a class

* change dir

* add panels

* log eval as table

* improve tables logging

* improve reports logging

* precommit run

* ruff check

* handle importing reports api gracefully

* ruff

* compare results

* minor pre-commit fixes

* build comparison report

* ruff check

* log results as artifacts

* remove comparison script

* update dependency

* type annotate and docstring

* add example

* update readme

* fix typo

* teardown

* handle outside wandb run

* gracefully fail reports creation

* precommit checks

* add report url to summary

* use wandb  printer for better url stdout

* fix ruff

* handle N/A and groups

* fix eval table

* remove unused var

* update wandb version req + disable reports stdout

* remove reports feature to TODO

* add label to multi-choice question data

* log model predictions

* lints

* loglikelihood_rolling

* log eval result for groups

* log tables by group for better handling

* precommit

* choices column for multi-choice

* graciously fail wandb

* remove reports feature

* track system metrics + total eval time + stdout

---------

Co-authored-by: Lintang Sutawika <[email protected]>
  • Loading branch information
2 people authored and nightingal3 committed May 2, 2024
1 parent 6be1045 commit ef5aadd
Show file tree
Hide file tree
Showing 6 changed files with 582 additions and 0 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -16,3 +16,5 @@ temp
# IPython
profile_default/
ipython_config.py
wandb
examples/wandb
39 changes: 39 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -245,6 +245,10 @@ For a full list of supported arguments, check out the [interface](https://github
## Visualizing Results
You can seamlessly visualize and analyze the results of your evaluation harness runs using both Weights & Biases (W&B) and Zeno.
### Zeno
You can use [Zeno](https://zenoml.com) to visualize the results of your eval harness runs.
First, head to [hub.zenoml.com](https://hub.zenoml.com) to create an account and get an API key [on your account page](https://hub.zenoml.com/account).
Expand Down Expand Up @@ -284,6 +288,41 @@ If you run the eval harness on multiple tasks, the `project_name` will be used a

You can find an example of this workflow in [examples/visualize-zeno.ipynb](examples/visualize-zeno.ipynb).

### Weights and Biases

With the [Weights and Biases](https://wandb.ai/site) integration, you can now spend more time extracting deeper insights into your evaluation results. The integration is designed to streamline the process of logging and visualizing experiment results using the Weights & Biases (W&B) platform.

The integration provide functionalities

- to automatically log the evaluation results,
- log the samples as W&B Tables for easy visualization,
- log the `results.json` file as an artifact for version control,
- log the `<task_name>_eval_samples.json` file if the samples are logged,
- generate a comprehensive report for analysis and visualization with all the important metric,
- log task and cli specific configs,
- and more out of the box like the command used to run the evaluation, GPU/CPU counts, timestamp, etc.

First you'll need to install the lm_eval[wandb] package extra. Do `pip install lm_eval[wandb]`.

Authenticate your machine with an your unique W&B token. Visit https://wandb.ai/authorize to get one. Do `wandb login` in your command line terminal.

Run eval harness as usual with a `wandb_args` flag. Use this flag to provide arguments for initializing a wandb run ([wandb.init](https://docs.wandb.ai/ref/python/init)) as comma separated string arguments.

```bash
lm_eval \
--model hf \
--model_args pretrained=microsoft/phi-2,trust_remote_code=True \
--tasks hellaswag,mmlu_abstract_algebra \
--device cuda:0 \
--batch_size 8 \
--output_path output/phi-2 \
--limit 10 \
--wandb_args project=lm-eval-harness-integration \
--log_samples
```

In the stdout, you will find the link to the W&B run page as well as link to the generated report. You can find an example of this workflow in [examples/visualize-wandb.ipynb](examples/visualize-wandb.ipynb).

## How to Contribute or Learn More?

For more information on the library and how everything fits together, check out all of our [documentation pages](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/docs)! We plan to post a larger roadmap of desired + planned library improvements soon, with more information on how contributors can help.
Expand Down
130 changes: 130 additions & 0 deletions examples/visualize-wandb.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "fc477b96-adee-4829-a9d7-a5eb990df358",
"metadata": {},
"source": [
"# Visualizing Results in Weights and Biases\n",
"\n",
"With the Weights and Biases integration, you can now spend more time extracting deeper insights into your evaluation results. The integration is designed to streamline the process of logging and visualizing experiment results using the Weights & Biases (W&B) platform.\n",
"\n",
"The integration provide functionalities\n",
"\n",
"- to automatically log the evaluation results,\n",
"- log the samples as W&B Tables for easy visualization,\n",
"- log the `results.json` file as an artifact for version control,\n",
"- log the `<task_name>_eval_samples.json` file if the samples are logged,\n",
"- generate a comprehensive report for analysis and visualization with all the important metric,\n",
"- log task and cli configs,\n",
"- and more out of the box like the command used to run the evaluation, GPU/CPU counts, timestamp, etc.\n",
"\n",
"The integration is super easy to use with the eval harness. Let's see how!"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3851439a-bff4-41f2-bf21-1b3d8704913b",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# Install this project if you did not already have it.\n",
"# This is all that is needed to be installed to start using Weights and Biases\n",
"\n",
"!pip -qq install -e ..[wandb]"
]
},
{
"cell_type": "markdown",
"id": "8507fd7e-3b99-4a92-89fa-9eaada74ba91",
"metadata": {},
"source": [
"# Run the Eval Harness\n",
"\n",
"Run the eval harness as usual with a `wandb_args` flag. This flag is used to provide arguments for initializing a wandb run ([wandb.init](https://docs.wandb.ai/ref/python/init)) as comma separated string arguments.\n",
"\n",
"If `wandb_args` flag is used, the metrics and all other goodness will be automatically logged to Weights and Biases. In the stdout, you will find the link to the W&B run page as well as link to the generated report."
]
},
{
"cell_type": "markdown",
"id": "eec5866e-f01e-42f8-8803-9d77472ef991",
"metadata": {},
"source": [
"## Set your API Key\n",
"\n",
"Before you can use W&B, you need to authenticate your machine with an authentication key. Visit https://wandb.ai/authorize to get one."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d824d163-71a9-4313-935d-f1d56397841c",
"metadata": {},
"outputs": [],
"source": [
"import wandb\n",
"wandb.login()"
]
},
{
"cell_type": "markdown",
"id": "124e4a34-1547-4bed-bc09-db012bacbda6",
"metadata": {},
"source": [
"> Note that if you are using command line you can simply authenticate your machine by doing `wandb login` in your terminal. For more info check out the [documentation](https://docs.wandb.ai/quickstart#2-log-in-to-wb)."
]
},
{
"cell_type": "markdown",
"id": "abc6f6b6-179a-4aff-ada9-f380fb74df6e",
"metadata": {},
"source": [
"## Run and log to W&B"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bd0a8130-a97b-451a-acd2-3f9885b88643",
"metadata": {},
"outputs": [],
"source": [
"!lm_eval \\\n",
" --model hf \\\n",
" --model_args pretrained=microsoft/phi-2,trust_remote_code=True \\\n",
" --tasks hellaswag,mmlu_abstract_algebra \\\n",
" --device cuda:0 \\\n",
" --batch_size 8 \\\n",
" --output_path output/phi-2 \\\n",
" --limit 10 \\\n",
" --wandb_args project=lm-eval-harness-integration \\\n",
" --log_samples"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
23 changes: 23 additions & 0 deletions lm_eval/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
import numpy as np

from lm_eval import evaluator, utils
from lm_eval.logging_utils import WandbLogger
from lm_eval.tasks import TaskManager, include_path, initialize_tasks
from lm_eval.utils import make_table

Expand Down Expand Up @@ -167,6 +168,11 @@ def parse_eval_args() -> argparse.Namespace:
metavar="CRITICAL|ERROR|WARNING|INFO|DEBUG",
help="Controls the reported logging error level. Set to DEBUG when testing + adding new task configurations for comprehensive log output.",
)
parser.add_argument(
"--wandb_args",
default="",
help="Comma separated string arguments passed to wandb.init, e.g. `project=lm-eval,job_type=eval",
)
parser.add_argument(
"--predict_only",
"-x",
Expand Down Expand Up @@ -195,6 +201,9 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
# we allow for args to be passed externally, else we parse them ourselves
args = parse_eval_args()

if args.wandb_args:
wandb_logger = WandbLogger(args)

eval_logger = utils.eval_logger
eval_logger.setLevel(getattr(logging, f"{args.verbosity}"))
eval_logger.info(f"Verbosity set to {args.verbosity}")
Expand Down Expand Up @@ -309,6 +318,16 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:

batch_sizes = ",".join(map(str, results["config"]["batch_sizes"]))

# Add W&B logging
if args.wandb_args:
try:
wandb_logger.post_init(results)
wandb_logger.log_eval_result()
if args.log_samples:
wandb_logger.log_eval_samples(samples)
except Exception as e:
eval_logger.info(f"Logging to Weights and Biases failed due to {e}")

if args.output_path:
output_path_file.open("w", encoding="utf-8").write(dumped)

Expand All @@ -334,6 +353,10 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
if "groups" in results:
print(make_table(results, "groups"))

if args.wandb_args:
# Tear down wandb run once all the logging is done.
wandb_logger.run.finish()


if __name__ == "__main__":
cli_evaluate()
Loading

0 comments on commit ef5aadd

Please sign in to comment.