Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add Weights and Biases support #1339

Merged
merged 57 commits into from
Feb 22, 2024

Conversation

ayulockin
Copy link
Contributor

@ayulockin ayulockin commented Jan 23, 2024

In #359 @parambharat proposed to add support for W&B logging. However it was done before the big refactor that got in.

As a user of both lm-evaluation-harness and wandb, I have opened this PR to add support for W&B logging.

Functionalities

The integration provide functionalities

  • to automatically log the evaluation results,
  • log the samples as W&B Tables for easy visualization,
  • log the results.json file as an artifact for version control,
  • log the <task_name>_eval_samples.json file if the samples are logged,
  • log task and cli specific configs,
  • and more out of the box like the command used to run the evaluation, GPU/CPU counts, timestamp, etc.

Installation:

pip install lm_eval[wandb]

Run Eval Harness:

lm_eval \
    --model hf \
    --model_args pretrained=microsoft/phi-2,trust_remote_code=True \
    --tasks hellaswag,mmlu_abstract_algebra \
    --device cuda:0 \
    --batch_size 8 \
    --output_path output/phi-2 \
    --limit 10 \
    --wandb_args project=lm-eval-harness-integration \
    --log_samples

Example

Here's a W&B run page with lm-eval-harness run on hellaswag and mmlu_abstract_algebra tasks using the microsoft/phi-2 model.

Here's the automatic generated report: https://wandb.ai/ayush-thakur/lm-eval-harness-integration/reports/-2024-02-09-12-16-01-wdp5ubxs-Evaluation-report--Vmlldzo2NjgzMDkz

@CLAassistant
Copy link

CLAassistant commented Jan 23, 2024

CLA assistant check
All committers have signed the CLA.

@StellaAthena
Copy link
Member

@ayulockin Thanks a ton for the overhaul! Would you be able to add a notebook to /examples/ demo'ing the features?

@ayulockin
Copy link
Contributor Author

Hey @StellaAthena, I will add a notebook today. Just finalising a few more features. Will fix failing CIs too.

@haileyschoelkopf
Copy link
Collaborator

Thanks very much! The support generally looks great to me, just would be preferable to move wandb.py outside of the api/ folder as that contains largely abstract base classes / core building blocks. (Additionally, won't naming the file wandb.py have a chance of causing circular import issues due to shadowing the wandb package name? perhaps wandb_logger.py would be preferable)

We might consider putting this into a new lm_eval/logging_utils.py and then move our current logger also into this file, and/or to allow for this file to later on contain things like tensorboard support or other 3rd-party loggers if they are contributed or requested.

@ayulockin
Copy link
Contributor Author

hey @haileyschoelkopf, sounds good. I was about to get decision on the namespace/scope. I will move the logging stuff to lm_eval/logging_utils.py. I have condensed different functions into a single class for better state management.

@haileyschoelkopf
Copy link
Collaborator

haileyschoelkopf commented Feb 1, 2024

@ayulockin just let me know if this is all ready to review! I see there have been more changes so not sure if this is currently the final version yet?

@ayulockin
Copy link
Contributor Author

Hey @haileyschoelkopf, this is still a work in progress. There are a few more improvements I wanna push in, likely by EOD. I will ping you once it's ready. Thanks 😊

@haileyschoelkopf
Copy link
Collaborator

No problem and no rush! Just wanted to make sure it wasn't blocked on me

@Luodian
Copy link

Luodian commented Feb 2, 2024

Looking forward to it!

@ayulockin
Copy link
Contributor Author

Hey @haileyschoelkopf, after updating the branch (also tested from the main branch) I am getting this assertion error:

The command:

lm_eval --model hf --model_args pretrained=microsoft/phi-2,trust_remote_code=True --tasks hellaswag,mmlu_abstract_algebra --device cuda:0 --batch_size 32 --output_path output/phi-2 --limit 0.001 --log_samples

The error:

2024-02-03:04:37:01,971 INFO     [utils.py:160] NumExpr defaulting to 8 threads.
2024-02-03:04:37:02,434 INFO     [config.py:58] PyTorch version 2.1.2+cu118 available.
2024-02-03:04:37:04,261 INFO     [__main__.py:162] Verbosity set to INFO
2024-02-03:04:37:04,262 INFO     [__init__.py:358] lm_eval.tasks.initialize_tasks() is deprecated and no longer necessary. It will be removed in v0.4.2 release. TaskManager will instead be used.
2024-02-03:04:37:10,640 WARNING  [__main__.py:174]  --limit SHOULD ONLY BE USED FOR TESTING.REAL METRICS SHOULD NOT BE COMPUTED USING LIMIT.
2024-02-03:04:37:10,641 WARNING  [__main__.py:224] File already exists at output/phi-2. Results will be overwritten.
2024-02-03:04:37:10,641 INFO     [__main__.py:238] Selected Tasks: ['hellaswag', 'mmlu_abstract_algebra']
2024-02-03:04:37:10,641 INFO     [__main__.py:239] Loading selected tasks...
2024-02-03:04:37:10,665 WARNING  [logging.py:61] Detected kernel version 4.19.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
2024-02-03:04:37:10,665 INFO     [huggingface.py:148] Using device 'cuda:0'
Loading checkpoint shards: 100%|█████████████████████████████████| 2/2 [00:03<00:00,  1.65s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-02-03:04:37:15,082 INFO     [evaluator.py:139] get_task_dict has been updated to accept an optional argument, `task_manager`Read more here: https://github.com/EleutherAI/lm-evaluation-harness/blob/recursive-groups/docs/interface.md#external-library-usage
/opt/conda/envs/lm-eval/lib/python3.10/site-packages/datasets/load.py:1429: FutureWarning: The repository for hellaswag contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/hellaswag
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
  warnings.warn(
/opt/conda/envs/lm-eval/lib/python3.10/site-packages/datasets/load.py:1429: FutureWarning: The repository for hails/mmlu_no_train contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/hails/mmlu_no_train
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
  warnings.warn(
2024-02-03:04:37:23,076 INFO     [task.py:360] Building contexts for task on rank 0...
Traceback (most recent call last):
  File "/opt/conda/envs/lm-eval/bin/lm_eval", line 8, in <module>
    sys.exit(cli_evaluate())
  File "/home/ayushthakur/lm-eval/lm-evaluation-harness/lm_eval/__main__.py", line 241, in cli_evaluate
    results = evaluator.simple_evaluate(
  File "/home/ayushthakur/lm-eval/lm-evaluation-harness/lm_eval/utils.py", line 415, in _wrapper
    return fn(*args, **kwargs)
  File "/home/ayushthakur/lm-eval/lm-evaluation-harness/lm_eval/evaluator.py", line 179, in simple_evaluate
    results = evaluate(
  File "/home/ayushthakur/lm-eval/lm-evaluation-harness/lm_eval/utils.py", line 415, in _wrapper
    return fn(*args, **kwargs)
  File "/home/ayushthakur/lm-eval/lm-evaluation-harness/lm_eval/evaluator.py", line 328, in evaluate
    task.build_all_requests(limit=limit, rank=lm.rank, world_size=lm.world_size)
  File "/home/ayushthakur/lm-eval/lm-evaluation-harness/lm_eval/api/task.py", line 385, in build_all_requests
    assert len(self._instances) != 0, "task.build_requests() did not find any docs!"
AssertionError: task.build_requests() did not find any docs!

@haileyschoelkopf
Copy link
Collaborator

Thanks for reporting!

I think that using --limit 0.01 or say --limit 10 would fix this, I believe it's because mmlu_abstract_algebra has < 1000 docs but your --limit passes 1/1000.

one of the incoming PRs fixes this by rounding up the number of docs when a float is passed to limit, will see if we can't get that merged sooner.

@Luodian
Copy link

Luodian commented Feb 4, 2024

Is there a way to disable the line's log?

import wandb.apis.reports as wr

It would print multiple times on different ranks.
image

@haileyschoelkopf
Copy link
Collaborator

Hi @ayulockin , tested this out a bit and it looks really nice, thank you for the work on it!

Hey @haileyschoelkopf, the --predict_only flag writes model outputs only for those tasks whose output type is generate_until?

Regarding this, #1441 should fix it!

Hey @haileyschoelkopf I was able to add some logic in this commit to log model outputs to W&B Table.

These tables look very nice overall! One thing:

  • For multiple_choice tasks, if there were a column that provided the list of possible answer strings that'd be really nice.
    I like what you did with loglikelihood_rolling!

Approved, conditional on the minor updates to the tables!

@ayulockin
Copy link
Contributor Author

Hey @haileyschoelkopf, thank you for the feedback. I am working on fixing the table formatting.

@ayulockin
Copy link
Contributor Author

ayulockin commented Feb 21, 2024

Hey @haileyschoelkopf what would be the best way to find which task belongs to which group? In the case of mmlu - mmlu is the parent group with four sub groups and multiple tasks within each sub group.

@haileyschoelkopf
Copy link
Collaborator

Hi @ayulockin , you should be able to get group membership via list(reversed(task_hierarchy.items()))) !

@haileyschoelkopf
Copy link
Collaborator

I think this PR is good to go now? @ayulockin were there any other last-minute things you wanted to address?

@ayulockin
Copy link
Contributor Author

Hey @haileyschoelkopf I am addressing one final bit. The PR will be ready in a couple of hrs.

@ayulockin
Copy link
Contributor Author

ayulockin commented Feb 22, 2024

Hey @haileyschoelkopf one final nit that I tried and wanna bring this to your attention for transparency.

Currently the wandb logic in the main.py file is confined to one block as shown below: (L319-L328)

            try:
                wandb_logger = WandbLogger(results, args)
                wandb_logger.log_eval_result()
                if args.log_samples:
                    wandb_logger.log_eval_samples(samples)
                # Tear down wandb run once all the logging is done.
                wandb_logger.run.finish()
            except Exception as e:
                eval_logger.info(f"Logging to Weights and Biases failed due to {e}")

I did this so that I don't spread the logic all over the __main__.py file. However, if you are okay with me doing so, I can split the above block of code.

this goes right after arguments are parsed and will initialise a W&B run.

if args.wandb_args:
    wandb_logger = WandbLogger(args)

And at the end of the script do:

if args.wandb_args:
    wandb_logger.run.finish()

The remaining diff remains as previously.

The benefit of doing so:

  • we get the exact run time of the script. The user will know how long evaluation took.
  • we get the GPU and CPU utilisation. This will be useful for the user to see how well their GPU is doing.
  • The stdout is logged as well.

Check out this run page: https://wandb.ai/ayush-thakur/lm-eval-harness-integration/runs/jz3fuidc/system?workspace=user-ayush-thakur

If you are okay with it, I will make the commit now and we are good to go from my end. Otherwise, the PR is complete from my end.

@haileyschoelkopf
Copy link
Collaborator

This sounds good to me, thanks @ayulockin for all your work on this!

@ayulockin
Copy link
Contributor Author

Thanks @haileyschoelkopf. The PR is complete from my end. :)

@haileyschoelkopf
Copy link
Collaborator

Unsure why linter test job is failing, they seem to all pass when I check locally!

@haileyschoelkopf haileyschoelkopf merged commit 2683fbb into EleutherAI:main Feb 22, 2024
7 of 8 checks passed
filtered_resps = [
np.argmax([n[0] for n in x["filtered_resps"]]) for x in data
]
elif config["output_type"] == "loglikelihood_rolling":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it intended, that in two elifs (about generate_until and loglikelihood_rolling) code is the same? If it is not copy-paste mistake, then may this be changed to some another check like
config["output_type"] in {"generate_until" , "loglikelihood_rolling"}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yeah you are right. oversight on my part. I am still working with this repo on a few projects. I will incorporate the code trim in the separate PR. Thanks for catching @LSinev :)

metrics[metric] = [x[metric] for x in data]

if config["output_type"] == "loglikelihood":
instance = [x["arguments"][0][0] for x in data]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instance = [x["arguments"][0][0] for x in data]
this part seems to be the same in all if cases, is it really necessary to define the same inside conditional statement?

wx-zhang pushed a commit to wx-zhang/lm-evaluation-harness that referenced this pull request Mar 13, 2024
* add wandb as extra dependency

* wandb metrics logging

* refactor

* log samples as tables

* fix linter

* refactor: put in a class

* change dir

* add panels

* log eval as table

* improve tables logging

* improve reports logging

* precommit run

* ruff check

* handle importing reports api gracefully

* ruff

* compare results

* minor pre-commit fixes

* build comparison report

* ruff check

* log results as artifacts

* remove comparison script

* update dependency

* type annotate and docstring

* add example

* update readme

* fix typo

* teardown

* handle outside wandb run

* gracefully fail reports creation

* precommit checks

* add report url to summary

* use wandb  printer for better url stdout

* fix ruff

* handle N/A and groups

* fix eval table

* remove unused var

* update wandb version req + disable reports stdout

* remove reports feature to TODO

* add label to multi-choice question data

* log model predictions

* lints

* loglikelihood_rolling

* log eval result for groups

* log tables by group for better handling

* precommit

* choices column for multi-choice

* graciously fail wandb

* remove reports feature

* track system metrics + total eval time + stdout

---------

Co-authored-by: Lintang Sutawika <[email protected]>
nightingal3 pushed a commit to mycoalchen/lm-evaluation-harness that referenced this pull request May 2, 2024
* add wandb as extra dependency

* wandb metrics logging

* refactor

* log samples as tables

* fix linter

* refactor: put in a class

* change dir

* add panels

* log eval as table

* improve tables logging

* improve reports logging

* precommit run

* ruff check

* handle importing reports api gracefully

* ruff

* compare results

* minor pre-commit fixes

* build comparison report

* ruff check

* log results as artifacts

* remove comparison script

* update dependency

* type annotate and docstring

* add example

* update readme

* fix typo

* teardown

* handle outside wandb run

* gracefully fail reports creation

* precommit checks

* add report url to summary

* use wandb  printer for better url stdout

* fix ruff

* handle N/A and groups

* fix eval table

* remove unused var

* update wandb version req + disable reports stdout

* remove reports feature to TODO

* add label to multi-choice question data

* log model predictions

* lints

* loglikelihood_rolling

* log eval result for groups

* log tables by group for better handling

* precommit

* choices column for multi-choice

* graciously fail wandb

* remove reports feature

* track system metrics + total eval time + stdout

---------

Co-authored-by: Lintang Sutawika <[email protected]>
djstrong pushed a commit to speakleash/lm-evaluation-harness that referenced this pull request Aug 2, 2024
* add wandb as extra dependency

* wandb metrics logging

* refactor

* log samples as tables

* fix linter

* refactor: put in a class

* change dir

* add panels

* log eval as table

* improve tables logging

* improve reports logging

* precommit run

* ruff check

* handle importing reports api gracefully

* ruff

* compare results

* minor pre-commit fixes

* build comparison report

* ruff check

* log results as artifacts

* remove comparison script

* update dependency

* type annotate and docstring

* add example

* update readme

* fix typo

* teardown

* handle outside wandb run

* gracefully fail reports creation

* precommit checks

* add report url to summary

* use wandb  printer for better url stdout

* fix ruff

* handle N/A and groups

* fix eval table

* remove unused var

* update wandb version req + disable reports stdout

* remove reports feature to TODO

* add label to multi-choice question data

* log model predictions

* lints

* loglikelihood_rolling

* log eval result for groups

* log tables by group for better handling

* precommit

* choices column for multi-choice

* graciously fail wandb

* remove reports feature

* track system metrics + total eval time + stdout

---------

Co-authored-by: Lintang Sutawika <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants