Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Error logging callback #2533

Closed
wants to merge 178 commits into from

Conversation

bmosaicml
Copy link
Contributor

@bmosaicml bmosaicml commented Sep 12, 2023

What does this PR do?

This PR adds a callback that logs ICL outputs during eval. It modifies the custom metrics to keep track of incorrect model outputs. Each metric is responsible for specifying the table schema for logging cached responses as well as specifying how to format the cached responses using the tokenizer.

The EvalOutputLogging callback is then responsible for logging the cached results in table format after each evaluation.

Design doc

What issue(s) does this change relate to?

Before submitting

  • Have you read the contributor guidelines?
  • Is this change a documentation change or typo fix? If so, skip the rest of this checklist.
  • Was this change discussed/approved in a GitHub issue first? It is much more likely to be merged if so.
  • Did you update any related docs and document your change?
  • Did you update any related tests and add any new tests related to your change? (see testing)
  • Did you run the tests locally to make sure they pass?
  • Did you run pre-commit on your change? (see the pre-commit section of prerequisites)

@bmosaicml bmosaicml requested review from a team, eracah and dakinggg as code owners September 12, 2023 20:11
bmosaicml and others added 19 commits September 12, 2023 16:31
Add pytorch nightly and CUDA 12.1 support for composer docker images

What issue(s) does this change relate to?
Related to https://mosaicml.atlassian.net/browse/GRT-2305

Tests
docker image: mosaicml/ci-staging:72744756-794c-4390-94db-72c212dd5e00 (cuda 12.1, pytorch 2.1.0)

mcli connect temp-test-ZAVxMh
Python 3.10.12 (main, Jun  7 2023, 12:45:35) [GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print(torch.version)
<module 'torch.version' from '/usr/lib/python3/dist-packages/torch/version.py'>
>>> print(torch.__version__)
2.1.0.dev20230623+cu121
>>> print(torch.version.cuda)
12.1
Integration Test
@mvpatel2000 has validated that this trains on initial mpt-2 experiments and speeds up training by +7-8% from 0.25 MFU to 0.27 MFU
* fix autoresume with slashed directory

* Revert "fix autoresume with slashed directory"

This reverts commit 3dfb5f5.

revert

* fix

* fix precommit

* Update in_context_learning_evaluation.py

* Update in_context_learning_evaluation.py

* Update in_context_learning_evaluation.py

* add tests
Signed-off-by: Prithvi Kannan <[email protected]>
Co-authored-by: Evan Racah <[email protected]>
Co-authored-by: eracah <[email protected]>
Upstreams and generalizes the callback that logs generations to wandb from foundry to composer.
…2476)

Upgrade torch docker nightly version to 08-23-23 so that we get nccl version 0.18.3 which was merged on 08-18-23.
* Update RTD build config with build.os
* Remove python.version

---------

Co-authored-by: Bandish Shah <[email protected]>
@maxisawesome
Copy link
Contributor

Successfully merged here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.