Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Script to automatically split off eval set #1525

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from
Draft

Conversation

mattyding
Copy link

@mattyding mattyding commented Sep 16, 2024

What changes are proposed in this pull request?

See go/sweeps-eval for more context. Corresponding MAPI changes are contained in https://github.com/databricks-mosaic/mcloud/pull/4562.

How is this tested?

Added unit tests.
Screenshot 2024-09-16 at 1 05 50 AM

The predecessor to the unit tests was this bash script I used to manually inspect the file outputs. Tests are the same as unit tests, but it invokes the scripts/... file instead of the command_utils/... one. Confirmed file outputs looked OK.
_test.txt

Everything passes:
Screenshot 2024-09-16 at 12 56 37 AM

)
if _is_empty_or_nonexistent(dirpath=dataset_name):
log.error("Failed to safely load the dataset from HF Hub.")
raise InvalidFileExtensionError(
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved HF safe download logic into a separate function so I could reuse it. Code is basically unchanged.

Unrelated to meat of the PR, but per Daniel request, tried refactoring further to get rid of try-except block. I don't think it can be done without significant added complexity. You want to barrier/sync before throwing this InvalidFileExtensionError; however, this must be nested within this download logic so that only rank0 encounters it. So you can't both separate out this logic and have graceful exit without try-except. :'(

import numpy as np
from typing import Optional

import composer.utils as utils
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import is formatted this way due to mocking difficulties - https://bhfsteve.blogspot.com/2012/06/patching-tip-using-mocks-in-python-unit.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant