Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Remote and HF promptfiles in hf_generate script #786

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

samhavens
Copy link
Contributor

@samhavens samhavens commented Dec 7, 2023

Small QoL improvement for testing generation. Uses composer's get_file to support remote prompt files, and adds a bit of syntax for pointing to huggingface hub datasets as well.

-p file::/local/path was already supported. This adds -p file::s3://remote/path and -p dataset::mosaicml/some-hub-dataset. For HF datasets, it defaults to looking for a column named prompt but will use any string passed as the prompt_delimiter as the column name (kind of abuses the API, but it felt understandable)

@dakinggg if this is out of scope let me know and I can close this out. Most workloads like this end up going to inference... but I can see it being useful both internally and for customers.

EDIT: updated the syntax so remote files are just s3:// and HF datasets are hf://

example invocations:

python hf_generate.py -n mosaicml/mpt-7b -p hf://mosaicml/some-prompts --model_dtype bf16   --max_batch_size 8

and

python hf_generate.py -n mosaicml/mpt-7b -p s3://our-bucket/path/prompts.txt --prompt-delimiter $'\n' --max_batch_size 8

Copy link
Collaborator

@dakinggg dakinggg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The functionality seems good and I'd be happy to merge wrt that. I don't like the invention of a new URI format here tho. We use the standard like s3:// in other places, and making it different here is pretty confusing.

To not break existing functionality, I suggest you leave the file:: thing for local files, and then you just use s3:// etc for remote files. hf:// is the canonical prefix for huggingface things.

)

local_path = prompt_path.split('/')[-1]
get_file(path=prompt_path, destination=local_path, overwrite=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you probably want to make local_path a tmp location?



def load_prompts_from_dataset(dataset_path: str,
prompt_delimiter: Optional[str] = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer you just add a new argument to the script for the dataset column name instead of overloading the delimiter param

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants