Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train mem overhaul #23

Open
daniel-z-kaplan opened this issue Jan 24, 2024 · 1 comment
Open

Train mem overhaul #23

daniel-z-kaplan opened this issue Jan 24, 2024 · 1 comment

Comments

@daniel-z-kaplan
Copy link
Collaborator

Setup code for individual clusters more cleanly

@Alexis-BX
Copy link
Member

Rework the scripts folder completely
Have folders for llava_v1, llava_v1.5, robin_v1, robin_v2 and evals
In robin_v2 have a folder for each cluster with install, pretrain, finetune script (include cedar and frontier folders)

Use of train_mem.py : when doing multinode training, environment variables are not properly set by the launch script (set them on main node but not the others)
As train_mem is run on every node this sets the variable properly.

Once the above reorganization is done: split train_mem into a seperate file for each cluster and put it in that cluster's folder

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants