You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Rework the scripts folder completely
Have folders for llava_v1, llava_v1.5, robin_v1, robin_v2 and evals
In robin_v2 have a folder for each cluster with install, pretrain, finetune script (include cedar and frontier folders)
Use of train_mem.py : when doing multinode training, environment variables are not properly set by the launch script (set them on main node but not the others)
As train_mem is run on every node this sets the variable properly.
Once the above reorganization is done: split train_mem into a seperate file for each cluster and put it in that cluster's folder
Setup code for individual clusters more cleanly
The text was updated successfully, but these errors were encountered: