Train mem overhaul #23

daniel-z-kaplan · 2024-01-24T13:33:04Z

Setup code for individual clusters more cleanly

Alexis-BX · 2024-01-26T22:34:21Z

Rework the scripts folder completely
Have folders for llava_v1, llava_v1.5, robin_v1, robin_v2 and evals
In robin_v2 have a folder for each cluster with install, pretrain, finetune script (include cedar and frontier folders)

Use of train_mem.py : when doing multinode training, environment variables are not properly set by the launch script (set them on main node but not the others)
As train_mem is run on every node this sets the variable properly.

Once the above reorganization is done: split train_mem into a seperate file for each cluster and put it in that cluster's folder

Alexis-BX added reduce technical debt quality of life and removed reduce technical debt labels Jan 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Train mem overhaul #23

Train mem overhaul #23

daniel-z-kaplan commented Jan 24, 2024

Alexis-BX commented Jan 26, 2024

Train mem overhaul #23

Train mem overhaul #23

Comments

daniel-z-kaplan commented Jan 24, 2024

Alexis-BX commented Jan 26, 2024