Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions and Suggestions for Enhancing Galore v2 #2

Open
kostum123 opened this issue Jul 12, 2024 · 0 comments
Open

Questions and Suggestions for Enhancing Galore v2 #2

kostum123 opened this issue Jul 12, 2024 · 0 comments

Comments

@kostum123
Copy link

First of all, I would like to thank you for adding this wonderful training method to the literature. I have some questions regarding the method:

  1. Is it possible to use Flash Attention, xFormers, or Torch Compile with this method?
  2. Is the VRAM usage of the model sub-quadratic or quadratic when increasing the max sequence length? Although quantization and LoRA-style optimizations reduce VRAM usage, the inability to use Flash Attention poses a challenge for pretraining or fine-tuning with long texts. Are you planning to optimize to address this issue?
  3. Will the model weights saved with this method be in bf16 format and can they be used in other training software(TRL and sft after pretraining) without any problems?

Finally, I have a suggestion. Integrating Galore v2 into the LLaMA Factory, similar to Galore v1, would allow it to be combined with pretraining methods such as LLaMA Pro. Please consider this integration.

Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant