-
Does each parameter take 1 byte, 2 bytes, 4 bytes ? it does not seem so clear in terms of memory efficiency. |
Beta Was this translation helpful? Give feedback.
Answered by
ptrendx
May 19, 2023
Replies: 1 comment
-
Currently the FP8 weights are only internal and so the actual model weights take the same amount of memory as without FP8 execution (e.g. 2B for FP16+FP8 training). We are working together with Meta on exposing FP8 tensors in pyTorch, which will enable storing only the FP8 weights, resulting in memory savings over the base model as well as e.g. faster communication in FSDP, but it is currently in the PoC stage. |
Beta Was this translation helpful? Give feedback.
0 replies
Answer selected by
ksivaman
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Currently the FP8 weights are only internal and so the actual model weights take the same amount of memory as without FP8 execution (e.g. 2B for FP16+FP8 training). We are working together with Meta on exposing FP8 tensors in pyTorch, which will enable storing only the FP8 weights, resulting in memory savings over the base model as well as e.g. faster communication in FSDP, but it is currently in the PoC stage.