Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't train in FP16 on Turing #747

Open
jafioti opened this issue Aug 24, 2024 · 1 comment
Open

Can't train in FP16 on Turing #747

jafioti opened this issue Aug 24, 2024 · 1 comment

Comments

@jafioti
Copy link

jafioti commented Aug 24, 2024

Hi,
I have a turing card (2080 super) and I'm trying to run training in FP16. I can't run in BF16 because the card doesn't have support for it, and when I try to run in FP16, I get build_from_checkpoint() does not support fp16 right now.. Is there any way to initialize the weights randomly instead of building from a checkpoint? My understanding is weight initialization right now is just handled by the python script, and the C file can only load a checkpoint.

@jafioti
Copy link
Author

jafioti commented Aug 24, 2024

I somewhat solved this by adding fp16 exporting to the python file. Now it works without cudnn (albeit with an increasing loss). With cudnn on, I get

W! CuDNN (v90300 75) function cudnnBackendFinalize() called:
w!         Warning: CUDNN_STATUS_NOT_SUPPORTED; Reason: userGraph->getEntranceNodesSize() != 2
w!         Warning: CUDNN_STATUS_NOT_SUPPORTED; Reason: numUserNodes != 5 && numUserNodes != 6
w! Time: 2024-08-24T17:17:17.366897 (0d+0h+0m+0s since start)
w! Process=30528; Thread=30528; GPU=NULL; Handle=NULL; StreamId=NULL.

[CUDNN ERROR] at file llmc/cudnn_att.cpp:137:
[cudnn_frontend] Error: No execution plans support the graph.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant