-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LSTM - different outputs for same weights across CPU and GPU, when using float32 + tf-keras + NVIDIA A100 #772
Comments
@sachinprasadhs, |
@tilakrayal - the gist shows a very small difference between CPU/GPU predictions, similar to what I see on my V100 host. I wouldn't be surprised if differences that small were in fact expected. But on my A100 host the difference becomes orders of magnitude larger. Is there a way to replicate my "problematic system" (NVIDIA A100 + Driver Version: 550.54.14 + CUDA Version: 12.4) on Colab, so that hopefully you can also see the entity of the problem, beyond the screenshots I can share? Thanks! |
I've updated the V100 system. It now has the exact same driver + CUDA as the A100 system (Driver Version: 550.54.14 + CUDA Version: 12.4), and still does not replicate the issue. So the issue seems specific to execution on the A100. How can we replicate on Colab? Thanks. |
Latest update: I got hold of an H200 system, which demonstrated the same issue I see on the A100. I've also become aware of the relatively new TF32 datatype https://www.tensorflow.org/api_docs/python/tf/config/experimental/enable_tensor_float_32_execution, which is apparently enabled by default on A100 and newer! Indeed, if I modify my example script and set tf.config.experimental.enable_tensor_float_32_execution(False), the numerical issues disappear, and the A100 system produces the same output as the V100 and CPUs. I find it quite concerning that Tensorflow would take such liberties with data types. In any case, the main question mark I have at this point is why I don't see the same numerical issues with multi-backend keras. Is it actually using float32, rather than the new TF32? Which keras implementation is doing the right thing? |
System information
Describe the problem
I have a model comprising almost entirely of LSTM layers. If I load the same weights into a copy of the model instanced to run on CPU and GPU, results are different.
This issue disappears (the GPU results change to match CPU) if I change any of these:
to
In all these cases, I'm running the same (official) docker image, in which my only modification has been to install tf-keras==2.16.0 and plotly.
Standalone code to reproduce the issue.
Resulting plot:
As mentioned at the beginning:
# keras.backend.set_floatx('float64')
USE_TF_KERAS = False
All workaround the issue, and the GPU prediction matches the CPU prediction.
I also re-iterate that all of this has been run in the official
tensorflow/tensorflow:2.16.1-gpu-jupyter
container, on both hosts.The text was updated successfully, but these errors were encountered: