Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential pytorch incompatibility #37

Open
ElliottKasoar opened this issue Sep 20, 2023 · 6 comments
Open

Potential pytorch incompatibility #37

ElliottKasoar opened this issue Sep 20, 2023 · 6 comments
Assignees
Labels
bug Something isn't working hackathon

Comments

@ElliottKasoar
Copy link
Contributor

ElliottKasoar commented Sep 20, 2023

This is not an issue I've encountered, but having followed the FTorch build instructions, the version of libtorch/pytorch installed may mean that FTorch is incompatible with the model saved in the examples, as this pip installs torch in a (new) virtual environment.

This would only lead to errors if breaking changes were made to the TorchScript format between the versions, and in many cases the same pip-installed torch would be used anyway.

@jatkinson1000
Copy link
Member

Mmmmm this is a good point - I can see it being an issue if someone uses newer features of pytorch in their model, but then implements FTorch linked to an older version of LibTorch without those features.

I can't immediately think of an easy way around this other than recommending that users ensure that their LibTorch version is at least as new as the one their model was built with.

I don't think we need to change anything code-wise, but it would be interesting to know what the error raised would be so that we can recognise this in future should users come across it.

@ElliottKasoar
Copy link
Contributor Author

ElliottKasoar commented Sep 20, 2023

In terms of LibTorch, I don't think users can go too far back, as I get errors when trying to build FTorch of the form:

/home/ek/ICCS/fortran-pytorch-lib/fortran-pytorch-lib/ctorch.cpp:235:18: error: ‘synchronize’ is not a member of ‘torch::cuda’
  235 |     torch::cuda::synchronize();

when running make for versions <= 1.7 (which is probably worth noting in itself).

For versions between 1.8 and 1.10, I can build FTorch successfully, but encounter errors when going through the example of the form:

 ./resnet_infer_fortran ../saved_resnet18_model_cpu.pt
[ERROR]: terminate called after throwing an instance of 'c10::Error'
  what():  isTuple()INTERNAL ASSERT FAILED at "../aten/src/ATen/core/ivalue_inl.h":1306, please report a bug to PyTorch. Expected Tuple but got String
Exception raised from toTuple at ../aten/src/ATen/core/ivalue_inl.h:1306 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fd324cac302 in /home/ek/ICCS/libtorch-1.9/libtorch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7fd324ca8c9b in /home/ek/ICCS/libtorch-1.9/libtorch/lib/libc10.so)
frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, std::string const&) + 0x3e (0x7fd324ca918e in /home/ek/ICCS/libtorch-1.9/libtorch/lib/libc10.so)
frame #3: <unknown function> + 0x3877287 (0x7fd315c50287 in /home/ek/ICCS/libtorch-1.9/libtorch/lib/libtorch_cpu.so)
frame #4: <unknown function> + 0x3878325 (0x7fd315c51325 in /home/ek/ICCS/libtorch-1.9/libtorch/lib/libtorch_cpu.so)
frame #5: torch::jit::SourceRange::highlight(std::ostream&) const + 0x36 (0x7fd313327e06 in /home/ek/ICCS/libtorch-1.9/libtorch/lib/libtorch_cpu.so)
frame #6: torch::jit::ErrorReport::what() const + 0x2c5 (0x7fd313308b85 in /home/ek/ICCS/libtorch-1.9/libtorch/lib/libtorch_cpu.so)
frame #7: <unknown function> + 0x4664 (0x7fd3251d9664 in /home/ek/lib/test/lib/libftorch.so)
frame #8: __ftorch_MOD_torch_module_load + 0x9 (0x7fd3251dcd99 in /home/ek/lib/test/lib/libftorch.so)
frame #9: <unknown function> + 0x17d8 (0x56364eecb7d8 in ./resnet_infer_fortran)
frame #10: <unknown function> + 0x117f (0x56364eecb17f in ./resnet_infer_fortran)
frame #11: __libc_start_main + 0xf3 (0x7fd324d27083 in /lib/x86_64-linux-gnu/libc.so.6)
frame #12: <unknown function> + 0x11be (0x56364eecb1be in ./resnet_infer_fortran)


Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
#0  0x7fd324f15d4a
#1  0x7fd324f14ee5
#2  0x7fd324d4608f
#3  0x7fd324d4600b
#4  0x7fd324d25858
#5  0x7fd3122908d0
#6  0x7fd31229c37b
#7  0x7fd31229b358
#8  0x7fd31229bd10
#9  0x7fd3121e7bfe
#10  0x7fd3121e85b9
#11  0x7fd315c4ff49
#12  0x7fd315c51324
#13  0x7fd313327e05
#14  0x7fd313308b84
#15  0x7fd3251d9663
#16  0x7fd3251dcd98
#17  0x56364eecb7d7
#18  0x56364eecb17e
#19  0x7fd324d27082
#20  0x56364eecb1bd
#21  0xffffffffffffffff
Aborted

I think versions 1.11+ work ok all the way through.

(This is testing against torch==2.0.1 installed with pip, python 3.9.18)

@jatkinson1000
Copy link
Member

jatkinson1000 commented Sep 20, 2023

Interesting. The first error is cuda related which suggests that torch is trying to use some GPU routines somewhere, and you used the CPU only binary(?).
Looking at the line referenced in our code it is annotated with a FIXME.
I'm not clear as to why the run thinks that the out pointer is_cuda however!

Perhaps this is just an out of date issue and we require libtorch >= 1.8


On the latter, the advice here seems to be 'use the latest' -_-
A similar issue raises the possibility of CPU/GPU incompatibility.

Perhaps most relevant suggests it may be an issue when saving the model to TorchScript with one version of LibTorch, and then running from Fortran with another. This is something that could definitely be tested and would be useful to know - if so it would need going into a readme and perhaps mean the 'preferred' approach is to link FTorch to a venv-installed LibTorch.

@jatkinson1000
Copy link
Member

This came up when I was doing work that led to #100

It should be documented somewhere that we should have libtorch and pytorch versions matching.

@jatkinson1000
Copy link
Member

This should be added to the troubleshooting and/or FAQ documentation.

@jatkinson1000
Copy link
Member

After discussion with @TomMelt we should put a note in troubleshooting to ask users to use consistent versions, or point them to where to examine this if they have issues.

We will tackle this as a small issue in an upcoming hackathon.

@jwallwork23 jwallwork23 added the bug Something isn't working label Jul 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working hackathon
Projects
None yet
Development

No branches or pull requests

4 participants