Potential pytorch incompatibility #37

ElliottKasoar · 2023-09-20T10:42:14Z

This is not an issue I've encountered, but having followed the FTorch build instructions, the version of libtorch/pytorch installed may mean that FTorch is incompatible with the model saved in the examples, as this pip installs torch in a (new) virtual environment.

This would only lead to errors if breaking changes were made to the TorchScript format between the versions, and in many cases the same pip-installed torch would be used anyway.

jatkinson1000 · 2023-09-20T10:51:11Z

Mmmmm this is a good point - I can see it being an issue if someone uses newer features of pytorch in their model, but then implements FTorch linked to an older version of LibTorch without those features.

I can't immediately think of an easy way around this other than recommending that users ensure that their LibTorch version is at least as new as the one their model was built with.

I don't think we need to change anything code-wise, but it would be interesting to know what the error raised would be so that we can recognise this in future should users come across it.

ElliottKasoar · 2023-09-20T15:18:20Z

In terms of LibTorch, I don't think users can go too far back, as I get errors when trying to build FTorch of the form:

/home/ek/ICCS/fortran-pytorch-lib/fortran-pytorch-lib/ctorch.cpp:235:18: error: ‘synchronize’ is not a member of ‘torch::cuda’
  235 |     torch::cuda::synchronize();

when running make for versions <= 1.7 (which is probably worth noting in itself).

For versions between 1.8 and 1.10, I can build FTorch successfully, but encounter errors when going through the example of the form:

 ./resnet_infer_fortran ../saved_resnet18_model_cpu.pt
[ERROR]: terminate called after throwing an instance of 'c10::Error'
  what():  isTuple()INTERNAL ASSERT FAILED at "../aten/src/ATen/core/ivalue_inl.h":1306, please report a bug to PyTorch. Expected Tuple but got String
Exception raised from toTuple at ../aten/src/ATen/core/ivalue_inl.h:1306 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fd324cac302 in /home/ek/ICCS/libtorch-1.9/libtorch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7fd324ca8c9b in /home/ek/ICCS/libtorch-1.9/libtorch/lib/libc10.so)
frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, std::string const&) + 0x3e (0x7fd324ca918e in /home/ek/ICCS/libtorch-1.9/libtorch/lib/libc10.so)
frame #3: <unknown function> + 0x3877287 (0x7fd315c50287 in /home/ek/ICCS/libtorch-1.9/libtorch/lib/libtorch_cpu.so)
frame #4: <unknown function> + 0x3878325 (0x7fd315c51325 in /home/ek/ICCS/libtorch-1.9/libtorch/lib/libtorch_cpu.so)
frame #5: torch::jit::SourceRange::highlight(std::ostream&) const + 0x36 (0x7fd313327e06 in /home/ek/ICCS/libtorch-1.9/libtorch/lib/libtorch_cpu.so)
frame #6: torch::jit::ErrorReport::what() const + 0x2c5 (0x7fd313308b85 in /home/ek/ICCS/libtorch-1.9/libtorch/lib/libtorch_cpu.so)
frame #7: <unknown function> + 0x4664 (0x7fd3251d9664 in /home/ek/lib/test/lib/libftorch.so)
frame #8: __ftorch_MOD_torch_module_load + 0x9 (0x7fd3251dcd99 in /home/ek/lib/test/lib/libftorch.so)
frame #9: <unknown function> + 0x17d8 (0x56364eecb7d8 in ./resnet_infer_fortran)
frame #10: <unknown function> + 0x117f (0x56364eecb17f in ./resnet_infer_fortran)
frame #11: __libc_start_main + 0xf3 (0x7fd324d27083 in /lib/x86_64-linux-gnu/libc.so.6)
frame #12: <unknown function> + 0x11be (0x56364eecb1be in ./resnet_infer_fortran)


Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
#0  0x7fd324f15d4a
#1  0x7fd324f14ee5
#2  0x7fd324d4608f
#3  0x7fd324d4600b
#4  0x7fd324d25858
#5  0x7fd3122908d0
#6  0x7fd31229c37b
#7  0x7fd31229b358
#8  0x7fd31229bd10
#9  0x7fd3121e7bfe
#10  0x7fd3121e85b9
#11  0x7fd315c4ff49
#12  0x7fd315c51324
#13  0x7fd313327e05
#14  0x7fd313308b84
#15  0x7fd3251d9663
#16  0x7fd3251dcd98
#17  0x56364eecb7d7
#18  0x56364eecb17e
#19  0x7fd324d27082
#20  0x56364eecb1bd
#21  0xffffffffffffffff
Aborted

I think versions 1.11+ work ok all the way through.

(This is testing against torch==2.0.1 installed with pip, python 3.9.18)

jatkinson1000 · 2023-09-20T15:45:26Z

Interesting. The first error is cuda related which suggests that torch is trying to use some GPU routines somewhere, and you used the CPU only binary(?).
Looking at the line referenced in our code it is annotated with a FIXME.
I'm not clear as to why the run thinks that the out pointer is_cuda however!

Perhaps this is just an out of date issue and we require libtorch >= 1.8

On the latter, the advice here seems to be 'use the latest' -_-
A similar issue raises the possibility of CPU/GPU incompatibility.

Perhaps most relevant suggests it may be an issue when saving the model to TorchScript with one version of LibTorch, and then running from Fortran with another. This is something that could definitely be tested and would be useful to know - if so it would need going into a readme and perhaps mean the 'preferred' approach is to link FTorch to a venv-installed LibTorch.

jatkinson1000 · 2024-03-27T03:13:18Z

This came up when I was doing work that led to #100

It should be documented somewhere that we should have libtorch and pytorch versions matching.

jatkinson1000 · 2024-04-09T01:07:39Z

This should be added to the troubleshooting and/or FAQ documentation.

jatkinson1000 · 2024-07-01T10:35:12Z

After discussion with @TomMelt we should put a note in troubleshooting to ask users to use consistent versions, or point them to where to examine this if they have issues.

We will tackle this as a small issue in an upcoming hackathon.

jatkinson1000 mentioned this issue May 31, 2024

Introduce integration tests based on the examples #118

Merged

jatkinson1000 added this to the Initial Release milestone Jun 12, 2024

jatkinson1000 added the hackathon label Jul 1, 2024

jwallwork23 added the bug Something isn't working label Jul 15, 2024

jatkinson1000 assigned ma595 Sep 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential pytorch incompatibility #37

Potential pytorch incompatibility #37

ElliottKasoar commented Sep 20, 2023 •

edited

Loading

jatkinson1000 commented Sep 20, 2023

ElliottKasoar commented Sep 20, 2023 •

edited

Loading

jatkinson1000 commented Sep 20, 2023 •

edited

Loading

jatkinson1000 commented Mar 27, 2024

jatkinson1000 commented Apr 9, 2024

jatkinson1000 commented Jul 1, 2024

Potential pytorch incompatibility #37

Potential pytorch incompatibility #37

Comments

ElliottKasoar commented Sep 20, 2023 • edited Loading

jatkinson1000 commented Sep 20, 2023

ElliottKasoar commented Sep 20, 2023 • edited Loading

jatkinson1000 commented Sep 20, 2023 • edited Loading

jatkinson1000 commented Mar 27, 2024

jatkinson1000 commented Apr 9, 2024

jatkinson1000 commented Jul 1, 2024

ElliottKasoar commented Sep 20, 2023 •

edited

Loading

ElliottKasoar commented Sep 20, 2023 •

edited

Loading

jatkinson1000 commented Sep 20, 2023 •

edited

Loading