Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

setup-ocaml@v3: uses cache from wrong runner OS! #839

Closed
edwintorok opened this issue Aug 5, 2024 · 12 comments · Fixed by #841
Closed

setup-ocaml@v3: uses cache from wrong runner OS! #839

edwintorok opened this issue Aug 5, 2024 · 12 comments · Fixed by #841
Labels
bug Something isn't working

Comments

@edwintorok
Copy link

edwintorok commented Aug 5, 2024

I haven't seen this failure on setup-ocaml@v2, but ever since we switched to setup-ocaml@v3 we started getting these failures:
https://github.com/xapi-project/xen-api/actions/runs/10248744378/job/28350652648?pr=5910

Retrieve the opam cache
  Received 176160768 of 186735370 (94.3%), 168.0 MBs/sec
  Cache Size: ~178 MB (186735370 B)
  /usr/bin/tar -xf /home/runner/work/_temp/299ab2db-12c7-4633-989d-0e75c2e8beff/cache.tzst -P -C /home/runner/work/xen-api/xen-api --use-compress-program unzstd
  Received 186735370 of 186735370 (100.0%), 89.0 MBs/sec
  Cache restored successfully
  Cache restored from key: v1-setup-ocaml-opam-e12cd600e6d6884128d41217fa4937e2747c543a86ae90e2c4ff585433b945c7
Install opam
Initialise opam state
Remove the opam repositories
Initialise the opam repositories
Retrieve the dune cache
Install dune
  /opt/hostedtoolcache/opam/2.2.0/x86_64/opam install dune
  The following actions will be performed:
  === install 1 package
    ∗ dune 3.15.3
  
  <><> Processing actions <><><><><><><><><><><><><><><><><><><><><><><><><><><><>
  Received 488 of 488 (100.0%), 0.0 MBs/sec
  ⬇ retrieved dune.3.15.3  (https://opam.ocaml.org/cache)
  [ERROR] The compilation of dune.3.15.3 failed at "ocaml boot/bootstrap.ml -j 3".
  
  #=== ERROR while compiling dune.3.15.3 ========================================#
  # context     2.2.0 | linux/x86_64 | ocaml-base-compiler.4.14.2 | git+https://github.com/xapi-project/xs-opam.git
  # path        ~/work/xen-api/xen-api/_opam/.opam-switch/build/dune.3.15.3
  # command     ~/.opam/opam-init/hooks/sandbox.sh build ocaml boot/bootstrap.ml -j 3
  # exit-code   1
  # env-file    ~/.opam/log/dune-3081-79f2b5.env
  # output-file ~/.opam/log/dune-3081-79f2b5.out
  ### output ###
  # /home/runner/work/xen-api/xen-api/_opam/bin/ocamlrun: /lib/x86_64-linux-gnu/libm.so.6: version `GLIBC_2.35' not found (required by /home/runner/work/xen-api/xen-api/_opam/bin/ocamlrun)
  # /home/runner/work/xen-api/xen-api/_opam/bin/ocamlrun: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.33' not found (required by /home/runner/work/xen-api/xen-api/_opam/bin/ocamlrun)
  # /home/runner/work/xen-api/xen-api/_opam/bin/ocamlrun: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by /home/runner/work/xen-api/xen-api/_opam/bin/ocamlrun)

We have a mix of ubuntu-22.04 and ubuntu-latest in our github workflows (by accident rather than intentionally) and it looks like they overwrite each-other's opam cache.
(we can fix the mix of runners on 'master', but that won't help with open PRs which will still have a mix).

The runner's OS should probably be included in the cache key.

@edwintorok
Copy link
Author

edwintorok commented Aug 5, 2024

I think this is a regression caused by 9dc6d97.
Previously the cache prefix would include a lot more information, but that commit dropped too much and now the action tries to incorrectly share caches between different OSes which isn't going to work:

-  const key = `${CACHE_PREFIX}-setup-ocaml-dune-${platform}-${architecture}-${ocamlVersion}-${workflow}-${job}-${runId}`;
-  const restoreKeys = [
-    `${CACHE_PREFIX}-setup-ocaml-dune-${platform}-${architecture}-${ocamlVersion}-${workflow}-${job}-${runId}`,
-    `${CACHE_PREFIX}-setup-ocaml-dune-${platform}-${architecture}-${ocamlVersion}-${workflow}-${job}-`,
-  ];
+  const { workflow, job } = github.context;
+  const ocamlCompiler = await resolveCompiler(OCAML_COMPILER);
+  const sha256 = crypto.createHash("sha256");
+  const hash = sha256
+    .update([architecture, job, ocamlCompiler, platform, workflow].join(""))
+    .digest("hex");
+  const key = `${CACHE_PREFIX}-setup-ocaml-dune-${hash}`;

Although the information still seems to be there in the hash (it is just now harder to debug because we can't see what goes into the hash), maybe the runner OS version was missing all along?

edwintorok added a commit to edwintorok/xen-api that referenced this issue Aug 5, 2024
This is a bug in setup-ocaml@v3 ocaml/setup-ocaml#839

Work it around by defining our own cache prefix key as suggested in EXAMPLES.md

Signed-off-by: Edwin Török <[email protected]>
@edwintorok
Copy link
Author

FWIW using ${{ matrix.container }} as in EXAMPLES.md doesn't work either, that value is empty.

edwintorok added a commit to edwintorok/xen-api that referenced this issue Aug 5, 2024
This is a bug in setup-ocaml@v3 ocaml/setup-ocaml#839

Work it around by defining our own cache prefix key as suggested in EXAMPLES.md

Signed-off-by: Edwin Török <[email protected]>
edwintorok added a commit to edwintorok/xen-api that referenced this issue Aug 5, 2024
This is a bug in setup-ocaml@v3 ocaml/setup-ocaml#839

Work it around by defining our own cache prefix key as suggested in EXAMPLES.md

Signed-off-by: Edwin Török <[email protected]>
edwintorok added a commit to edwintorok/xen-api that referenced this issue Aug 5, 2024
This is a bug in setup-ocaml@v3 ocaml/setup-ocaml#839

Work it around by defining our own cache prefix key as suggested in EXAMPLES.md

Signed-off-by: Edwin Török <[email protected]>
edwintorok added a commit to edwintorok/xen-api that referenced this issue Aug 5, 2024
This is a bug in setup-ocaml@v3 ocaml/setup-ocaml#839

Work it around by defining our own cache prefix key as suggested in EXAMPLES.md

Signed-off-by: Edwin Török <[email protected]>
edwintorok added a commit to edwintorok/xen-api that referenced this issue Aug 5, 2024
This is a bug in setup-ocaml@v3 ocaml/setup-ocaml#839

Work it around by defining our own cache prefix based on runner OS version.

Unfortunately the version itself doesn't seem to be available as a variable in GH actions.
There is 'runner.os', but that is just a generic Linux, there is 'matrix.os', but that is only present when using a matrix,
and there is '..container' which is only present when containers are used.

Signed-off-by: Edwin Török <[email protected]>
edwintorok added a commit to edwintorok/xen-api that referenced this issue Aug 5, 2024
This is a bug in setup-ocaml@v3 ocaml/setup-ocaml#839

Work it around by defining our own cache prefix based on runner OS version.

Unfortunately the version itself doesn't seem to be available as a variable in GH actions.
There is 'runner.os', but that is just a generic Linux, there is 'matrix.os', but that is only present when using a matrix,
and there is '..container' which is only present when containers are used.

Signed-off-by: Edwin Török <[email protected]>
edwintorok added a commit to edwintorok/xen-api that referenced this issue Aug 5, 2024
This is a bug in setup-ocaml@v3 ocaml/setup-ocaml#839

Work it around by defining our own cache prefix based on runner OS version.

Unfortunately the version itself doesn't seem to be available as a variable in GH actions.
There is 'runner.os', but that is just a generic Linux, there is 'matrix.os', but that is only present when using a matrix,
and there is '..container' which is only present when containers are used.

Signed-off-by: Edwin Török <[email protected]>
@smorimoto
Copy link
Member

The cache key should not actually contain less information.

edwintorok added a commit to edwintorok/xen-api that referenced this issue Aug 5, 2024
This is a bug in setup-ocaml@v3 ocaml/setup-ocaml#839

Work it around by defining our own cache prefix based on runner OS version.

Unfortunately the version itself doesn't seem to be available as a variable in GH actions.
There is 'runner.os', but that is just a generic Linux, there is 'matrix.os', but that is only present when using a matrix,
and there is '..container' which is only present when containers are used.

Signed-off-by: Edwin Török <[email protected]>
edwintorok added a commit to edwintorok/xen-api that referenced this issue Aug 5, 2024
This is a bug in setup-ocaml@v3 ocaml/setup-ocaml#839

Work it around by defining our own cache prefix based on runner OS version.

Unfortunately the version itself doesn't seem to be available as a variable in GH actions.
There is 'runner.os', but that is just a generic Linux, there is 'matrix.os', but that is only present when using a matrix,
and there is '..container' which is only present when containers are used.

Signed-off-by: Edwin Török <[email protected]>
edwintorok added a commit to edwintorok/xen-api that referenced this issue Aug 5, 2024
This is a bug in setup-ocaml@v3 ocaml/setup-ocaml#839

Work it around by defining our own cache prefix based on runner OS version.

Unfortunately the version itself doesn't seem to be available as a variable in GH actions.
There is 'runner.os', but that is just a generic Linux, there is 'matrix.os', but that is only present when using a matrix,
and there is '..container' which is only present when containers are used.

Use another GH action to determine the version, and now the cache-prefix looks like this:
```
cache-prefix: v3-Ubuntu-22.04
```

Signed-off-by: Edwin Török <[email protected]>
@edwintorok
Copy link
Author

edwintorok commented Aug 5, 2024

Hmm maybe this information was wrong before too, and it is just a timing issue (the new v3 action is quicker because pinning can be avoided).
Looks like PLATFORM is a very generic OS, doesn't include version information.

As a workaround I used this https://github.com/marketplace/actions/github-actions-runner-os-system-information actions and composed a cache prefix that included name and release, that relies on the getos package.

Perhaps setup-ocaml could call this function internally to figure out the runner's OS version?

@smorimoto
Copy link
Member

I have been considering about making it a little more strict. Let me try it.

@smorimoto smorimoto added the enhancement New feature or request label Aug 5, 2024
@smorimoto smorimoto linked a pull request Aug 5, 2024 that will close this issue
@smorimoto smorimoto added bug Something isn't working and removed enhancement New feature or request labels Aug 5, 2024
@smorimoto
Copy link
Member

@smorimoto
Copy link
Member

(Of course, the v3 tag is already moved. Just restarting the job should be enough!)

@edwintorok
Copy link
Author

Thanks for the quick fix.

Although I think the dune cache key should also include the OS, I'm not sure whether dune would hash the system binaries when calculating its input hash, so there is still a chance we might end up copying a binary compiled on a different OS. But even if Dune does that the cache will likely won't be effective if we mix different OS versions.

@smorimoto
Copy link
Member

If my "guess" is correct, I believe the dune cache should be OS independent. Does it actually depend on the OS and / or OS version? @rgrinberg @mefyl

@edwintorok
Copy link
Author

edwintorok commented Aug 5, 2024

the ocamlopt output will be OS specific (e.g. it may be linked to a newer libc, so it won't run on an older libc).
If dune hashes ocamlopt (I've taken a quick look with strace, and it doesn't appear to, although it stats it) then that'd be fine: different OS -> different ocamlopt hash -> different binaries. (although in this case restoring the cache will be wasted, and storing the cache will overwrite the other job's cache).

If dune doesn't hash ocamlopt (and just looks at its PATH which will be identical across OSes) and reuses the dune cache from another (newer) OS version then the output can potentially be wrong and won't run, because different OS -> same source input hash -> same binaries retrieved from the cache.

You can share the dune cache between different workspaces/opam switches on the same OS though.

@mefyl
Copy link

mefyl commented Aug 6, 2024

Indeed, AFAICT the dune cache cannot be shared between different OSes, it entirely relies on dune's decision to rebuild or not something. Sharing the cache between two machines is similar to copying a build tree from one machine to the other. If you do this between two different architecture, and if dune doesn't think it should rebuild the artifact and thus tries to reuse invalid compiled files, then the cache will suffer the same problem.

One could argue this is a misfeature :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants