Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Facilitate or at least document how to share the OAK cache #1051

Open
gouttegd opened this issue Apr 26, 2024 · 1 comment · May be fixed by #1109
Open

Facilitate or at least document how to share the OAK cache #1051

gouttegd opened this issue Apr 26, 2024 · 1 comment · May be fixed by #1109

Comments

@gouttegd
Copy link
Contributor

gouttegd commented Apr 26, 2024

The Ontology Access Kit (OAK, aka oaklib from Python’s point of view, aka runoak from the command line’s point of view) is one of the tools/libraries provided by the ODK.

In fact the ODK is supposedly one of the easiest way for “non-technical” users to get access to OAK, because installing Python programs is still too difficult for many people.

When OAK is used to access online resources (for example with -i sqlite:obo:uberon, which accesses a pre-built SQLite version of Uberon), it attempts to cache a copy of those resources in the local filesystem, to avoid always re-downloading them upon each new call. The default location for the cache is ~/.data, or the value of the PYSTOW_HOME environment variable if such a variable is set.

(As an aside, defaulting to such a generic name under the user’s home directory is a terrible move, but that’s a deliberate decision that’s unlikely to ever change.)

Now when OAK is used from the ODK, the ~/.data directory is within the Docker container. So any file that OAK is storing there will only exist for as long as the container itself exists. That means that when people are running several OAK commands like this:

sh run.sh runoak -i sqlite:obo:uberon command1 ...
sh run.sh runoak -i sqlite:obo:uberon command2 ...
sh run.sh runoak -i sqlite:obo:uberon command3 ...

none of these commands will benefit from the cache. They will all download a fresh copy of Uberon.

One workaround is of course to run a shell within a container, instead of running runoak directly, and to then invoke runoak from that shell:

sh run.sh bash
odkuser@abe8c94b5e84:/work/src/ontology$ runoak -i sqlite:obo:uberon command1 ...
odkuser@abe8c94b5e84:/work/src/ontology$ runoak -i sqlite:obo:uberon command2 ...
odkuser@abe8c94b5e84:/work/src/ontology$ runoak -i sqlite:obo:uberon command3 ...

But that is not really a satisfying solution as it will still lead to Uberon being re-downloaded every time the user has to start working with it, even if maybe they already downloaded it the day before.

It is possible to configure the ODK to make the local cache visible from the container by “binding” the ~/.data directory from the local filesystem to the /home/odkuser/.data directory within the container, by adding the following in the src/ontology/run.sh.conf file:

ODK_BINDS=~/.data:/home/odkuser/.data

(This will only work once #1050 will have been fixed.)

Another solution would be set the PYSTOW_HOME variable to a directory within the repository (most likely somewhere under src/ontology/tmp), which is already bound to a mount point within the container. That would at least allow sharing the cache between ODK/OAK invocations that are run from within the same repository.

At the very least, the ODK should provide documentation on how to do that.

Should the ODK try to do that automatically? I am on the fence here. On one side, it’d be nice for users if the OAK cache could work “out of the box” without any extra configuration. On the other side, the ODK container is supposed to shield the local filesystem (except the actual repository) from any side-effects – everything that happens in the container stays in the container –, so it may not be a good idea to silently break a hole through the container’s wall: what if an interrupted download corrupts the cache? Users could expect that it would have no consequence, since the command was run inside a container – except that no, actually the cache is outside the container, so you’ve just corrupted your actual cache, oops!

Thoughts?

@gouttegd
Copy link
Contributor Author

gouttegd commented Oct 22, 2024

From experimenting with supporting this in the ODKRunner, I think we can do the following:

If the ODK_SHARE_OAK_CACHE variable is set (in the environment or in the run.sh.conf file), it is expected to point to the OAK cache directory. Then, we simply bind that directory to the directory /home/odkuser/.data/oaklib within the container (or /root/.data/oaklib, if we are running as root), so that any OAK process started from within the container can access the cache.

For a little bit more ease of use, we could also support two special values for ODK_SHARE_OAK_CACHE:

(A) If ODK_SHARE_OAK_CACHE is set to user, then we automatically find the OAK cache directory, regardless of how Pystow is configured.

It wouldn’t be hard to do, but it would clutter the run.sh script quite a bit, because we’d need to basically replicate Pistow’s logic to determine the location of the cache directory.

I am on the fence about whether this is really useful or not. The only way the OAK cache directory could be elsewhere than in ~/.data/oaklib is if people have explicitly told Pystow to use another directory (by playing with the OAKLIB_HOME, PYSTOW_HOME, PYSTOW_NAME, or PYSTOW_USE_APPDIRS variables), and if they have done that then they know exactly where the cache directory is, and they can explicitly set ODK_SHARE_OAK_CACHE to the correct path.

(B) If ODK_SHARE_OAK_CACHE is set to repo, then the cache directory is assumed to be in the src/ontology/tmp/oaklib directory in the current repo (case of a user who would like to share the cache across multiple invocations of the ODK in the same repo, but not across all their repos).

This does not necessarily make things easier (it would be equivalent to ODK_SHARE_OAK_CACHE=$PWD/tmp/oaklib, which is not much harder than ODK_SHARE_OAK_CACHE=repo), but it would have the benefit of allowing for standardisation (the per-repo cache would always be located at the same place, instead of allowing people to use sometimes tmp/oaklib and sometimes something else like tmp/oaklib-cache, tmp/cache/oaklib, etc).

@gouttegd gouttegd linked a pull request Oct 22, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant