You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I thought it might be a good idea to put the lazy Reference parquet files into git. Using this data directly from git is somehow not possible - e.g. our gitlab server also do not allow byte-range requests which are required at some point, I guess.
So I thought I could add a simplecache:: in the URL and ended up with a catalog which contains entries configured like this:
I thought that this triggers downloads of required reference files first before opening it. For opening (getting metadata and coordinates) it seems to work. But for getting the real variable data, it seems like the caching is not fast enough or not syncronized correctly. Especially if I use it with dask, I get a lot of OS errors or incomplete parquet errors when accessing it the first time. When accessing it a third time, it usually works then - as if the caching has only finished then.
Is there a way to configure the process to wait for the caching before using the parquets? Or am I on the wrong track here?
Btw I also tried simplecache::reference:: but then all data that is used and referenced in the reference files is also cached. I only want to cache the reference parquets however...
Thanks and best,
Fabi
The text was updated successfully, but these errors were encountered:
My guess is, that you have multiple threads on the worker (or multiple workers that see the same filesystem). Since simplecache really is simple, it assumes that if a file is present, it is the whole cached file. So if one thread starts to download and another tries to open the file before that finishes, it will read the partial file on disk. That would account for what you see. I haven't yet thought of how you can solve this...
It may be worth opening an issue on fsspec, whereby the cacher downloads to a different filename and moves to the final destination when done (which may result in some files downloading multiple times, but that's not too bad).
Hi,
I thought it might be a good idea to put the lazy Reference parquet files into git. Using this data directly from git is somehow not possible - e.g. our gitlab server also do not allow byte-range requests which are required at some point, I guess.
So I thought I could add a
simplecache::
in the URL and ended up with a catalog which contains entries configured like this:I thought that this triggers downloads of required reference files first before opening it. For opening (getting metadata and coordinates) it seems to work. But for getting the real variable data, it seems like the caching is not fast enough or not syncronized correctly. Especially if I use it with dask, I get a lot of
OS error
s orincomplete parquet
errors when accessing it the first time. When accessing it a third time, it usually works then - as if the caching has only finished then.Is there a way to configure the process to wait for the caching before using the parquets? Or am I on the wrong track here?
Btw I also tried
simplecache::reference::
but then all data that is used and referenced in the reference files is also cached. I only want to cache the reference parquets however...Thanks and best,
Fabi
The text was updated successfully, but these errors were encountered: