Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CacheMapper to map from remote URL to local cached basename #1296

Merged
merged 3 commits into from
Jul 26, 2023

Conversation

ianthomas23
Copy link
Collaborator

@ianthomas23 ianthomas23 commented Jun 20, 2023

This is the start of work to make the caching system more configurable by splitting up different areas of responsibility of CachingFileSystem into separate class hierarchies that will eventually be pluggable.

It starts here by separating out the mapping from remote URL to local cached basename into a new class hierarchy based on AbstractCacheMapper. There is no change in functionality here, the two concrete derived classes cover the same_names=True and same_names=False options of CachingFileSystem. It will easy in future to add new CacheMapper classes or to modify the existing ones with extra attributes, e.g. number of directories to consider as well as the basename.

Implementation details:

  1. CacheMapper is the best name I can think of but I am happy to change it if someone has a better idea.
  2. CachingFileSystem no longer stores the same_names attribute, it is just used to create the CacheMapper which is stored. Removal of the same_names attribute could be considered an API change as it was technically public.
  3. I have kept the CachingFileSystem.hash_name function even though it now defers to the CacheMapper. This is just for backwards compatibility in case downstream libraries are using it. The local fsspec tests all pass if it is removed.

In future, to support more pluggable options we will need to be able to pass CacheMapper instances to the CachingFileSystem constructor, but we will probably also have to keep the same_names arg for backward compatibility.

(Edited: typo)

@martindurant
Copy link
Member

Sorry for not touching this, coming to it soon

@ianthomas23
Copy link
Collaborator Author

Sorry for not touching this, coming to it soon

Thanks, that would be useful. There is quite a lot of refactoring and new functionality to be added on top of this, if you think it is going in the right direction.

Copy link
Member

@martindurant martindurant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can find nothing wrong with this at all, the one tiniest question. Can merge anyway.

Also, would be happy to remove the eq/hash stuff - does it look like it has any use? Maybe no need to do that right now.

fn = os.path.join(self.storage[-1], detail["fn"])
fn = getattr(detail, "fn", "")
if not fn:
# fn should always be set, but be defensive here.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel like if we have a condition we expect never to hit, it's probably an exception or warning. I suppose this line is not covered by tests?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed this to raise an exception, and added an explicit test that the exception is raised in test_cached.py::test_clear_expired as that already contains the machinery to modify the cached metadata.

@ianthomas23
Copy link
Collaborator Author

Also, would be happy to remove the eq/hash stuff - does it look like it has any use? Maybe no need to do that right now.

We can safely remove it here, we'd just need to remove 2 lines of object and hash equality tests in test_chained_equivalent. But I'd be concerned that someone downstream might be using the public hash_name function. If you think it is safe to remove, maybe that should be in a separate PR so it is easy for us to revert if the need should arise?

@ianthomas23
Copy link
Collaborator Author

I'll merge this tomorrow morning (UK time), leaving time for any further comments before then.

@martindurant
Copy link
Member

I'd be concerned that someone downstream might be using the public hash_name function

I meant only __eq__/__hash__. The instances of cachers don't hold resources, and we use the init kwargs for determining whether to make a new instance anyway. I can't remember why they exist - returning the same instance is more important than testing equality, since the cache info dicts are mutable!
I wonder, for the sake of the test, whether is would work because of this, or what the default __eq__ would be for a class like this. In any case, no other fsspec implementations bother to have this.

@ianthomas23
Copy link
Collaborator Author

I meant only __eq__/__hash__.

I see now.

CachingFileSystem has __eq__ and __hash__ methods which previously used, amongst other attributes, self.same_names. Now this has been replaced with self._mapper we need both CacheMapper classes to have identity that depends just on their class.

@martindurant
Copy link
Member

I'll merge this tomorrow

whenever you're ready

@ianthomas23 ianthomas23 merged commit aacc5f2 into fsspec:master Jul 26, 2023
11 checks passed
@ianthomas23 ianthomas23 deleted the cache_mapper branch July 26, 2023 09:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants