Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow repeated extra arguments #1673

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

karasikov
Copy link

This fixes the error happening when the user passes extra parameters that are also inferred automatically. E.g., happens in lib datasets:

File ./python-default/3.10.14/lib/python3.10/site-packages/datasets/load.py:2692, in load_from_disk(dataset_path, fs, keep_in_memory, storage_options)
   2689     storage_options = fs.storage_options
   2691 fs: fsspec.AbstractFileSystem
-> 2692 fs, *_ = url_to_fs(dataset_path, **(storage_options or {}))
   2693 if not fs.exists(dataset_path):
   2694     raise FileNotFoundError(f"Directory {dataset_path} not found")

File ./3.10.14/lib/python3.10/site-packages/fsspec/core.py:396, in url_to_fs(url, **kwargs)
    385 known_kwargs = {
    386     "compression",
    387     "encoding",
   (...)
    393     "num",
    394 }
    395 kwargs = {k: v for k, v in kwargs.items() if k not in known_kwargs}
--> 396 chain = _un_chain(url, kwargs)
    397 inkwargs = {}
    398 # Reverse iterate the chain, creating a nested target_* structure

File ./3.10.14/lib/python3.10/site-packages/fsspec/core.py:349, in _un_chain(path, kwargs)
    347 if bit is bits[0]:
    348     kws.update(kwargs)
--> 349 kw = dict(**extra_kwargs, **kws)
    350 bit = cls._strip_protocol(bit)
    351 if (
    352     protocol in {"blockcache", "filecache", "simplecache"}
    353     and "target_protocol" not in kw
    354 ):

TypeError: dict() got multiple values for keyword argument 'account_name'

This fix

from fsspec.core import url_to_fs
url_to_fs('az://[email protected]/DATA', **{'anon': False, 'account_name': 'ACCOUNT'})

Out before:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[1], line 2
      1 from fsspec.core import url_to_fs
----> 2 url_to_fs('az://[email protected]/DATA', **{'anon': False, 'account_name': 'ACCOUNT'})

...
...
    347 if bit is bits[0]:
    348     kws.update(kwargs)
--> 349 kw = dict(**extra_kwargs, **kws)
    350 bit = cls._strip_protocol(bit)
    351 if (
    352     protocol in {"blockcache", "filecache", "simplecache"}
    353     and "target_protocol" not in kw
    354 ):

TypeError: dict() got multiple values for keyword argument 'account_name'

Out after:

(<adlfs.spec.AzureBlobFileSystem at 0x10708f460>, 'DIR/DATA')

@@ -346,7 +346,7 @@ def _un_chain(path, kwargs):
kws = kwargs.pop(protocol, {})
if bit is bits[0]:
kws.update(kwargs)
kw = dict(**extra_kwargs, **kws)
kw = dict(**{k: v for k, v in extra_kwargs.items() if k not in kws or v != kws[k]}, **kws)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the update on the line above, this could be repeated updates, and that way we can be a little more explicit about the order of precedence. In your model, user-supplied arguments should always win, overriding inferred ones?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly, to completely avoid thinking about the priorities, here we simply deduplicate the same key-value pairs (so the priority is irrelevant). If the same key has two different values, they're passed as before, so that will raise the same error, e.g., TypeError: dict() got multiple values for keyword argument 'account_name'

extra_kwargs = {'x': 5, 'y': 4}, kws = {'z': 4} becomes {'x': 5, 'y': 4, 'z': 4}
extra_kwargs = {'x': 5, 'y': 4}, kws = {'x': 5} becomes {'x': 5, 'y': 4}
extra_kwargs = {'x': 5, 'y': 4}, kws = {'x': 4} becomes dict(**{'x': 5, 'y': 4}, **{'x': 4}) and raises TypeError

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the priority is irrelevant

Checking the values isn't always straight forward, they might not be simple ints and str. We could catch it and pass a useful message to the user?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please include the two passing examples in some sort of test?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants