You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The problem is that when we try to instantiate remote dataset after pulling it, we get some errors related to missing listing datasets found in sources of remote dataset we just pulled.
We need to adjust the logic of cp method (used for instantiating datasets) to avoid having those listing dataset present in DB.
Error example:
Version not specified, pulling the latest one (v1)
Saving dataset ds://02jpg_files@v1 locally: 100%|█████████████████████████████████████████████████████████████████████| 5.00/5.00 [00:00<00:00, 10.2 rows/s]
Dataset ds://02jpg_files@v1 saved locally
_request non-retriable exception: Anonymous caller does not have storage.buckets.get access to the Google Cloud Storage bucket. Permission 'storage.buckets.get' denied on resource (or it may not exist)., 401
Traceback (most recent call last):
File ".../datachain/.venv/lib/python3.12/site-packages/gcsfs/retry.py", line 126, in retry_request
return await func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".../datachain/.venv/lib/python3.12/site-packages/gcsfs/core.py", line 440, in _request
validate_response(status, contents, path, args)
File ".../datachain/.venv/lib/python3.12/site-packages/gcsfs/retry.py", line 113, in validate_response
raise HttpError(error)
gcsfs.retry.HttpError: Anonymous caller does not have storage.buckets.get access to the Google Cloud Storage bucket. Permission 'storage.buckets.get' denied on resource (or it may not exist)., 401
Error: Dataset lst__gs://datachain-demo/ not found.
Traceback (most recent call last):
File ".../datachain/src/datachain/cli.py", line 1016, in main
catalog.pull_dataset(
File ".../datachain/src/datachain/catalog/catalog.py", line 1454, in pull_dataset
_instantiate_dataset()
File ".../datachain/src/datachain/catalog/catalog.py", line 1324, in _instantiate_dataset
self.cp(
File ".../datachain/src/datachain/catalog/catalog.py", line 1563, in cp
node_groups = self.enlist_sources_grouped(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File .../iterative/datachain/src/datachain/catalog/catalog.py", line 703, in enlist_sources_grouped
listing = Listing(st, client, self.get_dataset(dataset_name))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".../datachain/src/datachain/catalog/catalog.py", line 1090, in get_dataset
return self.metastore.get_dataset(name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".../src/datachain/data_storage/metastore.py", line 704, in get_dataset
raise DatasetNotFoundError(f"Dataset {name} not found.")
datachain.error.DatasetNotFoundError: Dataset lst__gs://datachain-demo/ not found.
Telemetry is disabled by environment variable.
Description
This is follow up issue of #539 (see #560 (comment))
The problem is that when we try to instantiate remote dataset after pulling it, we get some errors related to missing listing datasets found in sources of remote dataset we just pulled.
We need to adjust the logic of
cp
method (used for instantiating datasets) to avoid having those listing dataset present in DB.Error example:
To reproduce:
Version Info
0.3.11.dev99+g0eabe20
Python 3.12.4
The text was updated successfully, but these errors were encountered: