Instantiation after pulling remote dataset is failing #566

ilongin · 2024-11-06T00:46:36Z

Description

This is follow up issue of #539 (see #560 (comment))

The problem is that when we try to instantiate remote dataset after pulling it, we get some errors related to missing listing datasets found in sources of remote dataset we just pulled.
We need to adjust the logic of cp method (used for instantiating datasets) to avoid having those listing dataset present in DB.

Error example:

Version not specified, pulling the latest one (v1)
Saving dataset ds://02jpg_files@v1 locally: 100%|█████████████████████████████████████████████████████████████████████| 5.00/5.00 [00:00<00:00, 10.2 rows/s]
Dataset ds://02jpg_files@v1 saved locally
_request non-retriable exception: Anonymous caller does not have storage.buckets.get access to the Google Cloud Storage bucket. Permission 'storage.buckets.get' denied on resource (or it may not exist)., 401
Traceback (most recent call last):
  File ".../datachain/.venv/lib/python3.12/site-packages/gcsfs/retry.py", line 126, in retry_request
    return await func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../datachain/.venv/lib/python3.12/site-packages/gcsfs/core.py", line 440, in _request
    validate_response(status, contents, path, args)
  File ".../datachain/.venv/lib/python3.12/site-packages/gcsfs/retry.py", line 113, in validate_response
    raise HttpError(error)
gcsfs.retry.HttpError: Anonymous caller does not have storage.buckets.get access to the Google Cloud Storage bucket. Permission 'storage.buckets.get' denied on resource (or it may not exist)., 401
Error: Dataset lst__gs://datachain-demo/ not found.
Traceback (most recent call last):
  File ".../datachain/src/datachain/cli.py", line 1016, in main
    catalog.pull_dataset(
  File ".../datachain/src/datachain/catalog/catalog.py", line 1454, in pull_dataset
    _instantiate_dataset()
  File ".../datachain/src/datachain/catalog/catalog.py", line 1324, in _instantiate_dataset
    self.cp(
  File ".../datachain/src/datachain/catalog/catalog.py", line 1563, in cp
    node_groups = self.enlist_sources_grouped(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File .../iterative/datachain/src/datachain/catalog/catalog.py", line 703, in enlist_sources_grouped
    listing = Listing(st, client, self.get_dataset(dataset_name))
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../datachain/src/datachain/catalog/catalog.py", line 1090, in get_dataset
    return self.metastore.get_dataset(name)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../src/datachain/data_storage/metastore.py", line 704, in get_dataset
    raise DatasetNotFoundError(f"Dataset {name} not found.")
datachain.error.DatasetNotFoundError: Dataset lst__gs://datachain-demo/ not found.
Telemetry is disabled by environment variable.

To reproduce:

from datachain import DataChain, C

ds = DataChain.from_storage("gs://datachain-demo")
ds1 = ds.filter(C('file.path').glob('*.jpg')).save("jpg_files")
ds2 = ds.filter(C('file.path').glob('*.png')).save("png_files")
ds4 = ds1.union(ds2)
ds5 = ds4.save("jpg_png_files")

dsf1 = ds.filter(C("file.path").glob("*02.jpg")).limit(5).save("02jpg_files")

ds6 = ds5.union(dsf1)
ds6.save("all_files")

Version Info

0.3.11.dev99+g0eabe20
Python 3.12.4

The text was updated successfully, but these errors were encountered:

ilongin added bug Something isn't working priority-p1 labels Nov 6, 2024

ilongin self-assigned this Nov 6, 2024

ilongin mentioned this issue Nov 6, 2024

Dataset pull fixes #560

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Instantiation after pulling remote dataset is failing #566

Instantiation after pulling remote dataset is failing #566

ilongin commented Nov 6, 2024 •

edited

Loading

Instantiation after pulling remote dataset is failing #566

Instantiation after pulling remote dataset is failing #566

Comments

ilongin commented Nov 6, 2024 • edited Loading

Description

Version Info

ilongin commented Nov 6, 2024 •

edited

Loading