add docs and script to fetch rucio dataset files #165

garciagenrique · 2024-06-20T15:59:04Z

Summary

This PR adds a bash script, to be run on VEGA, that fetches all the files from a RUCIO dataset. The script assumes that the full dataset is present on the VEGA RSE, otherwise it would be necessary to run a replication rule first.

The script creates a simlink per file, allowing the user to access the dataset without searching nor interacting with the rucio file structure.

Related issue : #156

Co-authored-by: Giovanni Guerrieri [email protected]

matbun

It would be nice to have an example with a dataset already on the interTwin data lake, e.g., using the CERN use case dataset.
I am a bit unsure on these aspects:

usage of .txt file formats: is this a constraint? If not, would it be possible to make it more general?
Does this script allow to copy a whole dataset in one shot ?

Also, before merging I would kindly ask to fix linter problems

tutorials/data-lake/pull-dataset/README.md

tutorials/data-lake/pull-dataset/rucio_dataset_files.sh

matbun · 2024-06-22T10:43:05Z

Thinking to it twice, I would suggest the following improvements:

First, incorporating the shell script logic into itwinai would simplify the life of the users. Instead of copying and executing the script, they could call it from within Python. We could also add it to the itwinai CLI. Example:

itwinai get-rucio-dataset <SCOPE:DataSet> <output_file> <output_symlink_dir>

This would prevent the users to manage yet another script, allowing us to ship with itwinai always the latest version of it.
If you think that this is doable, could you implement a function under src/itwinai/rucio.py, please? I will take care of integrating it into the itwinai CLI.

Second, the tutorial on how to get data from Rucio should come after the explanation on how to properly setup the Rucio client. Thus, I have created another tutorial folder under tutorials/data-lake/01-configure-rucio. I know you have some info on that, so could you please add a bit of documentation in the README file?

garciagenrique · 2024-06-25T16:44:18Z

Thinking to it twice, I would suggest the following improvements:

First, incorporating the shell script logic into itwinai would simplify the life of the users. Instead of copying and executing the script, they could call it from within Python. We could also add it to the itwinai CLI. Example:
itwinai get-rucio-dataset <SCOPE:DataSet> <output_file> <output_symlink_dir>

What about just doing a itwin get-rucio-dataset and manage internally the <output_file> and the <output_symlink_dir> ? It will be basically doing a (more or less) os.lisdir(output_file).
I can implement it of course

Second, the tutorial on how to get data from Rucio should come after the explanation on how to properly setup the Rucio client. Thus, I have created another tutorial folder under tutorials/data-lake/01-configure-rucio. I know you have some info on that, so could you please add a bit of documentation in the README file?

I have improved the README of this PR, I will add a more detailed tutorial tomorrow on the pointed directory.

matbun · 2024-06-28T10:31:15Z

What about just doing a itwin get-rucio-dataset and manage internally the <output_file> and the <output_symlink_dir> ? It will be basically doing a (more or less) os.lisdir(output_file). I can implement it of course

Indeed, the simpler the better! I would still leave an optional argument to specify the output folder, to avoid over constraining the users to cd in the target folder before executing the command.

I have improved the README of this PR, I will add a more detailed tutorial tomorrow on the pointed directory.

Thanks!

garciagenrique force-pushed the 156-easily-access-datasets-on-rucio-data-lake branch from 37c18fe to ef20ca2 Compare June 20, 2024 16:01

garciagenrique requested a review from matbun June 20, 2024 16:01

matbun requested changes Jun 21, 2024

View reviewed changes

tutorials/data-lake/pull-dataset/README.md Outdated Show resolved Hide resolved

tutorials/data-lake/pull-dataset/README.md Outdated Show resolved Hide resolved

tutorials/data-lake/pull-dataset/rucio_dataset_files.sh Show resolved Hide resolved

add docs and script to fetch rucio dataset files

76fc9b9

garciagenrique force-pushed the 156-easily-access-datasets-on-rucio-data-lake branch from ef20ca2 to 76fc9b9 Compare June 25, 2024 16:40

matbun linked an issue Jul 2, 2024 that may be closed by this pull request

Easily access datasets on Rucio data lake #156

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add docs and script to fetch rucio dataset files #165

add docs and script to fetch rucio dataset files #165

garciagenrique commented Jun 20, 2024 •

edited

Loading

matbun left a comment

matbun commented Jun 22, 2024

garciagenrique commented Jun 25, 2024

matbun commented Jun 28, 2024

add docs and script to fetch rucio dataset files #165

Are you sure you want to change the base?

add docs and script to fetch rucio dataset files #165

Conversation

garciagenrique commented Jun 20, 2024 • edited Loading

Summary

matbun left a comment

Choose a reason for hiding this comment

matbun commented Jun 22, 2024

garciagenrique commented Jun 25, 2024

matbun commented Jun 28, 2024

garciagenrique commented Jun 20, 2024 •

edited

Loading