-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add docs and script to fetch rucio dataset files #165
base: 156-easily-access-datasets-on-rucio-data-lake
Are you sure you want to change the base?
Conversation
37c18fe
to
ef20ca2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to have an example with a dataset already on the interTwin data lake, e.g., using the CERN use case dataset.
I am a bit unsure on these aspects:
- usage of
.txt
file formats: is this a constraint? If not, would it be possible to make it more general? - Does this script allow to copy a whole dataset in one shot ?
Also, before merging I would kindly ask to fix linter problems
Thinking to it twice, I would suggest the following improvements: First, incorporating the shell script logic into itwinai would simplify the life of the users. Instead of copying and executing the script, they could call it from within Python. We could also add it to the itwinai CLI. Example: itwinai get-rucio-dataset <SCOPE:DataSet> <output_file> <output_symlink_dir> This would prevent the users to manage yet another script, allowing us to ship with itwinai always the latest version of it. Second, the tutorial on how to get data from Rucio should come after the explanation on how to properly setup the Rucio client. Thus, I have created another tutorial folder under |
ef20ca2
to
76fc9b9
Compare
What about just doing a
I have improved the |
Indeed, the simpler the better! I would still leave an optional argument to specify the output folder, to avoid over constraining the users to
Thanks! |
Summary
This PR adds a bash script, to be run on VEGA, that fetches all the files from a RUCIO dataset. The script assumes that the full dataset is present on the VEGA RSE, otherwise it would be necessary to run a replication rule first.
The script creates a simlink per file, allowing the user to access the dataset without searching nor interacting with the rucio file structure.
Related issue : #156
Co-authored-by: Giovanni Guerrieri [email protected]