Improve usability of Directory datatype #17614

wm75 · 2024-03-06T09:53:51Z

~~This adds functionality to the Directory datatype class, which can now be displayed and downloaded as an archive.~~
It also adds a new archive_to_directory converter that generalizes the existing tar_to_directory one to work with tar and zip archives. Also updates the older converter's requirement to an existing version of the galaxy-util package. Previously the exact requirement wasn't installable via conda.

How to test the changes?

(Select all options that apply)

I've included appropriate automated tests.
This is a refactoring of components with existing test coverage.
Instructions for manual testing are as follows:
1. upload an archive as either zip or tar
2. convert the datatype to directory
3. explore

License

I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

wm75 · 2024-03-06T09:54:42Z

@davelopez step 1 for zarr datatype integration

wm75 · 2024-03-06T09:58:56Z

@astrovsky01 could maybe be an interesting alternative to your colabfold tar archive?

bernt-matthias

It would be great to have some unit tests for setting metadata. Maybe here https://github.com/galaxyproject/galaxy/blob/dev/test/unit/data/datatypes/test_data.py.

There should be a bit of inspiration for such tests in the folder.

lib/galaxy/datatypes/converters/archive_to_directory.xml

lib/galaxy/config/sample/datatypes_conf.xml.sample

lib/galaxy/datatypes/data.py

wm75 · 2024-03-06T18:37:30Z

Ok, your opinion about the display_data part is a bit diappointing for me. I've spent a considerable amount of time looking at the existing code for the Data datatype class and actually thought that I had implemented the display part in the spirit of that parent class' code, but seems you don't think so.

Unfortunately, I don't understand, in particular, your distinction between the datatype's and the client's concern:
Data.display_data itself implements a poor man's directory display for the case that a link like datasets/ddaca2bad6847b13/display/dataset_22960784-8945-48d8-86fb-46d3c00a8b3e points to a filename in extra_files_path, which is actually a directory. In the case that the link leads to a file the same method will serve that file. So depending on its input params that method is making lots of decisions and can return very different things.
My proposed subclass method leaves almost all of this untouched, but special-cases one more situation, which is links like datasets/ddaca2bad6847b13/preview/. For this particular case, it will use the parent class' already existing directory display to show the contents of the root folder, which I don't think is a very far-fetched preview for a directory datatype.

Now for potential subclasses of Directory that may define index files, the display_data method will display that index file's content as a preview instead.

Data.display_data, in its docstring, also has the warning: Datatypes should be very careful if overriding this method and this interface between datatypes and Galaxy will likely change. so I thought it'd be better to implement enough flexibility in Directory.display_data so that subclasses won't immediately have to override the method again, but instead there's one hopefully carefully reviewed place where the magic happens. That's why I handle the index file case in the Directory class even though I agree with you that it's of no relevance for that class itself, but it's also logical to think that a number of directory subclasses will have something to use as an index file.

In general, there is no urgency here, and I do not intend to get into any heated argument over this. I'm willing to adjust the code and learn about your opinion and the reason behind it. Above are my reasons for implementing this first version like I did, and all I can say is that I gave this quite some thought, but never expected my first attempt to be perfect nor even close to it.

mvdbeek · 2024-03-07T09:28:36Z

I should say that our datatype code is quite sub-optimal in so many places, in part due to us not being very strict with reviewing them (for a good reason, we want to collect all those domain-specific datatypes), and having had to resort to server-side templating for the longest time in Galaxy's history. That in turn means that it's not always appropriate to just model new code on existing code, especially if you're working on important datatypes that we have to build on going forward. Now I don't mind altering stuff in display_data, we can always remove that later, but I don't think we should add more data to the database that we're not using.

itself implements a poor man's directory display for the case

All of these should eventually go away, just like the "bam-to-sam-to-tabular" display. You can produce a listing of extra files via the API, which is what a directory browser visualization should use to implement a browsable interface. We shouldn't have to stick that into the database. Making the visualizations first-class is a priority on the roadmap and we're not far away from getting there.

Datatypes should be very careful if overriding this method and this interface between datatypes and Galaxy will likely change.

I would take this comment quite verbatim to mean that the interface can change, in terms of function signature or expected return values. That comment likely comes from an era when we still had datatypes on the tool shed. I don't think that the number of subclasses matters, and in my previous comment I suggested that you can implement a directory-style subclass and a concrete implementation that uses the additional data you want to store in the database, which is my number one issue with adding unused metadata elements. And then we'll also see how all that is actually used.

What's here is great, could you not just break away the extra metadata elements and add them as a different parent class for your zarr datatype ?

bgruening · 2024-10-07T11:55:03Z

The Biohackathon is coming up soon again. What is the status here? My understanding is that we need this for the upcoming Zarr datatype and Zarr Visualisation?

lib/galaxy/datatypes/data.py

lib/galaxy/datatypes/converters/archive_to_directory.xml

mvdbeek · 2024-10-14T10:19:41Z

lib/galaxy/datatypes/data.py

@@ -1212,6 +1212,18 @@ def regex_line_dataprovider(
 class Directory(Data):
    """Class representing a directory of files."""

+    file_ext = "directory"
+
+    def _archive_main_file(


Can you add a test for roundtripping a directory via the API ?

Sure I can try!

I've added a test, but I'm not sure why I'm getting ModuleNotFoundError: No module named 'galaxy' when running the converter tool... using the UI seems to work fine 🤔

I don't think the API tests are setting up dependency resolution. I am mostly interested in verifying that the structure of the tar archive is the same pre-and-post upload. In fact I think even the checksum should match if we're not compressing the archive. In either case if you upload a tar file and download it again that should be sufficient. The converter is tested in the test framework.

Ahh I see, so you mean something like this fc7e959#diff-a6ab1700bcef9e1585a2bb0f84e8888470a770fb81c3e0337930e7cad573093fR662

I'll do that 👍

I've tried this:

def test_fetch_directory(self, history_id): testdir = TestDataResolver().get_filename("testdir.tar") with open(testdir, "rb") as fh: details = self._upload_and_get_details( fh, api="fetch", history_id=history_id, ext="directory", assert_ok=True ) assert details["file_ext"] == "directory" assert details["file_size"] == 3584 content = self.dataset_populator.get_history_dataset_content( history_id, dataset=details, to_ext="directory", type="bytes" ) dir_path = decompress_bytes_to_directory(cast(bytes, content)) assert dir_path.endswith("testdir") for path, entry_class in EXPECTED_CONTENTS.items(): path = os.path.join(dir_path, os.path.pardir, path) if entry_class == "Directory": assert os.path.isdir(path) else: assert os.path.isfile(path)

But if I don't run the converter manually instead of the to_ext="directory" the extra_files_path is empty, I guess that is why you have more changes persisting extra files in the object store in your referenced branch dev...mvdbeek:galaxy:directory_datatype_improvements#diff-8640d91ef47bca302b00039012979f4b1b79f5dbffbe2431bc9a05f19fb4c7d0R132

Should we merge your branch instead? Is something still missing in your branch or should that be how to do it?

Sorry, I'm a bit lost 😅

@mvdbeek, re-reading your comment:

In either case if you upload a tar file and download it again that should be sufficient.

do you mean something simpler like this instead?

def test_upload_tar_roundtrip(self, history_id): testdir = TestDataResolver().get_filename("testdir.tar") expected_hash = md5_hash_file(testdir) expected_size = os.path.getsize(testdir) with open(testdir, "rb") as fh: details = self._upload_and_get_details(fh, api="fetch", history_id=history_id, assert_ok=True) assert details["file_ext"] == "tar" assert details["file_size"] == expected_size content = cast( bytes, self.dataset_populator.get_history_dataset_content(history_id, dataset=details, type="bytes") ) assert len(content) == expected_size dir_path = decompress_bytes_to_directory(content) expected_contents = { "testdir": "Directory", "testdir/c": "Directory", "testdir/a": "File", "testdir/b": "File", "testdir/c/d": "File", } assert dir_path.endswith("testdir") for path, entry_class in expected_contents.items(): path = os.path.join(dir_path, os.path.pardir, path) if entry_class == "Directory": assert os.path.isdir(path) else: assert os.path.isfile(path) with tempfile.NamedTemporaryFile("wb") as temp: temp.write(content) actual_hash = md5_hash_file(temp.name) assert actual_hash == expected_hash

If this is what you mean, the uploaded vs downloaded tar size and contents match, but the hashes don't (not sure why).
I still don't see the connection with the directory datatype or the converter changes in this PR, so I might be misunderstanding something 😞

You've overwritten _archive_main_file, as a result you've made sure you're not getting an (empty) extra file added to the archive.

test_upload_tar_roundtrip looks fine to me, you sure you don't need to flush and that that is why the checksums don't match ? You could also skip the hash and just compare the bytes. If the contents are the same, no added or removed files or structure then it's all good.

Ahhh that was it... I was missing the flush 🙈

Thank you very much!!

@bernt-matthias

as suggested by @bernt-matthias

This will go into a sub-datatype when needed.

This should probably go in sub-classes that expect specific directory structures.

Compressed (Upload) -> Directory (Unpack) -> Compressed (Download)

bgruening · 2024-10-30T20:03:43Z

Thanks everyone!

github-actions · 2024-10-30T20:04:05Z

This PR was merged without a "kind/" label, please correct.

github-actions bot added the area/datatypes label Mar 6, 2024

github-actions bot added this to the 24.1 milestone Mar 6, 2024

wm75 requested review from mvdbeek and bernt-matthias March 6, 2024 09:55

bernt-matthias reviewed Mar 6, 2024

View reviewed changes

lib/galaxy/datatypes/converters/archive_to_directory.xml Show resolved Hide resolved

lib/galaxy/config/sample/datatypes_conf.xml.sample Outdated Show resolved Hide resolved

lib/galaxy/datatypes/data.py Outdated Show resolved Hide resolved

lib/galaxy/datatypes/data.py Show resolved Hide resolved

mvdbeek reviewed Mar 6, 2024

View reviewed changes

lib/galaxy/datatypes/data.py Outdated Show resolved Hide resolved

wm75 commented Mar 6, 2024

View reviewed changes

lib/galaxy/datatypes/data.py Outdated Show resolved Hide resolved

mvdbeek reviewed Mar 6, 2024

View reviewed changes

lib/galaxy/datatypes/data.py Outdated Show resolved Hide resolved

mvdbeek removed this from the 24.1 milestone May 14, 2024

mvdbeek self-requested a review May 14, 2024 14:29

davelopez marked this pull request as draft October 9, 2024 08:14

davelopez force-pushed the archive-to-dir-converter branch 2 times, most recently from 687d71e to aebb357 Compare October 9, 2024 13:10

davelopez marked this pull request as ready for review October 9, 2024 13:18

github-actions bot added this to the 24.2 milestone Oct 9, 2024

davelopez mentioned this pull request Oct 9, 2024

Fix extra files path type hint #18958

Merged

4 tasks

davelopez force-pushed the archive-to-dir-converter branch 2 times, most recently from 0d5123b to af45cd1 Compare October 10, 2024 10:49

mvdbeek reviewed Oct 10, 2024

View reviewed changes

lib/galaxy/datatypes/data.py Outdated Show resolved Hide resolved

mvdbeek reviewed Oct 10, 2024

View reviewed changes

lib/galaxy/datatypes/data.py Outdated Show resolved Hide resolved

davelopez marked this pull request as draft October 11, 2024 08:34

davelopez force-pushed the archive-to-dir-converter branch from aa91fbd to ee24ca3 Compare October 11, 2024 08:42

davelopez reviewed Oct 11, 2024

View reviewed changes

lib/galaxy/datatypes/converters/archive_to_directory.xml Show resolved Hide resolved

mvdbeek reviewed Oct 11, 2024

View reviewed changes

lib/galaxy/datatypes/converters/archive_to_directory.xml Show resolved Hide resolved

davelopez marked this pull request as ready for review October 11, 2024 14:37

davelopez force-pushed the archive-to-dir-converter branch from b9df77a to 6f0713b Compare October 14, 2024 07:34

mvdbeek reviewed Oct 14, 2024

View reviewed changes

davelopez marked this pull request as draft October 14, 2024 11:59

davelopez mentioned this pull request Oct 22, 2024

Add some Zarr-based datatypes #19040

Merged

4 tasks

davelopez requested a review from mvdbeek October 29, 2024 08:46

wm75 and others added 16 commits October 30, 2024 16:59

Improve usability of Directory datatype

7858f26

Use dataset's total_size property instead of calculating it again

7833391

Add auto-converters for tar.gz and tar.bz2

fcdc27d

as suggested by @bernt-matthias

Make new converter testable

2ff6d33

Add file_ext and missing type annotation

8e538f2

Fix linting

7484a6c

Fix code formatting

566d526

Drop index-related metadata from Directory datatype

4f64885

This will go into a sub-datatype when needed.

Drop root_folder metadata from Directory datatype

5530e15

This should probably go in sub-classes that expect specific directory structures.

Do not set peek for directories

4a506fe

Set output format to directory

5ce4581

Copy metadata to galaxy.json in archive_to_directory converter

673dd37

Revert back to use provided_metadata_file

75f9de3

Remove set_peek method from Directory datatype

dfd2184

Add roundtrip test for downloading directories

ccd7b52

Compressed (Upload) -> Directory (Unpack) -> Compressed (Download)

Add TAR and ZIP roundtrip tests

9ef9a43

davelopez force-pushed the archive-to-dir-converter branch from a9d4a60 to 9ef9a43 Compare October 30, 2024 15:59

mvdbeek approved these changes Oct 30, 2024

View reviewed changes

Flush temporary file before testing hashes

14eb0c3

davelopez marked this pull request as ready for review October 30, 2024 17:15

bgruening merged commit c6a10d6 into galaxyproject:dev Oct 30, 2024
55 checks passed

bgruening added the kind/enhancement label Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve usability of Directory datatype #17614

Improve usability of Directory datatype #17614

wm75 commented Mar 6, 2024 •

edited by davelopez

Loading

wm75 commented Mar 6, 2024

wm75 commented Mar 6, 2024

bernt-matthias left a comment

wm75 commented Mar 6, 2024

mvdbeek commented Mar 7, 2024 •

edited

Loading

bgruening commented Oct 7, 2024

mvdbeek Oct 14, 2024

davelopez Oct 14, 2024

davelopez Oct 15, 2024

mvdbeek Oct 15, 2024

davelopez Oct 16, 2024

davelopez Oct 16, 2024

davelopez Oct 25, 2024 •

edited

Loading

mvdbeek Oct 30, 2024

davelopez Oct 30, 2024

bgruening commented Oct 30, 2024

github-actions bot commented Oct 30, 2024

Improve usability of Directory datatype #17614

Improve usability of Directory datatype #17614

Conversation

wm75 commented Mar 6, 2024 • edited by davelopez Loading

How to test the changes?

License

wm75 commented Mar 6, 2024

wm75 commented Mar 6, 2024

bernt-matthias left a comment

Choose a reason for hiding this comment

wm75 commented Mar 6, 2024

mvdbeek commented Mar 7, 2024 • edited Loading

bgruening commented Oct 7, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davelopez Oct 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bgruening commented Oct 30, 2024

github-actions bot commented Oct 30, 2024

wm75 commented Mar 6, 2024 •

edited by davelopez

Loading

mvdbeek commented Mar 7, 2024 •

edited

Loading

davelopez Oct 25, 2024 •

edited

Loading