-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve usability of Directory datatype #17614
Conversation
@davelopez step 1 for zarr datatype integration |
@astrovsky01 could maybe be an interesting alternative to your colabfold tar archive? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great to have some unit tests for setting metadata. Maybe here https://github.com/galaxyproject/galaxy/blob/dev/test/unit/data/datatypes/test_data.py.
There should be a bit of inspiration for such tests in the folder.
Ok, your opinion about the display_data part is a bit diappointing for me. I've spent a considerable amount of time looking at the existing code for the Data datatype class and actually thought that I had implemented the display part in the spirit of that parent class' code, but seems you don't think so. Unfortunately, I don't understand, in particular, your distinction between the datatype's and the client's concern: Now for potential subclasses of Directory that may define index files, the display_data method will display that index file's content as a preview instead. Data.display_data, in its docstring, also has the warning: In general, there is no urgency here, and I do not intend to get into any heated argument over this. I'm willing to adjust the code and learn about your opinion and the reason behind it. Above are my reasons for implementing this first version like I did, and all I can say is that I gave this quite some thought, but never expected my first attempt to be perfect nor even close to it. |
I should say that our datatype code is quite sub-optimal in so many places, in part due to us not being very strict with reviewing them (for a good reason, we want to collect all those domain-specific datatypes), and having had to resort to server-side templating for the longest time in Galaxy's history. That in turn means that it's not always appropriate to just model new code on existing code, especially if you're working on important datatypes that we have to build on going forward. Now I don't mind altering stuff in
All of these should eventually go away, just like the "bam-to-sam-to-tabular" display. You can produce a listing of extra files via the API, which is what a directory browser visualization should use to implement a browsable interface. We shouldn't have to stick that into the database. Making the visualizations first-class is a priority on the roadmap and we're not far away from getting there.
I would take this comment quite verbatim to mean that the interface can change, in terms of function signature or expected return values. That comment likely comes from an era when we still had datatypes on the tool shed. I don't think that the number of subclasses matters, and in my previous comment I suggested that you can implement a directory-style subclass and a concrete implementation that uses the additional data you want to store in the database, which is my number one issue with adding unused metadata elements. And then we'll also see how all that is actually used. What's here is great, could you not just break away the extra metadata elements and add them as a different parent class for your zarr datatype ? |
The Biohackathon is coming up soon again. What is the status here? My understanding is that we need this for the upcoming Zarr datatype and Zarr Visualisation? |
687d71e
to
aebb357
Compare
0d5123b
to
af45cd1
Compare
aa91fbd
to
ee24ca3
Compare
b9df77a
to
6f0713b
Compare
@@ -1212,6 +1212,18 @@ def regex_line_dataprovider( | |||
class Directory(Data): | |||
"""Class representing a directory of files.""" | |||
|
|||
file_ext = "directory" | |||
|
|||
def _archive_main_file( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test for roundtripping a directory via the API ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure I can try!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a test, but I'm not sure why I'm getting ModuleNotFoundError: No module named 'galaxy'
when running the converter tool... using the UI seems to work fine 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the API tests are setting up dependency resolution. I am mostly interested in verifying that the structure of the tar archive is the same pre-and-post upload. In fact I think even the checksum should match if we're not compressing the archive. In either case if you upload a tar file and download it again that should be sufficient. The converter is tested in the test framework.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh I see, so you mean something like this fc7e959#diff-a6ab1700bcef9e1585a2bb0f84e8888470a770fb81c3e0337930e7cad573093fR662
I'll do that 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tried this:
def test_fetch_directory(self, history_id):
testdir = TestDataResolver().get_filename("testdir.tar")
with open(testdir, "rb") as fh:
details = self._upload_and_get_details(
fh, api="fetch", history_id=history_id, ext="directory", assert_ok=True
)
assert details["file_ext"] == "directory"
assert details["file_size"] == 3584
content = self.dataset_populator.get_history_dataset_content(
history_id, dataset=details, to_ext="directory", type="bytes"
)
dir_path = decompress_bytes_to_directory(cast(bytes, content))
assert dir_path.endswith("testdir")
for path, entry_class in EXPECTED_CONTENTS.items():
path = os.path.join(dir_path, os.path.pardir, path)
if entry_class == "Directory":
assert os.path.isdir(path)
else:
assert os.path.isfile(path)
But if I don't run the converter manually instead of the to_ext="directory"
the extra_files_path is empty, I guess that is why you have more changes persisting extra files in the object store in your referenced branch dev...mvdbeek:galaxy:directory_datatype_improvements#diff-8640d91ef47bca302b00039012979f4b1b79f5dbffbe2431bc9a05f19fb4c7d0R132
Should we merge your branch instead? Is something still missing in your branch or should that be how to do it?
Sorry, I'm a bit lost 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mvdbeek, re-reading your comment:
In either case if you upload a tar file and download it again that should be sufficient.
do you mean something simpler like this instead?
def test_upload_tar_roundtrip(self, history_id):
testdir = TestDataResolver().get_filename("testdir.tar")
expected_hash = md5_hash_file(testdir)
expected_size = os.path.getsize(testdir)
with open(testdir, "rb") as fh:
details = self._upload_and_get_details(fh, api="fetch", history_id=history_id, assert_ok=True)
assert details["file_ext"] == "tar"
assert details["file_size"] == expected_size
content = cast(
bytes, self.dataset_populator.get_history_dataset_content(history_id, dataset=details, type="bytes")
)
assert len(content) == expected_size
dir_path = decompress_bytes_to_directory(content)
expected_contents = {
"testdir": "Directory",
"testdir/c": "Directory",
"testdir/a": "File",
"testdir/b": "File",
"testdir/c/d": "File",
}
assert dir_path.endswith("testdir")
for path, entry_class in expected_contents.items():
path = os.path.join(dir_path, os.path.pardir, path)
if entry_class == "Directory":
assert os.path.isdir(path)
else:
assert os.path.isfile(path)
with tempfile.NamedTemporaryFile("wb") as temp:
temp.write(content)
actual_hash = md5_hash_file(temp.name)
assert actual_hash == expected_hash
If this is what you mean, the uploaded vs downloaded tar size and contents match, but the hashes don't (not sure why).
I still don't see the connection with the directory
datatype or the converter changes in this PR, so I might be misunderstanding something 😞
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You've overwritten _archive_main_file
, as a result you've made sure you're not getting an (empty) extra file added to the archive.
test_upload_tar_roundtrip
looks fine to me, you sure you don't need to flush and that that is why the checksums don't match ? You could also skip the hash and just compare the bytes. If the contents are the same, no added or removed files or structure then it's all good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahhh that was it... I was missing the flush 🙈
Thank you very much!!
as suggested by @bernt-matthias
This will go into a sub-datatype when needed.
This should probably go in sub-classes that expect specific directory structures.
Compressed (Upload) -> Directory (Unpack) -> Compressed (Download)
a9d4a60
to
9ef9a43
Compare
Thanks everyone! |
This PR was merged without a "kind/" label, please correct. |
This adds functionality to the Directory datatype class, which can now be displayed and downloaded as an archive.It also adds a new
archive_to_directory
converter that generalizes the existing tar_to_directory one to work with tar and zip archives. Also updates the older converter's requirement to an existing version of the galaxy-util package. Previously the exact requirement wasn't installable via conda.How to test the changes?
(Select all options that apply)
License