Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Usecases question #48

Open
aplavin opened this issue Jun 20, 2024 · 7 comments
Open

Usecases question #48

aplavin opened this issue Jun 20, 2024 · 7 comments

Comments

@aplavin
Copy link

aplavin commented Jun 20, 2024

Nice to see a modern take on datasets handling in Julia!
I've been looking at DataToolkit trying to understand how to apply it and what specific advantages would it bring. I have three different usecases in mind, and cannot really understand how to plug DataToolkit in any of them.
Briefly outlining them below, any suggestions are welcome!

  1. Small and sporadically-updated table, like 500 rows. Currently, I just put a CSV file into a data folder of the Julia package, and provide a function that reads it into a Julia table with some minor cleanup.
    What can DataToolkit improve here?

  2. An online collection of publicly-available tables. For a specific example, astronomical catalogs at https://vizier.cds.unistra.fr. Currently (in VirtualObservatory.jl) I provide a function that's basically download_and_read_table(catalog_id::String) with some conveniences.
    There are obvious issues with that: every time the dataset is downloaded anew, and one cannot access a dataset without internet/when the archive is down even if it was downloaded previously. Some transparent caching would be nice.

  3. A large well-structured collection of files (tables, images, ...), think hundreds of GBs. Currently, I manually ensure that the collection is available on the machine I need to work at, and have an interface like MyDataset("path-to-the-directory").
    Would be nice to have a per-machine config file so that the path is defined here, and then MyDataset() automatically finds it. Also, maybe some basic presence/sanity checks...

@tecosaur
Copy link
Owner

tecosaur commented Jul 5, 2024

Hi Alexander, thanks for your interest 🙂. By the sounds of it, DataToolkit should be able to help with some of the use cases you describe, and I'd be very happy to explore how it could do so better :)

(1) Small sporadically updated table

Without more details, it sounds like it would be hard to make much of an improvement over just calling CSV.read/CSV.write.

(2) An online collection of tabular data

This sounds rather applicable. I'd recommend holding off till the 0.10 release, but using DataToolkitBase + a Data.toml in the package, you'd get transparent caching via the "data store" (store plugin).

That said, if you've got 1000s of entries, asking for them to be put in a Data.toml is probably a bit much. You could use the (v0.10) API to construct anonymous DataCollections. This comes with an earlier "automatic cleanup" age (30 days by default), but I imagine that's still helpful.

Something else you could consider doing is add an API for adding a particular catalog to a user's project Data.toml, which would by extension allow them to download all the data referenced in that project using the data> store fetch command. Whether or not this makes sense will depend on how you see the package being used, but if it does I'd be happy to help.

(3) A large well-structured collection of files

How large is "large?" I've only tested DataToolkit with a few hundred data sets myself, but it should scale further, and I'd like to improve this area if it doesn't.

To put this collection in a Data.toml, an entry would need to be made for each "dataset" within the collection, but this could be done using the (v0.10) API for ease.

I think the main benefit you'd get from this is data existence and integrity checks.

The per-machine config makes things a little trickier. Since the "data store" is content-addressed, if you have the checksum for a dataset and it exists in the store, it doesn't need to be told where to find it.

That said, currently no work has been done for per-machine config files, beyond cache settings for the "data store".

@aplavin
Copy link
Author

aplavin commented Jul 6, 2024

Thanks for a detailed response!

(1) Small sporadically updated table

Without more details, it sounds like it would be hard to make much of an improvement over just calling CSV.read/CSV.write.

It's generally fine, but could use features like accessing older version of the same dataset. Is it in scope for this package?

using DataToolkitBase + a Data.toml in the package, you'd get transparent caching via the "data store" (store plugin).
That said, if you've got 1000s of entries, asking for them to be put in a Data.toml is probably a bit much.

Oh, there's definitely thousands and more of entries! Even more, new ones are regularly being added.
Still, all of them have "permanent" keys, basically paper_id/table_id. Transparent caching would make sense here, together with "automatic cleanup" you mentioned. Maybe other DataToolkit functions could be useful as well, not sure – caching is just what immediately pops in my mind.

Something else you could consider doing is add an API for adding a particular catalog to a user's project Data.toml, which would by extension allow them to download all the data referenced in that project using the data> store fetch command. Whether or not this makes sense will depend on how you see the package being used, but if it does I'd be happy to help.

Interesting approach... Does it work nicely with Pluto and temp envs more generally?
The current API in its most basic form is just

vizcat = VizierCatalog("J/ApJ/923/67/table2")  # metadata only, no actual data loaded
tbl = table(vizcat)  # downloads and reads the actual table

How do you think it would look in this scenario from the user side? I wonder what exactly "API for adding a particular catalog to a user's project Data.toml" should entail...

(3) A large well-structured collection of files

How large is "large?"

I'm thinking few*10^5 files, some hundreds Gb. Don't think it's reasonable to put each file individually into the toml list.

The per-machine config makes things a little trickier. Since the "data store" is content-addressed, if you have the checksum for a dataset and it exists in the store, it doesn't need to be told where to find it.

What do you mean by "exists in the store"? The data is just available on each machine, either locally at some path, or at a remotely-mounted disk. And it should surely stay that way, some other tools also access it. Just looking for a nice approach to access it from Julia without hardcoding the path in the code, and potentially with some consistency checks.

@tecosaur
Copy link
Owner

It's generally fine, but could use features like accessing older version of the same dataset. Is it in scope for this package?

Well, there is the versions plugin already — does that do what you want?

Interesting approach... Does it work nicely with Pluto and temp envs more generally?

My gut feeling is it should be fine, but I do recall some issues with Pluto specifically when replying on packages made available in the notebook: it's to do with the way Pluto creates temporary modules for each cell.

I wonder what exactly "API for adding a particular catalog to a user's project Data.toml" should entail...

This is currently being reworked in v0.10. Taking your mention of "a particular catalogue" as a DataSet, it might look something like this:

using DataToolkitCore

usrcol = getlayer()

ds = create!(usrcol, DataSet, "J/ApJ/923/67/table2", "description" => "...", ...)
storage!(ds, :web, "url" => "...")
loader!(ds, :something, params...)

I'm thinking few*10^5 files, some hundreds Gb. Don't think it's reasonable to put each file individually into the toml list.

Yea, some new tooling/capabilities will be needed to handle something like that.

What do you mean by "exists in the store"? The data is just available on each machine, either locally at some path, or at a remotely-mounted disk. And it should surely stay that way, some other tools also access it.

"The store" is a directory managed by DataToolkitStore that serves as a garbage-collected content-addressed archive.

You can set a path and hard-code DataToolkit and other tools to use it, or use the store and get the path from DataToolkit.

@aplavin
Copy link
Author

aplavin commented Sep 26, 2024

Taking your mention of "a particular catalogue" as a DataSet, it might look something like this: <...

Hm, that does sound reasonable! I wonder how caching would work when I have the same "catalogue" added independently in several different julia environment: will it only download and store it once? Would it need internet access at all to add the same catalogue in a new env?

@tecosaur
Copy link
Owner

I wonder how caching would work when I have the same "catalogue" added independently in several different julia environment: will it only download and store it once? Would it need internet access at all to add the same catalogue in a new env?

The whole point of a central store is that multiple projects can all reference the same data, and it will only be downloaded/stored once 🙂 (with no internet access needed so long as different projects look to be accessing the same file: by checksum or dataset attributes)

@aplavin
Copy link
Author

aplavin commented Sep 26, 2024

Thanks, I'll probably try DataToolkit in this scenario in the near future! I have a couple of private packages that serve the purpose of convenient access to specific datasets, and they are quite isolated, so a nice playground to see how it works :)

@tecosaur
Copy link
Owner

Nice! I suspect this will work a bit more pleasent after v0.10 is out (the example I gave uses new API), but if you're interested in taking it for a test run I'd be happy to help you work out details/fix bugs you may run into (maybe even improve the API).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants