roadmap: what should ipfsspec do? #7

d70-t · 2021-10-27T19:15:22Z

This issue is meant to discuss the purpose of the ipfsspec fsspec backend and to sharpen the overall design.

background

Due to the availability of IPFS -> HTTP gateways, a specialized IPFS backend for fsspec based read access is not required, as it is possible to open any CID using the http backend by accessing

http(s)://<gateway>/ipfs/<CID>

the downside of this approach is, that this requires to transform from content-based addressing to location-based addressing in user code. Using gateway-aware urls in user code makes it harder

to use local gateways
to do automatic fallback between multiple gateways
to define a preferred gateway based on the local computing environment

To overcome these downsides, it seems to be beneficial to refer to IPFS resources via a gateway-unaware url like

ipfs://<CID>

and do the translation to HTTP or IPFS when accessing the resource and based on the local computing environment and settings. This was the initial idea of ipfsspec.

design questions

Is such a library useful at all?

Or should this translation be implemented on a different layer?

Should this library do automatic load balancing / fallback between multiple gateways?

Doing load balancing or fallback properly is not trivial to implement (especially with async).
If the library should just work without user configuration, a solution with fallback is likely required, as otherwise it is not possible to use public gateways and still prefer the local gateway if is available.

Should the library provide write support?

... and if yes, how?

IPFS is a content addressable storage, thus one can not choose the filename when adding content. In stead, the "filename" is computed based on the stored content. As a result, the signature of a put function would rather look like

cid = put(content)

in stead of

put(content, filename)

and thus wouldn't directly fit into fsspec.

A way out might be to use the IPFS mutable filesystem, which adds a local mutable overlay on top of the immutable filesystem. Using MFS it would be possible to incrementally construct a local filesystem hierarchy and ask for a root CID after construction has finished. The downside of this approach is, that this only works locally (or at least local to one gateway) and thus is probably not suited for larger datasets. So there's probably not too much benefit as compared to writing data into a local temporary folder and than ipfs add -r -H the entire folder.

A related option might be to pin data blocks one by one and keep the virtual directory in memory. After writing out a larger dataset this way, a root CID for remotely stored datasets could be created. An advantage of this approach might be, that writing could be distributed to multiple remote gateways.

The text was updated successfully, but these errors were encountered:

Erotemic · 2023-10-17T16:53:50Z

I'm interested in this project, here are my thoughts on the questions:

Is such a library useful at all?

With gateways only? No, not really. Gateways are slow and unreliable. They are a fine fallback for content that the gateways have easy access to, but if I'm hosting data on my home network with a bad upload speed and I try to access it from a gateway, it often times out.

Should this library do automatic load balancing / fallback between multiple gateways?

No, not yet. Gateways are a crutch, but a useful one. Allow them to exist as a fallback, but I think the best coarse of action would be to utilize an installed ipfs implementation such as kubo as the primary method of accessing data. This would require a Python wrapper library around kubo that abstracts requests to it, is ideally duck-typed with a gateway version of the abstraction, and then both of them can be used to implement the fsspec hooks.

Should the library provide write suport?

Not until read support is very good. Pinning to ipfs has a lot of nuances that users will have various ways of performing. What this library should focus on is being able to access data already pinned on IPFS as efficiently as possible.

In terms of future write support, I think fsspec needs to expand its API to embrace the idea of content addressable data first. I think such a proposal is in-scope of the project and something that they could be convinced is a good idea.

d70-t · 2023-10-17T18:39:08Z

Thank a lot for this feedback 🎉 . This repo has been relatively quiet for a while (back then, I guess it's been go-ipfs 0.12.0), but I hope that things could slowly ramp up again.

With gateways only? No, not really. Gateways are slow and unreliable.

I'm not sure if I'm understanding the same as you in this point: I consider a locally running kubo instance as a Gateway (from ipfsspec point of view, one can use just the same API). Not-a-gateway would be if the ipfsspec library would itself speak the IPFS protocol and talk to other IPFS nodes that way.

So far, my understanding would be, having only one long-running IFPS node on a machine is better than having multiple short running ones (due to larger pinset and less impact of startup time). Thus, it could actually be better to talk to the local kubo "gateway" than having a full-blown IPFS protocol stack inside Python.

Allow them to exist as a fallback, but I think the best coarse of action would be to utilize an installed ipfs implementation such as kubo as the primary method of accessing data.

This is a tricky business: at some point, there has to be a decision which gateway to use. This could be

user configuration (makes things complicated if mandatory, doesn't help with adoption)
once the ipfsspec library is started (can be annoying if you are in a longer running a Python session and you start up kubo afterwards, because requests wouldn't get back to your local kubo)
once per request (probably not super nice and maybe not the best in performance either)
something in between once per interpreter startup and once per request (this essentially boils down to something similar to load balancing, probably one would want to give a very high priority to the local kubo)

So as far as my current understanding goes, there's either manual configuration or load balancing, if we want to have public gateways as a fallback option (and I know a couple of users which rely on that fallback option).

Should the library provide write suport?

Not until read support is very good.

I agree. Maybe we even want a different kind of library for write support.

d70-t mentioned this issue Oct 27, 2021

NOAA OISST Zarr is now on IPFS - next steps w/ Filecoin? pangeo-forge/roadmap#40

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roadmap: what should ipfsspec do? #7

roadmap: what should ipfsspec do? #7

d70-t commented Oct 27, 2021

Erotemic commented Oct 17, 2023

d70-t commented Oct 17, 2023

roadmap: what should ipfsspec do? #7

roadmap: what should ipfsspec do? #7

Comments

d70-t commented Oct 27, 2021

background

design questions

Is such a library useful at all?

Should this library do automatic load balancing / fallback between multiple gateways?

Should the library provide write support?

Erotemic commented Oct 17, 2023

d70-t commented Oct 17, 2023