RFC Request: dynamic p2p dataset replication #37

b5 · 2019-02-27T03:07:07Z

We've been thinking about dynamic dataset replication it for a while. While this is definitely future work, @camhunt had the following question:

Has there been any thought/discussion in leveraging gossipsub so that any node that subscribes to a channel will automatically replicate any data set advertised?

So I think it's time we laid out some thinking on this & started roadmapping. This is all very preliminary, but here's what I've got so far.

First, here's what I think we want:
Qri users should be able to express the kinds of dataset they're interested in, and Qri should automatically grab that kind of data when others publish it. Qri shouldn't do stupid things like eat the user's entire hard drive, or download stuff that could get the user in trouble

Some example kinds of dataset:

data published by a particular peer. eg: gimmie everything fivethirtyeight publishes!
datasets that match particular metadata fields. eg: I want climate change datasets
datasets that exactly match a schema eg: I know users are running a script that produces a dataset with a specific schema that I know about, grab any data that exactly matches that schema
datasets that kinda match a schema. eg: If this dataset works with the geojson spec, grab it
datasets who's data contains lots of repeating patterns. eg: matching based on a rabin fingerprint

Users should also be able to define constraints on what they grab. Examples:

No dataset larger than 5MB
I only want to keep the latest version of a dataset
if a dataset ever contains this phrase, drop it.
I don't want datasets with visualizations
I don't want to pin this if it's already pinned by lots of other people (highly available)

I don't think we need gossipsub for this (yet). We have a custom p2p protocol that's more efficient in terms of bandwidth consumption than gossipsub (simply by avoiding an extra source of bandwidth consumption), but more importantly, Qri p2p "understands" datasets, which we're going to need in the long term to be able to make rules like: "I'd like to replicate any dataset from peer steve that has the tag climate and is less than 100MB", most of those things require understanding what a dataset is, so it'll be easier to build in Qri from the start.

All the work we're doing on remotes are steps that build toward this. After we land the first remotes rfc (the over-http version of remotes), the next step will be to move dsync into p2p, so peers can sync only the blocks in a dataset they're missing in a pure p2p environment when they're directly connected, and fall back to pin-and-wait from the IPFS DHT when no direct connection can be established (publishing to places like Qri & non-NAT remotes will help this).

After that's in place, we'll need two things that will require some thought:

Some way to notify peers about new datasets so they can decide weather or not to pin them
A system for making pinning decisions we can grow into

Announcing New Datasets

Qri's p2p protocol already has the notion of announcing events. So far we can only announce that a peer has connected: https://github.com/qri-io/qri/blob/7fc0f8d44c32dbc89b850e986d40cc1afc801e10/p2p/connected.go#L40, so the distance here is shorter than it looks. Once we can do a dsync over p2p all we need is an Announce message for whenever a new dataset is created, and structure that message so that peers selectively propagate that message to all peers they're connected to (selectively to avoid DDOSing with announce messages). This is the exact same pattern gossipsub uses, but we can make smarter decisions about where to send those messages because Qri is more domain specific to datasets. We have lots to learn about how our network behaves at scale before going there.

When combined with IPFS public proxies, it would even mean that nodes that can't directly connect to each other (e.g., due to NAT issues) could still replicate data sets around

There are a few nice things coming down the bend on this topic, AutoNAT will land in go-ipfs 0.4.19, and we'll start setting up a NAT relay on our bootstrap nodes, which should mean Qri peers have an easier time finding each other. That's part of the reason I'd like to wait

Pinning Decisions

This doesn't solve the bad actor problem (which we could solve - crudely - by a whitelist of nodes which any given node would auto-replicate new data sets)

Totally agreed here. Once we can do all of the above, the last thing is a way to describe what you want to pin in a way that is "safe". Establishing manually maintained lists of trusted peers is the perfect place to start this pinning decisions stuff. All the filtering based on types of datasets can wait, because I think we can get a long way on just "pin whatever peer_a publishes. We can use that time to work on the other bit.

To me the big goal here is getting peers into a position of mutual support. That constraint of "don't pin this if the network has it in at least 3 places all the time" is the thing I've wanted since data rescue started.

I'm really intrigued by the work @nehbit has been doing on Aether. His notion of "neighborhoods" of slow-rotating peers for propagating content is really great work. I'm hoping we can come up with something that builds toward the notion of a neighborhood, where peers make pinning decisions not only based on what they want to keep, but also, to what degree a dataset is available on the network. This was part of the original promise of IPFS's bitswap algorithm and it hasn't lived up to the high-availability standard, so I don't consider this an easy problem.

I think the best place to start there is to ask people to manually tell us who they trust, which will help us understand how people want to compose themselves. We can use that to inform what will need to be a formally specified algo.

The text was updated successfully, but these errors were encountered:

nehbit · 2019-02-27T21:29:44Z

Hey there 👋

Qri users should be able to express the kinds of dataset they're interested in, and Qri should automatically grab that kind of data when others publish it. Qri shouldn't do stupid things like eat the user's entire hard drive, or download stuff that could get the user in trouble

This is something I'm working on with Aether as well. Not because the dataset I have is large, but because people do have certain preferences in which types of content they want to host, regardless of the legal status of the content. So there are some people that don't want to host content from certain communities for their own personal reasons. I'm investigating several possible avenues for this as well.

I'm really intrigued by the work @nehbit has been doing on Aether. His notion of "neighborhoods" of slow-rotating peers for propagating content is really great work. I'm hoping we can come up with something that builds toward the notion of a neighborhood, where peers make pinning decisions not only based on what they want to keep, but also, to what degree a dataset is available on the network. This was part of the original promise of IPFS's bitswap algorithm and it hasn't lived up to the high-availability standard, so I don't consider this an easy problem.

For neighbourhoods to work, you have to build them across peers that are equivalents in terms of the data they want. Aether can do them over the whole network because the app itself is defined in such a way that all nodes will want all the data, so that anyone can be in a neighbourhood with anyone else.

If you want nodes that want different pieces of data (or tracking different data streams), you might want to create neighbourhoods in a pattern that groups people that want the exact same dataset combination, and make them more stable (fewer insert / ejects) so that in effect they become semi-permanent swarms. If you want dataset A B C, you either find a neighbourhood that is composed of people tracking all of A B C in the best case, or find three that tracks A B and C separately in three neighbourhoods in the worst.

The current implementation in Aether is very simplistic and aided massively by the fact that Aether nodes want all the data, so all potential neighbours are valid neighbours. When people start to choose the data streams they want to share with others via choosing communities they want to broadcast, then I'll have to expand on this.

b5 · 2019-02-28T17:54:56Z

For neighbourhoods to work, you have to build them across peers that are equivalents in terms of the data they want.

this ☝️ strikes me as a product of deep thought. Seems like the challenge here is expressive equivalency. Two equal things have only one permutation that works (either they're equal, or they aren't), great place to start.

In Aether today all peers are equivalent, because they're hard-wired to want the same data. It seems to me you've introduced nuance in around what peers do with that information (if I understand how you do variable length rolling graph retention correctly). More reasons I think Aether is the bees knees.

On the Qri side of things, we've started on the other end of the spectrum. By default all Qri peers want no data at all, and must be explicitly told to go get datasets, which is why it's not surprising to me at all that our users want ways to automate data grabbing.

But in both cases I think we're looking for the same thing, some way to broaden the definition of what they want.

Just jotting an idea down here to explore properties of the problem:

however people manage to express statements about desired data, I think it would be a very useful property to record each of those statements in a monotonic fashion, such that all of their data interests could be stored as a CRDT. CRDTs would naturally abstract away unnecessary details
this CRDT would form a ruleset that a peer could apply to make a keep-or-skip decision on whatever the atomic unit of data is (example, a thread in Aether, a dataset in Qri).
if each peer could sum that CRDT into a single figure, that would produce a method for checking if two peers data interests are equal regardless of the order in which they've expressed their interests. However, any variations in their interests would produce inequality, which is a good property in this context
We could use a similar trick to Kademlia's binary XOR method of establishing a routing table of data interests (sum of CRDT), which could mean peers that want exactly the same data would be in the same neighborhood, and each neighborhood lands in the same place on the "city map" of neighborhoods. If this were to exist in both Aether & Qri today, all peers would be in the same neighborhood, using the same ruleset. Qri peers & Aether peers would be in different neighborhoods: Keep everything for Aether, Keep nothing for Qri.
If an Aether Peer changes their ruleset to say "don't keep thread x", their sum-of-ruleset changes, and they leave the default neighborhood for a new place in the city, but most importantly they know where their new hood is in relation to their old haunt. All neighbourhoods are now arranged in a ring.
Peers in each neighborhood try to keep a connection to at least one peer in the hood on either side of them. Regardless of what they decide to do with new data that's announced, they always pass data through from the receiving side, sending to the neighbourhood on the other side. This means all peers propagate all content, but in a ring form instead of a single degree connection fashion.
to avoid hotspots, (eg: someone changing a ruleset to a new neighborhood with one tenant & having the whole network flow all data through them as a result), peers would need to keep a balance of connections that are inside & outside their neighborhood, classifying all non-local-neighbourhood peers as either on the left or right of them

Anyway, that's my first crack at expressiveness-with-equivalency in terms of data interest. I have no idea if this will make sense 😄

nehbit · 2019-03-06T20:15:10Z

That's interesting! An alternative implementation that might be useful — using bloom filters. In this case, The whole network is one neighbourhood that is infinitely subdivisible, and for the data you want, you merge the fingerprints of the data (in Aether's case, communities) into a single bloom filter. Then, you can be a part of any neighbourhood with a bloom filter that includes yours. The fingerprint being in the bloom filter means that they will likely have at least the data you want.

The good thing is that these neighbourhoods don't need to exist as 'entities', they are, rather, emergent properties that occur when a local node picks their peers based on whether the remote node's bloom filter includes the local node's inside it.

There are some practical concerns with that, especially in the fact that the bloom filters can only answer 'might have it' and not 'will have' ones. The other issue is that they have a finite capacity before their error rate goes above the expected boundaries. But in the worst case, a bloom filter would assume everybody is in their neighbourhood (false positive rate of 100%), which would exactly be the current behaviour of Aether, so any implementation of a bloom filter as a neighbourhood selector would improve the current state. The other thing is, the users can choose the size of their bloom filters, and a higher bloom filter size would allow for a more granular choice of neighbourhoods (i.e. less data fetched that you do not want that you discard) and reduced false positive rate, in exchange to a larger bloom filter that needs to be transmitted. Bloom filters also improve privacy of nodes, in that one of the major concerns with choosing a subset of data to take in and broadcast is that it publicly declares the communities you are tracking, which bloom filters avoid.

At this point in Aether, I think there aren't enough peers (it has just reached about 500 concurrent for a brief time a week ago) to warrant this kind of logic for choosing between peers, it is still more efficient to just connect to everybody and discard the data you don't actually want. But as it grows further than that, this might be a good way to go about finding peers that are similar to you, without needing to know how similar exactly, beyond the fact that they are likely tracking the data you want to track as well.

b5 mentioned this issue Feb 27, 2019

rfc, remotes: Remotes are like a do-it-yourself Registry #36

Merged

b5 added the backlog label Feb 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC Request: dynamic p2p dataset replication #37

RFC Request: dynamic p2p dataset replication #37

b5 commented Feb 27, 2019

nehbit commented Feb 27, 2019

b5 commented Feb 28, 2019

nehbit commented Mar 6, 2019 •

edited

Loading

RFC Request: dynamic p2p dataset replication #37

RFC Request: dynamic p2p dataset replication #37

Comments

b5 commented Feb 27, 2019

Announcing New Datasets

Pinning Decisions

nehbit commented Feb 27, 2019

b5 commented Feb 28, 2019

nehbit commented Mar 6, 2019 • edited Loading

nehbit commented Mar 6, 2019 •

edited

Loading