Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partial handshake support: Request queue for nodes that haven't completed the handshake yet #77

Open
gavin-norman-sociomantic opened this issue Dec 8, 2017 · 19 comments

Comments

@gavin-norman-sociomantic

(This issue came from investigations of starting an app after a partial handshake with the legacy client, but I think the same principle will apply to the neo client as well.)

When a DHT client application starts up, it's possible that not all nodes are accessible. This will cause the handshake procedure for those nodes to not complete.

Then, when the client tries to send requests, some of them will fail due to not being able to identify the responsible node (the handshake has not completed for all nodes, so the client does not know the node responsible for certain hashes).

Currently, in both the legacy and neo clients, this is simply an error. This is, however, not very helpful from the application's point of view. Ideally, there should be support in the client for queueing up such requests for assignment once the responsible node is known.

(It may be possible to write something that can be used by both the legacy and neo clients.)

@gavin-norman-sociomantic
Copy link
Author

v13.2.0 needs to be released soon, in order to release v14.0.0 (compatible with the pending major releases of ocean and swarm). This issue can be delayed until the next minor release.

@gavin-norman-sociomantic gavin-norman-sociomantic changed the title Request queue for nodes that haven't completed the handshake yet Partial handshake support: Request queue for nodes that haven't completed the handshake yet Mar 15, 2018
@gavin-norman-sociomantic
Copy link
Author

v13.3.0 needs to be released and this new feature is not urgent. Moved to v13.4.0.

@gavin-norman-sociomantic
Copy link
Author

This is pretty difficult. Here's what would be required of a data structure suitable for addressing this:

  • Must be able to overflow from memory onto disk.
  • Must retain records in order.
  • Must be iteratable (i.e. a simple queue with push and pop is not enough).
  • Must be able to remove elements in the middle. (This would be necessary so that, once a node reconnects, the queued records just for that node can be popped and sent on to the DHT.)

Objections:

  1. I don't think we have a data structure that fulfils those criteria.
  2. It feels like we're attempting to duplicate much of the functionality of the DMQ inside applications.

@gavin-norman-sociomantic
Copy link
Author

Thinking about this the other way around:

  • The problem is that when an app like this starts up, it blindly starts consuming all records from its input DMQ channel, then ends up not being able to handle half of them.
  • So what if the app had a way to only start consuming records destined for DHT nodes that have already been handshaked?
  • We'd have to add a feature to DMQ Consume to allow subscription to a subsection of a channel, based on a hash range.

Of course, this idea is also very complicated, but feels like it might be going in the right direction, rather than each application duplicating loads of DMQ functionality internally.

@gavin-norman-sociomantic
Copy link
Author

See sociomantic-tsunami/dmqproto#68.

@nemanja-boric-sociomantic
Copy link
Contributor

The main complication here is that the DMQ writer needs its logic updated, as with this, the producer, and not the consumer will know how to deduct key from the records' content.

@gavin-norman-sociomantic
Copy link
Author

Yeah, we'd need to add a key/value Push request. I don't think that's a huge problem, though.

@nemanja-boric-sociomantic
Copy link
Contributor

Yes, I don't think that's a problem from the DMQ's side, but from the writer side. Imagine the application that receives some free text (say, something that looks like apache logline) and that pushes this to the DMQ for the subscriber to consume. Right now, the consumer is the one that figures out the key from the text contents, and the writer is just responsible for multiplexing this record to consumers.

However, the logic from the consumer will have to be moved to the producer which may introduce accidental complexity on the producers' side.

@gavin-norman-sociomantic
Copy link
Author

I see what you mean. I think there are probably some very simple cases (e.g. the id of a product), but there may also be not so simple cases.

@nemanja-boric-sociomantic
Copy link
Contributor

nemanja-boric-sociomantic commented Jun 25, 2018

The @gautam-kotian-sociomantic 's idea is to to write the records back in the DMQ node and use DMQ node as the overflow mechanism. I would also suggest to make application-overflow channels, as they should be there just to support the application.

This also require sociomantic-tsunami/dmqproto#68 as we want to pop only records for the DHT nodes that are back online. However, as this is the consumer-specific channel, consumer already knows the key of the record.

@gavin-norman-sociomantic
Copy link
Author

Yeah, pushing orphaned records back to the DMQ would be a reasonable (and much simpler) solution. There are several possible levels, with ascending complexity and performance:

  1. Push to the DMQ channel that you consumed from. The records will naturally be processed again, and maybe the DHT node's hash range is known then.
  2. Push to a separate channel. Consume from it when a DHT node handshake succeeds.
  3. Push to a separate channel and use Consider key-based channels dmqproto#68.

@gautam-kotian-sociomantic
  1. Push to the DMQ channel that you consumed from. The records will naturally be processed again, and maybe the DHT node's hash range is known then.

For channels that are mostly busy, this may work well, but in case the writer only occasionally writes to the channel, we may end up with a tight loop where an application constantly pops a record that it pushed to the DMQ only moments earlier. We'd need some special handling to prevent this kind of situation.

So at the moment, I like No. 2 the most.

@nemanja-boric-sociomantic
Copy link
Contributor

I think 2 is not good, given the problem if you have large number of DHT nodes down and a single node that's recovered - you can't fetch data from the channel for that particular DHT without circular reprocessing the data (unless we do what the legacy overflow did and create a channel per node).

@gavin-norman-sociomantic
Copy link
Author

I think 2 is not good, given the problem if you have large number of DHT nodes down and a single node that's recovered - you can't fetch data from the channel for that particular DHT without circular reprocessing the data

Circular reprocessing is assumed in 2, yes. Still, it's better than not being able to support this at all, right?

(unless we do what the legacy overflow did and create a channel per node).

This isn't possible with partial handshakes, as we don't know which node orphaned records should be handled by :(

@nemanja-boric-sociomantic
Copy link
Contributor

Circular reprocessing is assumed in 2, yes. Still, it's better than not being able to support this at all, right?

If we can make it work, then yes. I'm sceptical about that, though. The problem is that you always need to keep popping and popping, and push back and push back (unless we wait for all nodes to be available). Or we can push some kind of the sentinel item to prevent going in circles. I didn't think this though yet, though, but my gut feeling tells me I should be very suspicious here.

@gavin-norman-sociomantic
Copy link
Author

Agreed. On the whole, it's not an ideal solution.

@gavin-norman-sociomantic
Copy link
Author

Or... a kind of dumb solution that wouldn't be too hard to implement:

  • When the DHT node for a record is not known, push the record to a queue.
  • When a node completes the handshake, pop all records from the queue and process them again.

@gavin-norman-sociomantic
Copy link
Author

Thinking about this again. The most common situation is that a single DHT server is inaccessible when an app starts up. This may mean multiple DHT nodes are inaccessible, but they'd presumably all become accessible again at roughly the same time.

So I wonder if we should make a solution that works well for this most common case and not horribly for other cases. The proposal above covers this, I think.

@gavin-norman-sociomantic
Copy link
Author

gavin-norman-sociomantic commented Jan 8, 2019

I just had a sneaky thought:

  1. We're talking about partial handshakes and how to handle them.
  2. There's no reason an app would proceed to its main logic if no DHT nodes were connected at all.
  3. As long as at least one node is connected, we have a metric (that node's hash range) to guess the hash ranges of other nodes. (Though of course not which node is responsible for which hash range. But that doesn't matter for this purpose.)
  4. It could be possible to set up multiple queues of records based on guessed hash ranges.

@gavin-norman-sociomantic gavin-norman-sociomantic removed this from the v14.6.0 milestone Apr 23, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants