-
Notifications
You must be signed in to change notification settings - Fork 670
IP allocation design
See the requirements.
Basic idea: We divide the IP address space between all nodes, and adjust that division dynamically in order to ensure a supply of IPs to every node. This allows nodes to allocate and free individual IPs locally in a very efficient manner, generally without synchronous interaction with other nodes. It also provides some inherent partition tolerance. We use a CRDT to represent shared knowledge about the space.
The allocator running at each node has two interfaces: a command interface which the container infrastructure (e.g. the Weave script,or a Docker plugin) uses to request IP addresses, and a messaging interface which it uses to exchange information with other nodes on the network.
- Allocations. We use the word 'allocation' to refer to a specific IP address being assigned, e.g. so it can be given to a container.
- Reservations. Most of the time, instead of dealing with individual IP addresses, we operate on them in groups, so-called "reservations".
- Subnet. All addresses for one instance of the allocator come from a single subnet, e.g. 10.0.0.0/8.
- CRDT. All reservations in the system are recorded in a map from reservation start address to node ID. This map is a CRDT, referred to as 'the ring'. Nodes only ever make changes to reservations that they own (except under administrator command - see later). This makes the data structure inherently convergent.
- Gossipping. Updates to the CRDT are communicated between nodes through gossipping, which scales well and copes well with changing topologies and partitions.
- Protocol. Nodes request reservations through a mini messaging protocol. The messaging is async, with no delivery or ordering guarantees.
The commands supported by the allocator are:
- Allocate: request one IP address for a container
- Delete records for one container: all IP addresses allocated to that container are freed.
- Claim: request a specific IP address for a container (e.g. because it is already using that address)
- Free: return an IP address that is currently allocated
- The ring is a CRDT made up of a mapping for token -> {host, version, tombstone flag}
- Host identifier must be unique and survive across restarts.
- A token is an IP address, indicating the start of a reservation.
- The mapping is sorted by token
- A reservation is defined as the range from one token to next non-tombstone token; at the end of the subnet this operation can wrap.
- A host owns the reservations indicated by the tokens it owns.
- A token can only be inserted by the host owning the range it is inserted into
- Entries in the map can only be updated by the owning host, and when this is done the version is incremented
- The map is always gossiped in its entirely
- The merge operation when a host receives a map is:
- Disjoint tokens are just copied into the combined map
- For entries with the same token, pick the highest version number
- (this also works correctly in the presence of tombstones - the host replacing a tombstone must increment the version, just as if it was changing the host)
- If a token maps to a tombstone, this indicates that the previous
owning host that has left the network.
- For the purpose of ranges, tombstones are ignored - ie ranges extend past tombstones.
- Tombstones are only inserted by an administrative action (see below)
- If a node knows it is leaving and never coming back (as the same node), it can insert its own tombstones.
-
The allocator can allocate freely to containers on your machine from ranges you own
- This data does not need to be persisted (assumed to have the same failure domain)
-
If the allocator runs out of space (all owned ranges are full), it will ask another host for space
- we pick a host to ask at random, weighted by the amount of space owned by each host (this requires that the ring gossip includes amount of free space)
- if the target host decides to give up space, it unicasts a message back to the asker with the newly-updated ring.
- we will continue to ask for space until we receive some via the gossip mechanism, or our copy of the ring tells us all nodes are full.
-
When hosts are asked for space, there are 4 scenarios:
- We have an empty range; we can change the host associated with the token at the beginning of the range, increment the version and gossip that
- We have a range which can be subdivided by a single token to form a free range. We insert said token, mapping to the host requesting space and gossip that.
- We have a 'hole' in the middle of a range; an empty range can be created by inserting two tokens, one at the beginning of the hole mapping to the host requesting the space, and one at the end of the hole mapping to us.
- We have no space.
-
NB in this scheme there are only 2 messages hosts can exchange:
- This is the state of the ring (the gossip)
- Please may I have some space?
-
If we need claiming, we can extend message (ii) to optionally include an IP address.
- This extension requires a message in response to message (ii) when it cannot be satisfied.
-
Alternatively, we can keep claim in the API, and forgo it in the protocol.
- The implementation would therefore be - claim can only succeed if we already own a range containing the IP. Otherwise the claim will fail.
- This is the minimum requirement for restart the router with running containers.
Nodes are told the subnet - the IP range from which all allocations are made - when starting up. Each node must be given the same range.
We deal with concurrent start-up through a process of leader election. In essence, the node with the highest id claims the entire range for itself, and then other nodes can begin to request sub-spaces from it.
When a node starts up, it initialises the CRDT to just an empty ring. It then waits for ring updates and/or commands. Until a node receives a command to allocate or claim an IP address, it does not care what else is going on, but it does need to keep track of it.
When a node first receives such a command, it consults its map. If it sees any other nodes that do claim to have some reservations, then it proceeds with normal allocation [see above]. Otherwise, if no such update has been received (i.e. we've either received no update or only updates with all node entries being empty), then the node starts a leader election. If this node has the highest id, then the node claims the entire IP range for itself, inserting one token at the beginning of the ring.
If it sees that another node has the highest ID, it sends a message to that node proposing that it be leader. A node receiving such a message proceeds as above: only if it sees no other nodes with reservations and it sees itself as the node with the highest ID does it take charge.
Note that the chosen leader can be a node other than the one with the overall highest id. This happens in the event a node with a higher id starts up just as another node has made the decision to become leader. That is ok; the new node wouldn't itself claim to be leader until some time has passed, enough time to receive an update from the leader and thus preempt that.
Failures:
- two nodes that haven't communicated with each other yet can each decide to be leader -> this is a fundamental problem: if you don't know about someone else then you cannot make allowances for them. The resulting situation is effectively a network partition, and can be resolved in the same way when the nodes do make contact. To mitigate this issue, nodes could start allocating IPs at a random point to minimise the chance that both allocate the same IP before they get to merge their maps.
- prospective leader dies before sending map -> This failure will be detected by the underlying Weave peer topology. The node requiring space will re-try, re-running the leadership election across all connected peers.
When a node leaves, it updates all its own tokens to be tombstones, then broadcasts the updated ring. This causes its space to be inherited by the owner of the previous tokens on the ring, for each range.
After sending the message, the node terminates - it does not wait for any response. Overall, there is very little work performed by the node on shutdown, very little, so as to not unduly delay it.
Failures:
- message lost
- the space will be unusable by other nodes because it will still be seen as owned.
To cope with the situation where a node has left or died without managing to tell its peers, an administrator may go to any other node and command that it mark the dead node's tokens as tombstones. This information will then be gossipped out to the network.