Stateless irma server, keyshare server and myirmaserver #351

ivard · 2023-10-10T08:52:54Z

ivard
Oct 10, 2023
Maintainer

Currently, only the irma server has a stateless implementation using Redis. We'd like to add stateless implementations for the keyshare server and the myirmaserver. A downside of the current irma server implementation is that the solution does not support high availability. It relies on a stand-alone Redis instance. If that instance is in maintenance, experiences downtime or a network partition arises between the Redis instance and the application, then we experience downtime.

We have the following options:

Use standalone Redis

The simple option is to just accept the same risk for the keyshare server and the myirmaserver as we do for the irma server concerning high availability. This means that we cannot withstand Redis downtime or a severe network partition. Advantage is that the solution is quite straight forward given the Redis implementation we already have.

Impact: straight forward to implement, only a stand-alone Redis is needed in operations.
Operational costs: lowest of 4 options

Use Redis Cluster (with `etcd` for distributed locking)

Redis has a cluster mode that can be used for high availability. However, Redis Cluster does not guarantee strong consistency. We need this for storing the keyshare commitments and for (distributed) locking, so we cannot use Redis cluster out-of-the box.

This means that we should add etcd for doing the distributed locking and checking whether keyshare commitments and IRMA server nonces are being consumed. This introduces more complexity, both in code and in operations.

Impact: complex to implement, complex ecosystem in operations
Operational costs: highest of 4 options (both Redis and etcd should be maintained, but when having immutable infrastructure that's both not very complex)

Use `etcd`

etcd in itself also has a key-value, so we can also use etcd for both storage and distributed locking. In this way we don't need Redis, which makes the deployment a bit easier and gives less complexity.

A downside is that etcd uses write-ahead logging to keep track of all changes. Write-ahead logs are stored for a long time such that lagging nodes can easily recover. This means that if we use etcd to store the IRMA server state, then personal data will be stored way longer than necessary. That's a major issue.

We made a PoC implementation of this solution.

Impact: somewhat more work to implement than Redis, but not more complex. Impact is high on personal data processing terms.
Operational costs: medium (etcd is a bit more complex as Redis)

Use standalone Redis, with fallback standalone Redis instance

This is a slightly improved variant of the first solution. When there is a fallback standalone Redis instance, we can fallback to that instance if the main instance is unavailable. Every session will either be handled by the main Redis instance, or by the fallback instance. This means that there is no inconsistency risk. This means that when a Redis instance fails, there is user impact. All sessions that were handled by that Redis instance are lost. However, the user can immediately start a new session using the fallback Redis instance. In this way we reduce the impact of the downtime.

Impact: a bit more complex as Redis standalone
Operational costs: same as the first option, but there are two instances instead of one.

Other

A promising new development is Redis Raft. This implements strong consistency using Redis. This is however still a proof-of-concept at the moment and cannot be used in production.

stenreijers · 2023-10-10T11:41:06Z

stenreijers
Oct 10, 2023

I fully understand the above considerations. We had exactly the same considerations when we designed our infrastructure at Ver.iD. We could either go for a full blown Kafka setup with strong consistency guarantees or to use Redis which was a lot easier to implement and maintain and extremely fast.

In the end our solution was to use Redis in High Availability mode, i.e. one Redis primary node and (one or ) two Redis standby nodes. In normal operation mode this would be an ideal setup, since maintenance can still be done on the Redis standby nodes while keeping the service live. When the failed node is a Redis primary, due to updates or actual failure, a standby node with the most up-to-date replication log is promoted as the new primary and immediately starts serving clients. A new replacement node is automatically scheduled and becomes the new standby node and the old redis primary is cycled. The same process applies for whenever the failed node is a standby node, however that would not impact the service at all.

Unfortunately this setup does NOT guarantee strong consistency, but it provides a simple way to do maintenance on Redis instances. When the maintenance is done correctly, you would experience almost no downtime and almost no loss of database entries. In order to improve the odds that the latest replication log is available on the standby nodes, we execute our Redis commands in combination with the WAIT command.

Example: In the case of 3 redis nodes in total, the plenum would be 3 and an acceptable quorum could be 2, to just wait for confirmation of the redis primary and one standby node. Although this does NOT provide strong consistency, since a consensus protocol like RAFT is required to do so, it increases the odds that things run smoothly in case of failures and the correct standby node is promoted to serve as the new primary. I think it is close to best you can do with a Redis setup.

In our opinion, the risk of loss is minimized and acceptable for our purposes. I would expect a similar trade-off can be made here for irmago. In typical scenario irmago is not dealing with high risk payments such that every mutation MUST be preserved at all cost, it is dealing with small data transactions in which the chances of error can be made acceptable using a Redis setup in High Availability mode as described in the last paragraph. At least that is what we are now using in production, if we encounter a lot of issues, we will probably move towards a Kafka setup or some other protocol like etcd that provides more strong consistency guarantees.

Another plus here is that you would not need to create a different implementation in irmago but can just use everything from Redis that is already available, such that this feature can be implemented quickly.

0 replies

ivard · 2023-10-20T12:48:54Z

ivard
Oct 20, 2023
Maintainer Author

The first tried the fourth option to implement a fallback standalone Redis approach. This however was a bit tricky, because there were all kinds of corner cases I had to find a solution for. For example, the keyshare server really needs strong consistency and that was difficult given the auto-snapshot functionality of standalone Redis. Of course, that's something you can disable and discourage, but in my opinion it was still a bit tricky. It could have become a recipe for disaster. The keyshare server state that needs strong consistency is only relevant for a few seconds, so in my view it's easier for now to go with sticky sessions there.

That makes the problem a bit easier, because for the irma server, the integrated irma server within the keyshare components and the session storage of the myirmaserver, strong consistency is less important. It will have user impact if we loose data, but in most cases there is no security impact. That's something we can resolve with the built-in failover options of Redis. Therefore, I've worked out a solution that uses Redis Sentinel for automatic failovers.

It can be found in #354.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stateless irma server, keyshare server and myirmaserver #351

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Stateless irma server, keyshare server and myirmaserver #351

ivard Oct 10, 2023 Maintainer

Use standalone Redis

Use Redis Cluster (with etcd for distributed locking)

Use etcd

Use standalone Redis, with fallback standalone Redis instance

Other

Replies: 2 comments

stenreijers Oct 10, 2023

ivard Oct 20, 2023 Maintainer Author

ivard
Oct 10, 2023
Maintainer

Use Redis Cluster (with `etcd` for distributed locking)

Use `etcd`

stenreijers
Oct 10, 2023

ivard
Oct 20, 2023
Maintainer Author