Replies: 2 comments
-
I fully understand the above considerations. We had exactly the same considerations when we designed our infrastructure at Ver.iD. We could either go for a full blown Kafka setup with strong consistency guarantees or to use Redis which was a lot easier to implement and maintain and extremely fast. In the end our solution was to use Redis in High Availability mode, i.e. one Redis primary node and (one or ) two Redis standby nodes. In normal operation mode this would be an ideal setup, since maintenance can still be done on the Redis standby nodes while keeping the service live. When the failed node is a Redis primary, due to updates or actual failure, a standby node with the most up-to-date replication log is promoted as the new primary and immediately starts serving clients. A new replacement node is automatically scheduled and becomes the new standby node and the old redis primary is cycled. The same process applies for whenever the failed node is a standby node, however that would not impact the service at all. Unfortunately this setup does NOT guarantee strong consistency, but it provides a simple way to do maintenance on Redis instances. When the maintenance is done correctly, you would experience almost no downtime and almost no loss of database entries. In order to improve the odds that the latest replication log is available on the standby nodes, we execute our Redis commands in combination with the WAIT command. Example: In the case of 3 redis nodes in total, the plenum would be In our opinion, the risk of loss is minimized and acceptable for our purposes. I would expect a similar trade-off can be made here for Another plus here is that you would not need to create a different implementation in |
Beta Was this translation helpful? Give feedback.
-
The first tried the fourth option to implement a fallback standalone Redis approach. This however was a bit tricky, because there were all kinds of corner cases I had to find a solution for. For example, the keyshare server really needs strong consistency and that was difficult given the auto-snapshot functionality of standalone Redis. Of course, that's something you can disable and discourage, but in my opinion it was still a bit tricky. It could have become a recipe for disaster. The keyshare server state that needs strong consistency is only relevant for a few seconds, so in my view it's easier for now to go with sticky sessions there. That makes the problem a bit easier, because for the It can be found in #354. |
Beta Was this translation helpful? Give feedback.
-
Currently, only the
irma server
has a stateless implementation using Redis. We'd like to add stateless implementations for the keyshare server and the myirmaserver. A downside of the currentirma server
implementation is that the solution does not support high availability. It relies on a stand-alone Redis instance. If that instance is in maintenance, experiences downtime or a network partition arises between the Redis instance and the application, then we experience downtime.We have the following options:
Use standalone Redis
The simple option is to just accept the same risk for the keyshare server and the myirmaserver as we do for the irma server concerning high availability. This means that we cannot withstand Redis downtime or a severe network partition. Advantage is that the solution is quite straight forward given the Redis implementation we already have.
Impact: straight forward to implement, only a stand-alone Redis is needed in operations.
Operational costs: lowest of 4 options
Use Redis Cluster (with
etcd
for distributed locking)Redis has a cluster mode that can be used for high availability. However, Redis Cluster does not guarantee strong consistency. We need this for storing the keyshare commitments and for (distributed) locking, so we cannot use Redis cluster out-of-the box.
This means that we should add
etcd
for doing the distributed locking and checking whether keyshare commitments and IRMA server nonces are being consumed. This introduces more complexity, both in code and in operations.Impact: complex to implement, complex ecosystem in operations
Operational costs: highest of 4 options (both Redis and etcd should be maintained, but when having immutable infrastructure that's both not very complex)
Use
etcd
etcd
in itself also has a key-value, so we can also useetcd
for both storage and distributed locking. In this way we don't need Redis, which makes the deployment a bit easier and gives less complexity.A downside is that
etcd
uses write-ahead logging to keep track of all changes. Write-ahead logs are stored for a long time such that lagging nodes can easily recover. This means that if we useetcd
to store the IRMA server state, then personal data will be stored way longer than necessary. That's a major issue.We made a PoC implementation of this solution.
Impact: somewhat more work to implement than Redis, but not more complex. Impact is high on personal data processing terms.
Operational costs: medium (etcd is a bit more complex as Redis)
Use standalone Redis, with fallback standalone Redis instance
This is a slightly improved variant of the first solution. When there is a fallback standalone Redis instance, we can fallback to that instance if the main instance is unavailable. Every session will either be handled by the main Redis instance, or by the fallback instance. This means that there is no inconsistency risk. This means that when a Redis instance fails, there is user impact. All sessions that were handled by that Redis instance are lost. However, the user can immediately start a new session using the fallback Redis instance. In this way we reduce the impact of the downtime.
Impact: a bit more complex as Redis standalone
Operational costs: same as the first option, but there are two instances instead of one.
Other
A promising new development is Redis Raft. This implements strong consistency using Redis. This is however still a proof-of-concept at the moment and cannot be used in production.
Beta Was this translation helpful? Give feedback.
All reactions