Replies: 4 comments 6 replies
-
at @Surax98 , please expand the context a little bit. Possibly with some schema of the current workflows in the main outage scenarios. Then we can discuss more carefully about details, otherwise we risk to go in the wrong way. E.g. which info are needed to restore a cache, where can be recovered, and which is the impact in terms of work to be done and user/dev experience |
Beta Was this translation helpful? Give feedback.
-
Added schemas and references |
Beta Was this translation helpful? Give feedback.
-
Alright @Surax98 , if I read the schema correctly, I think we should go for the scenario 2, where interlink does know where to look to restore the status of submitted jobs/containers. The plugin should be as much stateless as possible in my vision. @spigad this is what we disussed last time, if you confirm, I think @Surax98 should go into the details of the scenario 2 here in the discussion:
|
Beta Was this translation helpful? Give feedback.
-
Ok, recap:
|
Beta Was this translation helpful? Give feedback.
-
I was thinking about how to implement a cache restoring system on interlink's side, but before asking for opinions, let's quickly break through how things are working right now.
To better follow the workflow, you can find schemas at the following link:
https://excalidraw.com/#json=wwvC3eA1fjRhDBrCD9P3Y,nTrhSA-H_k0doBesYAsxDw
I also added simple state diagrams for VK/InterLink/Sidecar, very simple and functional for our case to be better understood.
Diagram A) VK already restores its pods by querying InterLink at startup (1), so that it can provide cached pods (2) and let the VK restore everything by comparing cached pods with the ones registered to the cluster (3). It doesn't require any disk space at all and relies on the fact InterLink keeps a coherent cache. Needs to be improved, but it works for now.
However, it is not possible to guarantee that InterLink will never go down, so a cache restoring system must me implemented.
Diagram B) My idea was to use a similar approach to the one used by the VK, allowing InterLink to query the below Sidecar (1) to provide all running (and ended/scheduled as well) jobs (2) and then rebuild the cache based on this (3). This mechanism is similar to the one used by the VK (diagram A) and relies on the sidecar's caching system as well, which must be kept coherent too.
At this point, the problem is quite obvious: what happens if the sidecar goes down? And that's why I am here to ask how to behave. I have thought about 2 different approaches:
Any useful feedback would be much appreciated, since it's in the interested of everyone knowing how the caching mechanism for Interlink will be implemented (especially if it will require effort by sidecar's dev's side)
Beta Was this translation helpful? Give feedback.
All reactions