Replies: 6 comments 3 replies
-
@dshulyak How are we going to build the following features:
Current spacecraft has sub commands for these |
Beta Was this translation helpful? Give feedback.
-
usually terraform is written in a way that allows to create number of isntances based on configuration parameter, and if this configuration parameter changes we need to deploy more. should be straighforward as it is very common requirement
same, if version changes deployment needs to be updated. i never had any issues with it in previous projects, it should be straighforward. should be straighforward as it is very common requirement.
what does it mean exactly?
this is a part of the deployment, what is a problem?
it is ok to have some helper scripts that print necessary information using using k8s or spacemesh api. is it related to deployment? |
Beta Was this translation helpful? Give feedback.
-
Yes |
Beta Was this translation helpful? Give feedback.
-
I'd like to add to this: if bootnodes are to be persistent (they should!) we should have them backed up and recoverable. This should not be hard. IIRC, the p2p keypair is currently ephemeral (generated on node bootup) but it's part of the p2p identity and bootnodes at the same address with a different public key will (and should) be rejected. OTOH, we don't currently have a way to specify a desired p2p keypair for a node. We should implement this and then backup the p2p keypair so we can recover a bootnode if it fails. I could be wrong, btw, and we might load the p2p keypair from disk, this will make it easier in the short term, but eventually we do want to generate a new key on startup to increase privacy. |
Beta Was this translation helpful? Give feedback.
-
We should talk about this. PoETs are currently a SPOF. We should do several things to harden them, both on the node side and the PoET. I think I wrote a plan for this in the past - I should look it up. In short, the PoET itself should, first of all, be DNS addressable and backed up in short intervals. It has a recovery mechanism that allows it to continue where it left off when recovered. Then we can split the service into three main parts:
Since we want to have PoETs starting at different times, we can do an optimization where instead of duplicating this entire setup for every start time, we have a single instance of parts 1 and 2 above and only a separate part 3 for every parallel instance. |
Beta Was this translation helpful? Give feedback.
-
We should benchmark how long it takes to create a PoST proof per space unit and make each "big smesher" as big as possible, as long as it can still produce a PoST proof quickly enough that it fits in the "PoET gap" (a name I just invented to describe the gap between when a PoET proof is published and the next PoET round begins). This is the easiest solution, technically. We can also do all kinds of optimizations to make creating huge PoST proofs faster, like doing more hashing in parallel in the GPU and considering how we want to arrange the data on RAID arrays to allow fast parallel reads of related data. But for now just controlling how big each smesher is should be enough, imo. |
Beta Was this translation helpful? Give feedback.
-
motivation
originally spacecraft was meant both as a tool to rollout an open network and run longevity tests on this network.
in practice for effective testing we need controlled environment, where we can run reproducible tests. for all kinds of testing we will continue to use systest tooling, with best testing practices and chaos-mesh we should be able to cover all our needs with it.
additionally, there are other requirements that are not implemented in spacecraft, which i will cover in the text.
and lastly all operations on the cluster can be greatly simplified, simply by following common practices.
requirements
bootnodes
bootnodes are entry points to the network. they should be geographically distributed and have additional measures to remain available.
in our current deployment model we create ~5 bootnodes, all of them are in the same google cloud zone. and tied to the specific ip address.
instead - we should split them between several regions. for example 2 per europe/usa/china. and nodes should be addressable using dns.
poets
No special requirements. Poet will run on more expensive VMs, and use DNS entries for gateways.
smeshers
For open network we will want some number (1-10) of smeshers with large storage capacity. For mainnet we will likely have them on custom hardware, for testing open network it will be VMs with more disk space and slightly different config.
observability
We need two clusters:
implementation
changes in go-spacemesh
#3156 and #3157 will enable simper rollout of bootnodes.
deployment tooling
we should not use golang or any other programming language, all of the steps below can be implemented with devops tooling (terraform, ansible).
for observability we should use grafana+prometheus+loki stack. we should have 2 isolated clusters. one for managed smeshers and one for community smeshers.
they can be deployed on k8s.
output of this step is dns or ip addresses, all components will use them.
create vms for bootnodes (without k8s) on multiple providers (amazon, digital ocean, google cloud, alibaba cloud). make sure that docker is installed on them.
initially we don't need to support multiple providers and can use any.
each vm should have attached public dns name (https://cloud.google.com/dns/docs/tutorials/create-domain-tutorial).
hw reqs:
generate and distribute keys on all vms. as a result of this step we will have a list of dns identities and identities for poet server.
using dns from step 2. logs should be pushed with filebeat.
metrics will be pushed by every client. in long term logs will be pushed as well (#3134). in short term we may need to setup log collection with filebeat (or anything like it).
for testnets we just need to run them on slightly different VMs (more storage space). For genesis it needs to be clarified, not sure if we will need to automate anything.
run in the same k8s cluster as logging infra
Beta Was this translation helpful? Give feedback.
All reactions