new deployment scheme #3158

dshulyak · 2022-03-29T13:31:03Z

dshulyak
Mar 29, 2022
Maintainer

motivation

originally spacecraft was meant both as a tool to rollout an open network and run longevity tests on this network.

in practice for effective testing we need controlled environment, where we can run reproducible tests. for all kinds of testing we will continue to use systest tooling, with best testing practices and chaos-mesh we should be able to cover all our needs with it.

additionally, there are other requirements that are not implemented in spacecraft, which i will cover in the text.

and lastly all operations on the cluster can be greatly simplified, simply by following common practices.

requirements

bootnodes

bootnodes are entry points to the network. they should be geographically distributed and have additional measures to remain available.

in our current deployment model we create ~5 bootnodes, all of them are in the same google cloud zone. and tied to the specific ip address.

instead - we should split them between several regions. for example 2 per europe/usa/china. and nodes should be addressable using dns.

poets

No special requirements. Poet will run on more expensive VMs, and use DNS entries for gateways.

smeshers

For open network we will want some number (1-10) of smeshers with large storage capacity. For mainnet we will likely have them on custom hardware, for testing open network it will be VMs with more disk space and slightly different config.

observability

We need two clusters:

private (for our smeshers and poet)
public (for connected smeshers)

implementation

changes in go-spacemesh

#3156 and #3157 will enable simper rollout of bootnodes.

deployment tooling

we should not use golang or any other programming language, all of the steps below can be implemented with devops tooling (terraform, ansible).

observability cluster

for observability we should use grafana+prometheus+loki stack. we should have 2 isolated clusters. one for managed smeshers and one for community smeshers.

they can be deployed on k8s.

output of this step is dns or ip addresses, all components will use them.

vms on multiple providers

create vms for bootnodes (without k8s) on multiple providers (amazon, digital ocean, google cloud, alibaba cloud). make sure that docker is installed on them.

initially we don't need to support multiple providers and can use any.

each vm should have attached public dns name (https://cloud.google.com/dns/docs/tutorials/create-domain-tutorial).

hw reqs:

bootnode: 2 cpu, 2gb ram, 100gb hard drive
poet: ???

generate dns identities for all bootnodes.

generate and distribute keys on all vms. as a result of this step we will have a list of dns identities and identities for poet server.

start docker with poet server

using dns from step 2. logs should be pushed with filebeat.

start docker with bootnodes

metrics will be pushed by every client. in long term logs will be pushed as well (#3134). in short term we may need to setup log collection with filebeat (or anything like it).

optionally: deploy smeshers with large storage capacity

for testnets we just need to run them on slightly different VMs (more storage space). For genesis it needs to be clarified, not sure if we will need to automate anything.

optinally: web services

run in the same k8s cluster as logging infra

web3rover · 2022-04-05T11:11:47Z

web3rover
Apr 5, 2022

@dshulyak How are we going to build the following features:

Adding or removing miners
Upgrading miners
Release network on Github
Deploy web services
List of networks, list of GRPC endpoints and print rewards

Current spacecraft has sub commands for these

0 replies

dshulyak · 2022-04-05T15:44:40Z

dshulyak
Apr 5, 2022
Maintainer Author

Adding or removing miners

usually terraform is written in a way that allows to create number of isntances based on configuration parameter, and if this configuration parameter changes we need to deploy more. should be straighforward as it is very common requirement

Upgrading miners

same, if version changes deployment needs to be updated. i never had any issues with it in previous projects, it should be straighforward. should be straighforward as it is very common requirement.

Release network on Github

what does it mean exactly?

Deploy web services

this is a part of the deployment, what is a problem?

List of networks, list of GRPC endpoints and print rewards

it is ok to have some helper scripts that print necessary information using using k8s or spacemesh api. is it related to deployment?

2 replies

web3rover Apr 5, 2022

Deploy web services

We sometimes deploy web services separately after network deployment as its optional

Release network on Github

We have a sub command (https://github.com/spacemeshos/go-spacecraft/blob/main/network/release.go) to create sm-net (https://github.com/spacemeshos/sm-net) release

List of networks, list of GRPC endpoints and print rewards

Ok we can have some scripts for these. It's not related to deployed.

dshulyak Apr 5, 2022
Maintainer Author

We sometimes deploy web services separately after network deployment as its optional

it surely can remain optional with terraform and applied based on some variable

We have a sub command (https://github.com/spacemeshos/go-spacecraft/blob/main/network/release.go) to create sm-net (https://github.com/spacemeshos/sm-net) release

so thats a script that reads some stuff from API and uploads to github?

web3rover · 2022-04-06T08:53:22Z

web3rover
Apr 6, 2022

so thats a script that reads some stuff from API and uploads to github?

Yes

0 replies

noamnelke · 2022-05-17T11:51:40Z

noamnelke
May 17, 2022
Maintainer

bootnodes

<...>
instead - we should split them between several regions. for example 2 per europe/usa/china. and nodes should be addressable using dns.

I'd like to add to this: if bootnodes are to be persistent (they should!) we should have them backed up and recoverable. This should not be hard. IIRC, the p2p keypair is currently ephemeral (generated on node bootup) but it's part of the p2p identity and bootnodes at the same address with a different public key will (and should) be rejected. OTOH, we don't currently have a way to specify a desired p2p keypair for a node. We should implement this and then backup the p2p keypair so we can recover a bootnode if it fails.

I could be wrong, btw, and we might load the p2p keypair from disk, this will make it easier in the short term, but eventually we do want to generate a new key on startup to increase privacy.

0 replies

noamnelke · 2022-05-17T12:11:25Z

noamnelke
May 17, 2022
Maintainer

poets

No special requirements. Poet will run on more expensive VMs, and use DNS entries for gateways.

We should talk about this. PoETs are currently a SPOF. We should do several things to harden them, both on the node side and the PoET.

I think I wrote a plan for this in the past - I should look it up.

In short, the PoET itself should, first of all, be DNS addressable and backed up in short intervals. It has a recovery mechanism that allows it to continue where it left off when recovered.

Then we can split the service into three main parts:

A public facing webservice that accepts registrations and stores them persistently before returning a signed receipt (I don't think we currently do that, but that's easy). This service doesn't need special hardware and can theoretically be implemented in a serverless way, to facilitate infinite scalability. It's a classic webservice and can use common best practices, like an api gateway.
Persistent storage for registrations. Here, too, we can do common best practices, like use DynamoDB or equivalent with careful backups to minimize data loss.
A publicly inaccessible backend service that works on the actual proofs. This is where the strong hardware is needed and we also need to regularly backup the disk to enable recovery in case of a crash. This service connects to nodes for transmitting the final proof.

Since we want to have PoETs starting at different times, we can do an optimization where instead of duplicating this entire setup for every start time, we have a single instance of parts 1 and 2 above and only a separate part 3 for every parallel instance.

1 reply

noamnelke May 17, 2022
Maintainer

Found the old doc I wrote about this: https://docs.google.com/document/d/1mlk9QqVywUfyXPb5LNgo6HMuky647u9CCqeSFVd5KLE/edit?usp=sharing

noamnelke · 2022-05-17T12:16:33Z

noamnelke
May 17, 2022
Maintainer

smeshers

We should benchmark how long it takes to create a PoST proof per space unit and make each "big smesher" as big as possible, as long as it can still produce a PoST proof quickly enough that it fits in the "PoET gap" (a name I just invented to describe the gap between when a PoET proof is published and the next PoET round begins).

This is the easiest solution, technically.

We can also do all kinds of optimizations to make creating huge PoST proofs faster, like doing more hashing in parallel in the GPU and considering how we want to arrange the data on RAID arrays to allow fast parallel reads of related data. But for now just controlling how big each smesher is should be enough, imo.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spacemesh

new deployment scheme #3158

{{title}}

Replies: 6 comments 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

bootnodes

{{title}}

poets

{{title}}

{{title}}

smeshers

Select a reply

Spacemesh

new deployment scheme #3158

dshulyak Mar 29, 2022 Maintainer

motivation

requirements

bootnodes

poets

smeshers

observability

implementation

changes in go-spacemesh

deployment tooling

Replies: 6 comments · 3 replies

web3rover Apr 5, 2022

dshulyak Apr 5, 2022 Maintainer Author

web3rover Apr 5, 2022

dshulyak Apr 5, 2022 Maintainer Author

web3rover Apr 6, 2022

noamnelke May 17, 2022 Maintainer

bootnodes

noamnelke May 17, 2022 Maintainer

poets

noamnelke May 17, 2022 Maintainer

noamnelke May 17, 2022 Maintainer

smeshers

dshulyak
Mar 29, 2022
Maintainer

Replies: 6 comments 3 replies

web3rover
Apr 5, 2022

dshulyak
Apr 5, 2022
Maintainer Author

dshulyak Apr 5, 2022
Maintainer Author

web3rover
Apr 6, 2022

noamnelke
May 17, 2022
Maintainer

noamnelke
May 17, 2022
Maintainer

noamnelke May 17, 2022
Maintainer

noamnelke
May 17, 2022
Maintainer