-
Notifications
You must be signed in to change notification settings - Fork 89
Bacalhau project report 20220729
OK so the scaling goal for this month was 1000 nodes, we made it to 750 - which I'm pretty pleased with since this was a pretty aggressive target.
We discovered and fixed tons of bugs to get to this target:
- fixed race conditions due to not having consistent local state about our previous local actions
- loads lots and lots of panics & deadlocks under load
- we were misusing a libp2p field about the identity of the sending node (which wasn't always the original node which sent the message - especially in non-trivial topologies) was causing very hard to diagnose issues
- fixed a bug that was causing a node to bid on jobs multiple times under load
- the configuration of libp2p with gossipsub that we were using didn't have peer exchange enabled, so the nodes were forming a star network (everyone connected through the center) - this was causing gossipsub to not work reliably, and drop messages randomly - see "visualization tool" below for how we confirmed the fix for this
- missing mutexes and other bugs
- updated devstack so you can peer one devstack with another - this way you can construct huge Bacalhau networks by running large devstack deployments on multiple powerful servers and then peering them with eachother
Here's a full video report on the scaling work: https://drive.google.com/file/d/1hSD-PKqjy7zBSabkY4h_HZpsSQieaD3K/view?usp=sharing
Understanding the shape and behavior of the network is essential as we scale it and get more interesting topologies. To that end, I wrote a simple visualization tool. For example, you can understand the behavior of the WithPeerExchange
pubsub option in libp2p clearly by looking at the output:
ps, err := pubsub.NewGossipSub(ctx, h)
ps, err := pubsub.NewGossipSub(ctx, h, pubsub.WithPeerExchange(true))
Another example, here is a 2nd devstack peering with one that's already running:
And a few seconds later:
It's really nice to watch these systems evolve in front of your eyes.
Phil finished up GPU support and nearly has it deployed in production now.
Demo in the first few minutes of this standup: https://drive.google.com/file/d/1wFIL3HKUwQQPqImzM8Xwr-1djSOB_OHS/view
Enrico finished support for providing external URLs as input data for Bacalhau jobs: https://drive.google.com/file/d/1dXx7mW2lr3dbNLVCYA2-Sk08mbINjL8g/view?usp=sharing
This really helps mitigate the limitation that you can't use the network inside Bacalhau jobs. Now you no longer need to have data accessible on IPFS to get started!
Kai is hard at work on a branch which knocks down globbing, sharding and data parallelism in one fell swoop. This will be ready early next week. The demo of this will be to submit one CID with lots of files in it and automatically split the work across 10 nodes :-)
Vedant landed a great UX improvement, you can now run bacalhau run --wait
to wait for the job to finish and then automatically print the output of the job. No more copy pasting the job id into bacalhau get
🔥❤️
- Finish globing/sharding/parallelism
- Performance of large networks!
- Filecoin integration!