Skip to content

Bacalhau project report 20220729

lukemarsden edited this page Jul 29, 2022 · 3 revisions

750 nodes 🚀🚀🚀

OK so the scaling goal for this month was 1000 nodes, we made it to 750 - which I'm pretty pleased with since this was a pretty aggressive target.

We discovered and fixed tons of bugs to get to this target:

  • fixed race conditions due to not having consistent local state about our previous local actions
  • loads lots and lots of panics & deadlocks under load
  • we were misusing a libp2p field about the identity of the sending node (which wasn't always the original node which sent the message - especially in non-trivial topologies) was causing very hard to diagnose issues
  • fixed a bug that was causing a node to bid on jobs multiple times under load
  • the configuration of libp2p with gossipsub that we were using didn't have peer exchange enabled, so the nodes were forming a star network (everyone connected through the center) - this was causing gossipsub to not work reliably, and drop messages randomly - see "visualization tool" below for how we confirmed the fix for this
  • missing mutexes and other bugs
  • updated devstack so you can peer one devstack with another - this way you can construct huge Bacalhau networks by running large devstack deployments on multiple powerful servers and then peering them with eachother

Here's a full video report on the scaling work: https://drive.google.com/file/d/1hSD-PKqjy7zBSabkY4h_HZpsSQieaD3K/view?usp=sharing

Visualization tool

Understanding the shape and behavior of the network is essential as we scale it and get more interesting topologies. To that end, I wrote a simple visualization tool. For example, you can understand the behavior of the WithPeerExchange pubsub option in libp2p clearly by looking at the output:

ps, err := pubsub.NewGossipSub(ctx, h)
Screenshot 2022-07-28 at 17 48 11
ps, err := pubsub.NewGossipSub(ctx, h, pubsub.WithPeerExchange(true))
Screenshot 2022-07-28 at 17 48 39

Another example, here is a 2nd devstack peering with one that's already running: Screenshot 2022-07-28 at 20 26 41 (1)

And a few seconds later:

Screenshot 2022-07-28 at 20 26 44

It's really nice to watch these systems evolve in front of your eyes.

GPU support

Phil finished up GPU support and nearly has it deployed in production now.

Demo in the first few minutes of this standup: https://drive.google.com/file/d/1wFIL3HKUwQQPqImzM8Xwr-1djSOB_OHS/view

HTTP(S) URL support

Enrico finished support for providing external URLs as input data for Bacalhau jobs: https://drive.google.com/file/d/1dXx7mW2lr3dbNLVCYA2-Sk08mbINjL8g/view?usp=sharing

This really helps mitigate the limitation that you can't use the network inside Bacalhau jobs. Now you no longer need to have data accessible on IPFS to get started!

Sharding nearly finished

Kai is hard at work on a branch which knocks down globbing, sharding and data parallelism in one fell swoop. This will be ready early next week. The demo of this will be to submit one CID with lots of files in it and automatically split the work across 10 nodes :-)

bacalhau run --wait

Vedant landed a great UX improvement, you can now run bacalhau run --wait to wait for the job to finish and then automatically print the output of the job. No more copy pasting the job id into bacalhau get 🔥❤️

What's next

  • Finish globing/sharding/parallelism
  • Performance of large networks!
  • Filecoin integration!
Clone this wiki locally