Skip to content

Bacalhau project report 20220722

lukemarsden edited this page Jul 22, 2022 · 7 revisions

Ramping up quickly 🚀

The new team members Phil and Enrico are ramping up quickly and have already delivered a major feature:

GPU support

Documented here from the user's perspective, and here from the service provider's perspective, it's now possible to schedule jobs on GPUs using the Docker driver in Bacalhau.

Phil landed this feature in record time, very impressive 🚀 and this feature was requested by a prospective user, we are happy to be responsive to user requests.

The next step is to deploy some GPUs on our own production network.

Support for external HTTPS URLs

Enrico is working on adding a new storage driver for fetching data from a list of HTTPS URLs, so that users can use Bacalhau as a bridge (with compute in the middle) between data available via HTTPS and ingesting that data, or a processed version of it, into IPFS.

This is also a feature requested by a prospective user, and we look forward to shipping it shortly.

Support for bacalhau apply -f job.yaml

Vedant landed his first code change, which is to support specifying a job as a YAML file (which can be version controlled in the manner of Kubernetes YAMLs) rather than having to specify all the options as commandline parameters. This feature pairs nicely with the support for external HTTPS URLs, since if you have thousands of URLs to download, you don't want to have to specify them all as commandline parameters!

Passing tests on the datastore interface

The datastore refactor that Kai is working on is now passing the test suite! This refactor eliminates race conditions by having local knowledge about actions (e.g. compute node: "I will only bid on as many jobs as I have CPU and memory for" and requestor node: "I will only accept as many bids as the job's concurrency setting") live in a synchronous local metadata store rather than relying on network roundtrips.

We also designed a future change to make all the objects in the system into explicit state machines, using a pattern that has worked well for us on previous projects.

Next up on this track -- which will lean on the datastore work -- is to implement sharding and parallelism of jobs.

Investigating an issue on the production network

Wes reported that the production network occasionally seems to experience a sort of netsplit, whereby some nodes stop hearing about other nodes' jobs. We added instrumentation to the system so that you can make an API request to query the libp2p peers the nodes are connected to at runtime. This will help us track down and fix this issue next time it crops up in production.

Continuously benchmarking every commit

Dave is working on a benchmark setup so that every commit to every PR gets a corresponding PR comment with the timing info and how it relates to the latest benchmarking run on main. This will help us avoid regressing our performance achievements!

Spinning up stress tests

We have started spinning up the nodes that are going to be used for much larger scale stress testing. We are aiming to simulate 1K nodes by having 10 chunky nodes each with 100 bacalhau and IPFS instances on them, all cross-connected, in different cloud regions, which will be the first real test of how the network performs with a large number of nodes and geographically distributed.

Plan for post October 📆

We've started thinking and planning about what comes next after the "Master Plan - Part 1" goals are achieved, hopefully in October. Here's a preview of our thinking!

Key deliverables (e.g. over next 12-24 months)

This would form the Master Plan - Part 2, following on from Part 1.

CoD WG success

  • Align with making the Compute-over-Data Working Group (CoD WG) successful
  • Meet with key participants in CoD WG to establish useful collaborations
  • Splitting useful parts of Bacalhau into reusable pieces for other projects
  • This will be an ongoing theme throughout all of the future development as well, and collaboration on all of the following topics must be encouraged

User success

  • Listen to user feedback as they’re onboarded and develop new features and improvements to make them successful
  • Examples so far: GPU support, support for external HTTP(S) URLs as input data

Consensus & verification

  • Keep track of state of jobs as well as verification of jobs
  • Iterate on the verification protocol for deterministic WASM workloads (discussion already underway with Consensus team)
  • Prototype and test an implementation of the verification protocol

Smart contract

  • Smart contract implementation of scheduler in FVM
  • Integrating smart contract into Transport and Controller interfaces in Bacalhau

Efficiency

  • Work to ensure smart contract implementation can approach efficiency of libp2p based solution
  • Throughput is probably more important than latency for batch jobs

Formal verification

  • Formally verifying the Bacalhau smart contract protocol will help ensure correctness and eliminate protocol bugs
  • See: Glow, Dafny, Coq, Why3

BFT up to ⅓

  • Support Byzantine Fault Tolerance assuming ⅔ of the participants are honest

Nondeterministic execution

  • Per the original prototype, bring back support for verifying nondeterministic workloads (e.g. the docker driver, GPU workloads) via evidence of work

Plugin system

  • Support various flavors of evidence provided to support verifiable non-deterministic execution

Reputation system

  • Build a reputation system around the judgements being made by the verification protocol
  • This would allow a public dashboard of providers and how trustworthy they are

Incentive model

  • Based on the consensus protocol, ensure the incentive model is effective from a game-theoretic perspective, building on the formal verification work

Developer experience

  • Continuously improving the UX of the system for users and service providers

Hardening evidence of work system

  • Making the nondeterministic execution more robust via more sophisticated signal processing

Secrecy

  • Support private data and code

Long-running servers

  • Support for long-running servers (e.g. web applications, microservices) as well as data processing use-cases

If you have comments about what you think we should build, please let us know on the Filecoin Slack, #bacalhau channel 😄

What's next

  • Big push to get scale testing & sharding/parallelism done by the end of the month!
Clone this wiki locally