Bacalhau project report 20220722

Ramping up quickly 🚀

The new team members Phil and Enrico are ramping up quickly and have already delivered a major feature:

GPU support

Documented here from the user's perspective, and here from the service provider's perspective, it's now possible to schedule jobs on GPUs using the Docker driver in Bacalhau.

Phil landed this feature in record time, very impressive 🚀 and this feature was requested by a prospective user, we are happy to be responsive to user requests.

The next step is to deploy some GPUs on our own production network.

Support for external HTTPS URLs

Enrico is working on adding a new storage driver for fetching data from a list of HTTPS URLs, so that users can use Bacalhau as a bridge (with compute in the middle) between data available via HTTPS and ingesting that data, or a processed version of it, into IPFS.

This is also a feature requested by a prospective user, and we look forward to shipping it shortly.

Support for `bacalhau apply -f job.yaml`

Vedant landed his first code change, which is to support specifying a job as a YAML file (which can be version controlled in the manner of Kubernetes YAMLs) rather than having to specify all the options as commandline parameters. This feature pairs nicely with the support for external HTTPS URLs, since if you have thousands of URLs to download, you don't want to have to specify them all as commandline parameters!

Passing tests on the datastore interface

The datastore refactor that Kai is working on is now passing the test suite! This refactor eliminates race conditions by having local knowledge about actions (e.g. compute node: "I will only bid on as many jobs as I have CPU and memory for" and requestor node: "I will only accept as many bids as the job's concurrency setting") live in a synchronous local metadata store rather than relying on network roundtrips.

We also designed a future change to make all the objects in the system into explicit state machines, using a pattern that has worked well for us on previous projects.

Next up on this track -- which will lean on the datastore work -- is to implement sharding and parallelism of jobs.

Investigating an issue on the production network

Wes reported that the production network occasionally seems to experience a sort of netsplit, whereby some nodes stop hearing about other nodes' jobs. We added instrumentation to the system so that you can make an API request to query the libp2p peers the nodes are connected to at runtime. This will help us track down and fix this issue next time it crops up in production.

Continuously benchmarking every commit

Dave is working on a benchmark setup so that every commit to every PR gets a corresponding PR comment with the timing info and how it relates to the latest benchmarking run on main. This will help us avoid regressing our performance achievements!

Spinning up stress tests

We have started spinning up the nodes that are going to be used for much larger scale stress testing. We are aiming to simulate 1K nodes by having 10 chunky nodes each with 100 bacalhau and IPFS instances on them, all cross-connected, in different cloud regions, which will be the first real test of how the network performs with a large number of nodes and geographically distributed.

Plan for post October 📆

We've started thinking and planning about what comes next after the "Master Plan - Part 1" goals are achieved, hopefully in October. Here's a preview of our thinking!

Key deliverables (e.g. over next 12-24 months)

This would form the Master Plan - Part 2, following on from Part 1.

CoD WG success

Align with making the Compute-over-Data Working Group (CoD WG) successful
Meet with key participants in CoD WG to establish useful collaborations
Splitting useful parts of Bacalhau into reusable pieces for other projects
This will be an ongoing theme throughout all of the future development as well, and collaboration on all of the following topics must be encouraged

User success

Listen to user feedback as they’re onboarded and develop new features and improvements to make them successful
Examples so far: GPU support, support for external HTTP(S) URLs as input data

Consensus & verification

Keep track of state of jobs as well as verification of jobs
Iterate on the verification protocol for deterministic WASM workloads (discussion already underway with Consensus team)
Prototype and test an implementation of the verification protocol

Smart contract

Smart contract implementation of scheduler in FVM
Integrating smart contract into Transport and Controller interfaces in Bacalhau

Efficiency

Work to ensure smart contract implementation can approach efficiency of libp2p based solution
Throughput is probably more important than latency for batch jobs

Formal verification

Formally verifying the Bacalhau smart contract protocol will help ensure correctness and eliminate protocol bugs
See: Glow, Dafny, Coq, Why3

BFT up to ⅓

Support Byzantine Fault Tolerance assuming ⅔ of the participants are honest

Nondeterministic execution

Per the original prototype, bring back support for verifying nondeterministic workloads (e.g. the docker driver, GPU workloads) via evidence of work

Plugin system

Support various flavors of evidence provided to support verifiable non-deterministic execution

Reputation system

Build a reputation system around the judgements being made by the verification protocol
This would allow a public dashboard of providers and how trustworthy they are

Incentive model

Based on the consensus protocol, ensure the incentive model is effective from a game-theoretic perspective, building on the formal verification work

Developer experience

Continuously improving the UX of the system for users and service providers

Hardening evidence of work system

Making the nondeterministic execution more robust via more sophisticated signal processing

Secrecy

Support private data and code

Long-running servers

Support for long-running servers (e.g. web applications, microservices) as well as data processing use-cases

If you have comments about what you think we should build, please let us know on the Filecoin Slack, #bacalhau channel 😄

What's next

Big push to get scale testing & sharding/parallelism done by the end of the month!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bacalhau project report 20220722

Ramping up quickly 🚀

GPU support

Support for external HTTPS URLs

Support for `bacalhau apply -f job.yaml`

Passing tests on the datastore interface

Investigating an issue on the production network

Continuously benchmarking every commit

Spinning up stress tests

Plan for post October 📆

Key deliverables (e.g. over next 12-24 months)

CoD WG success

User success

Consensus & verification

Smart contract

Efficiency

Formal verification

BFT up to ⅓

Nondeterministic execution

Plugin system

Reputation system

Incentive model

Developer experience

Hardening evidence of work system

Secrecy

Long-running servers

What's next

Clone this wiki locally

Bacalhau project report 20220722

Ramping up quickly 🚀

GPU support

Support for external HTTPS URLs

Support for bacalhau apply -f job.yaml

Passing tests on the datastore interface

Investigating an issue on the production network

Continuously benchmarking every commit

Spinning up stress tests

Plan for post October 📆

Key deliverables (e.g. over next 12-24 months)

CoD WG success

User success

Consensus & verification

Smart contract

Efficiency

Formal verification

BFT up to ⅓

Nondeterministic execution

Plugin system

Reputation system

Incentive model

Developer experience

Hardening evidence of work system

Secrecy

Long-running servers

What's next

Clone this wiki locally

Support for `bacalhau apply -f job.yaml`