Skip to content

Design Rational

Zhenyu Guo edited this page Jun 8, 2015 · 3 revisions

The idea of this framework arises during the team's past efforts to (semi-) automatically test, debug, optimize, operate, scale, replicate, compose, and reason the given distributed systems, with which we encountered lots of obstacles. It becomes more desirable when we tried to study and reason the numerous data center outages, which happens in all major companies that host internet services. In both cases, our tools and runtime find themselves unreliable that there are always hidden dependencies/flows etc. that the they cannot capture and/or control, which is critical for their success. We therefore consider a new framework for building distributed systems, with which all these work can be easily enabled, while the framework itself is general to support many distributed system applications.

We summarize the requirements and it turns out that we need to answer a basic question: what are the key challenges the distributed systems impose and how to deal with them holistically. Compared to a single thread program, we believe the challenges are the new system complexities, such as concurrency, asynchrony, network delay, message lost, machine crash, and all kinds of different faults. These non-determinisms and their combination decide that it is much more difficult to build and manage a robust and high-performance distributed system, as well as the automation tools and runtime frameworks.

To address this challenge, we finally come to three design principles.

  • rDSN must be able to monitor and manipulate all the dependencies and non-determinisms in the system, with the appropriate abstraction level, so as to avoid the reliability problem and fix the semantic gap. We are therefore against the usage of benign data races in your code, except those self-contained high-performance data structures, with which we may lose some of the capabilities in rDSN.

  • Decouple the development of applications, frameworks and tools, so that they can be integrated transparently. The goal is to have a well-defined dependency contract among those large pieces. This will dramatically reduce the reasoning space for the developers when something goes wrong.

  • Last but very importantly, rDSN must be highly extensible so that developers can do low level optimizations and choose their familiar components whenever they want. However, this should not jeopardize the previous two principles which means the optimization needs to be decoupled from the application logic.

With these design principles, rDSN introduces a three-layer meta stack, for the development of ranging from a single node, a self-contained and distributed service, to an end-to-end workflow atop of many services.