Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AsyncBril #305

Open
wants to merge 34 commits into
base: main
Choose a base branch
from
Open

AsyncBril #305

wants to merge 34 commits into from

Conversation

evanmwilliams
Copy link
Contributor

@evanmwilliams evanmwilliams commented Dec 11, 2023

AsyncBril: Enhancing Bril with Asynchronous Programming and Thread-Style Features

By Andy He, Emily Wang, and Evan Williams

Overview and Motivation

In the final lecture of CS 6120, we talked about concurrency and parallelism and
the challenges they raise for compiler developers in the modern era of
computing. In particular, we examined how the problem of semantics is
inherently embedded in shared-memory multithreading and came across a key idea
by Boehm: threads cannot be implemented as a library.

The end of Moore's law indicates a need for parallelism to sustain advances
in computing. Thus, our goal in this project was to implement asynchronous
programming features for Bril. Modern programming languages rely heavily on
asynchronous programming and multithreading to optimize performance and manage
complex tasks. We thought the best way to learn about how these features are
implemented in real languages was to try it for ourselves. Further, Bril's
easily extensible nature makes it ideal for experimentation. We thought that
future CS 6120 students would benefit from being able to use our library to
implement their own synchronization primitives and other asynchronous
programming features using our extension.

Design and Implementation

In this projet, we added two main features to Bril: promises and atomic
primitive types (i.e. atomicint). Promises are a fundamental
part of asynchronous programming. They allow the language to handle operations
that might take an unknown amount of time to complete (like I/O operations)
without blocking the execution of the rest of the program. Atomic operations are
crucial for safe concurrency. They ensure that operations on shared data are
completed as indivisible steps, preventing race conditions where multiple
threads might try to modify the same data simultaneously.

To implement these new language features we directly modified the Rust
interpreter for Bril: brilirs. We chose to implement the feature in Rust for
a few reasons. First, the three of us are more experienced with Rust than we
are with TypeScript, so it seemed the natural choice for us. Second, we wanted
to take advantage of Rust's safe-concurrency features while implementing the
asynchronous extensions, since concurrency can be tricky to reason about.

We did not want to make major changes to the structure of brilirs, and wanted
our changes to be as modular as possible, as to minimize the changes needed to extend
brilirs in the future. A major goal of ours while designing and implementing this extension
was that we wanted to ensure that future contributors did not have to understand our work to
build features that do not require asynchronous programming. For instance, any type extension
in Bril will automatically be supported by AsyncBril.

The heap is thread-unsafe, allowing multiple threads to concurrently read/write
to the heap without checking. This is done using Rust unsafe blocks.
Alternatively, we could've used a Rw<T> to lock/unlock the heap when threads
were modifying it, but we ultimately decided that it would be more performant
to not implement it this way and provide support for synchronization primitives,
allowing for more flexibility.

Although the interpreter's heap is not thread safe, calls to alloc are still
made atomically, so no thread can accidentally be assigned the same index
in state.heap.memory when alloc is called simultaneously between threads.

Promises

We have introduced two critical syntax features to Bril: Promise<T> and
Resolve, significantly enhancing its capabilities in threaded execution and
asynchronous programming. The core of this implementation lies in how threads
are created and managed. Upon encountering a function defined with a
Promise<T> return type, a thread is immediately started. This allows for
concurrent function execution, enabling the language to handle complex,
asynchronous tasks more effectively. These threads have local stack and global
heap access, ensuring independent operation without interfering with each
other's state. However, they also access a global heap, which is crucial for
modifying pointers shared across threads.

Here's an example of how our notion of promises is implemented in Bril:

@main {
  size: int = const 4;
  size2: bool = const false;
  prom: promise<int> = call @mod size;
  prom2: promise<bool> = call @mod2 size2;
  x : int = resolve prom;
  y : bool = resolve prom2;
  print x;
  print y;
}

@mod(r: int): promise<int> {
  ret r;
}

@mod2(r: bool): promise<bool> {
  ret r;
}

This Bril program exemplifies the newly implemented asynchronous features in
the language, particularly demonstrating the use of Promise<T> and the
Resolve mechanism. In the @main function, two promises are created: prom
and prom2. The prom is a promise of an integer (promise<int>) and is
assigned the result of calling the @mod function with an integer argument
(size). Similarly, prom2 is a promise of a boolean (promise<bool>) and is
linked to the outcome of the @mod2 function, which is called with a
boolean argument (size2).

These promises, once initialized, signify the start of asynchronous operations.
The @mod and @mod2 functions, defined with return types promise<int> and
promise<bool> respectively, indicate that the function will be execute
asynchronously. Each function simply returns the passed argument, but in a
real-world scenario, these could represent more complex operations executed in
separate threads.

The use of the resolve keyword in the @main function is critical. resolve
is blocking and waits for the promise to be resolved to a value, synchronizes
the asynchronous function calls. The variables x and y are assigned the
resolved values of prom and prom2, respectively. The print statements at the
end of the @main function display these resolved values, showcasing how
asynchronous tasks can be integrated into the main flow of a Bril program.

The Resolve feature, synchronized with the promise mechanism, is instrumental
in managing the life cycle of asynchronous operations. It allows threads to
signal the completion of tasks and communicate results back to the main
execution flow.

Unresolved promises are automatically cleaned up when they go out of scope. This
is implemented through an atomic flag which indicates to all sub-threads to clean
up all resources used and abort. However, as this may cause undefined behavior
when interacting with the global heap, a warning is printed to indicate that a
promise was left unresolved. In general, all promises should be resolved.

The introduction of these asynchronous capabilities necessitated a medium-sized
refactor of the brilirs interpreter, including parameterizing the types of
State to accommodate the new features. A significant enhancement is the use of
atomics in the Rust interpreter. The atomics ensure a consistent and thread-safe
mechanism to increment the dynamic count across all threads, which is crucial
for tracking and managing active threads. Overall, this implementation not only
enriches Bril with robust asynchronous programming and safe concurrency
management but also significantly extends its utility as a tool for learning
and experimenting with compiler design and parallel computing concepts.

Atomics

In the context of Bril, atomic primitives provide a way to perform certain
operations on integers in a way that is guaranteed to be atomic, meaning that
the operations are indivisible. This atomicity is crucial in multi-threaded
environments to prevent race conditions. Race conditions occur when two or more
threads access shared data and try to change it simultaneously. If one thread's
operations interleave with another's in an uncontrolled manner, it can lead to
inconsistent or unpredictable results. Atomics prevent this by ensuring that
operations like incrementing a value, setting it, or comparing and swapping it
happen as a single, uninterruptible step.

A particularly powerful application of atomic primitives is in implementing
mutexes (mutual exclusion locks). Mutexes are a foundational concept in
concurrent programming, used to prevent multiple threads from accessing a
critical section of code simultaneously. By using atomic primitives to implement
mutexes, you can ensure that only one thread can enter a critical section at a
time, thereby protecting shared resources from concurrent access.

In a typical mutex implementation using atomics, an atomic variable is used as
a lock flag. Before a thread enters a critical section, it uses atomic
operations to check and set this flag. If the flag indicates that the mutex
is already locked, the thread will wait (or "spin") until the flag changes.
This is often implemented using a compare-and-swap atomic operation, where the
flag is only set if it is currently in the expected unlocked state. Once the
thread has finished executing the critical section, it uses another atomic
operation to set the lock flag back to its unlocked state.

Here's an example of how atomics work in Bril:

@main {
  one : int = const 1;
  size : atomicint = newatomic one;
  expect : int = const 1;
  upd : int = const 2;
  res : int = cas size expect upd;

  three : int = const 3;
  res1 : int = loadatomic size;
  res2 : int = swapatomic size three;
  res3 : int = loadatomic size;
  print res;
  print res1;
  print res2;
  print res3;
}

In our Bril program, we demonstrate the use of atomic operations to manage a
shared atomic integer. The program starts by initializing a regular integer
one with a value of 1, which is then used to create an atomic integer size
through the newatomic operation. The utilization of atomic integers is crucial
in concurrent programming environments, as it ensures that the integer can be
safely read and modified by multiple threads without causing race conditions.

We then perform a compare-and-swap (CAS) operation on the atomic integer size.
This atomic action is essential for achieving synchronization in multi-threaded
programs. The cas operation attempts to update size from its expected value
(expect, which is 1) to a new value (upd, which is 2). The success of this
operation, stored in res, hinges on whether the current value of size
matches the expected value, ensuring that the update occurs only if size has
not been altered by another thread. We print the result of this operation to
reflect its success or failure.

Furthermore, we employ the loadatomic operation to safely read the current
value of size into res1, ensuring a consistent and uncorrupted value.
Following this, we use the swapatomic operation, which atomically swaps
size's value with three (3). The previous value of size is captured in
res2. This swap operation is particularly useful for updating a shared
variable while simultaneously retrieving its old value. Another loadatomic
operation follows, storing the post-swap value of size in res3.

Finally, our program concludes by printing the results of these operations
(res, res1, res2, and res3). This output showcases the effects of
atomic operations on the shared atomic integer, providing a clear example of
how atomics can be used to manage shared state in a concurrent environment
without the need for complex and potentially cumbersome lock-based
synchronization mechanisms.

Elaborating more on how our atomics can implement a mutex lock and allow threads
to concurrently modify an array, here is another useful code snippet:

@acquire_lock(lock: atomicint) {
.loop:
  res: int = loadatomic lock;
  one: int = const 1;
  eq: bool = eq one res;
  br eq .loop .acquire_lock;
.acquire_lock:
  upd: int = add res one;
  old: int = cas lock res upd;
  eq: bool = eq old res;
  br eq .end .loop;
  .end:
  ret;
}

@release_lock(lock: atomicint) {
  zero: int = const 0;
  val: int = swapatomic lock zero;
}

The @acquire_lock function tries to acquire a lock on a shared resource
represented by lock, an atomic integer. The function first loads the current
value of lock atomically, and then loops to keep retrying to acquire the lock
if it's already being used by another thread. To attempt to acquire the lock, a
Compare-and-Swap (CAS) operation is used.

The @release_lock function releases the lock. It sets the lock to 0 using
an atomic swap operation, indicating that the lock is free. The use of atomic
operations ensures that these lock acquire and release operations are
thread-safe and prevent race conditions in concurrent environments.

Challenges

The original implementation of brilirs was understandably not designed with
multithreading in mind. As a result, we had to change some of the architecture
of the data structures and main function headers. To simplify the design,
State no longer has a parameterized lifetime, and instead uses an atomic
reference counter, Arc<T>, to hold the program and environment. Further, to
implement support for atomic ints, we needed a secondary state variable that
would all the instantiated atomics. We could not repurpose the heap for this, as
Value needs to derive the Copy trait, but this is not supported for Rust
AtomicI64. To remedy this, we added an additional data structure, modeled
after Heap, to store all atomics.

Evaluation

To evaluate our extension, we primarily focused on two key attributes:
correctness and runtime. To evaluate the runtime gains from asynchronous
programming, consider the following simple benchmark: we use a loop that just
adds 1 to a variable. On the first thread it performs 2000000 iterations, on
the second thread it performs 1000000, and so on and so forth. One good example
of when a benchmark like this would be used is in multiplying two large matrices
together. Below, we show the execution time in miliseconds against the number
of threads used in the task:

We observe that from 1 thread to 5 threads, we see substantial performance
improvements. This makes sense, as we are taking direct advantage of
parallelization in our programs. From 5 threads to 50 threads however, there
is actually a slight increase in the execution time. This also makes sense,
as increased number of threads requires more careful resource allocation by
the operating system. Since our implementation requires locks on the global
heap, more threads means more actors vying for the locked resources,
causing a general slowdown in overall performance.

To evaluate the correctness of our programs we noted that it is not particularly
useful to use pre-existing Bril benchmarks, as none of them are written with
asynchronous features in mind. We instead took a different approach. We can look
at asynchronous Bril programs that are implemented incorrectly and then
observe the behavior that is different from the expected behavior. In other
words, concurrent array accesses with variables that are not atomic will lead
to undefined behavior. Likewise, when atomics are used correctly the result
of the program is deterministic and predictable. This can easily be observed
by taking the Bril program above used to showcase atomics and removing the
atomicity of the integer (i.e. make it an int instead of an atomicint).
We quickly see why the atomics are necessary, and how it guarantees the
correctness of the program.

Overall evaluating asynchronous programming features can be challenging, but
we believe that through our thorough testing within the interpreter and our
Bril benchmarks that we have created a robust extension that can be used by
many future Bril users.

Conclusion

Concurrent programming is hard to understand, and it is important to begin
to understand how to implement concurrent programming features at the compilers
level. As we noted at the start, threads cannot be implemented as a library -
threading is inherently embedded into the semantics of the language. We gained
a lot of experience in this project with asynchronous programming and
concurrent programming features in Rust. Overall, we are excited to see this
extension be used in future iterations of CS 6120, and hope that more students
will build upon the work we've done to create many interesting optimizations
involving parallel programming in Bril.

@Pat-Lafon
Copy link
Contributor

Hey! This is a really cool project! I'm glad that overall, the rust code base was fairly extensible for supporting your extension. I created a pr, #306, to help address one of your failing ci checks. I think the other is an easy cargo fmt to fix.

If you don't mind, I have a few suggestions/questions about your extension?

It would be helpful if your extension was testable with make test/benchmark(I think some of these files are really more tests than benchmarks). That should be as simple as adding some output files and adding the corresponding paths to the Makefile lists.

@@ -286,6 +294,7 @@ fn type_check_instruction<'a>(
check_asmt_type(ty, &expected_arg.arg_type)
})?;

// print to console
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unclear?

check_num_funcs(0, funcs)?;
check_num_labels(0, labels)?;
check_asmt_type(&Type::Int, get_type(env, 0, args)?)?;
update_env(env, dest, op_type)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you forgot to check the op_type here as is done in all the other cases?

@@ -400,7 +497,11 @@ fn type_check_instruction<'a>(
Some(t) => {
check_num_args(1, args)?;
let ty0 = get_type(env, 0, args)?;
check_asmt_type(t, ty0)
if let Some(promise_t) = promise {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unclear to me why the promise variable needs to be threaded through as an argument to type_check_instruction. Isn't this information available inside of t here? If so, that would be simpler

use fxhash::FxHashMap;
use std::collections::HashMap;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FxHashMap should just be a drop in replacement no?

@@ -6,6 +6,7 @@ use std::fs::File;
use std::io::Read;

fn main() {
std::env::set_var("RUST_BACKTRACE", "1");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, this can be set in the command line without needing to modify the code.

@@ -0,0 +1,81 @@
Asynchronous Bril
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this extra file intended?

@@ -201,12 +250,17 @@ impl fmt::Display for Value {
Self::Float(v) => write!(f, "{v:.17}"),
Self::Char(c) => write!(f, "{c}"),
Self::Pointer(p) => write!(f, "{p:?}"),
Self::Atomic(a) => write!(f, "atomic_{a}"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the memory docs, they specify that the result of printing a pointer is unspecified(an implementation can do anything except error). Is that the same here?

out: &Mutex<T>,
val: &Value,
) -> Result<(), std::io::Error> {
let mut out = out.lock().unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a nit, but there isn't really a reason to change this function right? Can we acquire the lock first and then call this function as a separation of concerns? That is equivalent to what you need to do if you are using Display instead.

@Pat-Lafon
Copy link
Contributor

Unresolved promises are automatically cleaned up when they go out of scope.

This is a fine design decision but at the risk of overstepping, I think this is unnecessarily restrictive. Right now, promises need to get cleaned up at the end of each function call, but intuitively, there isn't any reason they can't span function boundaries right? Another approach is for these promises to have similar semantics as memory, everything needs to be cleaned up by the time the program terminates.

@@ -87,17 +91,61 @@ impl Environment {
}
}

//A second heap for atomics, since AtomicI64 cannot be added to the Value enum without a lot of work
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Purely out of curiosity, did you consider making them a part of the Value enum more seriously? Let's say that Value didn't implement Copy and you modeled these atomics as a variant of Arc<AtomicI64> instead of this map, would that be equivalent? What would the implications of that design decision be?

state.env.set(dest, Value::Int(res));
}
LoadAtomic => {
let arg0 = get_arg::<usize>(&state.env, 0, args) as usize;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary as usize cast. I think clippy has some suggestions that could be applied.

@@ -172,10 +221,10 @@ enum Value {
Float(f64),
Char(char),
Pointer(Pointer),
Atomic(usize),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might consider using the "NewType" idiom here to not mix this usize with the usize used for variable names.

op: bril_rs::EffectOps,
args: &[usize],
funcs: &[usize],
curr_block: &BasicBlock,
// There are two output variables where values are stored to effect the loop execution.
next_block_idx: &mut Option<usize>,
result: &mut Option<Value>,
out: Arc<Mutex<T>>,
stop_flag: Option<Arc<AtomicBool>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This stop_flag is somewhat mysterious to me. Why do we need to do this when the program ends up panic-ing anyways?

I think this could be implemented as &Option<AtomicBool>.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this fit into State?

@@ -745,6 +926,8 @@ fn parse_args(
Ok(())
}
bril_rs::Type::Pointer(..) => unreachable!(),
bril_rs::Type::Promise(..) => unreachable!("line 839"),
bril_rs::Type::AtomicInt => unreachable!("line 840"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are these line numbers in reference to?

@Pat-Lafon
Copy link
Contributor

The heap is thread-unsafe, allowing multiple threads to concurrently read/write
to the heap without checking. This is done using Rust unsafe blocks.
Alternatively, we could've used a Rw to lock/unlock the heap when threads
were modifying it, but we ultimately decided that it would be more performant
to not implement it this way and provide support for synchronization primitives,
allowing for more flexibility.

Hmmm, this seems scary. I'm concerned that it's not just thread-safety but that there is is also a possibility of memory corruption. Did you try profiling this to see what the penalty is? One observation, pointers can't be used to write to different memory blocks so you actually might be able to write to multiple memory regions simultaneously without contention. I guess you only need to take a whole heap write lock during alloc/free? Maybe something smart can be done here... memory allocation is by far brilirs's least optimized part.

@@ -516,19 +622,58 @@ fn execute_value_op<T: std::io::Write>(
let res = Value::Pointer(arg0.add(arg1));
state.env.set(dest, res);
}
Resolve => {
let res = match handlers.remove(&args[0]) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I first saw this line, I thought it was a bug since you are the variable name(usize'd index) to find the handler which also happens to be a usize. I realize looking at the rest of this code that it probably isn't.

Still, this seems like a serious hit to modularity, each promise is tied to the variable name it is bound to so you can't call id on a promise, pass it as argument, or store it in an array. I think this should be documented. For what it's worth, I think this is a straight forward limitation to lift, though related to your scoping restriction somewhat.

Resolve => {
let res = match handlers.remove(&args[0]) {
Some(handle) => handle.0.join().unwrap(),
None => panic!("Unexpected execution error"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use an result error instead? With a more descriptive error message.


let val = match result {
Some(val) => val,
None => Value::Uninitialized,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This result was originally an unwrap which I think better aligns with the program semantics. Typechecking should ensure that there is always a value here so it is not possible to create an Uninitialized value here

@Pat-Lafon
Copy link
Contributor

Sorry for all the comments, hopefully some of them were helpful but otherwise take them all with a grain of salt. Your team thought about this a lot more than I have and it's your project. 🦀

@sampsyo
Copy link
Owner

sampsyo commented Dec 24, 2023

Very cool implementation, everybody! @Pat-Lafon has some good advice here, if you're interested in continuing the hacking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants