Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-evaluate more safe ways to store types within compiler #7171

Open
Chriscbr opened this issue Sep 25, 2024 · 1 comment
Open

Re-evaluate more safe ways to store types within compiler #7171

Chriscbr opened this issue Sep 25, 2024 · 1 comment

Comments

@Chriscbr
Copy link
Contributor

Chriscbr commented Sep 25, 2024

Sharing some thoughts / notes on a possible code smell in our compiler architecture.

Our compiler stores type information internally using the Type enum data structure. Throughout the compiler, we reference and pass around these Type structures (as well as some other data structures, like SymbolEnv and Namespace) fairly freely using raw pointers throughout the compiler. To facilitate this, we use a custom type named UnsafeRef:

pub struct UnsafeRef<T>(*const T);
impl<T> Copy for UnsafeRef<T> {}
impl<'a, T> From<&'a T> for UnsafeRef<T> {
fn from(t: &'a T) -> Self {
UnsafeRef(t)
}
}
impl<T> Clone for UnsafeRef<T> {
fn clone(&self) -> Self {
*self
}
}
impl<T> std::ops::Deref for UnsafeRef<T> {
type Target = T;
fn deref(&self) -> &Self::Target {
unsafe { &*self.0 }
}
}
impl<T> std::ops::DerefMut for UnsafeRef<T> {
fn deref_mut(&mut self) -> &mut Self::Target {
unsafe { &mut *(self.0 as *mut T) }
}
}

TypeRef is defined as an alias to UnsafeRef<Type>.

The unsafe implementation of the Deref trait for UnsafeRef allows anyone to dereference a TypeRef value (e.g. *some_typeref) to access the underlying type. This pattern is generally "unsafe" in Rust since it's possible to write code like this:

fn make_type() -> {
  let t = Type::Int;
  return UnsafeRef(&t);
}

In the code above, we create an UnsafeRef using a pointer to a local variable. t goes out of scope at the end of the function, so the pointer is no longer valid, and accessing it would be a dangling pointer dereference, causing undefined behavior.

The way we avoid this class of issue today is by convention always creating TypeRef objects through a helper method on the Types singleton struct. The helper method uses Box to store each Type object on the heap, and then stores all of these Box objects within a Vec field on the Types struct. An UnsafeRef object with a pointer to the value on the heap is returned. We also ensure that the Types singleton is always kept in scope through the entire life of the compilation or language server - we don't let it go out of scope. By pointing to objects on the heap that we never deallocate, we've guaranteed that all of these raw pointers will always continue to be valid.

The downsides of this approach, besides using unsafe, include:

  1. No way to free memory for types that are no longer being used or referenced. This is most relevant for the LSP.
  2. Potential for data races if in the future we wanted to parallelize work in the compiler with threads.
  3. It may harder or impossible to implement optimizations like type interning. (Today, a new Type may be allocated on the heap for each Array<str> modeled in a user program. Interning would allow these to be de-duplicated.)

Pragmatically, this isn't a pressing issue for the compiler today, but it's worth keeping an eye on.


One alternative approach could be to use Rc<Type> instead of UnsafeRef<Type>. Rc reference counts usages of the pointer so that the underlying Type would be automatically deallocated when it reaches 0. But an obstacle implementing this today is that we have many places throughout the compiler today where it's expected to be able to mutate the internal structures referred to by Type (see as_class_mut, as_struct_mut, and update_known_inferences as examples). Rc only makes its contents readable, so we'd have to find ways to avoid these mutations. (For example, perhaps a unique StructId is stored inside Type, and the actual contents with a list of mutable fields is stored in a separate side table).

@MarkMcCulloh
Copy link
Contributor

An option would be to avoid pointers and instead use an ID to reference the data. Normally this isn't a super safe idea but luckily compilers have very static(-ish) lifetimes.

The simplest approach would be some arena built around storing items in a list. See id-arena for a nice example of that. It's both very simple and fast, although it still has the downside that individual elements can't be freed. That's where things get complicated.

Instead, worth looking into https://github.com/orlp/slotmap. Not quite as simple but you get this incredibly useful result:

use slotmap::{new_key_type, SlotMap};

new_key_type! {
    pub struct TypeId;
}

#[derive(Debug, Eq, PartialEq)]
pub enum Type {
    String,
    Number,
    Option(TypeId),
}

fn main() {
    // Create a new arena and keep it alive throughout compilation
    let mut types = SlotMap::<TypeId, Type>::with_key();

    // Allocate types
    let number = types.insert(Type::Number);
    let optional_number = types.insert(Type::Option(number));

    // Sometime later, retrieve an `Type` from the arena
    if let Type::Option(mut outer_type) = types[optional_number] {
        // The types are mutable here
        match &mut types[outer_type] {
            Type::Number => println!("Number"),
            _ => println!("Not a number"),
        }
    }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: 🆕 New - not properly defined
Development

No branches or pull requests

2 participants