Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[llvm][Support][Memory] Add memfd based fallback for strict W^X Linux systems #98538

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

minipli-oss
Copy link

PaX's MPROTECT feature as well as SELinux's deny_execmem policy prevent a transition from a once writable memory mapping to an executable one. This obviously breaks JIT code generation.

As both of these are Linux specific and the Linux kernel gained memfd support almost 10 years ago in v3.17, making use of it to implement a fallback seems like a viable option to get JIT code generation fixed for such systems.

Implement a detour through a memfd for systems that are detected to deny the W<->X transition of memory mappings. For PaX this can be easily achieved by evaluating the per-process PaX flags, if access to /proc/self/status is available. For others, and as a fallback for PaX, a runtime test is done once.

This enables such systems to make use of JIT code generation without completely abandoning their W^X policy.

Regarding the implementation, the runtime test for detecting PaX's MPROTECT shamelessly shows I'm a C developer at heart. This can -- and should! -- be made more C++ish by making use of a MemoryBuffer or StringRef maybe coupled with something fancy from lvm::sys::fs. However, I failed to find something as simple as only opening a file with O_CLOEXEC, reading it line by line, looking for the one I'm interested in. So there it is in plain C. 🫤

Also please excuse my clunky FDWrapper class. Again, I failed to find a simple thing that does RAII for fds, so implemented my own to ease erroring out of the code.

An example of what this change achieves, is to make use of LLVMpipe on such system out of the box by implementing a fallback to what normally would be denied by policy. This, in turn, will give users of such systems a working graphical desktop if there's no native video driver available, e.g. what is commonly the case in (wrongly configured) VMs.

PS: First time contributor, suggestions for improvement heavily welcomed!

Copy link

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be
notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write
permissions for the repository. In which case you can instead tag reviewers by
name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review
by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate
is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

@llvmbot
Copy link
Collaborator

llvmbot commented Jul 11, 2024

@llvm/pr-subscribers-llvm-support

Author: None (minipli-oss)

Changes

PaX's MPROTECT feature as well as SELinux's deny_execmem policy prevent a transition from a once writable memory mapping to an executable one. This obviously breaks JIT code generation.

As both of these are Linux specific and the Linux kernel gained memfd support almost 10 years ago in v3.17, making use of it to implement a fallback seems like a viable option to get JIT code generation fixed for such systems.

Implement a detour through a memfd for systems that are detected to deny the W<->X transition of memory mappings. For PaX this can be easily achieved by evaluating the per-process PaX flags, if access to /proc/self/status is available. For others, and as a fallback for PaX, a runtime test is done once.

This enables such systems to make use of JIT code generation without completely abandoning their W^X policy.

Regarding the implementation, the runtime test for detecting PaX's MPROTECT shamelessly shows I'm a C developer at heart. This can -- and should! -- be made more C++ish by making use of a MemoryBuffer or StringRef maybe coupled with something fancy from lvm::sys::fs. However, I failed to find something as simple as only opening a file with O_CLOEXEC, reading it line by line, looking for the one I'm interested in. So there it is in plain C. 🫤

Also please excuse my clunky FDWrapper class. Again, I failed to find a simple thing that does RAII for fds, so implemented my own to ease erroring out of the code.

An example of what this change achieves, is to make use of LLVMpipe on such system out of the box by implementing a fallback to what normally would be denied by policy. This, in turn, will give users of such systems a working graphical desktop if there's no native video driver available, e.g. what is commonly the case in (wrongly configured) VMs.

PS: First time contributor, suggestions for improvement heavily welcomed!


Full diff: https://github.com/llvm/llvm-project/pull/98538.diff

3 Files Affected:

  • (modified) llvm/lib/Support/Unix/Memory.inc (+59-3)
  • (added) llvm/lib/Support/Unix/MemoryLinux.h (+105)
  • (modified) llvm/unittests/Support/MemoryTest.cpp (+6)
diff --git a/llvm/lib/Support/Unix/Memory.inc b/llvm/lib/Support/Unix/Memory.inc
index bac208a7d543c..f894e24ea2919 100644
--- a/llvm/lib/Support/Unix/Memory.inc
+++ b/llvm/lib/Support/Unix/Memory.inc
@@ -18,6 +18,10 @@
 #include "llvm/Support/Process.h"
 #include "llvm/Support/Valgrind.h"
 
+#ifdef __linux__
+#include "MemoryLinux.h"
+#endif
+
 #ifdef HAVE_SYS_MMAN_H
 #include <sys/mman.h>
 #endif
@@ -177,6 +181,55 @@ std::error_code Memory::protectMappedMemory(const MemoryBlock &M,
       alignAddr((const uint8_t *)M.Address + M.AllocatedSize, PageSize);
 
   bool InvalidateCache = (Flags & MF_EXEC);
+  bool SkipMprotect = false;
+
+#if defined(__linux__)
+  // Check for cases where the EXEC protection flag changes and a possible
+  // strict W^X policy cannot be bypassed via mprotect() alone, e.g. under
+  // PaX's MPROTECT or SELinux's deny_execmem.
+  //
+  // To support such systems, we need to create a fresh mapping with the
+  // target protection flags.
+  if ((M.Flags ^ Flags) & MF_EXEC && execProtChangeNeedsNewMapping()) {
+    class FDWrapper {
+    public:
+      FDWrapper(int fd) : fd(fd) {}
+      ~FDWrapper() { ::close(fd); }
+      operator int() const { return fd; }
+    private:
+      int fd;
+    } fd(memfd_create("llvm", MFD_CLOEXEC));
+
+    if (fd < 0)
+      return errnoAsErrorCode();
+
+    const char *data = reinterpret_cast<char *>(Start);
+    uintptr_t len = End - Start;
+    uintptr_t left = len;
+
+    while (left) {
+      ssize_t cnt = ::write(fd, data, left);
+      if (cnt < 0) {
+        if (errno == EINTR)
+          continue;
+
+        return errnoAsErrorCode();
+      }
+      left -= cnt;
+      data += cnt;
+    }
+
+    void *addr = ::mmap(reinterpret_cast<void *>(Start), len, Protect,
+                        MAP_PRIVATE | MAP_FIXED, fd, 0);
+    if (addr == MAP_FAILED)
+      return errnoAsErrorCode();
+
+    // We created a new mapping with the final protection bits, therefore
+    // don't need to call mprotect() with the very same flags again -- unless
+    // we have to toggle PROT_READ for ARM.
+    SkipMprotect = true;
+  }
+#endif
 
 #if defined(__arm__) || defined(__aarch64__)
   // Certain ARM implementations treat icache clear instruction as a memory
@@ -190,13 +243,16 @@ std::error_code Memory::protectMappedMemory(const MemoryBlock &M,
 
     Memory::InvalidateInstructionCache(M.Address, M.AllocatedSize);
     InvalidateCache = false;
+    SkipMprotect = false;
   }
 #endif
 
-  int Result = ::mprotect((void *)Start, End - Start, Protect);
+  if (!SkipMprotect) {
+    int Result = ::mprotect((void *)Start, End - Start, Protect);
 
-  if (Result != 0)
-    return errnoAsErrorCode();
+    if (Result != 0)
+      return errnoAsErrorCode();
+  }
 
   if (InvalidateCache)
     Memory::InvalidateInstructionCache(M.Address, M.AllocatedSize);
diff --git a/llvm/lib/Support/Unix/MemoryLinux.h b/llvm/lib/Support/Unix/MemoryLinux.h
new file mode 100644
index 0000000000000..e1a642f121394
--- /dev/null
+++ b/llvm/lib/Support/Unix/MemoryLinux.h
@@ -0,0 +1,105 @@
+//===- Unix/MemoryLinux.h - Linux specific Helper Fuctions ------*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+// This file defines Linux specific helper functions for memory management.
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef LLVM_LIB_SUPPORT_UNIX_MEMORYLINUX_H
+#define LLVM_LIB_SUPPORT_UNIX_MEMORYLINUX_H
+
+#ifndef __linux__
+#error Linux only support header!
+#endif
+
+#include "llvm/Support/Process.h"
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+
+#include <sys/mman.h>
+#include <sys/syscall.h>
+
+#ifndef MFD_CLOEXEC
+#define MFD_CLOEXEC 1U
+#endif
+
+namespace llvm {
+namespace sys {
+namespace {
+
+static inline bool isPermissionError(int err) {
+  // PaX uses EPERM, SELinux uses EACCES
+  return err == EPERM || err == EACCES;
+}
+
+// FIXME: Make this either more lol-level C'ish or C++'ish
+static inline bool execProtChangeNeedsNewMapping() {
+  static int status = -1;
+
+  if (status != -1)
+    return status;
+
+  // Try to get the status from /proc/self/status, looking for PaX flags.
+  FILE *f = fopen("/proc/self/status", "re");
+  if (f) {
+    char *buf = NULL;
+    size_t len;
+
+    while (getline(&buf, &len, f) != -1) {
+      if (strncmp(buf, "PaX:", 4))
+        continue;
+
+      // Look for 'm', indicating PaX MPROTECT is disabled.
+      status = !strchr(buf + 4, 'm');
+      break;
+    }
+
+    fclose(f);
+    free(buf);
+
+    if (status != -1)
+      return status;
+  }
+
+  // Create a temporary writable mapping and try to make it excecutable.  If
+  // this fails, test 'errno' to ensure it failed because we were not allowed
+  // to create such a mapping and not because of some transient error.
+  size_t size = Process::getPageSizeEstimate();
+  void *addr = ::mmap(NULL, size, PROT_READ | PROT_WRITE,
+                      MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+  if (addr == MAP_FAILED) {
+    // Must be low on memory or have too many mappings alrady, not much we can
+    // do here.
+    status = 0;
+  } else {
+    if (::mprotect(addr, size, PROT_READ | PROT_EXEC) < 0)
+      status = isPermissionError(errno);
+    else
+      status = 0;
+    ::munmap(addr, size);
+  }
+
+  return status;
+}
+
+static inline int memfd_create(const char *name, int flags) {
+#ifdef SYS_memfd_create
+  return syscall(SYS_memfd_create, name, flags);
+#else
+  return -1;
+#endif
+}
+
+} // anonymous namespace
+} // namespace sys
+} // namespace llvm
+
+#endif
diff --git a/llvm/unittests/Support/MemoryTest.cpp b/llvm/unittests/Support/MemoryTest.cpp
index 9daa6d0ff9e4d..dbee1b62b25d1 100644
--- a/llvm/unittests/Support/MemoryTest.cpp
+++ b/llvm/unittests/Support/MemoryTest.cpp
@@ -11,6 +11,10 @@
 #include "gtest/gtest.h"
 #include <cstdlib>
 
+#if defined(__linux__)
+#include "../lib/Support/Unix/MemoryLinux.h"
+#endif
+
 #if defined(__NetBSD__)
 // clang-format off
 #include <sys/param.h>
@@ -40,6 +44,8 @@ bool IsMPROTECT() {
     err(EXIT_FAILURE, "sysctl");
 
   return !!(paxflags & CTL_PROC_PAXFLAGS_MPROTECT);
+#elif defined(__linux__)
+  return execProtChangeNeedsNewMapping();
 #elif (defined(__APPLE__) && defined(__aarch64__)) || defined(__OpenBSD__)
   return true;
 #else

@minipli-oss
Copy link
Author

Should have mentioned, this fixes #7073.

@minipli-oss
Copy link
Author

Ping, maybe @chandlerc, @andykaylor, @hubert-reinterpretcast can take a look?

@minipli-oss
Copy link
Author

@tstellar, can you please take a look? This change makes LLVM's JIT code generation SELinux deny_execmem compatible, so might be valuable for RedHat based systems?

PaX's MPROTECT feature as well as SELinux's deny_execmem policy prevent
a transition from a once writable memory mapping to an executable one.
This obviously breaks JIT code generation.

As both of these are Linux specific and the Linux kernel gained memfd
support almost 10 years ago in v3.17, making use of it to implement a
fallback seems like a viable option to get JIT code generation fixed
for such systems.

Implement a detour through a memfd for systems that are detected to deny
the W<->X transition of memory mappings. For PaX this can be easily
achieved by evaluating the per-process PaX flags, if access to
/proc/self/status is available. For others, and as a fallback for PaX, a
runtime test is done once.

This enables such systems to make use of JIT code generation without
completely abandoning their W^X policy.

Signed-off-by: Mathias Krause <[email protected]>
- set errno if SYS_memfd_create is unknown so error handling actually
  works in this case
- try making use of MFD_EXEC as demanded by recent kernels
@minipli-oss
Copy link
Author

I did a review and fixes myself and added them to the branch as follol-up commits, to ease review. My plan is to squash these into the first commit for a merge, though.

The changes fix error handling for old systems as well as supporting recent kernels better. I also added an explicit test for the W->X memory transition which is now supported through this change.

Copy link

github-actions bot commented Jul 24, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

@minipli-oss
Copy link
Author

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:

git-clang-format ... --extensions cpp,inc,h ...

Ahh, that's the key! I was wondering myself why git clang-format wasn't working for me. But, apparently, it ignores *.inc files by default. :/

@tstellar tstellar requested a review from lhames July 25, 2024 00:47
@DavidSpickett
Copy link
Collaborator

I did a review and fixes myself and added them to the branch as follol-up commits, to ease review. My plan is to squash these into the first commit for a merge, though.

llvm only allows the "squash and merge" option, so yes, you're doing the right thing :)

@DavidSpickett DavidSpickett changed the title [Support][Memory] Add memfd based fallback for strict W^X Linux systems [llvm][Support][Memory] Add memfd based fallback for strict W^X Linux systems Jul 25, 2024
Copy link
Collaborator

@DavidSpickett DavidSpickett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not my area of expertise but here are some general suggestions.

llvm/lib/Support/Unix/MemoryLinux.h Outdated Show resolved Hide resolved
@@ -177,6 +181,78 @@ std::error_code Memory::protectMappedMemory(const MemoryBlock &M,
alignAddr((const uint8_t *)M.Address + M.AllocatedSize, PageSize);

bool InvalidateCache = (Flags & MF_EXEC);
bool SkipMprotect = false;

#if defined(__linux__)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this all be sunk into a helper function?

#if defined(__linux__)
whatever = do_a_bunch_of_stuff();
#endif

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's much less code in the current version and moving it to a helper function would require passing quite some arguments. So, maybe, it's okay now?

I could move the memfd part to a helper class to squeeze out a few more lines, but that feels like overengineering just for the cause.

llvm/lib/Support/Unix/Memory.inc Outdated Show resolved Hide resolved
llvm/lib/Support/Unix/MemoryLinux.h Outdated Show resolved Hide resolved
Remove the "private" header file and make execProtectionChangeNeedsNewMapping()
a protected static member function of llvm::sys::Memory, allowing it to be
accessed by the unit tests as well.
For very simple fd streams we don't need to know the current position
nor the file's type. Support skipping these syscalls.

Signed-off-by: Mathias Krause <[email protected]>
… it supports a lightweight mode

Signed-off-by: Mathias Krause <[email protected]>
@minipli-oss
Copy link
Author

Quite some changes, but most important ones are:

  • made execProtectionChangeNeedsNewMapping() a protected static method of llvm::sys::Memory and moved its implementation back to .../Memory.inc and providing a stub implementation for Windows
  • made execProtectionChangeNeedsNewMapping() look much more like C++ by using MemoryBuffer and StringRef to do /proc/self/status parsing
  • changed raw_fd_ostream to allow skipping lseek() and fstat() syscalls as these would only cause unnecessary overhead for memfds
  • use these lightweight raw_fd_ostreams to replace the home-made FDWrapper class

Commit fc3e577 should maybe be its own PR and not squashed into this one?

status = 0;
} else {
if (::mprotect(addr, size, PROT_READ | PROT_EXEC) < 0)
status = isPermissionError(errno);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will trigger an AVC denial, which might kill the process depending on host configuration, but at the very least will cause log spam and confusion (does this executable require WX because of the jit, or because it was compromised?).

Ideally you would check for the selinux execmem permission beforehand, here's an example:

#include <functional>
#include <iostream>
#include <selinux/selinux.h>

namespace {

int noop_log(int, const char *, ...) { return 0; };

bool selinux_can_execmem() {

    if (is_selinux_enabled() != 1) {
        return true;
    }

    // this is required as libselinux installs printf() as a default log callback, or because consumers may
    // install their own
    const auto previous_callback = selinux_get_callback(SELINUX_CB_LOG);
    selinux_set_callback(SELINUX_CB_LOG, {.func_log = &noop_log});
    struct ScopeGuard {
        std::function<void()> func;
        ~ScopeGuard() { func(); }
    };
    const ScopeGuard scope_guard{
        [previous_callback]() { selinux_set_callback(SELINUX_CB_LOG, previous_callback); }};

    char *user_context_raw{};

    if (getcon_raw(&user_context_raw) < 0) {
        std::cerr << "error in getting my selinux context\n";
        return true;
    }
    if (selinux_check_access(user_context_raw, user_context_raw, "process", "execmem", nullptr) == 0) {
        return false;
    }
    return true;
}

} // namespace

int main() {
    if (selinux_can_execmem()) {
        std::cout << "I am allowed to execmem\n";
    } else {
        std::cout << "I am not allowed to execmem\n";
    }
}

Note that this will require linking with -lselinux. There is no freestanding way to check for selinux perms afaik.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I wasn't aware that there's a mode where an "execmem" denial would kill the involved process. During my tests on Fedora I only observed failing syscalls returning -EACCESS (this very mprotect() call).

While I'd like to have a better detection of SELinux's "deny_execmem", there are basically two reasons against it:

  1. If this sequence of syscalls (addr = mmap(RW); mprotect(addr, RX)) ends up killing the process instead of only making the second syscall fail, it probably shouldn't attempt to make use of a JIT anyways -- with a host configuration that strict.
  2. Linking against libselinux by default just for this test is a deal breaker as we (a) shouldn't require all users of libLLVM to link against libselinux (well, technically we could but) and (b) overriding library internal state -- even if only briefly -- will mess with the application also trying to make use of libselinux and that'll lead to unintentional misbehavior for applications linking against libLLVM. A library shouldn't do that.

We can, however, add a mode (build time option) where we do the specific checks for distributions that do care and want to have such a narrow check. It should, however, try to make use of libselinux without changing any of its internal state.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmph, I didn't think of the multithreaded issue of "one thread is the consumer setting a selinux callback, the other is the JIT doing it's thing". I'm not sure how to best solve this then, as I'd like to avoid having llvm print selinux log messages.

I guess at worst we could simply not check for execmem specifically and instead just bail out when we find selinux enabled? I'd argue that any sane selinux instance disabled execmem anyways :P

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmph, I didn't think of the multithreaded issue of "one thread is the consumer setting a selinux callback, the other is the JIT doing it's thing". I'm not sure how to best solve this then, as I'd like to avoid having llvm print selinux log messages.

Yeah, me too. Having a library printing stuff at "arbitrary" times isn't nice ;)

I guess at worst we could simply not check for execmem specifically and instead just bail out when we find selinux enabled? I'd argue that any sane selinux instance disabled execmem anyways :P

From my experience, quite the opposite. You won't find a current desktop system that has SELinux's "deny_execmem" enabled because there's just so many JITs out there all trying to either directly mmap(RWX) or do the transition RW- -> R-X.

So, no, just because SELinux is enabled, doesn't mean its policy will deny generating code at runtime. That's why the test in Memory::execProtectionChangeNeedsNewMapping() has a generic fallback if a specific test (for PaX only, so far) fails.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I (and I bet everyone else who works on selinux policies) would still much prefer being overly cautious rather than causing an AVC denial to detect execmem support.
It means we'd have to add a suppression for everything that uses the llvm jit, which means losing any information / audit logging on when a program unintentionally tries to use execmem.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I just want to stress, the status quo would be to generate the very same AVC denial and having the JIT code generation fail with EACCESS or even killing the process, as you've hinted before.

My changes do not make it worse, in fact, they make it better for configurations that do not kill a process on an AVC denial. But yeah, I see that preventing the AVC denial is a sensible goal as well.

I'll try to see what I can do about it without making libLLVM mess with libselinux's global state.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementing a freestanding check was actually relatively easy. ...well, kind of. I was able to implement a C version in basically no time but I struggled to make that comply with LLVM's set of C++ helpers around file i/o. In fact, I failed to find something that would suit my needs for opening a file as a stream for reading and writing. That's why I went for std::fstream instead. If that's a no-go, please, PLEASE, point me to some abstraction that's better suited. (Honestly, I even find the hops one has to go through just to make use of MemoryBuffer::getFileAsStream() ugly but beauty lies in the eye of the beholder, they say....). Just for the record, that would be the C pendant:

#include <sys/stat.h>
#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>

int is_selinux_enabled(void) __attribute__((weak));

#ifndef SEL_INO_MASK
#define SEL_INO_MASK 0x00ffffff
#endif
#ifndef SEL_CLASS_INO_OFFSET
#define SEL_CLASS_INO_OFFSET 0x04000000
#endif
#ifndef SEL_VEC_MAX
#define SEL_VEC_MAX 32
#endif
#ifndef SELINUX_MAGIC
#define SELINUX_MAGIC 0xf97cff8c
#endif

#define SEL_MNT "/sys/fs/selinux"

int selinux_can_execmem(void) {
    if (is_selinux_enabled && !is_selinux_enabled())
        return 1;

    // get current context
    char ctx[1024] = {};
    FILE *f = fopen("/proc/self/attr/current", "r");
    if (!f)
        return -1;
    if (!fgets(ctx, sizeof(ctx), f)) {
        fclose(f);
        return -1;
    }
    fclose(f);

    // The inode encodes class and permission bits
    struct stat sbuf;
    if (stat(SEL_MNT "/class/process/perms/execmem", &sbuf))
        return -1;

    if (!(sbuf.st_ino & SEL_CLASS_INO_OFFSET))
        return -1;

    int clazz = (sbuf.st_ino & SEL_INO_MASK) / (SEL_VEC_MAX + 1);
    int xperm = (sbuf.st_ino & SEL_INO_MASK) % (SEL_VEC_MAX + 1) - 1;
    if (xperm < 0 || xperm > 31)
        return -1;

    unsigned int execmem = 1U << xperm;

    f = fopen(SEL_MNT "/access", "r+");
    if (!f)
        return -1;

    if (fprintf(f, "%s %s %d %x", ctx, ctx, clazz, execmem) < 0) {
        fclose(f);
        return -1;
    }

    unsigned int lo, hi;
    if (fscanf(f, "%x %x", &lo, &hi) != 2) {
        fclose(f);
        return -1;
    }
    fclose(f);

    uint64_t allowed = (uint64_t)hi << 32 | lo;
    return !!(allowed & execmem);

}
int main(void) {
    int status = selinux_can_execmem();
    printf("I am %sallowed to execmem\n",
           status == 0 ? "not " :
           status == 1 ? "" :
           "maybe ");
    return 0;
}

Either way, the current version avoids the AVC denial by implementing a specific test which should address your concerns.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to go this route too, but gave up trying to decipher the class encoding - reading the kernel source never occured to me :P

LGTM, no clue whether llvm has an alternative to std::fstream so I'll leave this open.

llvm/lib/Support/Unix/Memory.inc Outdated Show resolved Hide resolved
llvm/include/llvm/Support/raw_ostream.h Show resolved Hide resolved
If MFD_CLOEXEC is defined, assume the libc function is so as well.

Signed-off-by: Mathias Krause <[email protected]>
@lhames
Copy link
Contributor

lhames commented Aug 1, 2024

I'm out of the loop on Linux memory protections. If I understand correctly this works around W^X by copying the content of the memory to the memfd, then mapping the memfd back in at the target address with R-X flags?

If so that's cool, but seems surprising behavior for sys::Memory.

What about adding a new W^X compliant JITLinkMemoryManager for this? Or a new Mapper for MapperJITLinkMemoryManager? The JIT linker only ever reads / writes memory, so the idea would be to create dual mappings -- one with RW- protections that can be used as the working memory for the linker, and a second with the requested permissions (which would simply have to be compliant with the system's policies) for use in JIT'd code.

@minipli-oss
Copy link
Author

I'm out of the loop on Linux memory protections. If I understand correctly this works around W^X by copying the content of the memory to the memfd, then mapping the memfd back in at the target address with R-X flags?

That's correct. It tries to be clever about it by only using the memfd if needed, e.g. if there's a change from NX -> X or vice versa. Other protection changes will still be handled by making use of mprotect().

If so that's cool, but seems surprising behavior for sys::Memory.

Depends on what the user expects to happen, I guess. The API is to change the protection bits for a given memory range and using the memfd hack^Wworkaround for systems that need it fulfills that goal.

IMHO, from an API user's perspective, it's an implementation detail how that gets done as long as the method does what the API demands, getting the protection bits modified. But yeah, changing the underlying mapping may be surprising. But, after all, there's no guaranties about how these are allocated in the first place, so there should also be no assumptions on how they're managed through calls to sys::Memory API methods.

What about adding a new W^X compliant JITLinkMemoryManager for this? Or a new Mapper for MapperJITLinkMemoryManager?

That's what I wanted to do initially -- either change how DefaultMMapper behaves or change SectionMemoryManager's constructor to detect if the system needs a W^X-compatible mapper and use that instead of DefaultMMapper. But I figured, changing DefaultMMapper makes no sense as it is only a very slim wrapper around sys::Memory so I could change it directly.

What actually requires implementing the fallback to memfds at this level is, there are quite a few "raw" calls to sys::Memory::protectMappedMemory(), e.g. in orc::LocalTrampolinePool, orc::LocalIndirectStubsInfo, orc:: InProcessMemoryMapper, jitlink::InProcessMemoryManager::IPInFlightAlloc, orc::rt_bootstrap::SimpleExecutorMemoryManager, llvm-rtdyld and the unit tests.

So, unfortunately, implementing a new mapper or memory manager won't help for these. That's why I implemented the handling directly at the lowest level that needs it.

The JIT linker only ever reads / writes memory, so the idea would be to create dual mappings -- one with RW- protections that can be used as the working memory for the linker, and a second with the requested permissions (which would simply have to be compliant with the system's policies) for use in JIT'd code.

I explicitly tried to avoid the dual-mapping approach as it causes way too much trouble for no real gain. Having two mappings for the same data implies having different addresses for the writable and executable parts, which, in turn, makes generating proper references for generated code more complicated than it needs to be (relative addresses used in the writable mapping will refer to a different memory address when executed from the other mapping, likely causing access faults).

Also, there's no need to have these dual mappings as there seems to be a streamlined approach to JIT code gen to write/modify as needed and when done, change protections to R-X. The latter happens even lazy, only on first use.

@lhames
Copy link
Contributor

lhames commented Aug 1, 2024

What actually requires implementing the fallback to memfds at this level is, there are quite a few "raw" calls to sys::Memory::protectMappedMemory(), e.g. in orc::LocalTrampolinePool, orc::LocalIndirectStubsInfo, orc:: InProcessMemoryMapper, jitlink::InProcessMemoryManager::IPInFlightAlloc, orc::rt_bootstrap::SimpleExecutorMemoryManager, llvm-rtdyld and the unit tests.

So, unfortunately, implementing a new mapper or memory manager won't help for these. That's why I implemented the handling directly at the lowest level that needs it.

They'll take a little work to adapt, but I believe we can fix these to use the SimpleSegmentAlloc utility, which uses a JITLinkMemoryManager for the underlying implementation.

The JIT linker only ever reads / writes memory, so the idea would be to create dual mappings -- one with RW- protections that can be used as the working memory for the linker, and a second with the requested permissions (which would simply have to be compliant with the system's policies) for use in JIT'd code.

I explicitly tried to avoid the dual-mapping approach as it causes way too much trouble for no real gain. Having two mappings for the same data implies having different addresses for the writable and executable parts, which, in turn, makes generating proper references for generated code more complicated than it needs to be (relative addresses used in the writable mapping will refer to a different memory address when executed from the other mapping, likely causing access faults).

ORC already handles this: it's required to support out-of-process JITing. Dual-mapping in the same process shouldn't require any additional effort.

@minipli-oss
Copy link
Author

What actually requires implementing the fallback to memfds at this level is, there are quite a few "raw" calls to sys::Memory::protectMappedMemory(), e.g. in orc::LocalTrampolinePool, orc::LocalIndirectStubsInfo, orc:: InProcessMemoryMapper, jitlink::InProcessMemoryManager::IPInFlightAlloc, orc::rt_bootstrap::SimpleExecutorMemoryManager, llvm-rtdyld and the unit tests.
So, unfortunately, implementing a new mapper or memory manager won't help for these. That's why I implemented the handling directly at the lowest level that needs it.

They'll take a little work to adapt, but I believe we can fix these to use the SimpleSegmentAlloc utility, which uses a JITLinkMemoryManager for the underlying implementation.

Hmm, feels a little odd to switch classes like jitlink::InProcessMemoryManager::IPInFlightAlloc over to use yet another JITLinkMemoryManager to implement... a JITLinkMemoryManager.

Going through the code base and getting rid of sys::Memory::protectMappedMemory() calls also feels like an orthogonal issue, as support for a strict W^X policy can very well be implemented at the sys::Memory layer without having to switch all users over to a memory manager based scheme. I'd rather defer that work to a later stage, as it's not strictly required to solve the W^X support issue and can be done one user at a time which should not only ease review but also prevents this PR becoming just one big patch with lots of loosely related changes.

The JIT linker only ever reads / writes memory, so the idea would be to create dual mappings -- one with RW- protections that can be used as the working memory for the linker, and a second with the requested permissions (which would simply have to be compliant with the system's policies) for use in JIT'd code.

I explicitly tried to avoid the dual-mapping approach as it causes way too much trouble for no real gain. Having two mappings for the same data implies having different addresses for the writable and executable parts, which, in turn, makes generating proper references for generated code more complicated than it needs to be (relative addresses used in the writable mapping will refer to a different memory address when executed from the other mapping, likely causing access faults).

ORC already handles this: it's required to support out-of-process JITing. Dual-mapping in the same process shouldn't require any additional effort.

There are several other issues with dual mappings. The probably obvious one is that one has to keep them in sync (what gets written to one mapping needs to be read back when executed from the other). The easiest way to achieve this is to created MAP_SHARED file-based mappings. The file can also be a memfd which simply gets mapped twice. However, using MAP_SHARED has the unfortunate side-effect that these mappings will not only be kept in-sync for the creating process but also for forked child processes which is actually bad as a child can tamper with mappings in the parent (MFD_CLOEXEC for the memfd won't help, as the mapping has a reference to the underlying file that won't be invalidated). One may try to work around that by calling madvise(MADV_DONTFORK) on the mappings, making them disappear in the child. However, that would be bad as well as the code would vanish, possibly breaking existing use-cases.

Another reason to avoid dual-mappings is that with the writable mapping one can always tamper with the code, i.e. inject new code at arbitrary times. And, in fact, there is no need to do that, as current behaviour clearly shows. There's only a need to have a RW-R-X transition -- no need to modify once generated and finalized code later on. That simply vanishes the need to have two mappings as only one is used at a given time, allowing to simply replace the writable mapping with an executable one in place.

Using a dual-mapping approach also requires more syscalls, slowing down JIT code generation further. As the mapping has to be file-based, the size needs to be set in advance (unlike in my current version where it's implicitly updated by simply calling write() on the memfd).

A dual-mapping scheme therefore provides no benefits over the replacement-mapping approach, which would even better mirror the visible semantics of the current mprotect() based implementation -- simply change the protection bits for a given virtual address range.

I therefore would vote against a dual-mapping approach and keep the replacement-mapping scheme, reusing the same VA range.

@lhames
Copy link
Contributor

lhames commented Sep 29, 2024

What actually requires implementing the fallback to memfds at this level is, there are quite a few "raw" calls to sys::Memory::protectMappedMemory(), e.g. in orc::LocalTrampolinePool, orc::LocalIndirectStubsInfo, orc:: InProcessMemoryMapper, jitlink::InProcessMemoryManager::IPInFlightAlloc, orc::rt_bootstrap::SimpleExecutorMemoryManager, llvm-rtdyld and the unit tests.
So, unfortunately, implementing a new mapper or memory manager won't help for these. That's why I implemented the handling directly at the lowest level that needs it.

They'll take a little work to adapt, but I believe we can fix these to use the SimpleSegmentAlloc utility, which uses a JITLinkMemoryManager for the underlying implementation.

Hmm, feels a little odd to switch classes like jitlink::InProcessMemoryManager::IPInFlightAlloc over to use yet another JITLinkMemoryManager to implement... a JITLinkMemoryManager.

I'm suggesting use of a SimpleSegmentAlloc for users who need executable code and are currently using sys::Memory directly.

Support for the memfd approach should be restricted to a JITLinkMemoryManager or Mapper.

Going through the code base and getting rid of sys::Memory::protectMappedMemory() calls also feels like an orthogonal issue, as support for a strict W^X policy can very well be implemented at the sys::Memory layer without having to switch all users over to a memory manager based scheme. I'd rather defer that work to a later stage, as it's not strictly required to solve the W^X support issue and can be done one user at a time which should not only ease review but also prevents this PR becoming just one big patch with lots of loosely related changes.

The JIT linker only ever reads / writes memory, so the idea would be to create dual mappings -- one with RW- protections that can be used as the working memory for the linker, and a second with the requested permissions (which would simply have to be compliant with the system's policies) for use in JIT'd code.

I explicitly tried to avoid the dual-mapping approach as it causes way too much trouble for no real gain. Having two mappings for the same data implies having different addresses for the writable and executable parts, which, in turn, makes generating proper references for generated code more complicated than it needs to be (relative addresses used in the writable mapping will refer to a different memory address when executed from the other mapping, likely causing access faults).

ORC already handles this: it's required to support out-of-process JITing. Dual-mapping in the same process shouldn't require any additional effort.

There are several other issues with dual mappings. The probably obvious one is that one has to keep them in sync (what gets written to one mapping needs to be read back when executed from the other). The easiest way to achieve this is to created MAP_SHARED file-based mappings. The file can also be a memfd which simply gets mapped twice. However, using MAP_SHARED has the unfortunate side-effect that these mappings will not only be kept in-sync for the creating process but also for forked child processes which is actually bad as a child can tamper with mappings in the parent (MFD_CLOEXEC for the memfd won't help, as the mapping has a reference to the underlying file that won't be invalidated).

What do you mean by "tamper with mappings in the parent"? If you just mean the writes to shared memory in the child will be visible in the parent process I would consider that desirable behavior. The existing SharedMemoryMapper already works this way.

One may try to work around that by calling madvise(MADV_DONTFORK) on the mappings, making them disappear in the child. However, that would be bad as well as the code would vanish, possibly breaking existing use-cases.

Another reason to avoid dual-mappings is that with the writable mapping one can always tamper with the code, i.e. inject new code at arbitrary times. And, in fact, there is no need to do that, as current behaviour clearly shows. There's only a need to have a RW-R-X transition -- no need to modify once generated and finalized code later on. That simply vanishes the need to have two mappings as only one is used at a given time, allowing to simply replace the writable mapping with an executable one in place.

Being able to rewrite code opens up the possibility of improving performance. E.g. rewriting jumps to bypass stubs and go directly to function bodies once the function has been complied.

ORC's memory manager interface was designed to be flexible enough to accommodate all of these options. Keeping dual RW- / R-X mappings as an option would be a really good outcome here.

Using a dual-mapping approach also requires more syscalls, slowing down JIT code generation further. As the mapping has to be file-based, the size needs to be set in advance (unlike in my current version where it's implicitly updated by simply calling write() on the memfd).

I'm suggesting that it should be reasonably easy to make the dual-mapping optional, and that some clients will be willing to pay that cost for the combination of security and runtime flexibility that it offers.

A dual-mapping scheme therefore provides no benefits over the replacement-mapping approach, which would even better mirror the visible semantics of the current mprotect() based implementation -- simply change the protection bits for a given virtual address range.

I'd say that the current semantics of Memory aren't especially helpful, given that LLVM's JIT aims to work transparently out-of-process: The working assumption is that the address range being written isn't the same as the one being executed. That's why implementing this as a mapper makes it so much more useful.

I therefore would vote against a dual-mapping approach and keep the replacement-mapping scheme, reusing the same VA range.

It's possible that there's something that I'm missing, but so far I haven't heard a compelling reason not to do this with a JITLinkMemoryManager or Mapper, and there are several benefits to doing it at that level.

@minipli-oss
Copy link
Author

[...]

I'm suggesting use of a SimpleSegmentAlloc for users who need executable code and are currently using sys::Memory directly.

Support for the memfd approach should be restricted to a JITLinkMemoryManager or Mapper.

I'll comment on that separately, need to hack some more code to convince myself either way.

[...]

There are several other issues with dual mappings. The probably obvious one is that one has to keep them in sync (what gets written to one mapping needs to be read back when executed from the other). The easiest way to achieve this is to created MAP_SHARED file-based mappings. The file can also be a memfd which simply gets mapped twice. However, using MAP_SHARED has the unfortunate side-effect that these mappings will not only be kept in-sync for the creating process but also for forked child processes which is actually bad as a child can tamper with mappings in the parent (MFD_CLOEXEC for the memfd won't help, as the mapping has a reference to the underlying file that won't be invalidated).

What do you mean by "tamper with mappings in the parent"? If you just mean the writes to shared memory in the child will be visible in the parent process I would consider that desirable behavior. The existing SharedMemoryMapper already works this way.

No, it'll be a security hole, no less. Take the below example which implements the dual-mapping scheme. It populates the mapping with some shellcode which simply calls a function. The code gets initialized to call the function good(). Afterwards the program forks a child process that drops privileges and modifies its mapping to call bad() instead. But as the mapping is shared with the parent process, it'll call bad() now too.

#define _GNU_SOURCE
#include <sys/mman.h>
#include <sys/wait.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <err.h>

static pid_t parent_pid;

#define msg(fmt,...) \
    printf("%6s (uid: %d): " fmt, getpid() == parent_pid ? "parent" : "child", \
           geteuid(), ## __VA_ARGS__)

static void good(void) {
    msg("all good...\n");
}

static void bad(void) {
    msg("!!! PWNED !!!\n");
}

static void call_code(const void *code) {
    msg("calling func@%p...\n", code);
    ((void (*)(void))code)();
}

static int pwn(void *code, long *fptr) {
    call_code(code);

    if (mprotect(code, 0x1000, PROT_READ | PROT_WRITE))
        err(1, "mprotect(rw-)");

    msg("patching code at %p...\n", fptr);
    *fptr = (long)bad;

    if (mprotect(code, 0x1000, PROT_READ | PROT_EXEC))
        err(1, "mprotect(r-x)");

    call_code(code);

    return 0;
}

int main(int argc, char *argv[]) {
    static char shellcode[] = {
#if defined(__x86_64__)
        // movabs $0, %rax
        0x48, 0xb8, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
        // jmp *%rax
        0xff, 0xe0
#elif defined(__i386__)
        // mov $0, %eax
        0xb8, 0x00, 0x00, 0x00, 0x00,
        // jmp *%eax
        0xff, 0xe0
#else
#error Unsupported architecture!
#endif
    };
    static const int sc_len = sizeof(shellcode);
    unsigned int fptr_offset = sizeof(long) == 8 ? 2 : 1;
    long *fptr = (long *)(shellcode + fptr_offset);
    enum { DUAL_MAP, DIRECT } mode = DIRECT;
    void *addr;
    int flags;
    int fd;

    // make shellcode[] call good()
    *fptr = (long)good;

    parent_pid = getpid();
    if (argc > 1 && strcmp(argv[1], "-d") == 0)
        mode = DUAL_MAP;

    fd = memfd_create("code", MFD_CLOEXEC);
    if (fd < 0)
        err(1, "memfd_create()");

    if (mode == DUAL_MAP) {
        // need to size file in advance
        if (ftruncate(fd, sizeof(shellcode)) < 0)
            err(1, "ftruncate()");

        flags = MAP_SHARED;
        addr = mmap(NULL, sc_len, PROT_READ | PROT_WRITE, flags, fd, 0);
        if (addr == MAP_FAILED)
            err(1, "mmap(rw-)");
        msg("created writable mapping at %p\n", addr);

        memcpy(addr, shellcode, sc_len);
    } else {
        flags = MAP_PRIVATE | MAP_POPULATE;

        if (write(fd, shellcode, sc_len) != sc_len)
            err(1, "write(shellcode)");
    }

    addr = mmap(NULL, sc_len, PROT_READ | PROT_EXEC, flags, fd, 0);
    if (addr == MAP_FAILED)
        err(1, "mmap(r-x)");
    msg("created %s executable mapping at %p\n",
        (flags & MAP_SHARED) ? "shared": "private", addr);

    call_code(addr);

    switch (fork()) {
    case -1: err(1, "fork()");
    case  0: setuid(getuid()); exit(pwn(addr, addr + fptr_offset));
    default: wait(NULL);
    }

    call_code(addr);

    return 0;
}

Compiling the code and making sure it has the SUID-bit set:

minipli@x1:~/tmp$ gcc -O2 -o shared_exec shared_exec.c
minipli@x1:~/tmp$ sudo chown 0:0 shared_exec
minipli@x1:~/tmp$ sudo chmod u+s shared_exec
minipli@x1:~/tmp$ ls -l shared_exec
-rwsr-xr-x 1 root root 16848 Okt 11 12:00 shared_exec

Calling the program with -d makes it use the dual-mapping scheme which requires using shared mappings to keep them in-sync. However, it also has the unfortunate side effect of sharing these among child processes, so we get the following behavior:

minipli@x1:~/tmp$ ./shared_exec -d
parent (uid: 0): created writable mapping at 0x7c0a706f8000
parent (uid: 0): created shared executable mapping at 0x7c0a706f7000
parent (uid: 0): calling func@0x7c0a706f7000...
parent (uid: 0): all good...
 child (uid: 1000): calling func@0x7c0a706f7000...
 child (uid: 1000): all good...
 child (uid: 1000): patching code at 0x7c0a706f7002...
 child (uid: 1000): calling func@0x7c0a706f7000...
 child (uid: 1000): !!! PWNED !!!
parent (uid: 0): calling func@0x7c0a706f7000...
parent (uid: 0): !!! PWNED !!!

The parent process initially calls good() through the code mapping. The forked child process modifies the code to call bad() instead, which can be seen by the last message printed by the (unprivileged) child process. However, as it is a shared mapping, the parent process will now call bad() too. A clear breach of privilege boundaries.

Running the program without extra arguments makes it use private mappings instead -- like I'm proposing:

minipli@x1:~/tmp$ ./shared_exec 
parent (uid: 0): created private executable mapping at 0x76bbc26ad000
parent (uid: 0): calling func@0x76bbc26ad000...
parent (uid: 0): all good...
 child (uid: 1000): calling func@0x76bbc26ad000...
 child (uid: 1000): all good...
 child (uid: 1000): patching code at 0x76bbc26ad002...
 child (uid: 1000): calling func@0x76bbc26ad000...
 child (uid: 1000): !!! PWNED !!!
parent (uid: 0): calling func@0x76bbc26ad000...
parent (uid: 0): all good...

Even though the code mapping was modified in the child process to call bad(), the parent's mapping was left as-is and is still calling good(). And this should very much be the by-default behavior to avoid the otherwise security hole potential.

[...] There's only a need to have a RW-R-X transition -- no need to modify once generated and finalized code later on. That simply vanishes the need to have two mappings as only one is used at a given time, allowing to simply replace the writable mapping with an executable one in place.

Being able to rewrite code opens up the possibility of improving performance. E.g. rewriting jumps to bypass stubs and go directly to function bodies once the function has been complied.

As long as all of these rewrites happen before the code gets used, i.e. before finalize(), that should be fine. The lazy evaluation of symbols should help in this regard, as an executable mapping, strictly speaking, is only needed on first use.
This also mirrors what I've observed during my tests -- executable mappings only created on the first use of a JIT'ed function.

ORC's memory manager interface was designed to be flexible enough to accommodate all of these options. Keeping dual RW- / R-X mappings as an option would be a really good outcome here.

No, not for the security hazard it creates. I can already hear all the CVEs rattling that this will cause. We can't do that -- not without users of that API explicitly opting-in for such a behavior.

Using a dual-mapping approach also requires more syscalls, slowing down JIT code generation further. As the mapping has to be file-based, the size needs to be set in advance (unlike in my current version where it's implicitly updated by simply calling write() on the memfd).

I'm suggesting that it should be reasonably easy to make the dual-mapping optional, and that some clients will be willing to pay that cost for the combination of security and runtime flexibility that it offers.

Given the security implications, I'm not sure we should provide that option at all -- given, that the status quo doesn't have it either.

A dual-mapping scheme therefore provides no benefits over the replacement-mapping approach, which would even better mirror the visible semantics of the current mprotect() based implementation -- simply change the protection bits for a given virtual address range.

I'd say that the current semantics of Memory aren't especially helpful, given that LLVM's JIT aims to work transparently out-of-process: The working assumption is that the address range being written isn't the same as the one being executed. That's why implementing this as a mapper makes it so much more useful.

I therefore would vote against a dual-mapping approach and keep the replacement-mapping scheme, reusing the same VA range.

It's possible that there's something that I'm missing, but so far I haven't heard a compelling reason not to do this with a JITLinkMemoryManager or Mapper, and there are several benefits to doing it at that level.

I need to refresh my memory here. I've tried to go that route before but reverted back to the much simpler solution which will cover all use cases. There were some road blocks but I can't remember them right now. I'll reevaluate and comment back on that part once I've all the details back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants