-
Notifications
You must be signed in to change notification settings - Fork 11.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[llvm][Support][Memory] Add memfd based fallback for strict W^X Linux systems #98538
base: main
Are you sure you want to change the base?
Conversation
Thank you for submitting a Pull Request (PR) to the LLVM Project! This PR will be automatically labeled and the relevant teams will be If you wish to, you can add reviewers by using the "Reviewers" section on this page. If this is not working for you, it is probably because you do not have write If you have received no comments on your PR for a week, you can request a review If you have further questions, they may be answered by the LLVM GitHub User Guide. You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums. |
@llvm/pr-subscribers-llvm-support Author: None (minipli-oss) ChangesPaX's MPROTECT feature as well as SELinux's deny_execmem policy prevent a transition from a once writable memory mapping to an executable one. This obviously breaks JIT code generation. As both of these are Linux specific and the Linux kernel gained memfd support almost 10 years ago in v3.17, making use of it to implement a fallback seems like a viable option to get JIT code generation fixed for such systems. Implement a detour through a memfd for systems that are detected to deny the W<->X transition of memory mappings. For PaX this can be easily achieved by evaluating the per-process PaX flags, if access to /proc/self/status is available. For others, and as a fallback for PaX, a runtime test is done once. This enables such systems to make use of JIT code generation without completely abandoning their W^X policy. Regarding the implementation, the runtime test for detecting PaX's MPROTECT shamelessly shows I'm a C developer at heart. This can -- and should! -- be made more C++ish by making use of a Also please excuse my clunky An example of what this change achieves, is to make use of LLVMpipe on such system out of the box by implementing a fallback to what normally would be denied by policy. This, in turn, will give users of such systems a working graphical desktop if there's no native video driver available, e.g. what is commonly the case in (wrongly configured) VMs. PS: First time contributor, suggestions for improvement heavily welcomed! Full diff: https://github.com/llvm/llvm-project/pull/98538.diff 3 Files Affected:
diff --git a/llvm/lib/Support/Unix/Memory.inc b/llvm/lib/Support/Unix/Memory.inc
index bac208a7d543c..f894e24ea2919 100644
--- a/llvm/lib/Support/Unix/Memory.inc
+++ b/llvm/lib/Support/Unix/Memory.inc
@@ -18,6 +18,10 @@
#include "llvm/Support/Process.h"
#include "llvm/Support/Valgrind.h"
+#ifdef __linux__
+#include "MemoryLinux.h"
+#endif
+
#ifdef HAVE_SYS_MMAN_H
#include <sys/mman.h>
#endif
@@ -177,6 +181,55 @@ std::error_code Memory::protectMappedMemory(const MemoryBlock &M,
alignAddr((const uint8_t *)M.Address + M.AllocatedSize, PageSize);
bool InvalidateCache = (Flags & MF_EXEC);
+ bool SkipMprotect = false;
+
+#if defined(__linux__)
+ // Check for cases where the EXEC protection flag changes and a possible
+ // strict W^X policy cannot be bypassed via mprotect() alone, e.g. under
+ // PaX's MPROTECT or SELinux's deny_execmem.
+ //
+ // To support such systems, we need to create a fresh mapping with the
+ // target protection flags.
+ if ((M.Flags ^ Flags) & MF_EXEC && execProtChangeNeedsNewMapping()) {
+ class FDWrapper {
+ public:
+ FDWrapper(int fd) : fd(fd) {}
+ ~FDWrapper() { ::close(fd); }
+ operator int() const { return fd; }
+ private:
+ int fd;
+ } fd(memfd_create("llvm", MFD_CLOEXEC));
+
+ if (fd < 0)
+ return errnoAsErrorCode();
+
+ const char *data = reinterpret_cast<char *>(Start);
+ uintptr_t len = End - Start;
+ uintptr_t left = len;
+
+ while (left) {
+ ssize_t cnt = ::write(fd, data, left);
+ if (cnt < 0) {
+ if (errno == EINTR)
+ continue;
+
+ return errnoAsErrorCode();
+ }
+ left -= cnt;
+ data += cnt;
+ }
+
+ void *addr = ::mmap(reinterpret_cast<void *>(Start), len, Protect,
+ MAP_PRIVATE | MAP_FIXED, fd, 0);
+ if (addr == MAP_FAILED)
+ return errnoAsErrorCode();
+
+ // We created a new mapping with the final protection bits, therefore
+ // don't need to call mprotect() with the very same flags again -- unless
+ // we have to toggle PROT_READ for ARM.
+ SkipMprotect = true;
+ }
+#endif
#if defined(__arm__) || defined(__aarch64__)
// Certain ARM implementations treat icache clear instruction as a memory
@@ -190,13 +243,16 @@ std::error_code Memory::protectMappedMemory(const MemoryBlock &M,
Memory::InvalidateInstructionCache(M.Address, M.AllocatedSize);
InvalidateCache = false;
+ SkipMprotect = false;
}
#endif
- int Result = ::mprotect((void *)Start, End - Start, Protect);
+ if (!SkipMprotect) {
+ int Result = ::mprotect((void *)Start, End - Start, Protect);
- if (Result != 0)
- return errnoAsErrorCode();
+ if (Result != 0)
+ return errnoAsErrorCode();
+ }
if (InvalidateCache)
Memory::InvalidateInstructionCache(M.Address, M.AllocatedSize);
diff --git a/llvm/lib/Support/Unix/MemoryLinux.h b/llvm/lib/Support/Unix/MemoryLinux.h
new file mode 100644
index 0000000000000..e1a642f121394
--- /dev/null
+++ b/llvm/lib/Support/Unix/MemoryLinux.h
@@ -0,0 +1,105 @@
+//===- Unix/MemoryLinux.h - Linux specific Helper Fuctions ------*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+// This file defines Linux specific helper functions for memory management.
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef LLVM_LIB_SUPPORT_UNIX_MEMORYLINUX_H
+#define LLVM_LIB_SUPPORT_UNIX_MEMORYLINUX_H
+
+#ifndef __linux__
+#error Linux only support header!
+#endif
+
+#include "llvm/Support/Process.h"
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+
+#include <sys/mman.h>
+#include <sys/syscall.h>
+
+#ifndef MFD_CLOEXEC
+#define MFD_CLOEXEC 1U
+#endif
+
+namespace llvm {
+namespace sys {
+namespace {
+
+static inline bool isPermissionError(int err) {
+ // PaX uses EPERM, SELinux uses EACCES
+ return err == EPERM || err == EACCES;
+}
+
+// FIXME: Make this either more lol-level C'ish or C++'ish
+static inline bool execProtChangeNeedsNewMapping() {
+ static int status = -1;
+
+ if (status != -1)
+ return status;
+
+ // Try to get the status from /proc/self/status, looking for PaX flags.
+ FILE *f = fopen("/proc/self/status", "re");
+ if (f) {
+ char *buf = NULL;
+ size_t len;
+
+ while (getline(&buf, &len, f) != -1) {
+ if (strncmp(buf, "PaX:", 4))
+ continue;
+
+ // Look for 'm', indicating PaX MPROTECT is disabled.
+ status = !strchr(buf + 4, 'm');
+ break;
+ }
+
+ fclose(f);
+ free(buf);
+
+ if (status != -1)
+ return status;
+ }
+
+ // Create a temporary writable mapping and try to make it excecutable. If
+ // this fails, test 'errno' to ensure it failed because we were not allowed
+ // to create such a mapping and not because of some transient error.
+ size_t size = Process::getPageSizeEstimate();
+ void *addr = ::mmap(NULL, size, PROT_READ | PROT_WRITE,
+ MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+ if (addr == MAP_FAILED) {
+ // Must be low on memory or have too many mappings alrady, not much we can
+ // do here.
+ status = 0;
+ } else {
+ if (::mprotect(addr, size, PROT_READ | PROT_EXEC) < 0)
+ status = isPermissionError(errno);
+ else
+ status = 0;
+ ::munmap(addr, size);
+ }
+
+ return status;
+}
+
+static inline int memfd_create(const char *name, int flags) {
+#ifdef SYS_memfd_create
+ return syscall(SYS_memfd_create, name, flags);
+#else
+ return -1;
+#endif
+}
+
+} // anonymous namespace
+} // namespace sys
+} // namespace llvm
+
+#endif
diff --git a/llvm/unittests/Support/MemoryTest.cpp b/llvm/unittests/Support/MemoryTest.cpp
index 9daa6d0ff9e4d..dbee1b62b25d1 100644
--- a/llvm/unittests/Support/MemoryTest.cpp
+++ b/llvm/unittests/Support/MemoryTest.cpp
@@ -11,6 +11,10 @@
#include "gtest/gtest.h"
#include <cstdlib>
+#if defined(__linux__)
+#include "../lib/Support/Unix/MemoryLinux.h"
+#endif
+
#if defined(__NetBSD__)
// clang-format off
#include <sys/param.h>
@@ -40,6 +44,8 @@ bool IsMPROTECT() {
err(EXIT_FAILURE, "sysctl");
return !!(paxflags & CTL_PROC_PAXFLAGS_MPROTECT);
+#elif defined(__linux__)
+ return execProtChangeNeedsNewMapping();
#elif (defined(__APPLE__) && defined(__aarch64__)) || defined(__OpenBSD__)
return true;
#else
|
Should have mentioned, this fixes #7073. |
Ping, maybe @chandlerc, @andykaylor, @hubert-reinterpretcast can take a look? |
@tstellar, can you please take a look? This change makes LLVM's JIT code generation SELinux deny_execmem compatible, so might be valuable for RedHat based systems? |
2985ac0
to
bea701f
Compare
PaX's MPROTECT feature as well as SELinux's deny_execmem policy prevent a transition from a once writable memory mapping to an executable one. This obviously breaks JIT code generation. As both of these are Linux specific and the Linux kernel gained memfd support almost 10 years ago in v3.17, making use of it to implement a fallback seems like a viable option to get JIT code generation fixed for such systems. Implement a detour through a memfd for systems that are detected to deny the W<->X transition of memory mappings. For PaX this can be easily achieved by evaluating the per-process PaX flags, if access to /proc/self/status is available. For others, and as a fallback for PaX, a runtime test is done once. This enables such systems to make use of JIT code generation without completely abandoning their W^X policy. Signed-off-by: Mathias Krause <[email protected]>
bea701f
to
3a22789
Compare
- set errno if SYS_memfd_create is unknown so error handling actually works in this case - try making use of MFD_EXEC as demanded by recent kernels
I did a review and fixes myself and added them to the branch as follol-up commits, to ease review. My plan is to squash these into the first commit for a merge, though. The changes fix error handling for old systems as well as supporting recent kernels better. I also added an explicit test for the W->X memory transition which is now supported through this change. |
✅ With the latest revision this PR passed the C/C++ code formatter. |
Ahh, that's the key! I was wondering myself why |
llvm only allows the "squash and merge" option, so yes, you're doing the right thing :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not my area of expertise but here are some general suggestions.
@@ -177,6 +181,78 @@ std::error_code Memory::protectMappedMemory(const MemoryBlock &M, | |||
alignAddr((const uint8_t *)M.Address + M.AllocatedSize, PageSize); | |||
|
|||
bool InvalidateCache = (Flags & MF_EXEC); | |||
bool SkipMprotect = false; | |||
|
|||
#if defined(__linux__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this all be sunk into a helper function?
#if defined(__linux__)
whatever = do_a_bunch_of_stuff();
#endif
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's much less code in the current version and moving it to a helper function would require passing quite some arguments. So, maybe, it's okay now?
I could move the memfd part to a helper class to squeeze out a few more lines, but that feels like overengineering just for the cause.
Remove the "private" header file and make execProtectionChangeNeedsNewMapping() a protected static member function of llvm::sys::Memory, allowing it to be accessed by the unit tests as well.
For very simple fd streams we don't need to know the current position nor the file's type. Support skipping these syscalls. Signed-off-by: Mathias Krause <[email protected]>
… it supports a lightweight mode Signed-off-by: Mathias Krause <[email protected]>
Quite some changes, but most important ones are:
Commit fc3e577 should maybe be its own PR and not squashed into this one? |
llvm/lib/Support/Unix/Memory.inc
Outdated
status = 0; | ||
} else { | ||
if (::mprotect(addr, size, PROT_READ | PROT_EXEC) < 0) | ||
status = isPermissionError(errno); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will trigger an AVC denial, which might kill the process depending on host configuration, but at the very least will cause log spam and confusion (does this executable require WX because of the jit, or because it was compromised?).
Ideally you would check for the selinux execmem
permission beforehand, here's an example:
#include <functional>
#include <iostream>
#include <selinux/selinux.h>
namespace {
int noop_log(int, const char *, ...) { return 0; };
bool selinux_can_execmem() {
if (is_selinux_enabled() != 1) {
return true;
}
// this is required as libselinux installs printf() as a default log callback, or because consumers may
// install their own
const auto previous_callback = selinux_get_callback(SELINUX_CB_LOG);
selinux_set_callback(SELINUX_CB_LOG, {.func_log = &noop_log});
struct ScopeGuard {
std::function<void()> func;
~ScopeGuard() { func(); }
};
const ScopeGuard scope_guard{
[previous_callback]() { selinux_set_callback(SELINUX_CB_LOG, previous_callback); }};
char *user_context_raw{};
if (getcon_raw(&user_context_raw) < 0) {
std::cerr << "error in getting my selinux context\n";
return true;
}
if (selinux_check_access(user_context_raw, user_context_raw, "process", "execmem", nullptr) == 0) {
return false;
}
return true;
}
} // namespace
int main() {
if (selinux_can_execmem()) {
std::cout << "I am allowed to execmem\n";
} else {
std::cout << "I am not allowed to execmem\n";
}
}
Note that this will require linking with -lselinux
. There is no freestanding way to check for selinux perms afaik.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I wasn't aware that there's a mode where an "execmem" denial would kill the involved process. During my tests on Fedora I only observed failing syscalls returning -EACCESS
(this very mprotect()
call).
While I'd like to have a better detection of SELinux's "deny_execmem", there are basically two reasons against it:
- If this sequence of syscalls (
addr = mmap(RW); mprotect(addr, RX)
) ends up killing the process instead of only making the second syscall fail, it probably shouldn't attempt to make use of a JIT anyways -- with a host configuration that strict. - Linking against
libselinux
by default just for this test is a deal breaker as we (a) shouldn't require all users oflibLLVM
to link againstlibselinux
(well, technically we could but) and (b) overriding library internal state -- even if only briefly -- will mess with the application also trying to make use oflibselinux
and that'll lead to unintentional misbehavior for applications linking againstlibLLVM
. A library shouldn't do that.
We can, however, add a mode (build time option) where we do the specific checks for distributions that do care and want to have such a narrow check. It should, however, try to make use of libselinux
without changing any of its internal state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmph, I didn't think of the multithreaded issue of "one thread is the consumer setting a selinux callback, the other is the JIT doing it's thing". I'm not sure how to best solve this then, as I'd like to avoid having llvm print selinux log messages.
I guess at worst we could simply not check for execmem
specifically and instead just bail out when we find selinux enabled? I'd argue that any sane selinux instance disabled execmem
anyways :P
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmph, I didn't think of the multithreaded issue of "one thread is the consumer setting a selinux callback, the other is the JIT doing it's thing". I'm not sure how to best solve this then, as I'd like to avoid having llvm print selinux log messages.
Yeah, me too. Having a library printing stuff at "arbitrary" times isn't nice ;)
I guess at worst we could simply not check for
execmem
specifically and instead just bail out when we find selinux enabled? I'd argue that any sane selinux instance disabledexecmem
anyways :P
From my experience, quite the opposite. You won't find a current desktop system that has SELinux's "deny_execmem" enabled because there's just so many JITs out there all trying to either directly mmap(RWX)
or do the transition RW-
-> R-X
.
So, no, just because SELinux is enabled, doesn't mean its policy will deny generating code at runtime. That's why the test in Memory::execProtectionChangeNeedsNewMapping()
has a generic fallback if a specific test (for PaX only, so far) fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I (and I bet everyone else who works on selinux policies) would still much prefer being overly cautious rather than causing an AVC denial to detect execmem
support.
It means we'd have to add a suppression for everything that uses the llvm jit, which means losing any information / audit logging on when a program unintentionally tries to use execmem
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I just want to stress, the status quo would be to generate the very same AVC denial and having the JIT code generation fail with EACCESS
or even killing the process, as you've hinted before.
My changes do not make it worse, in fact, they make it better for configurations that do not kill a process on an AVC denial. But yeah, I see that preventing the AVC denial is a sensible goal as well.
I'll try to see what I can do about it without making libLLVM mess with libselinux's global state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementing a freestanding check was actually relatively easy. ...well, kind of. I was able to implement a C version in basically no time but I struggled to make that comply with LLVM's set of C++ helpers around file i/o. In fact, I failed to find something that would suit my needs for opening a file as a stream for reading and writing. That's why I went for std::fstream
instead. If that's a no-go, please, PLEASE, point me to some abstraction that's better suited. (Honestly, I even find the hops one has to go through just to make use of MemoryBuffer::getFileAsStream()
ugly but beauty lies in the eye of the beholder, they say....). Just for the record, that would be the C pendant:
#include <sys/stat.h>
#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>
int is_selinux_enabled(void) __attribute__((weak));
#ifndef SEL_INO_MASK
#define SEL_INO_MASK 0x00ffffff
#endif
#ifndef SEL_CLASS_INO_OFFSET
#define SEL_CLASS_INO_OFFSET 0x04000000
#endif
#ifndef SEL_VEC_MAX
#define SEL_VEC_MAX 32
#endif
#ifndef SELINUX_MAGIC
#define SELINUX_MAGIC 0xf97cff8c
#endif
#define SEL_MNT "/sys/fs/selinux"
int selinux_can_execmem(void) {
if (is_selinux_enabled && !is_selinux_enabled())
return 1;
// get current context
char ctx[1024] = {};
FILE *f = fopen("/proc/self/attr/current", "r");
if (!f)
return -1;
if (!fgets(ctx, sizeof(ctx), f)) {
fclose(f);
return -1;
}
fclose(f);
// The inode encodes class and permission bits
struct stat sbuf;
if (stat(SEL_MNT "/class/process/perms/execmem", &sbuf))
return -1;
if (!(sbuf.st_ino & SEL_CLASS_INO_OFFSET))
return -1;
int clazz = (sbuf.st_ino & SEL_INO_MASK) / (SEL_VEC_MAX + 1);
int xperm = (sbuf.st_ino & SEL_INO_MASK) % (SEL_VEC_MAX + 1) - 1;
if (xperm < 0 || xperm > 31)
return -1;
unsigned int execmem = 1U << xperm;
f = fopen(SEL_MNT "/access", "r+");
if (!f)
return -1;
if (fprintf(f, "%s %s %d %x", ctx, ctx, clazz, execmem) < 0) {
fclose(f);
return -1;
}
unsigned int lo, hi;
if (fscanf(f, "%x %x", &lo, &hi) != 2) {
fclose(f);
return -1;
}
fclose(f);
uint64_t allowed = (uint64_t)hi << 32 | lo;
return !!(allowed & execmem);
}
int main(void) {
int status = selinux_can_execmem();
printf("I am %sallowed to execmem\n",
status == 0 ? "not " :
status == 1 ? "" :
"maybe ");
return 0;
}
Either way, the current version avoids the AVC denial by implementing a specific test which should address your concerns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to go this route too, but gave up trying to decipher the class encoding - reading the kernel source never occured to me :P
LGTM, no clue whether llvm has an alternative to std::fstream
so I'll leave this open.
Signed-off-by: Mathias Krause <[email protected]>
Signed-off-by: Mathias Krause <[email protected]>
If MFD_CLOEXEC is defined, assume the libc function is so as well. Signed-off-by: Mathias Krause <[email protected]>
I'm out of the loop on Linux memory protections. If I understand correctly this works around W^X by copying the content of the memory to the memfd, then mapping the memfd back in at the target address with R-X flags? If so that's cool, but seems surprising behavior for What about adding a new W^X compliant |
That's correct. It tries to be clever about it by only using the memfd if needed, e.g. if there's a change from NX -> X or vice versa. Other protection changes will still be handled by making use of
Depends on what the user expects to happen, I guess. The API is to change the protection bits for a given memory range and using the memfd hack^Wworkaround for systems that need it fulfills that goal. IMHO, from an API user's perspective, it's an implementation detail how that gets done as long as the method does what the API demands, getting the protection bits modified. But yeah, changing the underlying mapping may be surprising. But, after all, there's no guaranties about how these are allocated in the first place, so there should also be no assumptions on how they're managed through calls to
That's what I wanted to do initially -- either change how What actually requires implementing the fallback to memfds at this level is, there are quite a few "raw" calls to So, unfortunately, implementing a new mapper or memory manager won't help for these. That's why I implemented the handling directly at the lowest level that needs it.
I explicitly tried to avoid the dual-mapping approach as it causes way too much trouble for no real gain. Having two mappings for the same data implies having different addresses for the writable and executable parts, which, in turn, makes generating proper references for generated code more complicated than it needs to be (relative addresses used in the writable mapping will refer to a different memory address when executed from the other mapping, likely causing access faults). Also, there's no need to have these dual mappings as there seems to be a streamlined approach to JIT code gen to write/modify as needed and when done, change protections to R-X. The latter happens even lazy, only on first use. |
They'll take a little work to adapt, but I believe we can fix these to use the
ORC already handles this: it's required to support out-of-process JITing. Dual-mapping in the same process shouldn't require any additional effort. |
Signed-off-by: Mathias Krause <[email protected]>
Signed-off-by: Mathias Krause <[email protected]>
Hmm, feels a little odd to switch classes like Going through the code base and getting rid of
There are several other issues with dual mappings. The probably obvious one is that one has to keep them in sync (what gets written to one mapping needs to be read back when executed from the other). The easiest way to achieve this is to created Another reason to avoid dual-mappings is that with the writable mapping one can always tamper with the code, i.e. inject new code at arbitrary times. And, in fact, there is no need to do that, as current behaviour clearly shows. There's only a need to have a Using a dual-mapping approach also requires more syscalls, slowing down JIT code generation further. As the mapping has to be file-based, the size needs to be set in advance (unlike in my current version where it's implicitly updated by simply calling A dual-mapping scheme therefore provides no benefits over the replacement-mapping approach, which would even better mirror the visible semantics of the current I therefore would vote against a dual-mapping approach and keep the replacement-mapping scheme, reusing the same VA range. |
I'm suggesting use of a Support for the memfd approach should be restricted to a
What do you mean by "tamper with mappings in the parent"? If you just mean the writes to shared memory in the child will be visible in the parent process I would consider that desirable behavior. The existing SharedMemoryMapper already works this way.
Being able to rewrite code opens up the possibility of improving performance. E.g. rewriting jumps to bypass stubs and go directly to function bodies once the function has been complied. ORC's memory manager interface was designed to be flexible enough to accommodate all of these options. Keeping dual RW- / R-X mappings as an option would be a really good outcome here.
I'm suggesting that it should be reasonably easy to make the dual-mapping optional, and that some clients will be willing to pay that cost for the combination of security and runtime flexibility that it offers.
I'd say that the current semantics of Memory aren't especially helpful, given that LLVM's JIT aims to work transparently out-of-process: The working assumption is that the address range being written isn't the same as the one being executed. That's why implementing this as a mapper makes it so much more useful.
It's possible that there's something that I'm missing, but so far I haven't heard a compelling reason not to do this with a |
I'll comment on that separately, need to hack some more code to convince myself either way.
No, it'll be a security hole, no less. Take the below example which implements the dual-mapping scheme. It populates the mapping with some shellcode which simply calls a function. The code gets initialized to call the function #define _GNU_SOURCE
#include <sys/mman.h>
#include <sys/wait.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <err.h>
static pid_t parent_pid;
#define msg(fmt,...) \
printf("%6s (uid: %d): " fmt, getpid() == parent_pid ? "parent" : "child", \
geteuid(), ## __VA_ARGS__)
static void good(void) {
msg("all good...\n");
}
static void bad(void) {
msg("!!! PWNED !!!\n");
}
static void call_code(const void *code) {
msg("calling func@%p...\n", code);
((void (*)(void))code)();
}
static int pwn(void *code, long *fptr) {
call_code(code);
if (mprotect(code, 0x1000, PROT_READ | PROT_WRITE))
err(1, "mprotect(rw-)");
msg("patching code at %p...\n", fptr);
*fptr = (long)bad;
if (mprotect(code, 0x1000, PROT_READ | PROT_EXEC))
err(1, "mprotect(r-x)");
call_code(code);
return 0;
}
int main(int argc, char *argv[]) {
static char shellcode[] = {
#if defined(__x86_64__)
// movabs $0, %rax
0x48, 0xb8, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
// jmp *%rax
0xff, 0xe0
#elif defined(__i386__)
// mov $0, %eax
0xb8, 0x00, 0x00, 0x00, 0x00,
// jmp *%eax
0xff, 0xe0
#else
#error Unsupported architecture!
#endif
};
static const int sc_len = sizeof(shellcode);
unsigned int fptr_offset = sizeof(long) == 8 ? 2 : 1;
long *fptr = (long *)(shellcode + fptr_offset);
enum { DUAL_MAP, DIRECT } mode = DIRECT;
void *addr;
int flags;
int fd;
// make shellcode[] call good()
*fptr = (long)good;
parent_pid = getpid();
if (argc > 1 && strcmp(argv[1], "-d") == 0)
mode = DUAL_MAP;
fd = memfd_create("code", MFD_CLOEXEC);
if (fd < 0)
err(1, "memfd_create()");
if (mode == DUAL_MAP) {
// need to size file in advance
if (ftruncate(fd, sizeof(shellcode)) < 0)
err(1, "ftruncate()");
flags = MAP_SHARED;
addr = mmap(NULL, sc_len, PROT_READ | PROT_WRITE, flags, fd, 0);
if (addr == MAP_FAILED)
err(1, "mmap(rw-)");
msg("created writable mapping at %p\n", addr);
memcpy(addr, shellcode, sc_len);
} else {
flags = MAP_PRIVATE | MAP_POPULATE;
if (write(fd, shellcode, sc_len) != sc_len)
err(1, "write(shellcode)");
}
addr = mmap(NULL, sc_len, PROT_READ | PROT_EXEC, flags, fd, 0);
if (addr == MAP_FAILED)
err(1, "mmap(r-x)");
msg("created %s executable mapping at %p\n",
(flags & MAP_SHARED) ? "shared": "private", addr);
call_code(addr);
switch (fork()) {
case -1: err(1, "fork()");
case 0: setuid(getuid()); exit(pwn(addr, addr + fptr_offset));
default: wait(NULL);
}
call_code(addr);
return 0;
} Compiling the code and making sure it has the SUID-bit set: minipli@x1:~/tmp$ gcc -O2 -o shared_exec shared_exec.c
minipli@x1:~/tmp$ sudo chown 0:0 shared_exec
minipli@x1:~/tmp$ sudo chmod u+s shared_exec
minipli@x1:~/tmp$ ls -l shared_exec
-rwsr-xr-x 1 root root 16848 Okt 11 12:00 shared_exec Calling the program with minipli@x1:~/tmp$ ./shared_exec -d
parent (uid: 0): created writable mapping at 0x7c0a706f8000
parent (uid: 0): created shared executable mapping at 0x7c0a706f7000
parent (uid: 0): calling func@0x7c0a706f7000...
parent (uid: 0): all good...
child (uid: 1000): calling func@0x7c0a706f7000...
child (uid: 1000): all good...
child (uid: 1000): patching code at 0x7c0a706f7002...
child (uid: 1000): calling func@0x7c0a706f7000...
child (uid: 1000): !!! PWNED !!!
parent (uid: 0): calling func@0x7c0a706f7000...
parent (uid: 0): !!! PWNED !!! The parent process initially calls Running the program without extra arguments makes it use private mappings instead -- like I'm proposing: minipli@x1:~/tmp$ ./shared_exec
parent (uid: 0): created private executable mapping at 0x76bbc26ad000
parent (uid: 0): calling func@0x76bbc26ad000...
parent (uid: 0): all good...
child (uid: 1000): calling func@0x76bbc26ad000...
child (uid: 1000): all good...
child (uid: 1000): patching code at 0x76bbc26ad002...
child (uid: 1000): calling func@0x76bbc26ad000...
child (uid: 1000): !!! PWNED !!!
parent (uid: 0): calling func@0x76bbc26ad000...
parent (uid: 0): all good... Even though the code mapping was modified in the child process to call
As long as all of these rewrites happen before the code gets used, i.e. before
No, not for the security hazard it creates. I can already hear all the CVEs rattling that this will cause. We can't do that -- not without users of that API explicitly opting-in for such a behavior.
Given the security implications, I'm not sure we should provide that option at all -- given, that the status quo doesn't have it either.
I need to refresh my memory here. I've tried to go that route before but reverted back to the much simpler solution which will cover all use cases. There were some road blocks but I can't remember them right now. I'll reevaluate and comment back on that part once I've all the details back. |
PaX's MPROTECT feature as well as SELinux's deny_execmem policy prevent a transition from a once writable memory mapping to an executable one. This obviously breaks JIT code generation.
As both of these are Linux specific and the Linux kernel gained memfd support almost 10 years ago in v3.17, making use of it to implement a fallback seems like a viable option to get JIT code generation fixed for such systems.
Implement a detour through a memfd for systems that are detected to deny the W<->X transition of memory mappings. For PaX this can be easily achieved by evaluating the per-process PaX flags, if access to /proc/self/status is available. For others, and as a fallback for PaX, a runtime test is done once.
This enables such systems to make use of JIT code generation without completely abandoning their W^X policy.
Regarding the implementation, the runtime test for detecting PaX's MPROTECT shamelessly shows I'm a C developer at heart. This can -- and should! -- be made more C++ish by making use of a
MemoryBuffer
orStringRef
maybe coupled with something fancy fromlvm::sys::fs
. However, I failed to find something as simple as only opening a file withO_CLOEXEC
, reading it line by line, looking for the one I'm interested in. So there it is in plain C. 🫤Also please excuse my clunky
FDWrapper
class. Again, I failed to find a simple thing that does RAII for fds, so implemented my own to ease erroring out of the code.An example of what this change achieves, is to make use of LLVMpipe on such system out of the box by implementing a fallback to what normally would be denied by policy. This, in turn, will give users of such systems a working graphical desktop if there's no native video driver available, e.g. what is commonly the case in (wrongly configured) VMs.
PS: First time contributor, suggestions for improvement heavily welcomed!