Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify interoperability with fork and subprocess creation #474

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

nspark
Copy link
Contributor

@nspark nspark commented Jul 6, 2021

This PR specifies interoperability with fork and subprocess creation via posix_spawn.

Feedback is very much requested.

Closes #223

@nspark
Copy link
Contributor Author

nspark commented Jul 12, 2021

In my testing, an application that holds to the following clause works as expected for both Cray SHMEM on Aries and OSHMEM/UCX on InfiniBand:

When fork is invoked before the OpenSHMEM library is initialized, only one of either the parent or child processes may initialize the OpenSHMEM library.

@wrrobin
Copy link
Collaborator

wrrobin commented Jul 14, 2021

In my testing, an application that holds to the following clause works as expected for both Cray SHMEM on Aries and OSHMEM/UCX on InfiniBand:

When fork is invoked before the OpenSHMEM library is initialized, only one of either the parent or child processes may initialize the OpenSHMEM library.

Yes, makes sense.
For the other clause, if fork is invoked within the OpenSHMEM region, then the behavior is defined and both the parent and child can invoke SHMEM operations until shmem_finalize(). If the child process is killed before the shmem_finalize, will that be allowed for both child and parent to invoke SHMEM calls?

@nspark
Copy link
Contributor Author

nspark commented Jul 14, 2021

For the other clause, if fork is invoked within the OpenSHMEM region, then the behavior is defined and both the parent and child can invoke SHMEM operations until shmem_finalize(). If the child process is killed before the shmem_finalize, will that be allowed for both child and parent to invoke SHMEM calls?

No, the second clause says (emphasis added):

When fork is invoked within the OpenSHMEM portion of the program or after the OpenSHMEM library has been finalized, the newly created child process shall not call any OpenSHMEM routines; otherwise, the behavior is undefined.

So, no; the child process could not invoke SHMEM operations.

@wrrobin
Copy link
Collaborator

wrrobin commented Jul 14, 2021

Thanks @nspark. The text mentioned system level interfaces, so I assume, user level threads / processes do not fall under this restriction.

After finalize, a call to any OpenSHMEM API would lead to undefined behavior, from both child and parent processes. I dont see the current spec has any such statement other than saying that finalize must be the last OpenSHMEM call.

@jdinan
Copy link
Collaborator

jdinan commented Jul 26, 2021

@nspark I realize we have deprecated atexit finalization of OpenSHMEM libraries, but I am wondering about the interactions. When a child process is forked, it inherits the atexit handlers of the parent process. The OpenSHMEM library can record the PID of the process that calls shmem_init and only finalize the library when the process with this PID invokes the atexit handler. However, this could break code that assumes the child process can continue to use OpenSHMEM until it exits. One solution is to remove start_pes and implicit finalization. However, I think some OpenSHMEM libraries may still use implicit finalization even though we have added an explicit finalize call.

@nspark
Copy link
Contributor Author

nspark commented Aug 13, 2021

@jdinan I wonder whether pthread_atfork could be relevant here as a way for a library to safely disable its atexit handler in the child process. Of course, it can't dequeue the handler itself, but it could atomically set a flag that gets queried in the handler before invoking the internal library finalization.

@nspark
Copy link
Contributor Author

nspark commented Sep 9, 2021

@jdinan Re: implicit finalization, what about adding something like the following under "Notes to implementers"?

OpenSHMEM implementations that support implicit library finalization for compatibility with start_pes should ensure that child processes created after library initialization do not implicitly call OpenSHMEM operations as part of exit handlers invoked during normal process termination.

@jdinan
Copy link
Collaborator

jdinan commented Sep 21, 2021

@nspark This is a useful note to implementors. The start_pes function and implicit finalization are still required for implementations to be conformant with the spec. Can we also remove these functions (in addition to adding this note about backward compatibility) so that implementors don't need to deal with this case?

@nspark
Copy link
Contributor Author

nspark commented Oct 27, 2021

@jdinan Just to clarify, by "remove these functions," you mean totally gone from the API and not just deprecated, correct?

@jdinan
Copy link
Collaborator

jdinan commented Nov 8, 2021

Note that this would require OpenSHMEM libraries using IB Verbs to call ibv_fork_init: https://linux.die.net/man/3/ibv_fork_init

\item When \FUNC{fork} is invoked within the \openshmem portion of the
program or after the \openshmem library has been finalized, the
newly created child process shall not call any \openshmem routines;
otherwise, the behavior is undefined.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two cases above are still problematic. The issue is that fork marks pages shared by the parent and child processes as copy on write. When either process touches the page, it causes a copy of that page to be mapped into that process' address space and this breaks memory registration because it can change the parent's mapping for memory registered with the NIC. The second bullet above is solved by ibv_fork_init, which marks any memory registered to the NIC with MADV_DONTFORK. This prevents the page being remapped to a copy in the parent's address space, but it also changes the behavior of fork since some pages don't propagate to the child process. In new kernels, Linux copies the pages immediately, which solves the memory registration problem without breaking fork, but at the expense of copy space/time overhead (especially if the application does fork+exec). I'm not exactly sure what will happen in the case where fork is called before ibv_fork_init on older Linux kernels. Hopefully the NIC driver will see that the page is marked copy on write during registration and duplicate it in the parent's address space.

To summarize, any support for fork is problematic. I know the bullet below says that we aren't required to support it, but the two bullets above describe two usage models as things that ought to work. Fork bugs are very hard to diagnose; here's an interesting read of one user's journey discovering that fork was broken: https://blog.nelhage.com/post/a-cursed-bug/. I would suggest that if we want to keep the above bullets, we also need to give users some way to ask the SHMEM library whether or not the above usage models are supported.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That post is a great write-up, and it touches on a behavioral aspect of CoW that I didn't know:

Following a fork, [...] the first process to attempt to write to a copy-on-write (“CoW”) page will make a copy, leaving the original page (still visible to one or more other processes) intact.

This (potentially) interacts poorly with RDMA; If a process using RDMA forks, any future attempts to write to pages used for RDMA would result in the parent copying the page, which would mean it would no longer see any remote writes to that page by RDMA peers.

I had always thought the child always gets the copy. It sounds like the behavior in Linux 5.12 is (or can be) much better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm inclined to, for now, limit subprocess creation to posix_spawn. We could then revisit documenting fork compatibility with a supplemental API (e.g., shmem_is_forksafe) later.

I think the documentation rationale for posix_spawn is sound. I didn't know posix_spawn existed until I found that I couldn't fork-exec in various SHMEM implementations. I imagine many Unix-y programmers are more used to fork-exec than posix_spawn.

Still, what's the concern with use of fork before shmem_init[_thread]? I'd generally think that, in general, we're still in "vanilla Linux memory land" during that time.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The concern I had about fork before init is what NIC drivers will do when you try to register a page that's marked CoW. I suppose this is a common case, e.g. if a new allocation is registered before it's initialized (all pages mapped to zero page), so perhaps nothing to worry about.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been digging around this trying to better understand the problem and a few things have changed since this thread was initiated. It is not clear to me that posix_spawn should better than fork() or vfork(). I believe both of those functions can be called depending on the setting of a few POSIX system environment variables. I would like to see this interface defined, but I know this is not a trivial task. Have any of your (@jdinan) opinions on the matter changed with updates to the kernel and networking middle layers options.

The main ambition is having a defined interface within a SHMEM program that is able to spawn off a subprocess safely. The process should have limitations on what it can do (e.g. no shmem routines).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only change that comes to mind is Linux 5.12 copying the parent process pages into the child process at the time of the fork rather than marking them as CoW. Is having a copy of the parent process' data a usage model that's important for your apps? If not, we leave more freedom to implementations by not needing to support such a model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Interoperability with fork()
5 participants