Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jobs: change p_job variable type from short to int #763

Merged
merged 5 commits into from
Jul 13, 2024

Conversation

vmihalko
Copy link

This PR addresses a problem where the number of jobs exceeds 2^15 - 1 (32767), the variable short p_job overflows and ksh hangs (here).

How to reproduce?

reproducer.ksh:

#!/bin/ksh

trap 'echo SIGINT $i' SIGINT

for (( i=0; i<=60000; i++ )); do
    sleep inf &
done

kill -SIGINT $$

and run ksh -lic '. ./reproducer.ksh' .

This bug was found during a debugging session with @lzaoral and @msekletar - thank you!

McDutchie and others added 4 commits July 10, 2024 12:37
This change addresses an issue where ksh hangs when the number of jobs exceeds `2^15 - 1` (32767), which is the maximum value for a `short`.
src/cmd/INIT/mamake.c:
- Option -V: Silence an annoying gcc warning by checking the return
  value of write(2). It's kind of pointless just for printing the
  version, but on gcc, (void)write will not suppress the warning.
  https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66425

src/cmd/ksh93/tests/arith.sh:
- The Gentoo ebuild explicitly unsets A, which is apparently
  exported in their environment, killing the arith.sh tests.
  So this unsets it in the script.
  https://gitweb.gentoo.org/repo/gentoo.git/tree/app-shells/ksh/ksh-1.0.9.ebuild

README.md, src/cmd/ksh93/README:
- Fix typos.
@McDutchie
Copy link

McDutchie commented Jul 13, 2024

I could not run the reproducer without freezing my entire system. Changing sleep inf & to /bin/sleep 120 & allowed me to see what's going on. (The external sleep program is a great deal smaller than ksh, so it takes less resources than forking ksh for the sleep builtin.)

In my version of the reproducer, ksh "hung" at 32767 jobs as you expected, but after the first sleep processes exited, it continued where it left off, restarting the job numbers at 1.

So I don't think this is a really a bug. ksh dutifully waits for some jobs to exit and make their job table entries available so it can create new jobs. This never happens in your reproducer, because your background jobs sleep for infinite time.

I am not necessarily opposed to making it possible to have more than 32767 simultaneous jobs, although I do wonder if there is an actual use case for that.

Either way, I want to research the possible implications more first. For instance, if the job table can now be up to 2^31-1 (2147483647) entries long, I imagine that could take up a heck of a lot of memory.

@McDutchie
Copy link

On second thought, it's a bug.

The overflow happens exactly on jobs.c line 1210. The job_alloc function happily returns an integer value > 32767. Assigning that to a short is a signed short integer overflow, which is undefined behaviour in C.

But on most current systems (all that I know of, anyway), it will simply wrap around to -1, which also happens to be the temporary-failure return code, so it waits. Plus, ksh will re-use low job numbers as they become available again. So, the correct thing accidentally happens and ksh recovers as soon as some jobs exit. But that doesn't make it not a bug.

Your PR cleanly fixes it. Thanks for that.

@McDutchie McDutchie merged commit 7219ffa into ksh93:dev Jul 13, 2024
McDutchie pushed a commit that referenced this pull request Jul 13, 2024
This change addresses an issue where ksh has undefined behavior
when the number of jobs exceeds `2^15 - 1` (32767), which is the
maximum value for a `short`. The short integer overflow happened in
jobs.c, line 1210.
McDutchie pushed a commit that referenced this pull request Jul 13, 2024
This change addresses an issue where ksh has undefined behavior
when the number of jobs exceeds `2^15 - 1` (32767), which is the
maximum value for a `short`. The short integer overflow happened in
jobs.c, line 1210.
@lzaoral
Copy link

lzaoral commented Jul 25, 2024

I could not run the reproducer without freezing my entire system. Changing sleep inf & to /bin/sleep 120 & allowed me to see what's going on. (The external sleep program is a great deal smaller than ksh, so it takes less resources than forking ksh for the sleep builtin.)

Actually, I'd say this a bug in ksh handling the inf paramater incorrectly. When run under strace, the process will not enter a sleep state and will be looping with invalid calls to clock_nanosleep instead (at least in an aarch64 Fedora 40 VM):

...
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=2702722944, tv_nsec=3974490168}, 0xfffff507f7a0) = -1 EINVAL (Invalid argument)
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=2702722944, tv_nsec=3974490168}, 0xfffff507f7a0) = -1 EINVAL (Invalid argument)
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=2702722944, tv_nsec=3974490168}, 0xfffff507f7a0) = -1 EINVAL (Invalid argument)
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=2702722944, tv_nsec=3974490168}, 0xfffff507f7a0) = -1 EINVAL (Invalid argument)
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=2702722944, tv_nsec=3974490168}, 0xfffff507f7a0) = -1 EINVAL (Invalid argument)
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=2702722944, tv_nsec=3974490168}, 0xfffff507f7a0) = -1 EINVAL (Invalid argument)
...

edit: typos

@vmihalko vmihalko deleted the private-vmihalko-p_job-type-fix branch July 25, 2024 10:33
@McDutchie
Copy link

McDutchie commented Jul 25, 2024

I can't reproduce that on either macOS or Debian (both arm64).

On Debian arm64, I get this strace output for strace ksh -c 'sleep inf':

...
rt_sigaction(SIGWINCH, {sa_handler=0xffff86f9cf88, sa_mask=[], sa_flags=SA_INTERRUPT}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGALRM, {sa_handler=0xffff86f9cf88, sa_mask=[], sa_flags=SA_INTERRUPT}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=2147483647, tv_nsec=4294967295}, 0xfffff95fd570) = -1 EINVAL (Invalid argument)
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=2264800440, tv_nsec=65535}, 

...and it sleeps, for 65535 seconds at least (presumably for 65535 seconds at a time but I'd have to wait 18.2 hours to test that) (edit: that was nonsense: tv_sec is the seconds, tv_nsec additional nanoseconds).

If you'd like to help debug this on Fedora, the relevant sources are at src/lib/libast/tm/tvsleep.c and src/cmd/ksh93/bltins/sleep.c.

@McDutchie
Copy link

Also, possibly related: #750

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants