Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Forking Launcher #1138

Open
wants to merge 21 commits into
base: master
Choose a base branch
from

Conversation

johnnyshields
Copy link

@johnnyshields johnnyshields commented Mar 31, 2021

Forking Launcher

Status

<< READY FOR REVIEW >>

I will be running this in production at TableCheck for a few weeks in the meantime where we process 100,000+ jobs of all kinds per day (mailers, SMS, payments, search indexing, etc.) across many worker servers.

  • Test script/delayed_job (daemon mode) in production
  • Test script/delayed_job (forking mode) in production
  • Test Rake task (single worker) in production
  • Test Rake task (forking mode) in production

What does this PR do?

This PR makes two "Launcher" modes for Delayed Job:

  1. Current / default: use the Daemons gem to run as detached background progresses
  2. New: run with new --fork option to fork workers from parent process in the foreground.

"Fork mode" will only fork when using more than one worker, i.e. set by the --number-of-workers or --pools args. If only one worker, it runs a Delayed::Worker in the same process without forking.

This PR is a redux of #615.

Why?

It makes it much easier to use Delayed Job inside containers (Docker, etc.).
Using daemons has several undesirable behaviors when running inside a container:

  • The parent process spawns a background child process then exits immediately,
    which causes the container to shutdown.
  • The worker processes are detached from the parent, which prevents logging to STDOUT.

The new --fork option solves this by running workers in a foreground process tree.
When using --fork the daemons gem is not required.

Installing

I've added a new Delayed::Command#launch method. My intention is to switch to this in the generator for new installs in the next major version, because really forking should be the default and daemons should be a special option, especially in our increasingly container-based world. Here's what it looks like:

# uses daemonize by default, but allows `--fork` switch to override
Delayed::Command.new(ARGV).daemonize

# uses fork by default, but allows `-d` / `--daemonize` switch to override
Delayed::Command.new(ARGV).launch

Limitations

Fork mode does not yet support restart, stop etc. commands; it is always assumed to be start. You can gracefully stop it with SIGINT or SIGTERM. Commands can be added in a future PR.

About the Refactor

I split off the code in Delayed::Launcher::Daemonized from Delayed::Command. Following the SRP, Delayed::Command is now only responsible for parsing the CLI args, and the Launcher class is responsible for actually running the app. From there, I wrote a Delayed::Launcher::Forking class, but in the end I discovered much of the code can be shared with the Daemonized class, so I extracted out a Delayed::Launcher::Base parent class for both Launcher types.

A nice side effect of this PR is that we can now connect the Rake task to the Launcher rather than to the Worker class, and enable things like multiple workers and pools on Rake. (Rake and script/delayed are now unified; this is something which was always mysterious to me as a long time user.)

Extras

This PR includes #1139 (fix tests) and #1140 (cleanup README.md).

In addition, there are several quality of life improvements added to the Command script:

  • Command: Add --pools arg as an alias for --pool, and allow pipe-delimited pool definitions
  • Command: Add --num-worker arg as an alias for -n / --number-of-workers (and deprecate --number-of-workers)
  • Command: Pool parsing code has been extracted to PoolParser class
  • Command: Add -v / --verbose switch which sets quiet=false on workers
  • Command: Add -x switch as alias for --exit-on-complete
  • Command: Add STDERR warning num-workers less than 1
  • Command: Add STDERR warning if num-workers and pools both specified (pools overrides num-workers as per existing behavior)
  • Command: Add STDERR warning if queues and pools both specified (pools overrides queues as per existing behavior)
  • Command: Add STDERR warning if min-priority is greater than max-priority
  • Command: Add full RSpec coverage

The Rake script has also been enhanced:

  • Rake: Uses Forking launcher under the hood
  • Rake: Add support for NUM_WORKERS and POOLS args
  • Rake: Validate various parameters same as Command above
  • Rake: Add full RSpec coverage
  • Rake: Rename file from delayed/tasks.rb to delayed/tasks.rake to make testing easy. (Rake require_rake_file method doesn't play nice with .rb files)

I'd be happy to split some of these into extra PRs but given the amount of spec coverage I've added it's probably fine to merge as is.

@johnnyshields
Copy link
Author

@albus522 can you review the code on this PR? I'd like to ask about the general direction and whether you'd consider merging this.

Reason: The `script/delayed_job` process (described above) uses the
[Daemons](https://github.com/thuehlinger/daemons) gem which has several
undesirable behaviors when running inside a container (Docker, etc.):
- The parent process spawns a background child process then exits immediately,
which causes the container to shutdown.
- The worker processes are detached from the parent, which prevents logging to `STDOUT`.

The `--fork` option solves this by running workers in a foreground process tree.
When using `--fork` the `daemons` gem is not required.

The command script requires changing the following line to enable this:

# always daemonizes, never forks
Delayed::Command.new(ARGV).daemonize

must now be:

# daemonize by default unless --fork specified
Delayed::Command.new(ARGV).launch

In addition, there are several quality of life improvements added to the Command script:
- Command: Add --pools arg as an alias for --pool, and allow pipe-delimited pool definitions
- Command: Add --num-worker arg as an alias for -n / --number-of-workers (and deprecate --number-of-workers)
- Command: Pool parsing code has been extracted to PoolParser class
- Command: Add -v / --verbose switch which sets quiet=false on workers
- Command: Add -x switch as alias for --exit-on-complete
- Command: Add STDERR warning num-workers less than 1
- Command: Add STDERR warning if num-workers and pools both specified (pools overrides num-workers as per existing behavior)
- Command: Add STDERR warning if queues and pools both specified (pools overrides num-workers as per existing behavior)

The Rake script has also been enhanced:
- Rake: Uses Forking launcher under the hood
- Rake: Add support for NUM_WORKERS and POOLS args
@johnnyshields johnnyshields changed the title WIP: Forking launcher Forking Launcher Apr 4, 2021
@johnnyshields
Copy link
Author

@albus522 what steps can I take to get this merged?

Copy link
Member

@albus522 albus522 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have the spare brain cycles at the moment to dive into the meat of this PR, but some cursory surface observations. Please leave build matrix and rubocop config changes out of content PRs. Your task file rename doesn't seem to have worked quite right. File renames often have to be only a rename with no content change in a commit for git to work with them correctly. Leave requires where they are used, there is no need for all the worker stuff to be loaded in non-worker instances. I don't see anything glaringly wrong, but this probably could have been several smaller easier to digest PRs. This is not going to be a fast merge.

@johnnyshields
Copy link
Author

@albus522 thank you for your feedback.

Can you please take a quick look at PRs #1139 (CI fix) and #1140 (Changelog)? From there I'll be happy to break this one into a series of smaller PRs.

@johnnyshields
Copy link
Author

johnnyshields commented Apr 17, 2021

Update: we are now running this branch at TableCheck in production on Kubernetes, serving 100,000+ jobs per day of various kinds (mailers, report generation, payment processing, cache computation, data warehousing, etc.)

@johnnyshields
Copy link
Author

Update: we've now run this branch continually for one month with no issues.

@johnnyshields
Copy link
Author

@albus522 if you are unavailable to help get this merged, would it be possible to designate someone else in the Delayed Job community to review? I'll be glad to break this in a series smaller PRs, but I need someone to give me a reasonable amount of attention to do that.

@hmistry
Copy link

hmistry commented May 24, 2021

@johnnyshields Will this new fork method also work for JRuby applications? I'm currently converting an app to use JRuby and cannot run script/delayed_job becausedaemonize does not work in JRuby.

If yes, then it would be nice if I can configure the paths for pids and log files using ENV or an external config file. My JRuby app will be packaged using warbler so any writing to files must be done outside the Rails project directory... hence the need for configurability. (Not sure if delayed_job already does this as I'm currently looking into it.)

@johnnyshields
Copy link
Author

@hmistry I don't know, you'll have to try it. If JRuby supports fork then yes.

@johnnyshields
Copy link
Author

johnnyshields commented May 25, 2021

@hmistry you can try this gem: https://github.com/headius/forkjoin.rb

However, if daemonize doesn't work (it uses fork) then this also probably won't work, though technically this PR is a "simpler" type of forking so it might work with the gem about. Again you'll have to try it; JRuby is beyond the scope here.

You may also wish to use Sidekiq rather than DelayedJob which is multi-threaded rather than multi-process and should work well on JRuby.

@hmistry
Copy link

hmistry commented May 25, 2021

@johnnyshields Got it. Thank you I'll take a look at the gem. JRuby doesn't support forking processes so based on what you said, doubt this will work. Sidekiq might be better option however it requires engineering effort... will consider it if we can't find a solution with DJ.

@johnnyshields
Copy link
Author

@albus522 do you think you might be able to spend some time on this (or designated someone else to help review)? Since we've added this patch at my company we've cut our AWS bill for Delayed Job by about $5,000 / month (running on Kubernetes). Think lots of folks will benefit from it.

@johnnyshields
Copy link
Author

@albus522 pinging on this. Do you think you'll be able to carve out some time to help review this? As stated previously I'm happy to break it into a series of smaller PRs that can be merged in sequence.

We've been using this in production for months and it has been working like a charm.

@johnnyshields
Copy link
Author

johnnyshields commented Dec 23, 2021

@albus522 ping on this. We've processed literally billions of jobs on this branch having very diverse workloads and its working without a hitch.

@martinstreicher
Copy link

This is super useful for debugging.

@chtrinh
Copy link

chtrinh commented Dec 27, 2022

@johnnyshields I'm a bit confused, how does one use --fork and have the process in the foreground with --pools? The forking code checks for worker_count > 1

run_loop if worker_count > 1
before it attempts to run a loop for the foreground.

But when --pools is passed as an option, setup_workers -> setup_pooled_workers

setup_pooled_workers
never incrementsworker_count only worker_index

In addition because worker_count is never incremented, Delayed::Worker.before_fork is never invoked, causing

$> bin/delayed_job --fork --pools=mailers:2 start 
undefined method `each' for nil:NilClass
/Development/delayed_job/lib/delayed/worker.rb:96:in `after_fork'
/Development/delayed_job/lib/delayed/launcher/base.rb:72:in `run_worker'
/Development/delayed_job/lib/delayed/launcher/forking.rb:74:in `block in fork_worker'
/Development/delayed_job/lib/delayed/launcher/forking.rb:74:in `fork'
/Development/delayed_job/lib/delayed/launcher/forking.rb:74:in `fork_worker'
/Development/delayed_job/lib/delayed/launcher/forking.rb:63:in `add_worker'
/Development/delayed_job/lib/delayed/launcher/base.rb:49:in `block (2 levels) in setup_pooled_workers'
/Development/delayed_job/lib/delayed/launcher/base.rb:49:in `times'
/Development/delayed_job/lib/delayed/launcher/base.rb:49:in `block in setup_pooled_workers'
/Development/delayed_job/lib/delayed/launcher/base.rb:47:in `each'
/Development/delayed_job/lib/delayed/launcher/base.rb:47:in `setup_pooled_workers'
/Development/delayed_job/lib/delayed/launcher/base.rb:36:in `setup_workers'
/Development/delayed_job/lib/delayed/launcher/forking.rb:12:in `launch'
/Development/delayed_job/lib/delayed/command.rb:21:in `launch'
/Development/delayed_job/lib/delayed/command.rb:26:in `daemonize'
bin/delayed_job:5:in `<top (required)>'

for each queue passed.

@johnnyshields
Copy link
Author

@chtrinh I think I never got around to completing pools and forking together, or else I never tested that. I think all other functionality besides that (queues, etc.) is well-tested. Would be glad to accept a PR to this branch.

@chtrinh
Copy link

chtrinh commented Dec 28, 2022

@johnnyshields I ended up going a different way. I changed the binstub by appending loop {} to keep the main process running in the foreground and added some Signal traps to control the daemons that were spawned.

@MaximilianMeister
Copy link

I've managed to run delayed_job in a container (on kubernetes) by just calling bin/delayed_job run instead of start

that way it does not fork into the background. wouldn't this be enough to do the trick?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants