Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: managing worker memory #227

Open
hms opened this issue May 16, 2024 · 7 comments
Open

Question: managing worker memory #227

hms opened this issue May 16, 2024 · 7 comments

Comments

@hms
Copy link
Contributor

hms commented May 16, 2024

Because I'm locked to Heroku, I'm obliged to watch the memory utilization for the jobs "dyno" as I have a particular job that triggers R14 errors (soft OOM). Not only does this flood my logs with annoying but important to know messages, it slowly eats away at job performance. Of course, I could simply run a larger instance (Heroku bills have a very nasty way of growing when one is not paying attention) or wait for Heroku to restart things on their 24 hour clock.

Calls to GC didn't clean up enough memory to resolve the issue and it looked like the worker was slowly growing it's memory footprint over time, leaving me with the assumption that I was merely promoting my garbage or I was somehow slowly leaking.

My solution (at the moment) is via ApplicationJob an around_perform block and a memory check at the end of every job execution. If the worker is over the Heroku threshold, I ask the worker to commit seppuku via Process.kill('TERM', Process.pid). This "seems" to be working.

I am ok with what seems to be a very small performance impact of restarting a Unix process, whatever Rails and YJIT overhead due to the restart, and SolidQueues time to notice the missing worker and start a new one Vs. the rapid performance degradation due to paging and flood of log messages.

My worries are that this might bump into a SolidQueue design consideration that I didn't see /consider while reviewing the code or that this is just wrong for other reasons.

Any advice or suggestions would be appreciated.

@hms
Copy link
Contributor Author

hms commented May 21, 2024

Follow up question:

It seems that requesting the worker to self-terminate can leave SolidQueue in a less than correct state. For example, I have seen a record in the Semaphore table stick around for the full requested timeout limit. I don't have any idea if there are other issues, but that one impacts my application in less than optimal ways.

Question:

  • Is this the expected behavior or a bug?
  • Would I be better off self-terminating both the worker and dispatcher?
  • OR do I need to create a PR for an enhancement to signal the Supervisor with the request to terminate a specific worker so all of the proper shutdown and startup steps are honored?

Thanks in advance,
hms

@hms
Copy link
Contributor Author

hms commented May 22, 2024

I'm seeing some cleanup issues with the worker being asked to self-terminate. Examples include semaphores that seem to live an abnormally long term.

This suggests that my current solution isn't a very good one, or I have found something in the cleanup logic that isn't quite right -- if so, I'm happy to see if I can lock it down or even fix, or there is a design decision that's not visible to me around who is responsible for killing whom.

Either way, I'm kinda stuck here.

@rosa
Copy link
Member

rosa commented May 22, 2024

Hey @hms! There's not a good way right now to terminate a worker alone besides sending the TERM or QUIT signal as you're doing. Either of these will signal the worker to try to clean up as much as they can before SolidQueue.shutdown_timeout is over. Most things will be cleaned up just fine; in-progress jobs should be returned to the queue, but if they had a semaphore claimed, that will remain because they'll still be holding it (in the queue). This is expected behaviour, but the semaphore shouldn't last an abnormally long time, but it can take the whole expiration time depending on how long you're setting that. As soon as the job is claimed back and executed, it should free the semaphore.

Would I be better off self-terminating both the worker and dispatcher?

No, the dispatcher shouldn't play any role here.

Besides the semaphores, what else are you finding when you do this termination? And could you share more specific examples from your use case?

@hms
Copy link
Contributor Author

hms commented May 22, 2024

@rosa

I'm calling terminate via an around perform, so in theory, the worker shouldn't be requesting new work and should be running through it's cleanup protocol.

One thing I did see, and it's from memory at this point, was a semaphore that took a full SolidQueue restart to clear. Meaning, my new job that should have been fine to run, wouldn't because of the vestigial Semaphore. I was watching via Mission-Control the job start, go into a blocked state, come out of the block state, and go back into the blocked state, enough times that I pulled the plug on SQ and restarted it.

Question: Since the perform finished, shouldn't the Semaphore have been released?

I'll treat this more formally going forward and document issues vs. this somewhat anecdotal conversation here (my fault).

@rosa
Copy link
Member

rosa commented May 23, 2024

I'm calling terminate via an around perform, so in theory, the worker shouldn't be requesting new work and should be running through its cleanup protocol.
...
Question: Since the perform finished, shouldn't the Semaphore have been released?

Yes! I was thinking in the case your worker is running multiple threads, which I didn't ask, just assumed. My bad! If there were multiple threads, other jobs that weren't the one that sent the SIGTERM could be killed before they got a chance to finish and free the semaphore if they had one claimed 🤔 That was my guess but I imagine this is not the case and you're running a single thread.

Meaning, my new job that should have been fine to run, wouldn't because of the vestigial Semaphore. I was watching via Mission-Control the job start, go into a blocked state, come out of the block state, and go back into the blocked state

This is indeed very strange, especially the part where the job could come out of the block state, presumably because it was moved to ready, and then back to the block state 😕 I'm not sure how this could happen if the job didn't fail. If it failed and was automatically retried, then it'd have to wait again but I assume this is not the case.

@hms
Copy link
Contributor Author

hms commented May 23, 2024

@rosa

I think this might be a miss-reading of the documentation on my part. It clearly states the supervisor has patients when it receives a SIGTERM. The docs didn't say if the worker has any smarts about how to end itself with grace.

To answer your questions before I get into mine:

Yes! I was thinking in the case your worker is running multiple threads, which I didn't ask, just assumed. My bad! If there were multiple threads, other jobs that weren't the one that sent the SIGTERM could be killed before they got a chance to finish and free the semaphore if they had one claimed 🤔 That was my guess but I imagine this is not the case and you're running a single thread.

  • my queues are setup based on quality-of-service objectives: [QoS15s, mailers, QoS60s, QoS600s, *]
  • I am running 1 dispatcher (turns out running 2 (by accident) takes up a fair amount of memory and is bad(™))
  • I am running 1 worker with 2 threads. In a perfect world, I'd run a second worker for the QoS600s queue as it's the one with must of the potential for triggering the OOM issues. But a second worker burns enough extra memory that it's triggering the soft OOM far more frequently and it's less optimal than a single worker in my current circumstances.

I am trigging my SIGTERM via an around_perform and after the yield returns, so all of my application level requirements, at least for the triggering thread, are complete. That being said, I didn't dig enough yet to understand what SolidQueues post perform requirements might be.

This is indeed very strange, especially the part where the job could come out of the block state, presumably because it was moved to ready, and then back to the block state 😕 I'm not sure how this could happen if the job didn't fail. If it failed and was automatically retried, then it'd have to wait again but I assume this is not the case.

Again, my lack of clarity here might be causing some confusion. My QoS600s jobs have the potential to run for a while (based on the size of the file it's processing). So, my limit_concurrency 'duration' is set for 15.minutes (about 2x my expected worst case -- because until I get more production data, I'm just guessing). I really don't want another memory hungry job becoming unblocked and taking me from a soft OOM to a hard OOM.

Questions:

  • Did the SolidQueue design envision workers being asked to orderly shutdown outside of SolidQueue shutting down?
  • Due to the multi-threaded nature of the SolidQueue worker, would it make sense to grab another Signal such as USR1/2 and use it as a "worker - finish whatever is inflight, but do not dequeue anything new and die when the work count is zero"?
  • Does the worker have any shutdown / cleanup responsibilities that I should be paying attention to as I go exploring ways to make this work a little cleaner?

Hal

@rosa
Copy link
Member

rosa commented Jul 18, 2024

Hey @hms, so sorry for the radio silence on this one! Your last comment totally slipped through the cracks as it got me travelling for two weeks, and then I went back to work on totally unrelated things.

Did the SolidQueue design envision workers being asked to orderly shutdown outside of SolidQueue shutting down?

Basically just the INT and TERM signals. See answer below.

Due to the multi-threaded nature of the SolidQueue worker, would it make sense to grab another Signal such as USR1/2 and use it as a "worker - finish whatever is inflight, but do not dequeue anything new and die when the work count is zero"?

This should be already supported by the INT and TERM signals, which do precisely this: finish whatever is ongoing but don't get more work. The "finish whatever is ongoing" does have a timeout, though, which is whatever you have configured in config.shutdown_timeout (default is 5 seconds, you might want to increase it).

Does the worker have any shutdown / cleanup responsibilities that I should be paying attention to as I go exploring ways to make this work a little cleaner?

It just needs to stop its thread pool orderly like this and then deregister itself from the DB, which would also release any claimed jobs back to the queue.

BTW, related to memory consumption, this also came up recently: #262

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants