-
Notifications
You must be signed in to change notification settings - Fork 125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question: managing worker memory #227
Comments
Follow up question: It seems that requesting the worker to self-terminate can leave SolidQueue in a less than correct state. For example, I have seen a record in the Semaphore table stick around for the full requested timeout limit. I don't have any idea if there are other issues, but that one impacts my application in less than optimal ways. Question:
Thanks in advance, |
I'm seeing some cleanup issues with the worker being asked to self-terminate. Examples include semaphores that seem to live an abnormally long term. This suggests that my current solution isn't a very good one, or I have found something in the cleanup logic that isn't quite right -- if so, I'm happy to see if I can lock it down or even fix, or there is a design decision that's not visible to me around who is responsible for killing whom. Either way, I'm kinda stuck here. |
Hey @hms! There's not a good way right now to terminate a worker alone besides sending the
No, the dispatcher shouldn't play any role here. Besides the semaphores, what else are you finding when you do this termination? And could you share more specific examples from your use case? |
I'm calling terminate via an around perform, so in theory, the worker shouldn't be requesting new work and should be running through it's cleanup protocol. One thing I did see, and it's from memory at this point, was a semaphore that took a full SolidQueue restart to clear. Meaning, my new job that should have been fine to run, wouldn't because of the vestigial Semaphore. I was watching via Mission-Control the job start, go into a blocked state, come out of the block state, and go back into the blocked state, enough times that I pulled the plug on SQ and restarted it. Question: Since the perform finished, shouldn't the Semaphore have been released? I'll treat this more formally going forward and document issues vs. this somewhat anecdotal conversation here (my fault). |
Yes! I was thinking in the case your worker is running multiple threads, which I didn't ask, just assumed. My bad! If there were multiple threads, other jobs that weren't the one that sent the
This is indeed very strange, especially the part where the job could come out of the block state, presumably because it was moved to ready, and then back to the block state 😕 I'm not sure how this could happen if the job didn't fail. If it failed and was automatically retried, then it'd have to wait again but I assume this is not the case. |
I think this might be a miss-reading of the documentation on my part. It clearly states the supervisor has patients when it receives a SIGTERM. The docs didn't say if the worker has any smarts about how to end itself with grace. To answer your questions before I get into mine:
I am trigging my SIGTERM via an around_perform and after the yield returns, so all of my application level requirements, at least for the triggering thread, are complete. That being said, I didn't dig enough yet to understand what SolidQueues post perform requirements might be.
Again, my lack of clarity here might be causing some confusion. My QoS600s jobs have the potential to run for a while (based on the size of the file it's processing). So, my limit_concurrency 'duration' is set for 15.minutes (about 2x my expected worst case -- because until I get more production data, I'm just guessing). I really don't want another memory hungry job becoming unblocked and taking me from a soft OOM to a hard OOM. Questions:
Hal |
Hey @hms, so sorry for the radio silence on this one! Your last comment totally slipped through the cracks as it got me travelling for two weeks, and then I went back to work on totally unrelated things.
Basically just the
This should be already supported by the
It just needs to stop its thread pool orderly like this and then deregister itself from the DB, which would also release any claimed jobs back to the queue. BTW, related to memory consumption, this also came up recently: #262 |
Because I'm locked to Heroku, I'm obliged to watch the memory utilization for the jobs "dyno" as I have a particular job that triggers R14 errors (soft OOM). Not only does this flood my logs with annoying but important to know messages, it slowly eats away at job performance. Of course, I could simply run a larger instance (Heroku bills have a very nasty way of growing when one is not paying attention) or wait for Heroku to restart things on their 24 hour clock.
Calls to GC didn't clean up enough memory to resolve the issue and it looked like the worker was slowly growing it's memory footprint over time, leaving me with the assumption that I was merely promoting my garbage or I was somehow slowly leaking.
My solution (at the moment) is via ApplicationJob an around_perform block and a memory check at the end of every job execution. If the worker is over the Heroku threshold, I ask the worker to commit seppuku via Process.kill('TERM', Process.pid). This "seems" to be working.
I am ok with what seems to be a very small performance impact of restarting a Unix process, whatever Rails and YJIT overhead due to the restart, and SolidQueues time to notice the missing worker and start a new one Vs. the rapid performance degradation due to paging and flood of log messages.
My worries are that this might bump into a SolidQueue design consideration that I didn't see /consider while reviewing the code or that this is just wrong for other reasons.
Any advice or suggestions would be appreciated.
The text was updated successfully, but these errors were encountered: