Job results not being reported properly. #25

MrJoy · 2019-08-21T20:40:05Z

I'm seeing fairly frequent instances in which a job appears to be running for 30 minutes -- but, as far as I can tell, is not.

One of the job types where I see this has a metric that is recorded to DataDog when it successfully completes. That metric is never above 90 seconds or so. I am seeing some job failures here and there, but they are all below 30 minutes.

Instead, I suspect this is related to another problem we're having: We're seeing bursts of dial errors with workers and other clients trying to connect to Faktory unsuccessfully. What would happen if a job completed (success or failure), but the worker was unable to report that status back to the server because it couldn't get a connection to the server?

Looking at faktory_worker_go:

https://github.com/contribsys/faktory_worker_go/blob/master/runner.go#L260

				mgr.with(func(c *faktory.Client) error {
					if err != nil {
						return c.Fail(job.Jid, err, nil)
					} else {
						return c.Ack(job.Jid)
					}
				})

It appears that .with can return an error, but that this error is routinely ignored.

I don't think calling panic is appropriate for this situation, but perhaps some combination of the following might be helpful:

Getting and holding a connection to both retrieve the job, and report the result (and maybe as a side-effect, make the connection available to the worker via context so it can fire jobs without having to establish its own connection).
Introducing a pluggable logger mechanism so we can at least record that these failures are happening.
Having some sort of retry loop, specifically around reporting the results of a job.

The first option would have some risks/challenges of its own, of course. You'd need to ensure the connection didn't time out, handle reconnecting if it did go away (either due to timeout or a server failure), etc. I'm sure you have more insight into how it would impact operations concerns in general, so forgive me if it's an Obviously Stupid Idea. That said, for situations involving a relatively high job volume (mid-hundreds- low-thousands per second), the many-transient-connections thing has proven to be a bit of a challenge (I'm paying attention to #219 / #222 for this reason, and we've been having to be careful about tuning things like FIN_WAIT and such in our server configuration).

The text was updated successfully, but these errors were encountered:

mperham · 2019-08-27T23:02:10Z

Good catch. We should definitely be running golint or something similar to find code instances where we aren't handling errors.

Logging improvements are also welcome -- today FWG uses the Faktory logger infrastructure in faktory/util.

mperham · 2019-08-27T23:04:28Z

And thanks for your patience on all these issues -- I left for vacation right as you entered them all. Wonderful timing. 😂

MrJoy · 2019-08-27T23:14:43Z

No worries at all! Thank you for your patience with me, as I blunder my way through the learning curve!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Job results not being reported properly. #25

Job results not being reported properly. #25

MrJoy commented Aug 21, 2019

mperham commented Aug 27, 2019

mperham commented Aug 27, 2019

MrJoy commented Aug 27, 2019

Job results not being reported properly. #25

Job results not being reported properly. #25

Comments

MrJoy commented Aug 21, 2019

mperham commented Aug 27, 2019

mperham commented Aug 27, 2019

MrJoy commented Aug 27, 2019