Resque::Job instrumentation never flushing to OTLP when using BatchSpanProcessor #950

tlubz · 2024-04-23T19:03:49Z

tlubz
Apr 23, 2024

Description of the bug

Resque::Job instrumentation never pushes spans to the Exporter when using BatchSpanProcessor.
The same setup with SimpleSpanProcessor does work.

Share details about your runtime

Operating system details: Macos Sonoma 14.3.1 (23D60)
RUBY_ENGINE: "ruby"
RUBY_VERSION: "2.7.6"
RUBY_DESCRIPTION: "ruby 2.7.6p219 (2022-04-12 revision c9c2245c0a) [x86_64-darwin22]"

Share a simplified reproduction if possible

# example.rb
require 'bundler/inline'
require 'bundler'

gemfile(true) do
  source 'https://rubygems.org'

  gem 'opentelemetry-api'
  gem 'opentelemetry-sdk'

  gem 'resque', '2.0.0'
  gem 'redis', '4.6.0'
  gem 'opentelemetry-instrumentation-resque', '0.3.1'
end

require 'opentelemetry-api'
require 'opentelemetry-sdk'

class SummarizingExporter < OpenTelemetry::SDK::Trace::Export::ConsoleSpanExporter
  def export(spans, timeout: nil)
    return OpenTelemetry::SDK::Trace::Export::FAILURE if @stopped

    Array(spans).each { |s| summarize_span(s) }

    OpenTelemetry::SDK::Trace::Export::SUCCESS
  end

  def summarize_span(span)
    puts "Span(#{span.name}, #{span.attributes.inspect})"
  end
end

exporter = SummarizingExporter.new

use_batch = ENV['PROCESSOR_TYPE'] != 'simple' # default to batched

OpenTelemetry::SDK.configure do |c|
  c.use('OpenTelemetry::Instrumentation::Resque', span_naming: :job_class)

  if use_batch
    c.add_span_processor(
      OpenTelemetry::SDK::Trace::Export::BatchSpanProcessor.new(
        exporter,
        schedule_delay: 500 #ms
      )
    )
  else 
    c.add_span_processor(
      OpenTelemetry::SDK::Trace::Export::SimpleSpanProcessor.new(
        exporter
      )
    )
  end
end

MainTracer = OpenTelemetry.tracer_provider.tracer('example')

class ExampleJob
  @queue = 'example'

  def self.perform(arg)
    MainTracer.in_span('foo') { puts "performing ExampleJob with #{arg}" }
  end
end

require 'redis'
require 'resque'

Redis.silence_deprecations = true

Resque.enqueue(ExampleJob, 12345)

Resque::Worker.new('example').work_one_job

sleep(1) # wait for more than batch processor interval

OpenTelemetry.tracer_provider.shutdown

puts "Finished."

expected behavior:

→ PROCESSOR_TYPE=simple ruby example.rb 
...
I, [2024-04-23T11:45:27.080842 #88592]  INFO -- : Instrumentation: OpenTelemetry::Instrumentation::Resque was successfully installed with the following options {:span_naming=>:job_class, :propagation_style=>:link}
Span(ExampleJob send, {"messaging.system"=>"resque", "messaging.destination"=>"example", "messaging.destination_kind"=>"queue", "messaging.resque.job_class"=>"ExampleJob"})
performing ExampleJob with 12345
Span(foo, {})
Span(ExampleJob process, {"messaging.system"=>"resque", "messaging.destination"=>"example", "messaging.destination_kind"=>"queue", "messaging.resque.job_class"=>"ExampleJob"})
Finished.

note 3 spans are exported, including foo and ExampleJob process

actual behavior:

→ PROCESSOR_TYPE=batch ruby example.rb 
...
I, [2024-04-23T11:45:27.080842 #88592]  INFO -- : Instrumentation: OpenTelemetry::Instrumentation::Resque was successfully installed with the following options {:span_naming=>:job_class, :propagation_style=>:link}
performing ExampleJob with 12345
Span(ExampleJob send, {"messaging.system"=>"resque", "messaging.destination"=>"example", "messaging.destination_kind"=>"queue", "messaging.resque.job_class"=>"ExampleJob"})
Finished.

note, only the ExampleJob send span is exported

Notes

I also saw some similar behavior when using SimpleSpanProcessor with InMemorySpanExporter which also uses mutexes to lock resources 🤔 ... perhaps this bug related to thread/multiprocess safety. resque does some strange things with concurrency in its workers. at least in the version above, the default is to fork for every job.

setting FORK_PER_JOB=false in the env actually gives the correct result, but that's not generally desirable, as forking gives you better job isolation, protecting against memory space pollution.

The same behavior happens in the latest version of resque as of this writing (v2.6.0).

I wasn't able to test on ruby 3 due to my local environment being a little borked, but we are stuck on ruby 2.7.6 for now anyway.

Would be nice for this to Just Work out of the box, but in the mean time, if there is a workaround to allow this to export correctly even with FORK_PER_JOB enabled, that would be appreciated.

Thanks!

Answered by tlubz

Apr 23, 2024

Have you tried adding an at_exit handler and shutting down the SDK?

I see what you are saying. This workaround seems to get span batching to work as intended with forking resque workers:

at_exit do
  OpenTelemetry.tracer_provider.shutdown
end

worker = Resque::Worker.new('example')
worker.run_at_exit_hooks = true
worker.work_one_job

note the necessity of telling the worker to actually run at_exit hooks (also configurable as an env var RUN_AT_EXIT_HOOKS=true), as the default is to skip them with exit!

is putting OpenTelemetry.tracer_provider.shutdown in an at_exit hook best practice across opentelemetry-ruby instrumentation, or is this a special case workaround?

View full answer

tlubz · 2024-04-23T20:00:24Z

tlubz
Apr 23, 2024
Author

Digging into this a little more, it appears that when running in FORK_PER_JOB mode, Resque::Worker#work_one_job doesn't have any way to flush the batch (or wait for a successful flush) after the ExampleJob process span finishes but before the forked process terminates.

Adding e.g. Resque.after_perform { sleep 2 } causes at least the internal foo span above to be emitted, but the ExampleJob process is still swallowed.

This might be fixed by shutting down any BatchSpanProcessors right before the fork is terminated, which would attempt a flush

i believe for example, calling OpenTelemetry.tracer_provider.shutdown inside OpenTelemetry::Instrumentation::Resque::Patches::ResqueJob#perform's ensure block when the worker is in a forked process could accomplish this. Is there a precedent for this in other instrumentation?

0 replies

arielvalentin · 2024-04-23T21:20:20Z

arielvalentin
Apr 23, 2024
Maintainer

Have you tried adding an at_exit handler and shutting down the SDK?

This will attempt to flush any pending spans when the resque job exits.

0 replies

tlubz · 2024-04-23T23:00:01Z

tlubz
Apr 23, 2024
Author

Have you tried adding an at_exit handler and shutting down the SDK?

I see what you are saying. This workaround seems to get span batching to work as intended with forking resque workers:

at_exit do
  OpenTelemetry.tracer_provider.shutdown
end

worker = Resque::Worker.new('example')
worker.run_at_exit_hooks = true
worker.work_one_job

note the necessity of telling the worker to actually run at_exit hooks (also configurable as an env var RUN_AT_EXIT_HOOKS=true), as the default is to skip them with exit!

is putting OpenTelemetry.tracer_provider.shutdown in an at_exit hook best practice across opentelemetry-ruby instrumentation, or is this a special case workaround?

0 replies

arielvalentin · 2024-04-23T23:27:03Z

arielvalentin
Apr 23, 2024
Maintainer

Generally speaking yes.

I'm not aware of any hooks that run on workers when they terminate. If you know of one please share it in the readme of the instrumentation.

If you have a preferred method of graceful shutdown then register it there instead.

The other option is to set force_flush: true as part of the instrumentation options. This will force the BSP to export spans when the job completes.

However, that will not gracefully shutdown when the worker exits so YMMV.

0 replies

tlubz · 2024-04-24T16:35:52Z

tlubz
Apr 24, 2024
Author

Sounds good

I see other instrumentation libraries calling force_flush in an ensure block around the instrumented call. e.g. in OpenTelemetry::Instrumentation::Rake::Patches::Task#execute:

def execute(args = nil)
  tracer.in_span('rake.execute', attributes: { 'rake.task' => name }) do
    super
  end
ensure
  force_flush
end

Does missing this call qualify as a bug? Or do you want to keep the option to flush on each job open to the user?

0 replies

arielvalentin · 2024-04-24T16:40:32Z

arielvalentin
Apr 24, 2024
Maintainer

As I mentioned, it is an option you may set globally so the choice is yours:

https://github.com/open-telemetry/opentelemetry-ruby-contrib/blob/main/instrumentation/resque/lib/opentelemetry/instrumentation/resque/instrumentation.rb#L75

0 replies

arielvalentin · 2024-04-24T16:42:11Z

arielvalentin
Apr 24, 2024
Maintainer

hmmm looking at this now more closely though it is a little confusing. It should be force flushing forked jobs by default:

opentelemetry-ruby-contrib/instrumentation/resque/lib/opentelemetry/instrumentation/resque/patches/resque_job.rb

Line 58 in c642531

if (config[:force_flush] == :ask_the_job && worker&.fork_per_job?) ||

What happens when you set it to always?

0 replies

tlubz · 2024-04-24T18:15:05Z

tlubz
Apr 24, 2024
Author

Ok, I see where i was confused now. The unexpected behavior is happening on an older version of the resque instrumentation (0.3.1).

It looks like the force_flush option was added when support for ruby 2.7 was dropped, so our code running on ruby 2.7.6 doesn't have access to that.

It sounds like we will have to use the workaround until we finish upgrading our codebase to ruby 3?

0 replies

arielvalentin · 2024-04-24T18:50:28Z

arielvalentin
Apr 24, 2024
Maintainer

Indeed! Thanks for taking the extra time to dig into this.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resque::Job instrumentation never flushing to OTLP when using BatchSpanProcessor #950

{{title}}

Replies: 9 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Resque::Job instrumentation never flushing to OTLP when using BatchSpanProcessor #950

tlubz Apr 23, 2024

expected behavior:

actual behavior:

Notes

Replies: 9 comments

tlubz Apr 23, 2024 Author

arielvalentin Apr 23, 2024 Maintainer

tlubz Apr 23, 2024 Author

arielvalentin Apr 23, 2024 Maintainer

tlubz Apr 24, 2024 Author

arielvalentin Apr 24, 2024 Maintainer

arielvalentin Apr 24, 2024 Maintainer

tlubz Apr 24, 2024 Author

arielvalentin Apr 24, 2024 Maintainer

tlubz
Apr 23, 2024

tlubz
Apr 23, 2024
Author

arielvalentin
Apr 23, 2024
Maintainer

tlubz
Apr 23, 2024
Author

arielvalentin
Apr 23, 2024
Maintainer

tlubz
Apr 24, 2024
Author

arielvalentin
Apr 24, 2024
Maintainer

arielvalentin
Apr 24, 2024
Maintainer

tlubz
Apr 24, 2024
Author

arielvalentin
Apr 24, 2024
Maintainer