Cherry pick chanks/que#166 #59

isaacseymour · 2020-04-22T11:06:57Z

This should help with using a secondary database connection pool inside
Que jobs.

This should help with using a secondary database connection pool inside Que jobs. Original issue is [que-rb/que#166][166] [166]: que-rb/que#166

matthieuprat · 2020-04-23T09:29:16Z

On second look, it seems that the upstream version of Que only wraps the logic to retrieve the database connection in the executor. I don't think this is going to address the issue we're having here.

I reckon we'd need to wrap the whole job in the executor (that is, the Que::Job#run call).

Basically, the underlying issue is that when we're checking out a connection from the replica connection pool, we're not checking it back in when the job finishes. This is because clear_active_connections only clears active connections from the primary connection pool.

We could work around this issue by amending the cleanup! method:

- ::ActiveRecord::Base.clear_active_connections!
+ ActiveRecord::Base.connection_handlers.each_value do |handler|
+   handler.connection_pool.clear_active_connections!
+ end

Ideally though, we would wrap the job in the executor (among other things, the executor checks back all connections into the pool for us, which means we wouldn't need the above). This is what ActiveJob does. (ActiveJob actually wraps jobs in the reloader, which calls the executor. We could probably wrap Que jobs in the reloader as well but we don't need to do that to fix the connection issue.)

matthieuprat · 2020-05-06T10:50:30Z

On third look, we do wrap the job in Que.adapter.checkout (I originally thought that this checkout method was only used by Que when retrieving/updating/destroying the job).

So, I actually believe this would fix our issue 😄

matthieuprat · 2020-05-06T10:59:28Z

For future reference, here is the error we're hoping to get fixed with this PR:

NoMethodError: undefined method `owner' for nil:NilClass
  from active_record/connection_adapters/abstract/connection_pool.rb:883:in `remove_connection_from_thread_cache'
  from active_record/connection_adapters/abstract/connection_pool.rb:623:in `block in remove'
  from monitor.rb:235:in `mon_synchronize'
  from active_record/connection_adapters/abstract/connection_pool.rb:622:in `remove'
  from lib/database_replica.rb:20:in `rescue in block in with_connection'
  from lib/database_replica.rb:6:in `block in with_connection'
  ...
  from que/worker_group.rb:17:in `block (2 levels) in start'

Caused by PG::ConnectionBad: PQconsumeInput() server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.

  from active_record/connection_adapters/postgresql_adapter.rb:672:in `exec_params'
  ...
  from app/workers/update_creditor_balance_cache.rb:130:in `run_on_replica'
  from app/workers/update_creditor_balance_cache.rb:119:in `block in run'
  from lib/database_replica.rb:6:in `block in with_connection'
  ...
  from que/worker_group.rb:17:in `block (2 levels) in start'

And here is the original investigation (copy-pasted from Sentry):

What is the problem?

The root cause of the issue is that e.cause.connection here and there is sometimes nil. As a result, we're essentially calling pool.remove(nil) which results in the following error being raised here in ActiveRecord:

NoMethodError: undefined method `owner' for nil:NilClass
  from active_record/connection_adapters/abstract/connection_pool.rb:883:in `remove_connection_from_thread_cache'
  ...
  from active_record/connection_adapters/abstract/connection_pool.rb:622:in `remove'
  from banking_integrations/common/error_handling_helpers.rb:33:in `rescue in handle_error'
  ...

How to reproduce the problem?

In a Rails console:

pry> require "database_replica"
pry> DatabaseReplica.with_connection { ActiveRecord::Base.connection.execute("select pg_sleep(10)"); }

In another shell:

$ pkill -9 postgres

How to fix it?

A quick and dirty fix could be to fallback on ActiveRecord::Base.connection if e.cause.connection is nil (should work fine if we only checked out one connection from the pool).

A better fix would be to remove our custom logic to handle connection errors and wrap jobs in the Rails executor like the upstream version of Que does. (Doing so will ensure that all connections are checked back into the pool when a job has finished running — whereas we're only checking in connections of the primary connection handler at the moment. Checking in connections helps in that context because when a connection is subsequently checked out, Rails checks whether it's "active" — and discards it if it's not.)

danielroseman · 2020-05-11T12:40:25Z

@isaacseymour can we merge this? The exception happens quite a lot.

isaacseymour · 2020-05-11T13:33:28Z

@danielroseman would love to, just want to be cautious about deploying it in staging for a bit to make sure it doesn't do Terrible Things in the real world. If you have time to do that and monitor it, would be awesome :)

jasonlafferty · 2020-06-05T16:03:51Z

@isaacseymour how this this going? Was going through MIXP Sentry's and came across this.

isaacseymour · 2020-06-05T16:11:35Z

Someone who has some spare time should give it a go in staging I think, I haven't had time (or remembered about this) recently. You're more than welcome to if you have time!

matthieuprat · 2020-06-05T17:34:20Z

Just want to set the right expectations on this PR for whoever wants to pick it up: it won't fix the underlying issue, which is about the database being unreachable and which we, unfortunately, can't do much about.

What this fix will give us though is:

A "better" exception in Sentry (instead of the cryptic NoMethodError: undefined method 'owner' for nil:NilClass error, we'll get a nice and clean ActiveRecord::StatementInvalid error, like this one).
Fewer PG::UnableToSend: no connection to the server exceptions like this one (because currently, bad connections are not properly cleaned up, which means they get reused in subsequent jobs, which eventually results in PG::UnableToSend errors).

(@isaacseymour to confirm that I'm not saying utter nonsense!)

Cherry pick [que-rb/que#166][166]

571fb18

This should help with using a secondary database connection pool inside Que jobs. Original issue is [que-rb/que#166][166] [166]: que-rb/que#166

isaacseymour changed the title ~~Cherry pick [chanks/que#166][166]~~ Cherry pick chanks/que#166 Apr 22, 2020

matthieuprat approved these changes May 6, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cherry pick chanks/que#166 #59

Cherry pick chanks/que#166 #59

isaacseymour commented Apr 22, 2020

matthieuprat commented Apr 23, 2020

matthieuprat commented May 6, 2020

matthieuprat commented May 6, 2020 •

edited

Loading

danielroseman commented May 11, 2020

isaacseymour commented May 11, 2020

jasonlafferty commented Jun 5, 2020

isaacseymour commented Jun 5, 2020

matthieuprat commented Jun 5, 2020 •

edited

Loading

Cherry pick chanks/que#166 #59

Are you sure you want to change the base?

Cherry pick chanks/que#166 #59

Conversation

isaacseymour commented Apr 22, 2020

matthieuprat commented Apr 23, 2020

matthieuprat commented May 6, 2020

matthieuprat commented May 6, 2020 • edited Loading

danielroseman commented May 11, 2020

isaacseymour commented May 11, 2020

jasonlafferty commented Jun 5, 2020

isaacseymour commented Jun 5, 2020

matthieuprat commented Jun 5, 2020 • edited Loading

matthieuprat commented May 6, 2020 •

edited

Loading

matthieuprat commented Jun 5, 2020 •

edited

Loading