Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cherry pick chanks/que#166 #59

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Cherry pick chanks/que#166 #59

wants to merge 1 commit into from

Conversation

isaacseymour
Copy link

This should help with using a secondary database connection pool inside
Que jobs.

Original issue is chanks/que#166

This should help with using a secondary database connection pool inside
Que jobs.

Original issue is [que-rb/que#166][166]

[166]: que-rb/que#166
@isaacseymour isaacseymour changed the title Cherry pick [chanks/que#166][166] Cherry pick chanks/que#166 Apr 22, 2020
@matthieuprat
Copy link
Contributor

On second look, it seems that the upstream version of Que only wraps the logic to retrieve the database connection in the executor. I don't think this is going to address the issue we're having here.

I reckon we'd need to wrap the whole job in the executor (that is, the Que::Job#run call).

Basically, the underlying issue is that when we're checking out a connection from the replica connection pool, we're not checking it back in when the job finishes. This is because clear_active_connections only clears active connections from the primary connection pool.

We could work around this issue by amending the cleanup! method:

- ::ActiveRecord::Base.clear_active_connections!
+ ActiveRecord::Base.connection_handlers.each_value do |handler|
+   handler.connection_pool.clear_active_connections!
+ end

Ideally though, we would wrap the job in the executor (among other things, the executor checks back all connections into the pool for us, which means we wouldn't need the above). This is what ActiveJob does. (ActiveJob actually wraps jobs in the reloader, which calls the executor. We could probably wrap Que jobs in the reloader as well but we don't need to do that to fix the connection issue.)

@matthieuprat
Copy link
Contributor

On third look, we do wrap the job in Que.adapter.checkout (I originally thought that this checkout method was only used by Que when retrieving/updating/destroying the job).

So, I actually believe this would fix our issue 😄

@matthieuprat
Copy link
Contributor

matthieuprat commented May 6, 2020

For future reference, here is the error we're hoping to get fixed with this PR:

NoMethodError: undefined method `owner' for nil:NilClass
  from active_record/connection_adapters/abstract/connection_pool.rb:883:in `remove_connection_from_thread_cache'
  from active_record/connection_adapters/abstract/connection_pool.rb:623:in `block in remove'
  from monitor.rb:235:in `mon_synchronize'
  from active_record/connection_adapters/abstract/connection_pool.rb:622:in `remove'
  from lib/database_replica.rb:20:in `rescue in block in with_connection'
  from lib/database_replica.rb:6:in `block in with_connection'
  ...
  from que/worker_group.rb:17:in `block (2 levels) in start'

Caused by PG::ConnectionBad: PQconsumeInput() server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.

  from active_record/connection_adapters/postgresql_adapter.rb:672:in `exec_params'
  ...
  from app/workers/update_creditor_balance_cache.rb:130:in `run_on_replica'
  from app/workers/update_creditor_balance_cache.rb:119:in `block in run'
  from lib/database_replica.rb:6:in `block in with_connection'
  ...
  from que/worker_group.rb:17:in `block (2 levels) in start'

And here is the original investigation (copy-pasted from Sentry):

What is the problem?

The root cause of the issue is that e.cause.connection here and there is sometimes nil. As a result, we're essentially calling pool.remove(nil) which results in the following error being raised here in ActiveRecord:

NoMethodError: undefined method `owner' for nil:NilClass
  from active_record/connection_adapters/abstract/connection_pool.rb:883:in `remove_connection_from_thread_cache'
  ...
  from active_record/connection_adapters/abstract/connection_pool.rb:622:in `remove'
  from banking_integrations/common/error_handling_helpers.rb:33:in `rescue in handle_error'
  ...

How to reproduce the problem?

In a Rails console:

pry> require "database_replica"
pry> DatabaseReplica.with_connection { ActiveRecord::Base.connection.execute("select pg_sleep(10)"); }

In another shell:

$ pkill -9 postgres

How to fix it?

A quick and dirty fix could be to fallback on ActiveRecord::Base.connection if e.cause.connection is nil (should work fine if we only checked out one connection from the pool).

A better fix would be to remove our custom logic to handle connection errors and wrap jobs in the Rails executor like the upstream version of Que does. (Doing so will ensure that all connections are checked back into the pool when a job has finished running — whereas we're only checking in connections of the primary connection handler at the moment. Checking in connections helps in that context because when a connection is subsequently checked out, Rails checks whether it's "active" — and discards it if it's not.)

@danielroseman
Copy link

@isaacseymour can we merge this? The exception happens quite a lot.

@isaacseymour
Copy link
Author

@danielroseman would love to, just want to be cautious about deploying it in staging for a bit to make sure it doesn't do Terrible Things in the real world. If you have time to do that and monitor it, would be awesome :)

@jasonlafferty
Copy link

@isaacseymour how this this going? Was going through MIXP Sentry's and came across this.

@isaacseymour
Copy link
Author

Someone who has some spare time should give it a go in staging I think, I haven't had time (or remembered about this) recently. You're more than welcome to if you have time!

@matthieuprat
Copy link
Contributor

matthieuprat commented Jun 5, 2020

Just want to set the right expectations on this PR for whoever wants to pick it up: it won't fix the underlying issue, which is about the database being unreachable and which we, unfortunately, can't do much about.

What this fix will give us though is:

  • A "better" exception in Sentry (instead of the cryptic NoMethodError: undefined method 'owner' for nil:NilClass error, we'll get a nice and clean ActiveRecord::StatementInvalid error, like this one).

  • Fewer PG::UnableToSend: no connection to the server exceptions like this one (because currently, bad connections are not properly cleaned up, which means they get reused in subsequent jobs, which eventually results in PG::UnableToSend errors).

(@isaacseymour to confirm that I'm not saying utter nonsense!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants