Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: idsse-795: PublishConfirm setup timeout #67

Merged
merged 4 commits into from
Jul 16, 2024

Conversation

mackenzie-grimes-noaa
Copy link
Contributor

@mackenzie-grimes-noaa mackenzie-grimes-noaa commented Jul 16, 2024

Linear Issue

IDSSE-795

Changes

  • Add optional timeout to PublishConfirm _wait_for_channel_to_be_ready(). Default is 6 seconds
  • If RabbitMQ setup did not work, or the PublishConfirm thread did not report back after 6 seconds, give up trying to publish the message, and immediately return to the caller to let them know.

Explanation

My current theory is that PublishConfirm hanging/deadlocking bug was due to the child Thread failing, for some reason, to connect to RabbitMQ and establish its exchange/queue... but because it's not the main thread, no logs were appearing in the console/stdout to help us troubleshoot. At least this is what I saw locally, and adding debug breakpoints helped me find that my exchange/queue parameters did not match the existing exchanges and queues on the RabbitMQ server.

These changes may not resolve any issues, but it will surface them more clearly, by creating a Python Future which the RabbitMQ setup callbacks will either resolve to True if everything worked, or an exception if RabbitMQ throws one, or timeout after 6 seconds.

Copy link
Contributor

@Geary-Layne Geary-Layne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a timeout for setup is a good idea. See the linear task for a post that might help with logs from threads. I also put on comment that I think would be good for original bug, it follows the logic in the RPC class. Also see PublisherSync for the code I added to re-connect, though you might be able it improve on the implementation.

if callback is not None:
self._on_ready_callback = callback # to be invoked after all pika setup is done
if is_ready is not None:
self._is_ready_future = is_ready # to be invoked after all pika setup is done
self._thread.start()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should put an if self._thread.is_alive around this, but that is more for the original problem not what is covered in this PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went back on forth on this, and either way your comment points to how messy and hard-to-read this class has gotten.

I put an is_alive() check in the public method start(), which in turn calls this private _start() method. However, the methods aren't identical: public start doesn't accept a callback/Future argument, which the private _start does. And _wait_for_channel_to_be_ready calls the private start, not the public one.

Should I just combine them to be one consistent function? Or is there a better organization?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would put the 'is_alive' check right before the thread.start(), that means it is always checked. As of right now I don't see the need for a public and private start, so I'd get rid of the private, which is how rpc_client.py is written.


Returns:
bool: True if channel is ready. False if timed out waiting for RabbitMQ to connect
"""

# validate that PublishConfirm thread has been set up and connected to RabbitMQ
logger.info('DEBUG _wait_for_channel_to_be_ready state')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These look like debugs not info logs

@mackenzie-grimes-noaa mackenzie-grimes-noaa merged commit c918996 into main Jul 16, 2024
2 checks passed
@mackenzie-grimes-noaa mackenzie-grimes-noaa deleted the bug/pub-conf-timeout branch July 16, 2024 20:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants