Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent deadlocks in zaza #580

Open
fnordahl opened this issue Oct 11, 2022 · 2 comments
Open

Intermittent deadlocks in zaza #580

fnordahl opened this issue Oct 11, 2022 · 2 comments

Comments

@fnordahl
Copy link
Collaborator

Since commit b393baa models are intermittently stuck initializing vault, and they appear to get stuck here:

return await model.applications[application_name].get_config()

@fnordahl
Copy link
Collaborator Author

@ajkavanagh
Copy link
Collaborator

Looking into the linked issue, it appears it got stuck1 in async_block_until_all_units_idle(..) and the check calls all_units_idle() which is in python. The zaza/model.py code:

async def async_block_until_all_units_idle(model_name=None, timeout=2700):
    """Block until all units in the given model are idle.

    An example accessing this function via its sync wrapper::

        block_until_all_units_idle('modelname')

    :param model_name: Name of model to query.
    :type model_name: str
    :param timeout: Time to wait for status to be achieved
    :type timeout: float
    """
    model = await get_model(model_name)
    await block_until_auto_reconnect_model(
        lambda: units_with_wl_status_state(
            model, 'error') or model.all_units_idle(),
        model=model,
        timeout=timeout)
    errored_units = units_with_wl_status_state(model, 'error')
    if errored_units:
        raise UnitError(errored_units)

The relevant code in libjuju:

    def all_units_idle(self):
        """Return True if all units are idle.
        """
        for unit in self.units.values():
            unit_status = unit.data['agent-status']['current']
            if unit_status != 'idle':
                return False
        return True

This check is run in an async loop that includes an await asyncio.sleep(...) to make sure the async coroutines make progress. I'm guessing that maybe something is getting hung up in libjuju and its not making progress on status values. e.g. in libjuju a coroutine updates the unit data using messages received from the juju controller into the unit data structures.


2022-10-11 19:05:20.852261 | focal-medium |   File "/home/ubuntu/src/review.opendev.org/x/charm-ovn-chassis/src/.tox/func-target/lib/python3.8/site-packages/zaza/model.py", line 1753, in async_block_until_all_units_idle
2022-10-11 19:05:20.852271 | focal-medium |     await block_until_auto_reconnect_model(
2022-10-11 19:05:20.852281 | focal-medium |   File "/home/ubuntu/src/review.opendev.org/x/charm-ovn-chassis/src/.tox/func-target/lib/python3.8/site-packages/zaza/model.py", line 414, in block_until_auto_reconnect_model
2022-10-11 19:05:20.852294 | focal-medium |     await asyncio.wait_for(_block(), timeout)
2022-10-11 19:05:20.852307 | focal-medium |   File "/usr/lib/python3.8/asyncio/tasks.py", line 501, in wait_for
2022-10-11 19:05:20.852317 | focal-medium |     raise exceptions.TimeoutError()

Footnotes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants