Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

neutron: set a failure-timeout on neutron-ha-tool #2063

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

dirkmueller
Copy link
Contributor

@dirkmueller dirkmueller commented Mar 18, 2019

We don't want the l3 ha tool service to be stopped after 3 weeks of weekly
patching and rebooting the rabbitmq cluster. Set a timeout of a failure
if it happened more than 10 minutes ago.

@@ -154,6 +154,9 @@
agent "systemd:neutron-l3-ha-service"
op node[:neutron][:ha][:neutron_l3_ha_resource][:op]
action :update
meta ({
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lint/ParenthesesAsGroupedExpression: (...) interpreted as grouped expression. (https://github.com/bbatsov/ruby-style-guide#parens-no-spaces)

Copy link
Member

@aspiers aspiers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commit message references the l3 agent but the change affects neutron-l3-ha-service. It's not clear to me what the exact problem is or why timing out a failure of neutron-l3-ha-service would address it. I'm guessing there is some missing detail regarding the interaction between the two - please can you clarify in the commit message?

We don't want the neutron-ha-tool service to be stopped after 3 weeks of weekly
patching and rebooting the rabbitmq cluster. Set a timeout of a failure
if it happened more than 10 minutes ago.
@dirkmueller
Copy link
Contributor Author

@aspiers sorry, fixed the typo. this is about the neutron-l3-ha-service which randomly but regularly gets stopped by pacemaker because of some sequense of consecutive errors.

For example recently somebody broke keystone for a time of 15 minutes, and that caused pacemaker to stop the service due to repeated failure. this is not helpful for achieving high availability when pacemaker just kills the service that should take care of availability.

@aspiers
Copy link
Member

aspiers commented Mar 31, 2019

@aspiers sorry, fixed the typo. this is about the neutron-l3-ha-service which randomly but regularly gets stopped by pacemaker because of some sequense of consecutive errors.

For example recently somebody broke keystone for a time of 15 minutes, and that caused pacemaker to stop the service due to repeated failure. this is not helpful for achieving high availability when pacemaker just kills the service that should take care of availability.

OK thanks, that makes sense now. Ideally I would prefer that info to be in the commit message too, since the commit message doesn't feel entirely self-explanatory yet. But the main problem seems to be that the CI is currently failing:

+(qa_crowbarsetup.sh:3967) oncontroller_check_crm_failcounts(): [[ 1 = 1 ]]
+(qa_crowbarsetup.sh:3967) oncontroller_check_crm_failcounts(): [[ disallowskipfailcount = \d\i\s\a\l\l\o\w\s\k\i\p\f\a\i\l\c\o\u\n\t ]]
+(qa_crowbarsetup.sh:3968) oncontroller_check_crm_failcounts(): crm_mon --failcounts -1
+(qa_crowbarsetup.sh:3968) oncontroller_check_crm_failcounts(): grep fail-count=
+(qa_crowbarsetup.sh:3968) oncontroller_check_crm_failcounts(): complain 55 'Cluster resources'\'' failures detected'
+(mkcloud-common.sh:114) complain(): local ex=55
+(mkcloud-common.sh:114) complain(): shift
   neutron-l3-ha-service: migration-threshold=3 fail-count=3 last-failure='Mon Mar 25 21:27:23 2019'
+(mkcloud-common.sh:115) complain(): printf 'Error (55): %s\n' 'Cluster resources'\'' failures detected'
Error (55): Cluster resources' failures detected
+(mkcloud-common.sh:116) complain(): [[ 55 = - ]]
+(mkcloud-common.sh:116) complain(): exit 55

I guess that's probably related to this change somehow.

@aspiers
Copy link
Member

aspiers commented Mar 31, 2019

@aspiers commented on March 31, 2019 1:26 PM:

But the main problem seems to be that the CI is currently failing:

[snipped]

I'm going to see if logreduce can help with this...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

3 participants