Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[orchagent] RouteOrch cannot consume new routes if there are enough routes being tried in the m_toSync #3027

Open
goomadao opened this issue Jan 29, 2024 · 1 comment

Comments

@goomadao
Copy link

goomadao commented Jan 29, 2024

Description

When there are many routes being retried in the consumer.m_toSync of ROUTE TABLE all the time (be blocked by the Neighbor non-existance or something), the Consumer will be not able to pops() any new routes by calling the Consumer::execute() function. The amount of the retrying routes to trigger this issue depends on the shortest Timer whose priority is higher than the ROUTE TABLE Consumer. The priority of the ROUTE TABLE Consumer is 5.

Steps to reproduce the issue

  1. Distribute routes referencing NHG 5822 which does not exist or is deleted earlier
  2. Diliver NHG 16518
  3. Updating all the routes to reference NHG 16518

Describe the results you received

The old routes are retrying all the time & the new routes cannot be consumed. RouteOrch stucks here.
image

Describe the results you expected

New routes are able to be consumed and processed by route orch properly.

Output of show version

Output of show techsupport

(paste your output here or download and attach the file here)

Root cause of this issue

In the OrchDaemon::start(), a Selectable is selected and its execute() function will be called. After that, doTask() of all orchs will be triggered and retry all the remaining tasks. Therefore, if there are enough routes being retried, and there is a Timer whose priority is higher than the ROUTE TABLE Consumer, and the interval of this Timer is shorter than the retrying duration, the ROUTE TABLE Consumer will never be selected. In other words, new routes will never be consumed.

Additional information you deem important (e.g. issue happens only occasionally):

This was triggered occasionally in our testbed where the BGP was flapping and some interfaces were shutting down & starting up. And it may contribute to this issue that we have an additional Timer whose interval is 50ms.

Possible solution

Modify the mechanism for retrying. For example, we can do the retry operation every two loops. We can also limit this change within only the route orch to narrow the influencing scope.
image

@goomadao
Copy link
Author

Another problem is that the priority does not take effect at present. As is shown below, the priority of the ROUTE TABLE Consumer is 0, not 5 as defined. In this situation, the above issue won't happen.
image

To make the priority valid, the following changes can be applied.

--- a/orchagent/orch.h
+++ b/orchagent/orch.h
@@ -96,7 +96,8 @@ class Executor : public swss::Selectable
 {
 public:
     Executor(swss::Selectable *selectable, Orch *orch, const std::string &name)
-        : m_selectable(selectable)
+        : Selectable(selectable->getPri())
+        , m_selectable(selectable)
         , m_orch(orch)
         , m_name(name)
     {

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant