Preserve results when NLM_F_DUMP_INTR is set #1018

robmry · 2024-09-09T15:57:04Z

When the kernel detects a change in a set of resources being dumped (walked-over) by a netlink call, it sets flag NLM_F_DUMP_INTR to indicate that the results may be incomplete or inconsistent.

Before #925 (in v1.2.1), the flag was ignored and results were returned without an error. With that change, response handling is aborted, results are discarded, and unix.EINTR is returned.

Depending on what the caller is looking for, the incomplete/inconsistent results may be usable - and that may be preferable to retrying the whole request (which could take unbounded time on a system that's seeing a lot of changes in a lot of data).

This PR introduces a new error, netlink.ErrDumpInterrupted that's specific to the NLM_F_DUMP_INTR case. Response handling completes as normal, and data is returned along with the error. So, the caller can use something like errors.Is(err, netlink.ErrDumpInterrupted), and take appropriate action.

If any callers are checking for errors.Is(err, unix.EINTR), that'll continue to work. But, this is a breaking change for any callers checking the error with a simple err == unix.EINTR.

robmry · 2024-09-09T17:13:48Z

@adrianchiris, @aboch ... could you take a look?

adrianchiris · 2024-09-10T07:13:18Z

nl/nl_linux.go

+// ErrDumpInterrupted is an instance of errDumpInterrupted, used to report that
+// a netlink function has set the NLM_F_DUMP_INTR flag in a response message,
+// indicating that the results may be incomplete or inconsistent.
+var ErrDumpInterrupted = errDumpInterrupted{}


you could use:
errors.New("results may be incomplete or inconsistent")

It is not equivalent: errDumpInterrupted implements interface { Is(error) bool }. errors.Is(errors.New(""), unix.EINTR) == false while errors.Is(ErrDumpInterrupted, unix.EINTR) == true.

yes you are right.

adrianchiris · 2024-09-10T07:13:57Z

@robmry just as an FYI, the check was introduced to align with how libmnl is executing netlink commands. see [1]

can you provide more info on the use-case where its ok and needed to rely partial results instead of retry ? (except the ci failure u linked).

TBH i favor sticking with how iproute2/libmnl implements execution of these cmds unless there is a good reason to deviate

[1] https://github.com/justmirror/libmnl/blob/f14732339a77a2e6ba9ed4ca99b347ce1dc60801/src/callback.c#L70

robmry · 2024-09-10T10:54:56Z

@robmry just as an FYI, the check was introduced to align with how libmnl is executing netlink commands. see [1]

can you provide more info on the use-case where its ok and needed to rely partial results instead of retry ? (except the ci failure u linked).

TBH i favor sticking with how iproute2/libmnl implements execution of these cmds unless there is a good reason to deviate

[1] https://github.com/justmirror/libmnl/blob/f14732339a77a2e6ba9ed4ca99b347ce1dc60801/src/callback.c#L70

Thanks @adrianchiris - the moby (dockerd) maintainers were concerned about unbounded retries on the new EINTR. It's not quite like a retry on a normal EINTR from a system call as it's (perhaps) more likely to happen again - and the kernel interface gives the option of using the partial data.

So, the preference is to be able to retry a limited number of times and, in the unlikely event the retries fail, fall-back to using the partial data - on the basis that it'll be no worse than it was before the change, and no problems have been noticed/reported.

As an example of where partial data might be used in a more considered way - in this package, LinkByName can fall back to using linkByNameDump, which searches a dump for the name. If there's no result with the name, there's no way to tell if the link exists without a retry. But, if there's at-least one entry with the name, it might be what the caller wants ... this package can't tell. Not-discarding the result and returning a Link with the error means the caller can decide - it might want to retry, use the Link as-is, or it might want to get a more up-to-date version using the returned Index in LinkByIndex (which doesn't do a dump).

Many callers are still likely to see the error is non-nil and just take an error path without looking at the data.

iproute2 does something similar here https://github.com/iproute2/iproute2/blob/e9096586e0701d5ae031df2f2708d20d34ae7bd4/lib/libnetlink.c#L901 ... it doesn't retry on NLM_F_DUMP_INTR, but it's not a daemon and the user can choose to retry when they get an error.

aboch · 2024-09-10T17:31:15Z

Yes, the goal of this library is to be in sync with iproute2 behavior as much as possible.
If we start to add exceptions to accommodate this or this other project, we will end up introducing a lot of oddities in the code which will hinder the use of this library in future.

Per go guidelines, but in general a good practice for any language, if the the function returns an non nil error, any other returned variable must not be trusted.

The fact that you have each list method to still return the original INTR error is good, but if the parsing functions fail because the partial list is truncated at the element level (I do not know if it is possible) then the parsing function will fail and the returned error will be a parsing error. This will mask the real INTR error, making it harder to root cause.

robmry · 2024-09-10T18:17:02Z

Yes, the goal of this library is to be in sync with iproute2 behavior as much as possible. If we start to add exceptions to accommodate this or this other project, we will end up introducing a lot of oddities in the code which will hinder the use of this library in future.

The iproute2 link I pasted above shows it handling NLM_F_DUMP_INTR as it's handled by this change.

There's nothing moby-specific about the change, or requirement ... it's discussed in my comment because I was responding to the request for a use-case, and moby is one. But, the LinkByName example is based on how it could be used by a function in this package (or its caller).

Per go guidelines, but in general a good practice for any language, if the the function returns an non nil error, any other returned variable must not be trusted.

I can't find anything like that in https://go.dev/doc/effective_go#errors - I also thought it was unusual, but io.ReadAll is a counter-example.

In this case, the error is very explicit that the results aren't to be trusted. The idea is to give the caller the choice (like the kernel and iproute2 do).

The fact that you have each list method to still return the original INTR error is good, but if the parsing functions fail because the partial list is truncated at the element level (I do not know if it is possible) then the parsing function will fail and the returned error will be a parsing error. This will mask the real INTR error, making it harder to root cause.

I don't think that's possible - individual results from a collection are copied into the netlink response with locks held, so they'll be fine. The flag indicates that a generation number for the collection changed as those results were being accumulated. (A reference to a position in the collection is stored between netlink responses to a request, but that position can get out of date. Even with a single netlink response message, collections can be changed between reads of elements of the collection.)

corhere · 2024-09-10T18:33:44Z

if the parsing functions fail because the partial list is truncated at the element level (I do not know if it is possible) then the parsing function will fail and the returned error will be a parsing error. This will mask the real INTR error, making it harder to root cause.

Before v1.2.1, NLM_F_DUMP_INTR flag was not checked when parsing messages. If it was possible for the kernel to return a truncated list, someone would have seen and complained about parsing errors in the same circumstances where we are currently complaining about EINTR. This library itself is very strong evidence that the kernel does not pass ill-formed messages to userspace when a dump is interrupted.

aboch · 2024-09-10T20:35:33Z

I guess you convinced me.
@robmry can you please add a comment line on top of each exposed method saying the in case of EINTR returned error the collection of elements is still returned but may not be current?

adrianchiris · 2024-09-11T09:27:07Z

Hi all,

Adding some more points:

kernel recommends to retry inconsistent dump commands [1]
i dont think that iproute makes a habbit of using inconsistent results of dump commands. from grepping a bit in iproute code
i found the example[2] listed in above comments used only in ip address flush[3]

ipaddr_flush() --- uses ---> rtnl_dump_filter_nc() --- uses ---> rtnl_dump_filter_l()

NLM_F_DUMP_INTR is relevant only for dump commands (NLM_F_DUMP flag set ) so for non dump commands no need to do the check IMO.
LinkByName in this library does a fallback to dump command only when kernel does not support getting the specific interface according to comments [4]

That said, i have no objection for the change as long as we still fail if results are inconsistent.

Thanks @adrianchiris - the moby (dockerd) maintainers were concerned about unbounded retries on the new EINTR. It's not quite like a retry on a normal EINTR from a system call as it's (perhaps) more likely to happen again - and the kernel interface gives the option of using the partial data.

please note that the kernel refers to this data as "inconsistent" rather then partial.

[1] https://github.com/torvalds/linux/blob/8d8d276ba2fb5f9ac4984f5c10ae60858090babc/Documentation/userspace-api/netlink/intro.rst#dump-consistency
[2] https://github.com/iproute2/iproute2/blob/e9096586e0701d5ae031df2f2708d20d34ae7bd4/lib/libnetlink.c#L901
[3] https://github.com/iproute2/iproute2/blob/e9096586e0701d5ae031df2f2708d20d34ae7bd4/ip/ipaddress.c#L1996
[4]

netlink/link_linux.go

Line 1871 in a018296

// older kernels don't support looking up via IFLA_IFNAME

robmry · 2024-09-11T10:10:44Z

I guess you convinced me. @robmry can you please add a comment line on top of each exposed method saying the in case of EINTR returned error the collection of elements is still returned but may not be current?

Done, except the error is ErrDumpInterrupted. Generally, the comment is ...

// If the returned error is [ErrDumpInterrupted], results may be inconsistent
// or incomplete.

robmry · 2024-09-11T10:57:25Z

kernel recommends to retry inconsistent dump commands [1]

Yes, indeed. But, if the data is returned, the caller has the option of choosing to do something else.

i dont think that iproute makes a habbit of using inconsistent results of dump commands. from grepping a bit in iproute code
i found the example[2] listed in above comments used only in ip address flush[3]

ipaddr_flush() --- uses ---> rtnl_dump_filter_nc() --- uses ---> rtnl_dump_filter_l()

Also macro rtnl_dump_filter, which is used to process results for a lot of functions that make NLM_F_DUMP requests.

NLM_F_DUMP_INTR is relevant only for dump commands (NLM_F_DUMP flag set ) so for non dump commands no need to do the check IMO.

Agreed - but this PR already only changes functions that do an NLM_F_DUMP.

LinkByName in this library does a fallback to dump command only when kernel does not support getting the specific interface according to comments [4]

I think that's what I described in the example? ("in this package, LinkByName can fall back to using linkByNameDump, which searches a dump for the name"), and now in the godoc comment describing the ErrDumpInterrupted return. (Ditto for LinkByAlias.)

aboch · 2024-09-12T19:44:30Z

@robmry This project follows a "One change/fix/feature per pull request" and the "Squash and Merge" option is not enabled.
Please squash your commits into one and push again.

robmry · 2024-09-15T10:59:43Z

@robmry This project follows a "One change/fix/feature per pull request" and the "Squash and Merge" option is not enabled. Please squash your commits into one and push again.

Oh, sure - done ...

I got rid of the gofmt changes, as they weren't directly related (only really there to stop my editor messing with code in the other commits). The other three commits were split to try to make reviews a bit easier:

introduce the new error type
mechanical changes to use the new type, and return results where EINTR would have been returned before
catch some NLM_F_DUMP_INTR cases missed by the original change

(I've still got those changes ... happy to re-raise as separate PRs if preferred?)

adrianchiris · 2024-09-19T06:59:44Z

ipset_linux.go

@@ -250,7 +250,7 @@ func (h *Handle) IpsetListAll() ([]IPSetResult, error) {
 		result[i].unserialize(msg)
 	}

-	return result, nil
+	return result, err


no need to check for ErrDumpInterrupted at L244?

Ah, thank you - I think I modified this one as it's a ...ListAll() - then realised it doesn't set NLM_F_DUMP, and didn't fully undo the change. I'll change the return back to nil.

adrianchiris · 2024-09-19T07:13:19Z

link_linux.go

@@ -2440,6 +2464,9 @@ func linkSubscribeAt(newNs, curNs netns.NsHandle, ch chan<- LinkUpdate, done <-c
 				continue
 			}
 			for _, m := range msgs {
+				if m.Header.Flags&unix.NLM_F_DUMP_INTR != 0 && cberr != nil {


even if for initial list would return dump interrupted there would be additional notifications from kernel (i think).
not sure we want this here.

Possibly ... what additional notifications are you thinking of, and where would they go?

adrianchiris · 2024-09-19T07:26:12Z

socket_linux.go

@@ -157,6 +157,9 @@ func (u *UnixSocket) deserialize(b []byte) error {
 }

 // SocketGet returns the Socket identified by its local and remote addresses.
+//
+// If the returned error is [ErrDumpInterrupted], the search for a result may
+// be incomplete and the caller should retry.


ATM this would return nil as Socket if any error happened at L219 (req.Execute()) is this intended ?

i.e in case of ErrDumpInterrupted we dont get the partial results

This one returns specific errors if there's not exactly one result and, even if there is one socket, either of those is a possibility if the dump was interrupted.

So, if there's not exactly one result, ErrDumpInterrupted should be returned rather than one of those other errors to indicate that a retry is needed.

But, I guess it'd be better to explicitly check for that case ... I'll update it.

This one returns specific errors if there's not exactly one result ...

Hmm - looking at it more closely, that might have been what was intended, but it's not what it does ...

if len(msgs) == 0 { return nil, errors.New("no message nor error from netlink") } if len(msgs) > 2 { return nil, fmt.Errorf("multiple (%d) matching sockets", len(msgs)) }

So, if there are 2 results, it'll use the first and discard the second. I guess the second test should probably be > 1 or >= 2.

Assuming that's a bug, fixing it doesn't belong in this PR - so I'll leave it alone here ... just returning nil, rather than making things more confusing.

Add a specific error to report that a netlink response had NLM_F_DUMP_INTR set, indicating that the set of results may be incomplete or inconsistent. unix.EINTR was previously returned (with no results) when the NLM_F_DUMP_INTR flag was set. Now, errors.Is(err, unix.EINTR) will still work. But, this will be a breaking change for any code that's checking for equality with unix.EINTR. Return results with ErrDumpInterrupted. Results may be incomplete or inconsistent, but give the caller the option of using them. Look for NLM_F_DUMP_INTR in more places: - linkSubscribeAt, neighSubscribeAt, routeSubscribeAt - can do an initial dump, which may report inconsistent results -> if there's an error callback, call it with ErrDumpInterrupted - socketDiagXDPExecutor - makes an NLM_F_DUMP request, without using Execute() -> give it the same behaviour as functions that do use Execute() Signed-off-by: Rob Murray <[email protected]>

adrianchiris

lgtm

…93d350 - Preserve results when NLM_F_DUMP_INTR is set vishvananda/netlink#1018 - Fix SetSendTimeout/SetReceiveTimeout vishvananda/netlink#1012 full diff: vishvananda/netlink@v1.3.0...084abd9 Signed-off-by: Rob Murray <[email protected]>

thaJeztah · 2024-10-08T20:16:44Z

@adrianchiris @aboch any chance on tagging a new release now that our fixes went in?

According to the kernel docs[^1], the kernel can return incomplete results for netlink state dumps if the state changes while we are dumping it. The result is then marked by `NLM_F_DUMP_INTR`. The `vishvananda/netlink` library returned `EINTR` since v1.2.1, but more recent versions have changed it such that it returns `netlink.ErrDumpInterrupted` instead[^2]. These interruptions seem common in high-churn environments. If the error occurs, it is in most cases best to just try again. Therefore, this commit adds a wrapper for all `netlink` functions marked to return `ErrDumpInterrupted` that retries the function up to 30 times until it either succeeds or returns a different error. While may call sites do have their own high-level retry mechanism (see e.g. cilium#32099), the logged error message can still cause CI to fail (e.g. cilium#35259). Long high-level retry intervals can also become problematic: For example, if the routing setup fails due to `NLM_F_DUMP_INTR` during an CNI ADD invocation, the retry adds add several seconds of additional delay to an already overloaded system, instead of resolving the issue quickly. A subsequent commit will add an additional linter that nudges developers to use this new `safenetlink` package for function calls that can be interrupted. This ensures that we don't have to add retries in all subsystems individually. [^1]: https://docs.kernel.org/userspace-api/netlink/intro.html [^2]: vishvananda/netlink#1018 Signed-off-by: Sebastian Wicki <[email protected]>

According to the kernel docs[^1], the kernel can return incomplete results for netlink state dumps if the state changes while we are dumping it. The result is then marked by `NLM_F_DUMP_INTR`. The `vishvananda/netlink` library returned `EINTR` since v1.2.1, but more recent versions have changed it such that it returns `netlink.ErrDumpInterrupted` instead[^2]. These interruptions seem common in high-churn environments. If the error occurs, it is in most cases best to just try again. Therefore, this commit adds a wrapper for all `netlink` functions marked to return `ErrDumpInterrupted` that retries the function up to 30 times until it either succeeds or returns a different error. While may call sites do have their own high-level retry mechanism (see e.g. #32099), the logged error message can still cause CI to fail (e.g. #35259). Long high-level retry intervals can also become problematic: For example, if the routing setup fails due to `NLM_F_DUMP_INTR` during an CNI ADD invocation, the retry adds add several seconds of additional delay to an already overloaded system, instead of resolving the issue quickly. A subsequent commit will add an additional linter that nudges developers to use this new `safenetlink` package for function calls that can be interrupted. This ensures that we don't have to add retries in all subsystems individually. [^1]: https://docs.kernel.org/userspace-api/netlink/intro.html [^2]: vishvananda/netlink#1018 Signed-off-by: Sebastian Wicki <[email protected]>

[ upstream commit 5d1951b ] According to the kernel docs[^1], the kernel can return incomplete results for netlink state dumps if the state changes while we are dumping it. The result is then marked by `NLM_F_DUMP_INTR`. The `vishvananda/netlink` library returned `EINTR` since v1.2.1, but more recent versions have changed it such that it returns `netlink.ErrDumpInterrupted` instead[^2]. These interruptions seem common in high-churn environments. If the error occurs, it is in most cases best to just try again. Therefore, this commit adds a wrapper for all `netlink` functions marked to return `ErrDumpInterrupted` that retries the function up to 30 times until it either succeeds or returns a different error. While may call sites do have their own high-level retry mechanism (see e.g. #32099), the logged error message can still cause CI to fail (e.g. #35259). Long high-level retry intervals can also become problematic: For example, if the routing setup fails due to `NLM_F_DUMP_INTR` during an CNI ADD invocation, the retry adds add several seconds of additional delay to an already overloaded system, instead of resolving the issue quickly. A subsequent commit will add an additional linter that nudges developers to use this new `safenetlink` package for function calls that can be interrupted. This ensures that we don't have to add retries in all subsystems individually. [^1]: https://docs.kernel.org/userspace-api/netlink/intro.html [^2]: vishvananda/netlink#1018 Signed-off-by: Sebastian Wicki <[email protected]>

robmry mentioned this pull request Sep 9, 2024

Retry on EINTR from netlink dump calls moby/moby#48407

Merged

robmry force-pushed the nlm_f_dump_intr_handling branch 3 times, most recently from 3b85385 to fae5970 Compare September 9, 2024 17:08

robmry marked this pull request as ready for review September 9, 2024 17:11

adrianchiris reviewed Sep 10, 2024

View reviewed changes

robmry force-pushed the nlm_f_dump_intr_handling branch from fae5970 to 9c357a0 Compare September 11, 2024 10:08

robmry force-pushed the nlm_f_dump_intr_handling branch from 9c357a0 to 1f76c75 Compare September 15, 2024 10:56

adrianchiris reviewed Sep 19, 2024

View reviewed changes

robmry force-pushed the nlm_f_dump_intr_handling branch from 1f76c75 to 5da8257 Compare September 19, 2024 14:00

adrianchiris approved these changes Sep 22, 2024

View reviewed changes

aboch merged commit 084abd9 into vishvananda:main Sep 22, 2024
2 checks passed

jrajahalme mentioned this pull request Oct 13, 2024

datapath: Retry netlink.LinkList when interrupted cilium/cilium#35259

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preserve results when NLM_F_DUMP_INTR is set #1018

Preserve results when NLM_F_DUMP_INTR is set #1018

robmry commented Sep 9, 2024 •

edited

Loading

robmry commented Sep 9, 2024

adrianchiris Sep 10, 2024

corhere Sep 10, 2024

adrianchiris Sep 16, 2024

adrianchiris commented Sep 10, 2024 •

edited

Loading

robmry commented Sep 10, 2024

aboch commented Sep 10, 2024

robmry commented Sep 10, 2024

corhere commented Sep 10, 2024

aboch commented Sep 10, 2024

adrianchiris commented Sep 11, 2024 •

edited

Loading

robmry commented Sep 11, 2024

robmry commented Sep 11, 2024

aboch commented Sep 12, 2024

robmry commented Sep 15, 2024

adrianchiris Sep 19, 2024 •

edited

Loading

robmry Sep 19, 2024

adrianchiris Sep 19, 2024

robmry Sep 19, 2024

adrianchiris Sep 19, 2024

robmry Sep 19, 2024

robmry Sep 19, 2024 •

edited

Loading

adrianchiris left a comment

thaJeztah commented Oct 8, 2024

Preserve results when NLM_F_DUMP_INTR is set #1018

Preserve results when NLM_F_DUMP_INTR is set #1018

Conversation

robmry commented Sep 9, 2024 • edited Loading

robmry commented Sep 9, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adrianchiris commented Sep 10, 2024 • edited Loading

robmry commented Sep 10, 2024

aboch commented Sep 10, 2024

robmry commented Sep 10, 2024

corhere commented Sep 10, 2024

aboch commented Sep 10, 2024

adrianchiris commented Sep 11, 2024 • edited Loading

robmry commented Sep 11, 2024

robmry commented Sep 11, 2024

aboch commented Sep 12, 2024

robmry commented Sep 15, 2024

adrianchiris Sep 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robmry Sep 19, 2024 • edited Loading

Choose a reason for hiding this comment

adrianchiris left a comment

Choose a reason for hiding this comment

thaJeztah commented Oct 8, 2024

robmry commented Sep 9, 2024 •

edited

Loading

adrianchiris commented Sep 10, 2024 •

edited

Loading

adrianchiris commented Sep 11, 2024 •

edited

Loading

adrianchiris Sep 19, 2024 •

edited

Loading

robmry Sep 19, 2024 •

edited

Loading