Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to destroy the network because of empty cached cni result. #1055

Closed
sofat1989 opened this issue Jan 4, 2024 · 4 comments · Fixed by #1072
Closed

Failed to destroy the network because of empty cached cni result. #1055

sofat1989 opened this issue Jan 4, 2024 · 4 comments · Fixed by #1072

Comments

@sofat1989
Copy link

failed to "KillPodSandbox" for "1b77faa7-5bc3-4b87-a56b-97918872b6f4" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to destroy network for sandbox \"5bae61e683f7e59902f458514e9af1515d1c1c71ff069e9d4a2008a011307d04\": failed to get network \"generic-veth\" cached result: decoding version from network config: unexpected end of JSON input"


276776240 -rw------- 1 root root    0 Nov 27 23:16 generic-veth-5bae61e683f7e59902f458514e9af1515d1c1c71ff069e9d4a2008a011307d04-eth1
276779765 -rw------- 1 root root    0 Nov 27 23:16 istio-cni-5bae61e683f7e59902f458514e9af1515d1c1c71ff069e9d4a2008a011307d04-eth2
reboot   system boot  5.15.0-26-generi Mon Nov 27 23:20 - 00:10 (30+00:49)
reboot   system boot  5.15.0-26-generi Mon Nov 27 23:16 - 00:10 (30+00:53)
reboot   system boot  5.15.0-26-generi Mon Nov 27 23:13 - 00:10 (30+00:56)

Recently we met an issue. When the containerd is adding the networks for pods, but at that time, the system is rebooted. It will leave some empty files under /var/lib/cni/results. And after the system is up again, but the pods are deleted by the users. containerd will try to read the cached result to destroy the networks. It will fail directly.

For the cached file, the data will be put in the prevResult field. It seems that a lot of plugins only care about the previous result of the other plugins chained together. And it doesn't care about the previous result of itself.

Can we delay reporting the error of cached result and try to delete the network with cachedResult=nil ?

// DelNetwork executes the plugin with the DEL command
 func (c *CNIConfig) DelNetwork(ctx context.Context, net *NetworkConfig, rt *RuntimeConf) error {
        var cachedResult types.Result
+       var cachedResultErr error

        // Cached result on DEL was added in CNI spec version 0.4.0 and higher
        if gtet, err := version.GreaterThanOrEqualTo(net.Network.CNIVersion, "0.4.0"); err != nil {
@@ -567,12 +571,14 @@ func (c *CNIConfig) DelNetwork(ctx context.Context, net *NetworkConfig, rt *Runt
        } else if gtet {
                cachedResult, err = c.getCachedResult(net.Network.Name, net.Network.CNIVersion, rt)
                if err != nil {
-                       return fmt.Errorf("failed to get network %q cached result: %w", net.Network.Name, err)
+                       cachedResultErr = fmt.Errorf("failed to get network %q cached result: %w", net.Network.Name, err)
                }
        }
-
        if err := c.delNetwork(ctx, net.Network.Name, net.Network.CNIVersion, net, cachedResult, rt); err != nil {
-               return err
+               if cachedResultErr != nil {
+                       return fmt.Errorf("plugin %s failed (delete): %w; %w", pluginDescription(net.Network), cachedResultErr, err)
+               }
+               return fmt.Errorf("plugin %s failed (delete): %w", pluginDescription(net.Network), err)
        }
        _ = c.cacheDel(net.Network.Name, rt)
        return nil
@dims
Copy link

dims commented Feb 14, 2024

xref: containerd/containerd#8197

@squeed
Copy link
Member

squeed commented Feb 26, 2024

Oops! That's not good!

Delete needs to be extremely robust to failure. We clearly need to fix this.

@henry118
Copy link
Member

I'll be happy to contribute a fix. IMHO there are a few options, some of them were mentioned in #1066 discussions too.

  1. Introduce an optional syncfs syscall to flush the FS on every cache file writes. But it would potentially degrade the overall system performance;
  2. Take the proposal in this issue description, try to delete the network with empty cachedResult as the best effort.
  3. I'm open to other suggestions too.

Thanks!

@henry118
Copy link
Member

henry118 commented Mar 7, 2024

Discussed during community meeting. The decision was 2nd option, "try to delete the network with empty cachedResult as the best effort". PR #1072.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants