Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reordered net context connect operation #69145 #69146

Conversation

matt-wood-ct
Copy link

Moved socket offload connect ahead of non offload connect implementation to fix issue with multiple network interfaces
See #69145

Copy link

Hello @matt-wood-ct, and thank you very much for your first pull request to the Zephyr project!
Our Continuous Integration pipeline will execute a series of checks on your Pull Request commit messages and code, and you are expected to address any failures by updating the PR. Please take a look at our commit message guidelines to find out how to format your commit messages, and at our contribution workflow to understand how to update your Pull Request. If you haven't already, please make sure to review the project's Contributor Expectations and update (by amending and force-pushing the commits) your pull request if necessary.
If you are stuck or need help please join us on Discord and ask your question there. Additionally, you can escalate the review when applicable. 😊

@matt-wood-ct
Copy link
Author

Renamed commit to follow guidelines

Moved socket offload connect ahead of non offload connect
implementation to fix issue with multiple network interfaces

Signed-off-by: Matt Wood <[email protected]>
Replace the 4 space scheme with
tab for compliance

Signed-off-by: Matt Wood <[email protected]>
Copy link
Member

@jukkar jukkar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of comments:

  • please squash the patches together, we normally do not do fix up commits in the same PR
  • could you elaborate in the commit message why/how the change solves the issue, otherwise it is difficult at some later date to figure out from the commit why this change fixed the issue

Overall, I am not convinced about the change. The checks in the original code were done before the offloaded call in order to check the addresses etc. Now we are not doing those checks for offloaded interface.
Where/why exactly the crash happens, could we fix things there instead of this shortcut?

@jukkar
Copy link
Member

jukkar commented Feb 19, 2024

Forgot to mention earlier, you should add Fixes #69145 line to commit message body so that the issue gets linked to the commit (helps when generating release notes).

@rlubos
Copy link
Contributor

rlubos commented Feb 19, 2024

Overall, I am not convinced about the change. The checks in the original code were done before the offloaded call in order to check the addresses etc. Now we are not doing those checks for offloaded interface.
Where/why exactly the crash happens, could we fix things there instead of this shortcut?

Agreed, plus when it comes to NET_OFFLOAD and mutliple interfaces it's not really supported well, there's a more serious issue on the table:
https://github.com/zephyrproject-rtos/zephyr/blob/main/subsys/net/ip/net_context.c#L456
So any crashes in such setup could be as well a consequence of that.

@matt-wood-ct
Copy link
Author

such setup

Thanks for the feedback, I think you are both right. I've been doing more testing and my "fix" does indeed just make other things worse, I am surprised it worked at all. I will continue my testing and see if I can find a better fix.
I had not spotted that oddity and you may be right that might be the root cause of the issues I am encountering.

Either way I must find a workable solution as the product I am working on does have all 3 interfaces and we do need to switch between the in runtime 😢

@rlubos
Copy link
Contributor

rlubos commented Feb 19, 2024

Either way I must find a workable solution as the product I am working on does have all 3 interfaces and we do need to switch between the in runtime 😢

Check this comment where the user describes how they handle this limitation - as a workaround, they call net_if_set_default() before creating socket. I have to admit it's not perfect, but might help.

@matt-wood-ct
Copy link
Author

Either way I must find a workable solution as the product I am working on does have all 3 interfaces and we do need to switch between the in runtime 😢

Check this comment where the user describes how they handle this limitation - as a workaround, they call net_if_set_default() before creating socket. I have to admit it's not perfect, but might help.

Hi Robert,
Unfortunately we are doing that and we still get crashes from inside the stack, the scenario current works as follows:

  1. Wifi comes up (socket offload driver)
  2. sockets are working fine
  3. GSM comes up (PPP driver)
  4. wifi set as default
  5. socket connect causes bus fault

So currently we are getting crashes with the default configured correctly, any suggestions? (I am still intently debugging of course)

@jukkar
Copy link
Member

jukkar commented Feb 19, 2024

So currently we are getting crashes with the default configured correctly, any suggestions? (I am still intently debugging of course)

It would be really helpful if you could pin point the exact line where the crash happens, without this info it is difficult to suggest anything.

@matt-wood-ct
Copy link
Author

matt-wood-ct commented Feb 19, 2024

Network ifaces (to show interface numbering):

uart:~$ net iface

Interface eth0 (0x80092260) (Ethernet) [1]
===================================
Interface is down.

Interface net0 (0x800922e0) (IP Offload) [2]
=====================================
Interface is down.

Interface ppp0 (0x80092360) (PPP) [3]
==============================
Interface is down.

Log when wifi is enabled and comes up first

[00:00:16.162,000] <dbg> COR_NET_MAIN: testInterface: Test Wi-Fi (0x800922e0)
[00:00:16.262,000] <dbg> COR_NET_MAIN: testInterface: Create socket 0
[00:00:16.500,000] <dbg> COR_NET_MAIN: testInterface: Connect socket 0
[00:00:21.742,000] <dbg> COR_NET_MAIN: testInterface: Test Wi-Fi (0x800922e0)
[00:00:21.842,000] <dbg> COR_NET_MAIN: testInterface: Create socket 0
[00:00:22.128,000] <dbg> COR_NET_MAIN: testInterface: Connect socket 0
[00:00:25.459,000] <inf> lib_sim7070: Bringing interfaces up! Successfully started PPP
[00:00:25.471,000] <inf> lib_sim7070: AT channel 2 connected to GSM_2
[00:00:27.370,000] <dbg> COR_NET_MAIN: testInterface: Test Wi-Fi (0x800922e0)
[00:00:27.470,000] <dbg> COR_NET_MAIN: testInterface: Create socket 0
[00:00:27.571,000] <dbg> net_ctx: net_context_bind: (cor_net_main_initializeThread): Context 0x80025480 binding to TCP 0.0.0.0:36382 iface 3 (0x80092360)

So in the above log the wifi is up and tested (testing consists of creating a socket, opening it then closing it to 1.1.1.1).
Before each test net_if_set_default is called with the relevant iface. But you can see it attempted to bind using the PPP interface even though the wifi was set as default...

Hope this log helps to explain the series of events more clearly

Edit:
Here is the code for the testInterface function:

bool testInterface(supportedNetworkInterface_t *interface)
{
    bool retVal = false;
    int sock;
    int err;
    struct sockaddr_in addr4;
    struct net_if* originalDefault = net_if_get_default();
    LOG_DBG("Test %s (%p)", interface->name, interface->interface);
    net_if_set_default(interface->interface);
    k_msleep(100);
    addr4.sin_family = AF_INET;
    addr4.sin_port = htons(NET_TEST_IPV4_PEER_PORT);
    inet_pton(AF_INET, NET_TEST_IPV4_PEER,
              &addr4.sin_addr);

    sock = socket( addr4.sin_family,
                   SOCK_STREAM, IPPROTO_TCP );
    LOG_DBG("Create socket %d", sock);
    k_msleep(100);
    if (0 <= sock)
    {
        err = connect( sock, (struct sockaddr*) &addr4, sizeof( addr4 ) );
        LOG_DBG("Connect socket %d", err);
        close( sock );
        if (err >= 0)
        {
            /* socket connected, connection is OK */
            retVal = true;
        }
    }

    net_if_set_default(originalDefault);

    return retVal;
}

@matt-wood-ct
Copy link
Author

matt-wood-ct commented Feb 19, 2024

So currently we are getting crashes with the default configured correctly, any suggestions? (I am still intently debugging of course)

It would be really helpful if you could pin point the exact line where the crash happens, without this info it is difficult to suggest anything.

Sorry just saw your question, the actual crash is here:
image
conn going into tcp_in is null and it tries to access a member of conn which I believe causes the usage fault

@jukkar
Copy link
Member

jukkar commented Feb 19, 2024

conn going into tcp_in is null and it tries to access a member of conn which I believe causes the usage fault

Hmm, but we access the data behind conn pointer in previous line and it does not crash there?

@matt-wood-ct
Copy link
Author

conn going into tcp_in is null and it tries to access a member of conn which I believe causes the usage fault

Hmm, but we access the data behind conn pointer in previous line and it does not crash there?

I don't think it does, that's just how the PC is rendered by the debugger, I interpret that to mean it crashed on the line before.

As a confirmation I added in a null trap and it did indeed stop the crash (but of course the network didn't work)

@matt-wood-ct
Copy link
Author

Ok, after another hour or so of bashing my head against this I've discovered the context inside net_context_connect literally changes to iface 3 (GSM PPP) after running the function bind_default, see the log below:

[00:00:29.274,000] <dbg> net_ctx: net_context_connect: CONNECT IFACE: 2
[00:00:29.274,000] <dbg> net_ctx: net_context_connect: 1200 CONNECT IFACE: 2
[00:00:29.274,000] <dbg> net_ctx: net_context_connect: 1282 CONNECT IFACE: 2
[00:00:29.275,000] <dbg> net_ctx: net_context_bind: (cor_net_main_initializeThread): Context 0x80025480 binding to TCP 0.0.0.0:61425 iface 3 (0x80092360)
[00:00:29.275,000] <dbg> net_ctx: net_context_connect: 1287 CONNECT IFACE: 3
[00:00:29.275,000] <dbg> net_ctx: net_context_connect: 1299 CONNECT IFACE: 3

snippet of code from the function to show where those prints are:

		rport = addr4->sin_port;

                LOG_DBG("1282 CONNECT IFACE: %d", context->iface);
		ret = bind_default(context);
		if (ret) {
			goto unlock;
		}
                LOG_DBG("1287 CONNECT IFACE: %d", context->iface);
		net_sin_ptr(&context->local)->sin_family = AF_INET;

so net_context_connect prepare the wifi interface for this connection then near the end binds it to the PPP connection which of course since its partually configured does not work (and crashes if I remove my NULL trap)

@rlubos
Copy link
Contributor

rlubos commented Feb 19, 2024

Ok, after another hour or so of bashing my head against this I've discovered the context inside net_context_connect literally changes to iface 3 (GSM PPP) after running the function bind_default, see the log below:

I've been investigating the case and was going to ask about this, it seems that with current implementation if net_context_bind() for whatever reason decides that non-offloaded interface is a better choice to communicate with the remote, it'd overwrite the interface set on the context. And that is bad, as we don't get the TCP context allocated then.

Could you check if the following patch helps?

diff --git a/subsys/net/ip/net_context.c b/subsys/net/ip/net_context.c
index ae7a4a4e848..fb6e1a51237 100644
--- a/subsys/net/ip/net_context.c
+++ b/subsys/net/ip/net_context.c
@@ -542,7 +542,7 @@ int net_context_get(sa_family_t family, enum net_sock_type type, uint16_t proto,
                        return ret;
                }
 
-               net_context_set_iface(*context, net_if_get_default());
+               net_context_bind_iface(*context, net_if_get_default());
        }
 
        return 0;

The change just makes sure that the current net_context is permanently bound to the interface, which should be the case of offloaded contexts.

@matt-wood-ct
Copy link
Author

Ok, after another hour or so of bashing my head against this I've discovered the context inside net_context_connect literally changes to iface 3 (GSM PPP) after running the function bind_default, see the log below:

I've been investigating the case and was going to ask about this, it seems that with current implementation if net_context_bind() for whatever reason decides that non-offloaded interface is a better choice to communicate with the remote, it'd overwrite the interface set on the context. And that is bad, as we don't get the TCP context allocated then.

Could you check if the following patch helps?

diff --git a/subsys/net/ip/net_context.c b/subsys/net/ip/net_context.c
index ae7a4a4e848..fb6e1a51237 100644
--- a/subsys/net/ip/net_context.c
+++ b/subsys/net/ip/net_context.c
@@ -542,7 +542,7 @@ int net_context_get(sa_family_t family, enum net_sock_type type, uint16_t proto,
                        return ret;
                }
 
-               net_context_set_iface(*context, net_if_get_default());
+               net_context_bind_iface(*context, net_if_get_default());
        }
 
        return 0;

The change just makes sure that the current net_context is permanently bound to the interface, which should be the case of offloaded contexts.

Hi Robert,
You are a true legend, this fix does appear to work!!!!
See log below

[00:00:57.028,000] <dbg> COR_NET_MAIN: testInterface: Test Wi-Fi (0x800922e0)
[00:00:57.128,000] <dbg> COR_NET_MAIN: testInterface: Create socket 0
[00:00:57.426,000] <dbg> COR_NET_MAIN: testInterface: Connect socket 0
[00:01:02.668,000] <dbg> COR_NET_MAIN: testInterface: Test Wi-Fi (0x800922e0)
[00:01:02.768,000] <dbg> COR_NET_MAIN: testInterface: Create socket 0
[00:01:03.066,000] <dbg> COR_NET_MAIN: testInterface: Connect socket 0
[00:01:08.308,000] <dbg> COR_NET_MAIN: testInterface: Test Wi-Fi (0x800922e0)
[00:01:08.408,000] <dbg> COR_NET_MAIN: testInterface: Create socket 0
[00:01:08.706,000] <dbg> COR_NET_MAIN: testInterface: Connect socket 0
[00:01:09.506,000] <inf> lib_sim7070: Bringing interfaces up! Successfully started PPP
[00:01:09.515,000] <inf> lib_sim7070: AT channel 2 connected to GSM_2
[00:01:13.948,000] <dbg> COR_NET_MAIN: testInterface: Test Wi-Fi (0x800922e0)
[00:01:14.048,000] <dbg> COR_NET_MAIN: testInterface: Create socket 0
[00:01:14.346,000] <dbg> COR_NET_MAIN: testInterface: Connect socket 0
[00:01:14.588,000] <dbg> COR_NET_MAIN: testInterface: Test Cellular (0x80092360)
[00:01:14.688,000] <dbg> COR_NET_MAIN: testInterface: Create socket 0
[00:01:14.788,000] <dbg> net_ctx: net_context_bind: (cor_net_main_initializeThread): Context 0x80025480 binding to TCP 0.0.0.0:45924 iface 3 (0x80092360)
[00:01:15.819,000] <dbg> COR_NET_MAIN: testInterface: Connect socket 0
[00:01:15.832,000] <inf> COR_NET_MAIN: Interface 'Cellular' has started working
[00:01:17.908,000] <dbg> net_ctx: net_context_unref: (tcp_work): Context 0x80025480 released
[00:01:20.832,000] <dbg> COR_NET_MAIN: testInterface: Test Wi-Fi (0x800922e0)
[00:01:20.932,000] <dbg> COR_NET_MAIN: testInterface: Create socket 0
[00:01:21.258,000] <dbg> COR_NET_MAIN: testInterface: Connect socket 0
[00:01:21.500,000] <dbg> COR_NET_MAIN: testInterface: Test Cellular (0x80092360)
[00:01:21.600,000] <dbg> COR_NET_MAIN: testInterface: Create socket 0
[00:01:21.700,000] <dbg> net_ctx: net_context_bind: (cor_net_main_initializeThread): Context 0x80025480 binding to TCP 0.0.0.0:45845 iface 3 (0x80092360)
[00:01:22.635,000] <dbg> COR_NET_MAIN: testInterface: Connect socket 0
[00:01:24.490,000] <dbg> net_ctx: net_context_unref: (tcp_work): Context 0x80025480 released

@matt-wood-ct
Copy link
Author

However I did just test that patch with ethernet and plugging that in stops PPP from working and ethernet does not work:

[00:07:37.071,000] <dbg> COR_NET_MAIN: testInterface: Test Wi-Fi (0x800922e0)
[00:07:37.172,000] <dbg> COR_NET_MAIN: testInterface: Create socket 0
[00:07:37.428,000] <dbg> COR_NET_MAIN: testInterface: Connect socket 0
[00:07:37.670,000] <dbg> COR_NET_MAIN: testInterface: Test Cellular (0x80092360)
[00:07:37.770,000] <dbg> COR_NET_MAIN: testInterface: Create socket 0
[00:07:37.870,000] <dbg> net_ctx: net_context_bind: (cor_net_main_initializeThread): Context 0x80025480 binding to TCP 0.0.0.0:34145 iface 3 (0x80092360)
[00:07:38.646,000] <inf> eth_mcux: ETH_0 enabled 10M full-duplex mode.
[00:07:38.649,000] <dbg> net_ctx: net_context_unref: (rx_q[0]): Context 0x800253e0 released
[00:07:38.649,000] <dbg> net_ctx: net_context_bind: (rx_q[0]): Context 0x800253e0 binding to UDP 0.0.0.0:42058 iface 3 (0x80092360)
[00:07:38.651,000] <inf> net_dhcpv4: Received: 10.0.4.99
[00:07:38.652,000] <inf> net_config: IPv4 address: 10.0.4.99
[00:07:38.652,000] <inf> net_config: Lease time: 172800 seconds
[00:07:38.652,000] <inf> net_config: Subnet: 255.255.255.0
[00:07:38.652,000] <inf> net_config: Router: 10.0.4.254
[00:07:40.469,000] <dbg> COR_NET_MAIN: testInterface: Connect socket -1
[00:07:40.469,000] <dbg> net_ctx: net_context_unref: (cor_net_main_initializeThread): Context 0x80025480 released
[00:07:40.469,000] <wrn> COR_NET_MAIN: Interface 'Cellular' is up but has stopped working
[00:07:45.469,000] <dbg> COR_NET_MAIN: testInterface: Test Ethernet (0x80092260)
[00:07:45.570,000] <dbg> COR_NET_MAIN: testInterface: Create socket 0
[00:07:45.670,000] <dbg> net_ctx: net_context_bind: (cor_net_main_initializeThread): Context 0x80025480 binding to TCP 0.0.0.0:34096 iface 3 (0x80092360)
[00:07:48.107,000] <dbg> COR_NET_MAIN: testInterface: Connect socket -1
[00:07:48.108,000] <dbg> net_ctx: net_context_unref: (cor_net_main_initializeThread): Context 0x80025480 released
[00:07:48.108,000] <dbg> COR_NET_MAIN: testInterface: Test Wi-Fi (0x800922e0)
[00:07:48.208,000] <dbg> COR_NET_MAIN: testInterface: Create socket 0
[00:07:48.456,000] <dbg> COR_NET_MAIN: testInterface: Connect socket 0
[00:07:48.698,000] <dbg> COR_NET_MAIN: testInterface: Test Cellular (0x80092360)
[00:07:48.798,000] <dbg> COR_NET_MAIN: testInterface: Create socket 0
[00:07:48.898,000] <dbg> net_ctx: net_context_bind: (cor_net_main_initializeThread): Context 0x80025480 binding to TCP 0.0.0.0:58356 iface 3 (0x80092360)
[00:07:51.766,000] <dbg> COR_NET_MAIN: testInterface: Connect socket -1
[00:07:51.766,000] <dbg> net_ctx: net_context_unref: (cor_net_main_initializeThread): Context 0x80025480 released
[00:07:56.766,000] <dbg> COR_NET_MAIN: testInterface: Test Ethernet (0x80092260)
[00:07:56.867,000] <dbg> COR_NET_MAIN: testInterface: Create socket 0
[00:07:56.967,000] <dbg> net_ctx: net_context_bind: (cor_net_main_initializeThread): Context 0x80025480 binding to TCP 0.0.0.0:56685 iface 3 (0x80092360)
[00:07:59.494,000] <dbg> COR_NET_MAIN: testInterface: Connect socket -1
[00:07:59.495,000] <dbg> net_ctx: net_context_unref: (cor_net_main_initializeThread): Context 0x80025480 released
[00:07:59.495,000] <dbg> COR_NET_MAIN: testInterface: Test Wi-Fi (0x800922e0)
[00:07:59.595,000] <dbg> COR_NET_MAIN: testInterface: Create socket 0
[00:07:59.964,000] <dbg> COR_NET_MAIN: testInterface: Connect socket 0
[00:08:00.205,000] <dbg> COR_NET_MAIN: testInterface: Test Cellular (0x80092360)
[00:08:00.306,000] <dbg> COR_NET_MAIN: testInterface: Create socket 0
[00:08:00.406,000] <dbg> net_ctx: net_context_bind: (cor_net_main_initializeThread): Context 0x80025480 binding to TCP 0.0.0.0:62564 iface 3 (0x80092360)
[00:08:03.333,000] <dbg> COR_NET_MAIN: testInterface: Connect socket -1

Looks like it is trying to bind PPP when it tries to use ETH, so similar issue but slightly different

@rlubos
Copy link
Contributor

rlubos commented Feb 19, 2024

Looks like it is trying to bind PPP when it tries to use ETH, so similar issue but slightly different

Does the IP address remain registered on the PPP interface when it goes down? As before, on binding net stack will try to select best interface based on remote/local addresses, so if the "best match" remains on the PPP interface, it'd be kept being selected.

Or, you can also enforce that a specific "native" (non-offloaded) socket is bound to a specific interface by calling setsockopt() with SO_BINDTODEVICE socket option before connecting.

I did just test that patch with ethernet and plugging that in stops PPP from working

That sounds quite fishy, I don't see any obvious reason why would that be happening...

@matt-wood-ct
Copy link
Author

setsockopt

Now this is interesting, thank you so much for pointing out this socket option, setting this does seem to make the setup work even without the patch you described. Still testing but early signs are promising...

@matt-wood-ct
Copy link
Author

Can confirm setting the default interface and using the socket option SO_BINDTODEVICE selecting between my 3 network interfaces appears to work reliably, thank you so much for your help Robert and Jukka

@jukkar
Copy link
Member

jukkar commented Feb 19, 2024

Can confirm setting the default interface and using the socket option SO_BINDTODEVICE selecting between my 3 network interfaces appears to work reliably, thank you so much for your help Robert and Jukka

You are welcome, I am glad the issue is sorted. If everything is working ok, please close the issue and the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants