-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed various TCP-related deadlocks and improved disconnection mechanism #102
base: develop
Are you sure you want to change the base?
Conversation
1eedab4
to
628ed77
Compare
(1) can happen if the client closes the connection before WSAPoll() or poll() is invoked. Since POLLERR or POLLHUP will be set as the socket status under such a condition and these status bits are not recognized by the agent, the socket descriptor becomes permanently lost. (2) started happening after (1) was solved. If the sending or receiving operation closes the connection, then a deadlock could be caused by the other operation as the non-active connection state did not result in the error flag getting set. (3) revolves around the fact that send() does not guarantee that the remote side receives the data. Calling close() from the remote end before all data is received, can result in a TCP RST getting sent in response to the FIN-ACK message by the side initiating the closure (typically the client). This could also prevent a loss of data by ensuring that both ends have entered the disconnection process, since the agent side must shut down the sending too, before the client will call close(). I hope this explanation would be better than my original attempt. |
I found that I made a mistake in the Linux TCP server code, which rendered it uncompilable. I have fixed that mistake and also split apart the two set of changes into two commits. |
I have also added one more patch for setting SO_REUSEADDR on the server listening socket, to allow the Agent to bind to its listening port after being restarted. I am not sure if your team prefers this behaviour, but it is something desired by mine (to have near zero downtime). If this is not desired, please feel free to let me know and I shall drop the commit. |
The build error was "ERROR: Couldn't find any revision to build. Verify the repository and branch configuration for this job." |
@sp193 thanks, We will look at this. |
{ | ||
UXR_AGENT_LOG_ERROR( | ||
UXR_DECORATE_RED("socket opt error"), | ||
"Port: {}", | ||
listener_poll_.fd); | ||
return false; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, move it down to maintain the one return rule.
fd_set read_fdset; | ||
struct timeval timeout = { | ||
10, | ||
0 | ||
}; | ||
FD_ZERO(&read_fdset); | ||
FD_SET(connection_platform.poll_fd->fd, &read_fdset); | ||
if (select(connection_platform.poll_fd->fd + 1, &read_fdset, NULL, NULL, &timeout) > 0) | ||
{ | ||
char dummy; | ||
while (recv(connection_platform.poll_fd->fd, &dummy, sizeof(dummy), 0) > 0) {}; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to refactor this to use poll
instead of select
?
if (-1 == setsockopt(listener_poll_.fd, SOL_SOCKET, SO_REUSEADDR, (const char *)&reuse, sizeof(reuse))) | ||
{ | ||
UXR_AGENT_LOG_ERROR( | ||
UXR_DECORATE_RED("socket opt error"), | ||
"Port: {}", | ||
listener_poll_.fd); | ||
return false; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comments as in Linux.
fd_set read_fdset; | ||
struct timeval timeout = { | ||
10, | ||
0 | ||
}; | ||
FD_ZERO(&read_fdset); | ||
FD_SET(connection_platform.poll_fd->fd, &read_fdset); | ||
if (select(connection_platform.poll_fd->fd + 1, &read_fdset, NULL, NULL, &timeout) > 0) | ||
{ | ||
char dummy; | ||
while (recv(connection_platform.poll_fd->fd, &dummy, sizeof(dummy), 0) > 0) {}; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comments as in Linux.
@@ -78,6 +78,17 @@ bool TCPv4Agent::init() | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rebase it over develop branch to solve conflicts. Thanks!
Hi @sp193. We want to merge this work in our develop branch. Did you check @julianbermudez comments?. |
1. TCPServer::listener_loop() not disconnecting failed/disconnected sockets, leaving the socket unusable. 2. Fixed TCPServer::recv_locking() and TCPServer::send_locking() not indicating the non-active connection state as an error, which previously resulted in the caller entering an infinite loop.
… client before closing the socket, to avoid possibly losing bytes in flight.
…the listening port after being restarted.
Hi, Thanks for reviewing my work. I have been very overwhelmed by work, which is why I have been postponing work on this. Please review the similar patch to the Client as well, otherwise the disconnection procedure would not really be improved on: eProsima/Micro-XRCE-DDS-Client#97 The logic here is that the client should attempt to wait for the agent to actively start on the shutdown process, before it closes the TCP connection. Otherwise, the connection may come down on bytes in flight, which may cause the OS to deem the connection as having an error and sends a TCP RST. |
Hi @sp193, Thanks so much to you for helping us to improve Micro XRCE. I've just reviewed the eProsima/Micro-XRCE-DDS-Client#97. |
(3) revolves around the fact that send() does not guarantee that the receiver will receive all data. By waiting for the other side (the agent in this case) to also close the connection, we can be sure that all data that should have been received, was received. This also solves another problem: under such a condition, calling close() on the socket() will result in the remote end sending a TCP RST in response to the FIN-ACK.