Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed various TCP-related deadlocks and improved disconnection mechanism #102

Open
wants to merge 3 commits into
base: develop
Choose a base branch
from

Conversation

sp193
Copy link

@sp193 sp193 commented Jul 3, 2019

  1. TCPServer::listener_loop() not disconnecting failed/disconnected sockets, leaving the socket unusable.
  2. Fixed TCPServer::recv_locking() and TCPServer::send_locking() not indicating the non-active connection state as an error, which previously resulted in the caller entering an infinite loop.
  3. Synchronize the TCP stream with the client before closing the socket, to avoid possibly losing bytes in flight.

(3) revolves around the fact that send() does not guarantee that the receiver will receive all data. By waiting for the other side (the agent in this case) to also close the connection, we can be sure that all data that should have been received, was received. This also solves another problem: under such a condition, calling close() on the socket() will result in the remote end sending a TCP RST in response to the FIN-ACK.

@sp193 sp193 force-pushed the tcp_fix branch 3 times, most recently from 1eedab4 to 628ed77 Compare July 3, 2019 11:38
@sp193
Copy link
Author

sp193 commented Jul 3, 2019

(1) can happen if the client closes the connection before WSAPoll() or poll() is invoked. Since POLLERR or POLLHUP will be set as the socket status under such a condition and these status bits are not recognized by the agent, the socket descriptor becomes permanently lost.

(2) started happening after (1) was solved. If the sending or receiving operation closes the connection, then a deadlock could be caused by the other operation as the non-active connection state did not result in the error flag getting set.

(3) revolves around the fact that send() does not guarantee that the remote side receives the data. Calling close() from the remote end before all data is received, can result in a TCP RST getting sent in response to the FIN-ACK message by the side initiating the closure (typically the client). This could also prevent a loss of data by ensuring that both ends have entered the disconnection process, since the agent side must shut down the sending too, before the client will call close().

I hope this explanation would be better than my original attempt.

@sp193
Copy link
Author

sp193 commented Aug 25, 2019

I found that I made a mistake in the Linux TCP server code, which rendered it uncompilable. I have fixed that mistake and also split apart the two set of changes into two commits.

@sp193
Copy link
Author

sp193 commented Aug 25, 2019

I have also added one more patch for setting SO_REUSEADDR on the server listening socket, to allow the Agent to bind to its listening port after being restarted. I am not sure if your team prefers this behaviour, but it is something desired by mine (to have near zero downtime).

If this is not desired, please feel free to let me know and I shall drop the commit.

@richiware
Copy link
Member

Build status:

  • Linux Build Status
  • Windows Build Status

@sp193
Copy link
Author

sp193 commented Oct 1, 2019

The build error was "ERROR: Couldn't find any revision to build. Verify the repository and branch configuration for this job."

@BorjaOuterelo
Copy link
Contributor

@sp193 thanks, We will look at this.

Comment on lines 84 to 90
{
UXR_AGENT_LOG_ERROR(
UXR_DECORATE_RED("socket opt error"),
"Port: {}",
listener_poll_.fd);
return false;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, move it down to maintain the one return rule.

Comment on lines 408 to 414
fd_set read_fdset;
struct timeval timeout = {
10,
0
};
FD_ZERO(&read_fdset);
FD_SET(connection_platform.poll_fd->fd, &read_fdset);
if (select(connection_platform.poll_fd->fd + 1, &read_fdset, NULL, NULL, &timeout) > 0)
{
char dummy;
while (recv(connection_platform.poll_fd->fd, &dummy, sizeof(dummy), 0) > 0) {};
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to refactor this to use poll instead of select?

Comment on lines 73 to 80
if (-1 == setsockopt(listener_poll_.fd, SOL_SOCKET, SO_REUSEADDR, (const char *)&reuse, sizeof(reuse)))
{
UXR_AGENT_LOG_ERROR(
UXR_DECORATE_RED("socket opt error"),
"Port: {}",
listener_poll_.fd);
return false;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comments as in Linux.

Comment on lines 380 to 386
fd_set read_fdset;
struct timeval timeout = {
10,
0
};
FD_ZERO(&read_fdset);
FD_SET(connection_platform.poll_fd->fd, &read_fdset);
if (select(connection_platform.poll_fd->fd + 1, &read_fdset, NULL, NULL, &timeout) > 0)
{
char dummy;
while (recv(connection_platform.poll_fd->fd, &dummy, sizeof(dummy), 0) > 0) {};
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comments as in Linux.

@@ -78,6 +78,17 @@ bool TCPv4Agent::init()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rebase it over develop branch to solve conflicts. Thanks!

@richiware
Copy link
Member

Build status:

  • Linux Build Status
  • Windows Build Status

@BorjaOuterelo
Copy link
Contributor

Hi @sp193. We want to merge this work in our develop branch. Did you check @julianbermudez comments?.

1. TCPServer::listener_loop() not disconnecting failed/disconnected sockets, leaving the socket unusable.
2. Fixed TCPServer::recv_locking() and TCPServer::send_locking() not indicating the non-active connection state as an error, which previously resulted in the caller entering an infinite loop.
… client before closing the socket, to avoid possibly losing bytes in flight.
@richiware
Copy link
Member

Build status:

  • Linux Build Status
  • Windows Build Status

@sp193
Copy link
Author

sp193 commented Nov 13, 2019

Hi,

Thanks for reviewing my work. I have been very overwhelmed by work, which is why I have been postponing work on this.
I have attempted to make the changes as requested.

Please review the similar patch to the Client as well, otherwise the disconnection procedure would not really be improved on: eProsima/Micro-XRCE-DDS-Client#97

The logic here is that the client should attempt to wait for the agent to actively start on the shutdown process, before it closes the TCP connection. Otherwise, the connection may come down on bytes in flight, which may cause the OS to deem the connection as having an error and sends a TCP RST.

@julionce
Copy link
Contributor

Hi @sp193,

Thanks so much to you for helping us to improve Micro XRCE.
We need more people like you xD.

I've just reviewed the eProsima/Micro-XRCE-DDS-Client#97.

@jamoralp jamoralp closed this Nov 19, 2020
@jamoralp jamoralp reopened this Nov 19, 2020
@richiware
Copy link
Member

Build status:

  • Linux Build Status
  • Windows Build Status

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants