Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Abnormal disconnection caused by asynchronous Ping message #3331

Closed
elderhammer opened this issue Jun 25, 2024 · 5 comments · Fixed by #3332 or #3418
Closed

[Bug] Abnormal disconnection caused by asynchronous Ping message #3331

elderhammer opened this issue Jun 25, 2024 · 5 comments · Fixed by #3332 or #3418
Assignees
Labels
bug Incorrect or unexpected behavior

Comments

@elderhammer
Copy link
Contributor

elderhammer commented Jun 25, 2024

Steps to Reproduce

1.Receive a Pong message from the peer and create an asynchronous task to send a Ping message to the peer after PING_SLEEP_IN_SECS seconds.

2024-06-25T07:25:55.802800Z TRACE snarkos_node_router::inbound: Received 'Pong' from '127.0.0.1:4132'

/// Sleeps for a period and then sends a `Ping` message to the peer.
fn pong(&self, peer_ip: SocketAddr, _message: Pong) -> bool {
// Spawn an asynchronous task for the `Ping` request.
let self_ = self.clone();
tokio::spawn(async move {
// Sleep for the preset time before sending a `Ping` request.
tokio::time::sleep(Duration::from_secs(Self::PING_SLEEP_IN_SECS)).await;
// Check that the peer is still connected.
if self_.router().is_connected(&peer_ip) {
// Retrieve the block locators.
match self_.sync.get_block_locators() {
// Send a `Ping` message to the peer.
Ok(block_locators) => self_.send_ping(peer_ip, Some(block_locators)),
Err(e) => error!("Failed to get block locators - {e}"),
}
}
});
true
}

2.Periodic refresh of peers.

2024-06-25T07:26:07.870256Z  INFO snarkos_node_router::heartbeat: Disconnecting from '127.0.0.1:4132' (periodic refresh of peers)
2024-06-25T07:26:07.870268Z TRACE snarkos_node_router::outbound: Sending 'Disconnect' to '127.0.0.1:4132'

3.Receive a connection request from a peer

2024-06-25T07:26:15.775172Z DEBUG tcp{name="0"}: snarkos_node_tcp::tcp: Received a connection from 127.0.0.1:51296

4.Complete the handshake, and add the peer to the router.

2024-06-25T07:26:15.775550Z DEBUG tcp{name="0"}: snarkos_node_tcp::protocols::handshake: shaking hands with 127.0.0.1:51296 as the Responder
2024-06-25T07:26:15.775572Z DEBUG snarkos_node_router::handshake: Received a connection request from '127.0.0.1:51296'
2024-06-25T07:26:15.789750Z TRACE snarkos_node_router::handshake: Received 'ChallengeRequest' from '127.0.0.1:51296'
2024-06-25T07:26:15.790940Z TRACE snarkos_node_router::handshake: Sending 'ChallengeResponse' to '127.0.0.1:51296'
2024-06-25T07:26:15.791066Z TRACE snarkos_node_router::handshake: Sending 'ChallengeRequest' to '127.0.0.1:51296'
2024-06-25T07:26:15.801582Z TRACE snarkos_node_router::handshake: Received 'ChallengeResponse' from '127.0.0.1:51296'
2024-06-25T07:26:15.804149Z DEBUG tcp{name="0"}: snarkos_node_tcp::protocols::handshake: successfully handshaken with 127.0.0.1:51296

// Add the peer to the router.
self.insert_connected_peer(Peer::new(peer_ip, &peer_request), peer_addr);

5.Create asynchronous tasks for processing, reading, and writing

2024-06-25T07:26:15.804489Z TRACE tcp{name="0"}: snarkos_node_tcp::protocols::reading: spawned a task for processing messages from 127.0.0.1:51296

6.Asynchronous task sleep ends, send Ping message

2024-06-25T07:26:15.804961Z TRACE snarkos_node_router::outbound: Sending 'Ping' to '127.0.0.1:4132'

7.Because the peer has been added to the router, can_send returns true and the Ping message is sent. However, because the writing asynchronous task has not been created, the sending fails.

2024-06-25T07:26:15.804980Z  WARN snarkos_node_router::outbound: Failed to send 'Ping' to '127.0.0.1:4132': not connected
2024-06-25T07:26:15.804988Z DEBUG snarkos_node_router::outbound: Disconnecting from '127.0.0.1:4132' (unable to send)
2024-06-25T07:26:15.805060Z  WARN tcp{name="0"}: snarkos_node_tcp::tcp: Failed to disconnect, was not connected to 127.0.0.1:51296
2024-06-25T07:26:15.805072Z  WARN snarkos_node_router: Disconnecting with fallback safety (report this to @ljedrz)

// Determine whether to send the message.
if !self.can_send(peer_ip, &message) {
return None;
}
// Resolve the listener IP to the (ambiguous) peer address.
let peer_addr = match self.router().resolve_to_ambiguous(&peer_ip) {
Some(peer_addr) => peer_addr,
None => {
warn!("Unable to resolve the listener IP address '{peer_ip}'");
return None;
}
};

8.The reading and writing asynchronous tasks are created later.

2024-06-25T07:26:15.805089Z TRACE tcp{name="0"}: snarkos_node_tcp::protocols::reading: spawned a task for reading messages from 127.0.0.1:51296
2024-06-25T07:26:15.805254Z TRACE tcp{name="0"}: snarkos_node_tcp::protocols::writing: spawned a task for writing messages to 127.0.0.1:51296

This will cause two problems:

  • Asynchronous tasks such as processing, reading, and writing are not terminated correctly
  • Because the peer believes that the connection is complete, it will reject subsequent connections from the current node
2024-06-25T07:26:32.872315Z DEBUG tcp{name="1"}: snarkos_node_tcp::tcp: Received a connection from 127.0.0.1:57797
2024-06-25T07:26:32.872716Z DEBUG tcp{name="1"}: snarkos_node_tcp::protocols::handshake: shaking hands with 127.0.0.1:57797 as the Responder
2024-06-25T07:26:32.872755Z DEBUG snarkos_node_router::handshake: Received a connection request from '127.0.0.1:57797'
2024-06-25T07:26:32.873158Z TRACE snarkos_node_router::handshake: Received 'ChallengeRequest' from '127.0.0.1:57797'
2024-06-25T07:26:32.873211Z ERROR tcp{name="1"}: snarkos_node_tcp::protocols::handshake: handshake with 127.0.0.1:57797 failed: Dropping connection request from '127.0.0.1:4202' (already connected)
2024-06-25T07:26:32.873252Z ERROR tcp{name="1"}: snarkos_node_tcp::tcp: Failed to connect with 127.0.0.1:57797: Dropping connection request from '127.0.0.1:4202' (already connected)

Expected Behavior

This shouldn't happen

Your Environment

snarkOS Version: 09aa62b

@elderhammer elderhammer added the bug Incorrect or unexpected behavior label Jun 25, 2024
@elderhammer
Copy link
Contributor Author

@ljedrz might be of interest to you

@ljedrz
Copy link
Collaborator

ljedrz commented Jun 25, 2024

Very cool find; I think I know how to tackle this.

@ljedrz ljedrz self-assigned this Jun 25, 2024
@ljedrz
Copy link
Collaborator

ljedrz commented Jun 25, 2024

Ah yes, it seems that #2690 will need to be reopened; a workaround where we would abort ping tasks on disconnect would also work, but the real solution is to delay a peer's promotion to the connected status.

@elderhammer
Copy link
Contributor Author

On testnet beta, this issue occurs frequently, caused by the UnconfirmedTransaction and UnconfirmedSolution messages.

2024-07-04T10:41:42.069597Z  WARN snarkos_node_router::outbound: Failed to send 'UnconfirmedTransaction' to '89.116.29.150:4130': not connected
2024-07-04T10:41:42.069659Z  WARN tcp{name="0"}: snarkos_node_tcp::tcp: Failed to disconnect, was not connected to 89.116.29.150:58433
2024-07-04T10:41:42.069667Z  WARN snarkos_node_router: Disconnecting with fallback safety (report this to @ljedrz)

2024-07-04T13:19:33.127551Z  WARN snarkos_node_router::outbound: Failed to send 'UnconfirmedTransaction' to '211.171.42.223:4130': not connected
2024-07-04T13:19:33.127618Z  WARN tcp{name="0"}: snarkos_node_tcp::tcp: Failed to disconnect, was not connected to 211.171.42.223:55251
2024-07-04T13:19:33.127629Z  WARN snarkos_node_router: Disconnecting with fallback safety (report this to @ljedrz)

2024-07-04T19:04:50.829665Z  WARN snarkos_node_router::outbound: Failed to send 'UnconfirmedSolution' to '157.66.54.66:4130': not connected
2024-07-04T19:04:50.829711Z  WARN tcp{name="0"}: snarkos_node_tcp::tcp: Failed to disconnect, was not connected to 157.66.54.66:56847
2024-07-04T19:04:50.829719Z  WARN snarkos_node_router: Disconnecting with fallback safety (report this to @ljedrz)

2024-07-04T23:09:06.395996Z  WARN snarkos_node_router::outbound: Failed to send 'UnconfirmedSolution' to '118.140.26.162:4130': not connected
2024-07-04T23:09:06.396038Z  WARN tcp{name="0"}: snarkos_node_tcp::tcp: Failed to disconnect, was not connected to 118.140.26.162:47970
2024-07-04T23:09:06.396046Z  WARN snarkos_node_router: Disconnecting with fallback safety (report this to @ljedrz)

2024-07-05T01:29:44.276617Z  WARN snarkos_node_router::outbound: Failed to send 'UnconfirmedTransaction' to '104.237.2.224:4130': not connected      
2024-07-05T01:29:44.276679Z  WARN tcp{name="0"}: snarkos_node_tcp::tcp: Failed to disconnect, was not connected to 104.237.2.224:4130                
2024-07-05T01:29:44.276691Z  WARN snarkos_node_router: Disconnecting with fallback safety (report this to @ljedrz)  

2024-07-05T01:36:28.543874Z  WARN snarkos_node_router::outbound: Failed to send 'UnconfirmedTransaction' to '34.17.53.129:4130': not connected
2024-07-05T01:36:28.544005Z  WARN tcp{name="0"}: snarkos_node_tcp::tcp: Failed to disconnect, was not connected to 34.17.53.129:4130
2024-07-05T01:36:28.544012Z  WARN snarkos_node_router: Disconnecting with fallback safety (report this to @ljedrz)
2024-07-05T01:36:32.919219Z  WARN tcp{name="0"}: snarkos_node_tcp::tcp: Already connected to 34.17.53.129:4130
2024-07-05T01:36:32.919232Z  WARN snarkos_node_router: Unable to connect to '34.17.53.129:4130' - entity already exists
2024-07-05T01:36:37.919752Z  WARN tcp{name="0"}: snarkos_node_tcp::tcp: Already connected to 34.17.53.129:4130
2024-07-05T01:36:37.919782Z  WARN snarkos_node_router: Unable to connect to '34.17.53.129:4130' - entity already exists
2024-07-05T01:36:42.920965Z  WARN tcp{name="0"}: snarkos_node_tcp::tcp: Already connected to 34.17.53.129:4130
2024-07-05T01:36:42.920990Z  WARN snarkos_node_router: Unable to connect to '34.17.53.129:4130' - entity already exists
2024-07-05T01:36:47.921713Z  WARN tcp{name="0"}: snarkos_node_tcp::tcp: Already connected to 34.17.53.129:4130
2024-07-05T01:36:47.921728Z  WARN snarkos_node_router: Unable to connect to '34.17.53.129:4130' - entity already exists

@ljedrz
Copy link
Collaborator

ljedrz commented Jul 10, 2024

#3332 should fix this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect or unexpected behavior
Projects
None yet
2 participants