Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[5.0.3] P2P: Resolve on reconnect #2408

Merged
merged 3 commits into from
Sep 26, 2024
Merged

[5.0.3] P2P: Resolve on reconnect #2408

merged 3 commits into from
Sep 26, 2024

Conversation

heifner
Copy link
Member

@heifner heifner commented Sep 25, 2024

Resolve address on reconnect.

See AntelopeIO/spring#525

Instead of:

info  2024-09-25T21:50:50.686 net-1     net_plugin.cpp:2797           operator()           ] connection failed to p2p.telos-bad.zenblocks.io:9876, Element not found
info  2024-09-25T21:50:50.686 net-1     net_plugin.cpp:1461           _close               ] ["p2p.telos-bad.zenblocks.io:9876" - 1 <unknown>:<unknown>] closing

Now you get:

warn  2024-09-25T21:42:48.642 net-0     net_plugin.cpp:4524           operator()           ] Unable to resolve p2p.telos-bad.zenblocks.io:9876 Host not found (authoritative)

@heifner heifner added the OCI Work exclusive to OCI team label Sep 25, 2024
@heifner heifner added this to the Leap v5.0.3 milestone Sep 25, 2024
plugins/net_plugin/net_plugin.cpp Outdated Show resolved Hide resolved
plugins/net_plugin/net_plugin.cpp Outdated Show resolved Hide resolved
plugins/net_plugin/net_plugin.cpp Outdated Show resolved Hide resolved
Copy link
Member

@linh2931 linh2931 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can a test be added?

fc_wlog( logger, "Unable to resolve ${host}:${port} ${error}",
("host", host)("port", port)( "error", err.message() ) );
c->set_state(connection::connection_state::closed);
++(c->consecutive_immediate_connection_close);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does increasing this do? I thought it was going to cause some sort of cool down when reaching def_max_consecutive_immediate_connection_close but doesn't seem to matter

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to look into it, but I think the change that broke this re-connect also broke this back off of consecutive_immediate_connection_close

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, actually this a reason to reuse the connection object. I guess I should maintain the reuse of the connection object for this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like now (on 51fd8fa) after failing to resolve a host def_max_consecutive_immediate_connection_close times it just never tries it again. But if the resolve completes and simply unable to connect to the remote host, it will try retry indefinitely. I can't tell if this discrepancy is intentional or not. It makes me wonder if we should not be increasing this counter here for consistency between the two, but realistically if someone has a completely bad hostname it's probably not fixing itself.

fc_wlog( logger, "Unable to resolve ${host}:${port} ${error}",
("host", host)("port", port)( "error", err.message() ) );
c->set_state(connection::connection_state::closed);
++(c->consecutive_immediate_connection_close);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like now (on 51fd8fa) after failing to resolve a host def_max_consecutive_immediate_connection_close times it just never tries it again. But if the resolve completes and simply unable to connect to the remote host, it will try retry indefinitely. I can't tell if this discrepancy is intentional or not. It makes me wonder if we should not be increasing this counter here for consistency between the two, but realistically if someone has a completely bad hostname it's probably not fixing itself.

if( !err ) {
c->connect( results );
} else {
fc_wlog( logger, "Unable to resolve ${host}:${port} ${error}",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw this is a warning log, where failing to connect is an info log. Not sure if you want to make consistent or not

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the async_connect fails it calls c->close(false) which calls _close() which increments consecutive_immediate_connection_close.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a warn is appropriate if unable to resolve.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the async_connect fails it calls c->close(false) which calls _close() which increments consecutive_immediate_connection_close.

There is something different between the two though. For example when I run nodeos with --p2p-peer-address fdsfdsfds.invalid:4444 --p2p-peer-address fddsfdsfdsfds.invalid:4444 --p2p-peer-address 127.0.0.1:2121 --p2p-peer-address www.google.com:4444 you'll see how the latter 2 will try to reconnect forever where the first 2 won't.

There is also some log oddness at shutdown,

info  2024-09-26T02:45:43.993 nodeos    net_plugin.cpp:4578           close_all            ] close all 4 connections
info  2024-09-26T02:45:43.993 net-1     net_plugin.cpp:1460           _close               ] ["127.0.0.1:2121" - 1 <unknown>:<unknown>] closing
info  2024-09-26T02:45:43.993 net-2     net_plugin.cpp:1460           _close               ] ["fddsfdsfdsfds.invalid:4444" - 2 <unknown>:<unknown>] closing
info  2024-09-26T02:45:43.993 nodeos    net_plugin.cpp:4385           plugin_shutdown      ] exit shutdown

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This different in backoff behavior seems to match Leap 4.0 behavior.

@heifner heifner merged commit 2afa2a4 into release/5.0 Sep 26, 2024
29 checks passed
@heifner heifner deleted the resolve-on-reconnect branch September 26, 2024 17:45
@ericpassmore ericpassmore added the bug Something isn't working label Oct 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working OCI Work exclusive to OCI team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants