[BUG] segfault in freeswitch/load_balancer module after connection loss #3468

spacetourist · 2024-09-12T14:31:56Z

OpenSIPS version you are running

version: opensips 3.2.10 (x86_64/linux)
flags: STATS: On, DISABLE_NAGLE, USE_MCAST, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, HP_MALLOC, DBG_MALLOC, FAST_LOCK-ADAPTIVE_WAIT
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535
poll method support: poll, epoll, sigio_rt, select.
main.c compiled on 15:42:44 Dec 20 2022 with gcc 4.8.5

Describe the bug

In my setup the load balancer module has a list of FreeSWITCH destinations in the database, each instance sends a HEARTBEAT every second according to this config:

modparam("freeswitch", "event_heartbeat_interval", 1)
modparam("load_balancer", "db_table", "load_balancer_fr")
modparam("load_balancer", "probing_interval", 1)
modparam("load_balancer", "fetch_freeswitch_stats", 1)
modparam("load_balancer", "initial_freeswitch_load", 5000)

The initial fault is that when the connection to the FS server is severed abruptly we are not sending any TCP keep alives or heartbeats and we do not reconnect when the server comes back online. In this state the data is stale and the destination is still active so when the server comes online calls will be delivered but without the heartbeat data the destinations load is not counted resulting in too many calls being delivered to the instance.

The segfault occurs when using opensips-cli to attempt to restore the connection. Attempting mi lb_reload does nothing as OpenSIPs continues to believe the connection is OK in an effort to clear this I remove the DB record for the impacted destination and mi lb_reload followed by reinstating the DB record and issuing mi lb_reload again. This causes OpenSIPs to crash with the following output:

2024-09-10T09:12:42.007871+01:00 FR-P-SIPSBC-1 opensips[40744]: ERROR:freeswitch:handle_reconnects: failed to connect to FS sock '192.168.151.229:8021'
2024-09-10T09:12:42.008202+01:00 FR-P-SIPSBC-1 opensips[40744]: ERROR:freeswitch:io_watch_del: [FS Manager] trying to delete already erased entry 0 in the hash(0, 0, (nil)) )
2024-09-10T09:12:42.008454+01:00 FR-P-SIPSBC-1 opensips[40744]: ERROR:freeswitch:destroy_fs_evs: del failed for sock 0
2024-09-10T09:12:42.008652+01:00 FR-P-SIPSBC-1 opensips[40744]: ERROR:freeswitch:destroy_fs_evs: disconnect error 1 on FS sock 192.168.151.229:8021
2024-09-10T09:12:42.008950+01:00 FR-P-SIPSBC-1 opensips[40744]: ERROR:freeswitch:handle_io: failed to destroy FS evs!
2024-09-10T09:12:42.009161+01:00 FR-P-SIPSBC-1 opensips[40744]: CRITICAL:core:sig_usr: segfault in process pid: 40744, id: 1

To Reproduce

This can be reproduced simply by:

Setup OpenSIPs load_balancer with a freeswitch destination and 1s HEARTBEAT
Use tcpdump to verify that heartbeats arrive every second
Power off the FS server and then start it back up
Use tcpdump to verify that heartbeats do not resume when the server comes back online
Remove the destination from the load_balancer table and issue mi lb_reload
Add the destination back in and issue mi lb_reload
Observe segfault

Analysis

When a remote FS server restarts gracefully OpenSIPs will receive a TCP FIN and will issue SYN packets every second until the server comes back online at which point the ESL connection is automatically re-established.

When the FS server is instead halted abruptly (power off) the server will not send the FIN and OpenSIPs will not start its SYN polling meaning that when the FS server comes back online it is never detected and OpenSIPs will not reconnect. In this state the load_balancer module is left with stale data for the impacted host. As the load balancer has not been told to disable the destination calls will start to be allocated to the instance throughout this period, when offline these fail over gracefully to another instance however once the server is back online it will start to get calls whilst never getting heartbeat data causing the instance to get overloaded (I have multiple OpenSIPs instances feeding into the same pool of FreeSWITCH).

Expected behavior

The lb_reload operation should handle the reconnection process gracefully, stale connections should be detected and replaced without panic.

The system should detect a stale destination which has failed to send an ESL HEARTBEAT and either disable the host or attempt reconnection until it returns.

Additionally the HB arrival is not currently tracked and exposed via opensips-cli so I have no way to implement effective monitoring of this scenario, exposing this data would be really helpful.

OS/environment information

Operating System: Almalinux 9
OpenSIPS installation: Manual packages
other relevant information:
- Running a custom load_balancer module as per this PR new operational mode - percent with CPU #3351 (I do not believe that is relevant to the fault described)

The text was updated successfully, but these errors were encountered:

spacetourist changed the title ~~[BUG] segault in freeswitch/load_balancer module after connection loss~~ [BUG] segfault in freeswitch/load_balancer module after connection loss Sep 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] segfault in freeswitch/load_balancer module after connection loss #3468

[BUG] segfault in freeswitch/load_balancer module after connection loss #3468

spacetourist commented Sep 12, 2024

[BUG] segfault in freeswitch/load_balancer module after connection loss #3468

[BUG] segfault in freeswitch/load_balancer module after connection loss #3468

Comments

spacetourist commented Sep 12, 2024