-
Notifications
You must be signed in to change notification settings - Fork 372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support zero-knowledge scaling of RTPEngine #892
base: master
Are you sure you want to change the base?
Conversation
This will contains the IP address RTPEngine advertised for this call.
…address instead of generating an address from a the static configuration
I forgot to note the expected configuration required for this setup to work: The configuration should have:
Example:
|
Ok, so to summarise: Normally rtpengine stores the name (label) and the ID of the local interface used for each local port into Redis, and then uses that information to restore the same status when a call is restored. This patch adds a plain interface address to the call structure, and when present, rtpengine uses that address instead of the actual interface address when communicating to the outside world, while the actual UDP socket is still bound to whatever local interface is configured. Did I get it right so far? |
Yes, that's basically it. This works well for me where my nodes are running behind a 1-to-1 NAT where what we store and advertise is an IP address not listed on the local interface. I don't think it matters for this behavior change (I assume that without the NAT public IP address, The main issue is that the stored interface address belongs to another node (it is only used for "foreign calls"), so the RTP socket is actually bound on another machine. |
Ok. I don't have a problem adding such a feature, but obviously it would have to be optional and configurable. Also I'd like to have it in a more generic way, meaning without the requirement for a special case config. This can probably be achieved by having the "connection address" stored not per call, but rather per socket (or per media), and perhaps add a flag to configured interfaces to signal that this is an interface that can receive non-local network traffic. |
Sorry for taking the time to respond. I want to advance this feature request, but am running into difficulties:
|
Re 1. Yes all streams go through the same server, but a single call can have ports/sockets allocated from multiple different interfaces. This feature currently requires that only a single interface is configured. I'd like to lift this restriction and make it work also in scenarios with multiple interfaces present. SRTP should not really be affected by this. If it works without this feature, it should work with it. At least I see no reason why it shouldn't 😄 |
8ff568b
to
5f62bfd
Compare
@rfuchs - I'd appreciate your input on a problem I'm with this feature: when rtpengine loads a foreign call from Redis, as part of loading the data structure into memory, it also tries to open all the ports described in the sfds. I've been getting error sometimes with that process as when multiple servers are handling calls, they can open the same ports numbers for different calls (on different servers), so when a foreign call is loaded, it might describe ports that are already in use by local calls. I tried to figure out how the static configuration - with multiple interfaces configured on the local machine (describing interfaces on remote machines), but I can't figure out how you avoid that problem in the static configuration either. I'd appreciate any pointers. The last commit - that tries to just not allocate the ports - was causing later code to SIGSEGV or SIGABRT, so I removed it. |
Possibly related to my last question, you said:
Would that mean that |
Ok, I've been spending a lot of time with this issue the last week, as I still can't get this feature to work, so here's a summary of the status - what I'm looking for and what I understood rtpengine does. Hopefully it will inform our communications. Current clustering methodI believe I failed to understand what the current cluster/redis support in rtpengine does. The way I understand it now is that the cluster setup is supposed to be as follows:
When a new While concurrent calls with different primaries may re-use port numbers, because each logical interface has its own port pool, it works fine because "conflicting" ports are bound on different IP addresses and the OS network stack handles that nicely.
Suggested "zero-knowledge" clustering method.What I'm trying to achieve is a dynamic cluster that supports:
Problems:
The setup I'm looking for will be like this:
When a new When an alternate receives a command (because the load balancer is dumb), it generates the correct SDP for continuing to route the media through the primary, because it has all the information: it knows all the port numbers and the controlling (foreign) network interface's familty and address.
The problems
Any pointers, screams or oy-veys are welcome :-) |
So each media can have different network interfaces
ports owned by another server are not bound locally, we just know what they are
Currently only missing endpoint addresses, but I need more
I was mistaken about the alternates not updating Redis - they do. So I'm trying to implement "don't ignore foreign updates, try to integrate them" for primaries. The issues I'm having are: |
For the underlying RTP sockets, I see two distinct cases:
In both cases, the logical interface name should still be honoured, so that multiple interfaces can be supported, even though in the first case it will likely be of little consequence. For the Redis part of the puzzle, I can't provide much insight since this sort of active/active failover is contributed code and we're not using rtpengine in this mode ourselves. I would assume that the distinction between "foreign" and owned call must remain so that standby instances don't act on these calls (e.g. doing timeouts or writing to Redis) by default. I would try to avoid sending signalling to instances that don't own the call if at all possible as that would complicate things considerably. Instead, receiving signalling (or RTP for that matter) should switch a call from "foreign" to "owned," giving that instance ownership of that call, presuming that whichever instance owned the call previously is not there any more. This switching of ownership could also happen explicitly instead, e.g. through some sort of command to rtpengine "take over all calls owned by instance X." This would require giving each instance a unique identifier and storing the owner ID for each call into Redis. |
Very interesting - I wasn't aware of that capability.
Yea, I'm looking at that complication now :-/ . One option we were looking at is "just" writing the session aware signaling load-balancer, to make sure all signaling work well, but we concluded that writing a reliable load-balancer for auto scaling is harder then utilizing the existing Redis active/active support. This assessment may have been incorrect.
AFAIK, each instance already have a unique ID: it adds it to the SDPs it replies with. I think storing it in Redis would be a good idea, but I didn't find where it is stored in memory. |
That's |
…own, saner logic with separation of data, policy and mechanism
Constant across restart does not help with my "immutable replaceable instances" scenario, but I guess I can make this as a hash on configured interfaces hardware IDs? |
…dis.c and into their own reusable objects
…to_params_sdes from JSON
…nt number the first param set, the second set should be numbered start from 1 (the first set is 0)
I dont need the extra warnings on read failure every time, especially when expected
9aa77c5
to
e767e59
Compare
otherwise, lets try to avoid parsing "" as "0.0.0.0:0", as that doesnt make sense.
by having codec.c create the encoded format in the same way it will later parse it, then just use it in redis.c
…ementation in json_media()
we need better resolution to properly ignore high speed redis updates being sent on the shared channel (now that updates are sent both ways)
refactor the handling of the redis "set" event so that if we answer for a foreign call and send an update, we can also ignore our update. In the future we may be able to completely drop the json_restore_call() path and use the more flexible redis-json implementation for all reading
So the last push (and I'll fix the merge conflicts in a bit) works for me: in a cluster of rtpengine instances that share a Redis database, all servers keep an up to date copy of the the call data structure for all calls, and when a non-owner of a call gets a command, they can answer it truthfully and if any new data needs to be generated - it is propagated correctly back to the owner so the owner can handle RTP streams with full knowledge of what answers where provided by alternates. There are still things that can be improved: even after 14abc07 there are some problems with payload descriptor encoding/decoding; I want to rebuild the JSON structure so it is hierarchical like the actual call structure and use numeric encoding for numbers instead of a set of string arrays; a couple of places where I had to copy code from But that being said, I believe this feature is ready to merge (after I fix the conflicts) |
Well this has grown into a rather large PR. Are you able to refactor/squash/rebase this into commits that are easier to review? |
Sorry for the late reply - I was sick at home for a couple of weeks and then covid19 happened :-/ I'll rebase the commits for something more manageable in the next few days. |
Also, @rfuchs , we encountered another edge case that breaks this work, that I would appreciate your feedback on: Looking at call_interfaces.c:975 - this gets triggered if we get a new This is obviously not what the original code was intended to handle, but I've added some code in the last commit to have other servers honor an RTPEngine that "takes ownership" of a call like that. My question is - shouldn't we do this always (why only in cluster mode)? If a new offer is received for an existing call, why not always destroy the old call and start a new call? |
Well, no, because rtpengine keeps call state intact across re-invites, and handles changes in the media descriptions gracefully (or at least attempts to). This is things like local ports allocated, packet counters, protocols used, codecs, crypto info, etc etc. I can't really comment on this particular code (nor anything else from the active/active mechanism) as it's been contributed by @smititelu and @inf265 and so I don't know why in this case the call is destroyed and re-created. |
In that case, I believe the correct approach would be to remove the code that has special handling of "offer for a known foreign call" added by @lbalaceanu - with the code in this PR, the reasoning in the comment is no longer needed: the "fallback server" will just return the same reply the original server sent and there is no loss of data and no need to restart anything. I'm going to do that and run some tests before pushing the change here. |
…n existing call id" This reverts commit 9b9a165. After removing "call restart in case of repeat offer", this is no longer needed.
Reverts commit e043670. If we get a second offer, and it is a retry, new redis data allows us to reply with the original server reply, so no need to restart the session.
I haven't been able to browse through the actual commits proposed by this pull request, but as far as I remember the original code you mention is related to this scenario: the primary rtpengine (owner) receives an offer from a SIP proxy, but its response doesn't reach this proxy in time. The proxy then sends the same offer to another rtpengine (alternate). Obviously there must be a way to re-conciliate the 2 rtpengines. The code in the original Redis fixed keyspace setup might seem at times cumbersome, but this is because over time it got tested and repaired and obviously different corner cases have been addressed. I consider that this progress should be kept and new features come as extensions or additions to working code. Thank you |
Hi @lbalaceanu - sorry it took me so long to respond. I've been dealing with other things that took attention away from this project. Regarding your comments, to the best of my knowledge an ability, all the new behavior I introduced up till now isn't supposed to change the original behavior - though this is based only on my reading of the code and guesses as to the intent of the original author, as the system architecture for which the "Redis fixed keyspace setup" was built is not documented anywhere and I have no way to actually test that the original behavior is intact. I would love to be able to re-produce such a set up to allow me to test such assumptions. Are you the original author and/or have a relevant setup that we can discuss? |
The current model for creating RTPEngine clusters, as documented in the Redis keyspace notification wiki page, require a static configuration where the exact topology of the cluster is fully known ahead of time, and a Redis keyspace is allocated for each cluster member.
Here we introduce a new way to handle RTPEngine clusters that allows a cluster to scale out and in automatically without pre-configuring each cluster member with all the known addresses of all other members. This configuration supports a completely stateless message distribution (for example as that offered by a layer 3 network load balancer) by remembering the network identity of a call owner in the Redis-distributed data structure and allowing a node that receives a command for a session created by another node to answer with the owner's network identity without knowing it ahead of time.
This patch, though minimal, is probably a bit too hackish - and is not configurable at all - so it is presented here as a basis for discussion. Please let me know what you think.