-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
XEP-0198 SM support #17
base: master
Are you sure you want to change the base?
Conversation
actually it seems to be working fine for resumption as well, just one minor tweak |
With SM Wocky changed a bit the API by introducing new signals and changing wocky-connector API/Properties. The new code should conditionally enable the SM if wocky API is >= 0.2. SM Support should not require glib/dbus API changes and works transparently with existing connections/channels interfaces.
Hi @rufferson, I'm testing this feature on my Sailfish phone, and it seems to work fine. |
Thanks for testing, I do remember I was adding some setting to suppress/disable NM disconnect but for some reason neither my mc-tool nor even empathy can see my gabble accounts now (not sure what has changed) - i do see the accounts by dbus viewer though :/. |
Ok I think it was because of tp-glib version bump and i didn't relink gabble against it (while being busy with meson). Now that I can rebuild it with meson - i see all accounts are back. And no - i didn't find any account params to disable it but I remembered I tried to use gsettings to switch it off:
Either way, I assume neither of us spotted any regression so should be fairly safe to merge? |
Sorry, I did not dive too deep into your code, but what happens with the queued stanzas when reconnection fails? Will the application get an notification? |
So re-connection is retried indefinitely. It is tired on each heartbeat, but heartbeat is skipped if connection is in progress (eg. waiting for timeout). There are only two conditions under which reconnection stops:
As long as reconnection is pending - SM context is kept and queued stanzas are living in SM context. When reconnection stops, context is destroyed - context is part of the porter and is destroyed with porter (and as it currently happens sending queue is discarded on porter destruction). |
I guess that's what happens on my Sailfish network-change moments. I'll need to investigate that further. |
I tested on my Sailfish device with
I'm not 100% sure if the heartbeat mechanism is broken because of my backport, though. |
Ok, 40min is big. In my tests it was giving up at ~10min, sometimes 7 sometimes 12, so I've raised resumption window on my test server from 5 to 10 min and was testing success and failure cases by that. And that's also the reason for defaults set to 600s (normally it takes the value from the server). |
The phone was unlocked most of the time. I was also following the logs using ssh all the time, therefore networking was active. Couldn't we expect the server to send a reply to |
that would be too agile, in my tests tcp stack recovers fully and instantly even after several minutes of blackout. But I've introduced a counter already for a number of unacked Rs so now just need to agree on when to give up. I was targeting to give up at 10th which with 30sec hb makes 5min, but in SFOS it's 150sec so will make 25min. So perhaps 3 (7.5min on SFOS, 1.5min otherwise) would be reasonable middle-ground? |
You could also make it a configuration parameter. But 3 seems fine for now, at least we'll keep the maximum value under 10 minutes. |
Can you try my gir branch? last commit adds connection abort on 3d missed hb. Abort then triggers resumption. |
I keep getting errors like
That part seems to work fine! |
So server acks more than we (think) we've sent. It might be there are some stanzas injected outside of porter (using connection object instead, which are not counted then). Can you try |
It seems the first disco request is not being counted (no
|
No, xep explicitly calls out no-wait approach (after bind the stream is async, no order could be expected or enforced whatsoever (except the RFC's order of processing). And I designed it to follow the spec - start counting after sending enable, and reset inbound counter after getting enabled. But now I think when I moved sm context out into separate struct I'm likely resetting both counters on Edit: actually no, I'm not resetting context at all on
|
Ah, I get it now, Queuing message is in the send callback, meaning it is executed after successful send operation, and queues sent stanza to unacked queue. So the reason we see Queuing message before second send is just a conincidence, second disco is queued later - which is visible in my example (second Queueing after disco before enabled):
So it must be something else which injects the uncounted stanza. |
I got a clue: The sending operation of some stanza fails with an Most of the times it happens after a reconnection, when my MAM implementation fetches many messages in a row. MAM simply uses Wocky is unmodified from your sm branch. It happens on PC and phone. |
Hm, ok, but this error happens only when something injects stanza outside of the porter. Porter serializes stanzas via queue so it won't start sending one until finished with another. This error though (sending in progress) is an indication of data being sent out of porter. I tried doing that once (before I got familiar with wocky) and learned it hard way :) |
I did search for |
This is odd, any sidecars which could be injecting then? The thing is when this error is generated stanza cannot be sent. The error is generated before anything is tried (before serialisation, before write, simply first check and abort). And as I said porter cannot trap this error as it is sequenced via queue. Can you send a debug output sample surrounding this error? With |
And this is super weird, because the stanza reaches the server and gets counted (therefore the sm error afterwards). |
Does this help?
|
Is this fragment of debug pre-filtered or full? I just wonder why suddenly R kicks in, it kicks under two conditions - too many unacked stanzas in the queue (more than 5 actually) and on hb timer. But I don't see hb debugs here. BTW is it happening only on the phone? I'm just thinking sailfish is actually using iphb (while non-sfos normal timers). And iphb runs own thread. But the queue is normal queue which is not thread-safe (unlike async_queue). |
No, its also happening on my local machine, without any additional modifications to wocky itself. I just added additional debug output for the heartbeat call, but it might take a while to catch an occurrence. |
It seems the "r" stanza is being sent twice:
Something like |
Just thinking aloud now...
Assuming P.S.: good catch and useful debug :) |
But that shouldn't affect ack counting because |
Its been a while, but I figured one stable hack and noticed another issue: I'm currently using a workaround for the resend-issue which is hacky, but might be okay to demonstrate the issue: noonien-d/wocky@085fcf8 In a cleaner approach, it might be nicer to remove the stanza from the queue when it gets send, and to use another reference in the callback. That way resending is impossible. A completely unrelated issue is, that (at least on Sailfish) messages received on resumption won't get through to the GUI. They get lost somewhere in gabble. I will try to reproduce that error with a little bit more logging. |
ehm, let me try to get my head around this commit, it doesn't make sense to me (but still smells a race condition). But with this change you don't get any ack overrun right? |
I really cannot reproduce it. I've made syntetic test in twisted, I've recovered my old laptop and installed plasma there to make real-life test - joined coversations MUC which generated 1100 incouming and 400 outgoing messages within several seconds - everything smooth. If you're getting double-send I need that traceback when you hit it as I reviewed all the possible entry points and cannot see how it could happen. |
Weirdly enough, I was not able to get receive a record of this error either. |
If you are really not getting double-send after code modification - that actually adds more to the race condition case (code modification caused timing shift and hence race avoidance). Then again - eiether your other patches are adding some concurrency (another thread) which may trigger send while main loop is sending or the client maybe making some call patterns (although client is supposed to use dbus and dbus runs in main loop). Do I make it up or you mentioned once that you can reproduce it with other TP clients (not only comhistoryd)? I've used it so far with kde and empathy (and twisted python test script). neither of those is able to trigger the error. |
I had this issue with empathy at least once. But I just checked my logs on the phone (which don't cover 100% of the last month): |
This seems to be the cause, I disabled status signaling on resume and no problems since then. I didn't manage to debug commhistoryd though ... |
I finally managed to get a debug trace of commhistoryd: it seems the first messages don't get their
This ends up in a null pointer on the messages The root cause seems to be in TelepathyQt ... and might affect other clients as well. |
Add support for wocky Stream Management into gabble connection.
Wocky can do SM alone by storing connector and its channel managers for future reconnections, however since resumption may not-fatally fail (while still allowing to continue with new connection) it requires separate handling within Gabble.
This PR implements minimal support, which however does not require any changes in the underlying Glib or DBus APIs. When wocky signals about connection resumption GabbleConnection lets it to continue and merely signal up the CONNECTING state. This allows Client to update the UI to indicate limited connectivity state. In addition signal handler increases IQ Timeouts (VCard and Disco for now) to 30min. When resumed is signaled it merely forwards it upstream as CONNECTED state. And finally when resume-done is signaled (which indicates queue flush) it resets IQ timeouts back to normal values.
closes #16