Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XEP-0198 SM support #17

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

XEP-0198 SM support #17

wants to merge 2 commits into from

Conversation

rufferson
Copy link
Collaborator

@rufferson rufferson commented Dec 16, 2020

Add support for wocky Stream Management into gabble connection.
Wocky can do SM alone by storing connector and its channel managers for future reconnections, however since resumption may not-fatally fail (while still allowing to continue with new connection) it requires separate handling within Gabble.

This PR implements minimal support, which however does not require any changes in the underlying Glib or DBus APIs. When wocky signals about connection resumption GabbleConnection lets it to continue and merely signal up the CONNECTING state. This allows Client to update the UI to indicate limited connectivity state. In addition signal handler increases IQ Timeouts (VCard and Disco for now) to 30min. When resumed is signaled it merely forwards it upstream as CONNECTED state. And finally when resume-done is signaled (which indicates queue flush) it resets IQ timeouts back to normal values.

closes #16

@rufferson
Copy link
Collaborator Author

actually it seems to be working fine for resumption as well, just one minor tweak

 With SM Wocky changed a bit the API by introducing new signals
 and changing wocky-connector API/Properties. The new code should
 conditionally enable the SM if wocky API is >= 0.2. SM Support
 should not require glib/dbus API changes and works transparently
 with existing connections/channels interfaces.
@noonien-d
Copy link
Contributor

Hi @rufferson, I'm testing this feature on my Sailfish phone, and it seems to work fine.
Unfortunately, resumption does not happen that often, because mission-control (?) disconnects the connection when, for example, a switch between wifi / mobile happens.

@rufferson
Copy link
Collaborator Author

rufferson commented Jan 24, 2021

Thanks for testing, I do remember I was adding some setting to suppress/disable NM disconnect but for some reason neither my mc-tool nor even empathy can see my gabble accounts now (not sure what has changed) - i do see the accounts by dbus viewer though :/.

@rufferson
Copy link
Collaborator Author

Ok I think it was because of tp-glib version bump and i didn't relink gabble against it (while being busy with meson). Now that I can rebuild it with meson - i see all accounts are back. And no - i didn't find any account params to disable it but I remembered I tried to use gsettings to switch it off:

$ gsettings get im.telepathy.MissionControl.FromEmpathy use-conn
false

Either way, I assume neither of us spotted any regression so should be fairly safe to merge?

@noonien-d
Copy link
Contributor

Sorry, I did not dive too deep into your code, but what happens with the queued stanzas when reconnection fails? Will the application get an notification?
In my implementation, I had something that explicitly reports TP_DELIVERY_STATUS_PERMANENTLY_FAILED for all leftover messages in the queue, but I'm not sure if there is some hidden magic in your implementation 😅

@rufferson
Copy link
Collaborator Author

rufferson commented Feb 11, 2021

So re-connection is retried indefinitely. It is tired on each heartbeat, but heartbeat is skipped if connection is in progress (eg. waiting for timeout). There are only two conditions under which reconnection stops:

  • Successfully connected and got XMPP error (not network)
  • Connection is explicitly closed (cancellable->cancel) - which is again non-network error.

As long as reconnection is pending - SM context is kept and queued stanzas are living in SM context. When reconnection stops, context is destroyed - context is part of the porter and is destroyed with porter (and as it currently happens sending queue is discarded on porter destruction).
About notifications. wocky signals all the SM state changes (resuming, resumed, resume-done, resume-failed) upstream and some of them are upstreamed by gabble (as much as it can within existing signals, no new signals are introduced). But when porter dies it's standard 'connection fail' from app perspective (state change to 0, reason 'error'). SM-based state changes are connected->connecting, reason none and connecting->connected, reason none (iirc).

@noonien-d
Copy link
Contributor

noonien-d commented Feb 11, 2021

Connection is explicitly closed (cancellable->cancel) - which is again non-network error.

I guess that's what happens on my Sailfish network-change moments. I'll need to investigate that further.
For a users perspective, there is no regression due to this behaviour, so it mustn't keep us from merging the branch.

@noonien-d
Copy link
Contributor

I tested on my Sailfish device with use-conn=false (needed to recompile mission-control for this) and I made some observations:

  • I switched to another wifi, telepathy does not react because of use-conn=false
  • gabble keeps sending Writing xml: <r xmlns="urn:xmpp:sm:3"/> every 30s without reply, but doesn't timeout
  • After about 40 minutes, connection breaks with Error receiving stanza: Error receiving data: Invalid argument
  • SM resume failes, because the time limit has been set to 10 minutes -> reconnection, sent messages are lost

I'm not 100% sure if the heartbeat mechanism is broken because of my backport, though.

@rufferson
Copy link
Collaborator Author

Ok, 40min is big. In my tests it was giving up at ~10min, sometimes 7 sometimes 12, so I've raised resumption window on my test server from 5 to 10 min and was testing success and failure cases by that. And that's also the reason for defaults set to 600s (normally it takes the value from the server).
Either way I think it's time to add a reconnect tracker then in the porter as relying on tcp stack, especially under non-monotonic timer conditions as in sailfish sleep state, is a fat chance. Is the hb still 30sec in sleep state though? I thought it is increased when sleeping.

@noonien-d
Copy link
Contributor

The phone was unlocked most of the time. I was also following the logs using ssh all the time, therefore networking was active.

Couldn't we expect the server to send a reply to <r /> during the heartbeat interval, using that as a criteria?

@rufferson
Copy link
Collaborator Author

that would be too agile, in my tests tcp stack recovers fully and instantly even after several minutes of blackout. But I've introduced a counter already for a number of unacked Rs so now just need to agree on when to give up. I was targeting to give up at 10th which with 30sec hb makes 5min, but in SFOS it's 150sec so will make 25min. So perhaps 3 (7.5min on SFOS, 1.5min otherwise) would be reasonable middle-ground?

@noonien-d
Copy link
Contributor

You could also make it a configuration parameter.

But 3 seems fine for now, at least we'll keep the maximum value under 10 minutes.

@rufferson
Copy link
Collaborator Author

Can you try my gir branch? last commit adds connection abort on 3d missed hb. Abort then triggers resumption.

@noonien-d
Copy link
Contributor

I keep getting errors like Invalid acknowledgement 23, must be between 22 and 22 on my Sailfish build, I guess I'll need to test a little further. The current build is patched af, maybe I should try with a cleaner version...

Can you try my gir branch? last commit adds connection abort on 3d missed hb. Abort then triggers resumption.

That part seems to work fine!

@rufferson
Copy link
Collaborator Author

So server acks more than we (think) we've sent. It might be there are some stanzas injected outside of porter (using connection object instead, which are not counted then). Can you try grep wocky_xmpp_connection_send_stanza_async -R src/ in the SFOS patched gabble source tree?

@noonien-d
Copy link
Contributor

It seems the first disco request is not being counted (no Queueing stanza in log).
Shouldn't we wait for the enable reply before announcing the connected state to gabble?

(telepathy-gabble:15877): wocky-DEBUG: 19:50:18.215: _write_node_tree: Serializing tree:
* enable xmlns='urn:xmpp:sm:3' max='600' resume='true'
(telepathy-gabble:15877): wocky-DEBUG: 19:50:18.215: Writing xml: <enable max="600" resume="true" xmlns="urn:xmpp:sm:3"/>
(telepathy-gabble:15877): gabble-DEBUG: 19:50:18.217: connector_connected (connection.c:2116): connected (jid: [email protected]/resource)
(telepathy-gabble:15877): wocky-DEBUG: 19:50:18.217: wocky_ping_set_property: wocky-ping.c:91: updated ping interval to 150
(telepathy-gabble:15877): wocky-DEBUG: 19:50:18.218: wocky_heartbeat_source_wait: wocky-heartbeat-source.c:117: requested 150 as maximum interval; using the recommended 150 interval
(telepathy-gabble:15877): gabble-DEBUG: 19:50:18.218: connector_connected (connection.c:2130): 
(telepathy-gabble:15877): gabble-DEBUG: 19:50:18.218: gabble_roster_porter_available_cb (roster.c:1748): adding callbacks
(telepathy-gabble:15877): gabble-DEBUG: 19:50:18.218: porter_available_cb (im-factory.c:638): adding callbacks
(telepathy-gabble:15877): gabble-DEBUG: 19:50:18.220: connector_connected (connection.c:2197): Created self handle 1, our JID is [email protected]/resource
(telepathy-gabble:15877): gabble-DEBUG: 19:50:18.220: gabble_connection_refresh_capabilities (connection.c:2734): incorporating caps for org.freedesktop.Telepathy.Client.voicecall:
  --begin--
  Feature: http://jabber.org/protocol/jingle/description/audio
  Feature: urn:xmpp:jingle:apps:rtp:1
  Feature: urn:xmpp:jingle:apps:rtp:audio
  Feature: urn:xmpp:jingle:apps:rtp:rtp-hdrext:0
  Feature: urn:xmpp:jingle:apps:rtp:rtcp-fb:0
  --end--

(telepathy-gabble:15877): gabble-DEBUG: 19:50:18.220: gabble_presence_set_capabilities (presence.c:335): about to add caps to resource resource with serial 3
(telepathy-gabble:15877): gabble-DEBUG: 19:50:18.220: gabble_presence_set_capabilities (presence.c:346): found resource resource
(telepathy-gabble:15877): gabble-DEBUG: 19:50:18.220: gabble_presence_set_capabilities (presence.c:351): new serial 3, old 2, clearing caps
(telepathy-gabble:15877): gabble-DEBUG: 19:50:18.220: gabble_presence_set_capabilities (presence.c:359): updating caps for resource resource
(telepathy-gabble:15877): gabble-DEBUG: 19:50:18.220: gabble_connection_refresh_capabilities (connection.c:2757): nothing to do
(telepathy-gabble:15877): gabble-DEBUG: 19:50:18.220: gabble_disco_request_with_timeout (disco.c:498): Creating disco request 0x7aa620 for domain.com
(telepathy-gabble:15877): wocky-DEBUG: 19:50:18.220: _write_node_tree: Serializing tree:
* iq xmlns='jabber:client' type='get' to='domain.com' id='afa38b88-f6a3-4b9d-9db0-cb8d17b2c113'
    * query xmlns='http://jabber.org/protocol/disco#info'
(telepathy-gabble:15877): wocky-DEBUG: 19:50:18.220: Writing xml: <iq type="get" to="domain.com" id="afa38b88-f6a3-4b9d-9db0-cb8d17b2c113"><query xmlns="http://jabber.org/protocol/disco#info"/></iq>
(telepathy-gabble:15877): gabble-DEBUG: 19:50:18.221: gabble_disco_request_with_timeout (disco.c:498): Creating disco request 0resource2b80 for [email protected]
(telepathy-gabble:15877): wocky-DEBUG: 19:50:18.221: Parsing chunk: <enabled id='ea8d59a3-6df1-40d2-8b35-a139e39d6982' max='300' resume='true' xmlns='urn:xmpp:sm:3'/>
(telepathy-gabble:15877): wocky-DEBUG: 19:50:18.221: _end_element_ns: Received stanza
* enabled xmlns='urn:xmpp:sm:3' id='ea8d59a3-6df1-40d2-8b35-a139e39d6982' max='300' resume='true'
(telepathy-gabble:15877): wocky-DEBUG: 19:50:18.222: wocky_porter_sm_handle: wocky-c2s-porter.c:1717: SM on connection 0x1c411a8 is enabled
(telepathy-gabble:15877): wocky-DEBUG: 19:50:18.222: send_stanza_cb: wocky-c2s-porter.c:738: Queueing stanza iq
(telepathy-gabble:15877): wocky-DEBUG: 19:50:18.222: _write_node_tree: Serializing tree:
* iq xmlns='jabber:client' type='get' to='[email protected]' id='feade433-e0bf-4f56-ae92-5205ca65510a'
    * query xmlns='http://jabber.org/protocol/disco#info'

@rufferson
Copy link
Collaborator Author

rufferson commented Feb 21, 2021

No, xep explicitly calls out no-wait approach (after bind the stream is async, no order could be expected or enforced whatsoever (except the RFC's order of processing). And I designed it to follow the spec - start counting after sending enable, and reset inbound counter after getting enabled. But now I think when I moved sm context out into separate struct I'm likely resetting both counters on enabled.

Edit: actually no, I'm not resetting context at all on enabled, only on failed. And in my earlier debug traces I see I was having Queueing message for initial disco. But now that I test I don't have it. And neither I have ack overflow error. Need a detailed look onto it.

(telepathy-gabble:388156): gabble-DEBUG: 08:47:32.473: gabble_presence_set_capabilities (../src/presence.c:335): about to add caps to resource 58d0073e with serial 3
(telepathy-gabble:388156): gabble-DEBUG: 08:47:32.473: gabble_presence_set_capabilities (../src/presence.c:346): found resource 58d0073e
(telepathy-gabble:388156): gabble-DEBUG: 08:47:32.473: gabble_presence_set_capabilities (../src/presence.c:350): new serial 3, old 2, clearing caps
(telepathy-gabble:388156): gabble-DEBUG: 08:47:32.473: gabble_presence_set_capabilities (../src/presence.c:359): updating caps for resource 58d0073e
(telepathy-gabble:388156): gabble-DEBUG: 08:47:32.473: gabble_connection_refresh_capabilities (../src/connection.c:2729): nothing to do
(telepathy-gabble:388156): gabble-DEBUG: 08:47:32.473: gabble_disco_request_with_timeout (../src/disco.c:497): Creating disco request 0x555555774180 for xmpp.zone
(telepathy-gabble:388156): wocky-DEBUG: 08:47:32.473: _write_node_tree: Serializing tree:
* iq xmlns='jabber:client' type='get' to='xmpp.zone' id='02f10fe2-76fb-41b8-a9b0-3e0643426214'
    * query xmlns='http://jabber.org/protocol/disco#info'
(telepathy-gabble:388156): wocky-DEBUG: 08:47:32.473: Writing xml: <iq type="get" to="xmpp.zone" id="02f10fe2-76fb-41b8-a9b0-3e0643426214"><query xmlns="http://jabber.org/protocol/disco#info"/></iq>
(telepathy-gabble:388156): gabble-DEBUG: 08:47:32.473: gabble_disco_request_with_timeout (../src/disco.c:497): Creating disco request 0x55555575a700 for [email protected]
(telepathy-gabble:388156): wocky-DEBUG: 08:47:32.473: send_stanza_cb: ../subprojects/wocky/wocky/wocky-c2s-porter.c:737: Queueing stanza iq
(telepathy-gabble:388156): wocky-DEBUG: 08:47:32.473: _write_node_tree: Serializing tree:
* iq xmlns='jabber:client' type='get' to='[email protected]' id='5f6570aa-92e1-4f61-b181-0216fbc62230'
    * query xmlns='http://jabber.org/protocol/disco#info'
(telepathy-gabble:388156): wocky-DEBUG: 08:47:32.473: Writing xml: <iq type="get" to="[email protected]" id="5f6570aa-92e1-4f61-b181-0216fbc62230"><query xmlns="http://jabber.org/protocol/disco#info"/></iq>

@rufferson
Copy link
Collaborator Author

Ah, I get it now, Queuing message is in the send callback, meaning it is executed after successful send operation, and queues sent stanza to unacked queue. So the reason we see Queuing message before second send is just a conincidence, second disco is queued later - which is visible in my example (second Queueing after disco before enabled):

(telepathy-gabble:388156): gabble-DEBUG: 08:47:32.473: gabble_disco_request_with_timeout (../src/disco.c:497): Creating disco request 0x555555774180 for xmpp.zone
(telepathy-gabble:388156): wocky-DEBUG: 08:47:32.473: _write_node_tree: Serializing tree:
* iq xmlns='jabber:client' type='get' to='xmpp.zone' id='02f10fe2-76fb-41b8-a9b0-3e0643426214'
    * query xmlns='http://jabber.org/protocol/disco#info'
(telepathy-gabble:388156): wocky-DEBUG: 08:47:32.473: Writing xml: <iq type="get" to="xmpp.zone" id="02f10fe2-76fb-41b8-a9b0-3e0643426214"><query xmlns="http://jabber.org/protocol/disco#info"/></iq>
(telepathy-gabble:388156): gabble-DEBUG: 08:47:32.473: gabble_disco_request_with_timeout (../src/disco.c:497): Creating disco request 0x55555575a700 for [email protected]
(telepathy-gabble:388156): wocky-DEBUG: 08:47:32.473: send_stanza_cb: ../subprojects/wocky/wocky/wocky-c2s-porter.c:737: Queueing stanza iq
(telepathy-gabble:388156): wocky-DEBUG: 08:47:32.473: _write_node_tree: Serializing tree:
* iq xmlns='jabber:client' type='get' to='[email protected]' id='5f6570aa-92e1-4f61-b181-0216fbc62230'
    * query xmlns='http://jabber.org/protocol/disco#info'
(telepathy-gabble:388156): wocky-DEBUG: 08:47:32.473: Writing xml: <iq type="get" to="[email protected]" id="5f6570aa-92e1-4f61-b181-0216fbc62230"><query xmlns="http://jabber.org/protocol/disco#info"/></iq>
(telepathy-gabble:388156): wocky-DEBUG: 08:47:32.473: send_stanza_cb: ../subprojects/wocky/wocky/wocky-c2s-porter.c:737: Queueing stanza iq
(telepathy-gabble:388156): wocky-DEBUG: 08:47:32.508: Parsing chunk: <enabled resume='true' max='300' id='g2gCbQAAAAg1OGQwMDczZWgDYgAABk1iAA2i1GIAQzTD' xmlns='urn:xmpp:sm:3'/>
(telepathy-gabble:388156): wocky-DEBUG: 08:47:32.508: _end_element_ns: Received stanza
* enabled xmlns='urn:xmpp:sm:3' resume='true' max='300' id='g2gCbQAAAAg1OGQwMDczZWgDYgAABk1iAA2i1GIAQzTD'

So it must be something else which injects the uncounted stanza.

@noonien-d
Copy link
Contributor

I got a clue: The sending operation of some stanza fails with an Another send operation is pending error. While the stanza actually ends up at the server, the porter doesn't count it.

Most of the times it happens after a reconnection, when my MAM implementation fetches many messages in a row. MAM simply uses _gabble_connection_send_with_reply, though.

Wocky is unmodified from your sm branch. It happens on PC and phone.

@rufferson
Copy link
Collaborator Author

Hm, ok, but this error happens only when something injects stanza outside of the porter. Porter serializes stanzas via queue so it won't start sending one until finished with another. This error though (sending in progress) is an indication of data being sent out of porter. I tried doing that once (before I got familiar with wocky) and learned it hard way :)

@noonien-d
Copy link
Contributor

I did search for wocky_xmpp_connection_send_stanza_async in the gabble source tree, but there was no result.

@rufferson
Copy link
Collaborator Author

This is odd, any sidecars which could be injecting then? The thing is when this error is generated stanza cannot be sent. The error is generated before anything is tried (before serialisation, before write, simply first check and abort). And as I said porter cannot trap this error as it is sequenced via queue. Can you send a debug output sample surrounding this error? With G_MESSAGES_DEBUG=all WOCKY_DEBUG=all GABBLE_DEBUG=all

@noonien-d
Copy link
Contributor

This is odd, any sidecars which could be injecting then? The thing is when this error is generated stanza cannot be sent. The error is generated before anything is tried (before serialisation, before write, simply first check and abort).

And this is super weird, because the stanza reaches the server and gets counted (therefore the sm error afterwards).

@noonien-d
Copy link
Contributor

Does this help?
I added a DEBUG_STANZA at wocky_xmpp_connection_send_stanza_async and some other output in send_stanza_cb. It seems to fail to send the r stanza at 00:17:56.056.

(telepathy-gabble:17688): GLib-Net-DEBUG: 00:17:55.264: CLIENT[0x5561866285f0]: claiming operation OP_WRITE
(telepathy-gabble:17688): GLib-Net-DEBUG: 00:17:55.264: CLIENT[0x5561866285f0]: claiming operation OP_WRITE succeeded
(telepathy-gabble:17688): GLib-Net-DEBUG: 00:17:55.264: CLIENT[0x5561866285f0]: yielding operation OP_WRITE
(telepathy-gabble:17688): GLib-Net-DEBUG: 00:17:55.264: CLIENT[0x5561866285f0]: successfully write 139 bytes to TLS connection
(telepathy-gabble:17688): wocky-DEBUG: 00:17:55.264: send_stanza_cb: wocky-c2s-porter.c:764: Queueing stanza 1720 : message (null)
(telepathy-gabble:17688): wocky-DEBUG: 00:17:56.054: _write_node_tree: Serializing tree:
* message xmlns='jabber:client' type='chat' to='[email protected]/resource'
    * active xmlns='http://jabber.org/protocol/chatstates'
(telepathy-gabble:17688): wocky-DEBUG: 00:17:56.054: Writing xml: <message type="chat" to="[email protected]/resource"><active xmlns="http://jabber.org/protocol/chatstates"/></message>
(telepathy-gabble:17688): GLib-Net-DEBUG: 00:17:56.054: CLIENT[0x5561866285f0]: starting to write 136 bytes to TLS connection
(telepathy-gabble:17688): GLib-Net-DEBUG: 00:17:56.054: CLIENT[0x5561866285f0]: claiming operation OP_WRITE
(telepathy-gabble:17688): GLib-Net-DEBUG: 00:17:56.054: CLIENT[0x5561866285f0]: claiming operation OP_WRITE succeeded
(telepathy-gabble:17688): GLib-Net-DEBUG: 00:17:56.054: CLIENT[0x5561866285f0]: yielding operation OP_WRITE
(telepathy-gabble:17688): GLib-Net-DEBUG: 00:17:56.054: CLIENT[0x5561866285f0]: successfully write 136 bytes to TLS connection
(telepathy-gabble:17688): wocky-DEBUG: 00:17:56.055: send_stanza_cb: wocky-c2s-porter.c:764: Queueing stanza 1721 : message (null)
(telepathy-gabble:17688): wocky-DEBUG: 00:17:56.055: _write_node_tree: Serializing tree:
* r xmlns='urn:xmpp:sm:3'
(telepathy-gabble:17688): wocky-DEBUG: 00:17:56.055: Writing xml: <r xmlns="urn:xmpp:sm:3"/>
(telepathy-gabble:17688): GLib-Net-DEBUG: 00:17:56.055: CLIENT[0x5561866285f0]: starting to write 26 bytes to TLS connection
(telepathy-gabble:17688): GLib-Net-DEBUG: 00:17:56.055: CLIENT[0x5561866285f0]: claiming operation OP_WRITE
(telepathy-gabble:17688): GLib-Net-DEBUG: 00:17:56.055: CLIENT[0x5561866285f0]: claiming operation OP_WRITE succeeded
(telepathy-gabble:17688): GLib-Net-DEBUG: 00:17:56.055: CLIENT[0x5561866285f0]: yielding operation OP_WRITE
(telepathy-gabble:17688): GLib-Net-DEBUG: 00:17:56.056: CLIENT[0x5561866285f0]: successfully write 26 bytes to TLS connection
(telepathy-gabble:17688): wocky-DEBUG: 00:17:56.056: wocky_xmpp_connection_send_stanza_async: Failed: Another send operation is pending
* r xmlns='urn:xmpp:sm:3'
(telepathy-gabble:17688): wocky-DEBUG: 00:17:56.056: _write_node_tree: Serializing tree:
* message xmlns='jabber:client' type='chat' to='[email protected]/resource' id='2a1bd003-dc76-4617-ad1f-3bf6e0e828d2'
    * body
        "meeh"
    * active xmlns='http://jabber.org/protocol/chatstates'
    * request xmlns='urn:xmpp:receipts'
    * markable xmlns='urn:xmpp:chat-markers:0'
(telepathy-gabble:17688): wocky-DEBUG: 00:17:56.057: Writing xml: <message type="chat" to="[email protected]/resource" id="2a1bd003-dc76-4617-ad1f-3bf6e0e828d2"><body>meeh</body><active xmlns="http://jabber.org/protocol/chatstates"/><request xmlns="urn:xmpp:receipts"/><markable xmlns="urn:xmpp:chat-markers:0"/></message>
(telepathy-gabble:17688): GLib-Net-DEBUG: 00:17:56.057: CLIENT[0x5561866285f0]: starting to write 274 bytes to TLS connection
(telepathy-gabble:17688): GLib-Net-DEBUG: 00:17:56.057: CLIENT[0x5561866285f0]: claiming operation OP_WRITE
(telepathy-gabble:17688): GLib-Net-DEBUG: 00:17:56.057: CLIENT[0x5561866285f0]: claiming operation OP_WRITE succeeded
(telepathy-gabble:17688): GLib-Net-DEBUG: 00:17:56.057: CLIENT[0x5561866285f0]: yielding operation OP_WRITE
(telepathy-gabble:17688): GLib-Net-DEBUG: 00:17:56.057: CLIENT[0x5561866285f0]: successfully write 274 bytes to TLS connection
(telepathy-gabble:17688): wocky-DEBUG: 00:17:56.057: send_stanza_cb: wocky-c2s-porter.c:721: Canceling stanza Another send operation is pending

@rufferson
Copy link
Collaborator Author

rufferson commented Mar 5, 2021

Is this fragment of debug pre-filtered or full? I just wonder why suddenly R kicks in, it kicks under two conditions - too many unacked stanzas in the queue (more than 5 actually) and on hb timer. But I don't see hb debugs here.
HB is actually the only second place which calls xmpp_connection_send_stanza - first is sending qeuue processing chain. HB should also be skipped if the queue is not empty. However if there's mistake in calculating queue length it may misfire and screw the queue.

BTW is it happening only on the phone? I'm just thinking sailfish is actually using iphb (while non-sfos normal timers). And iphb runs own thread. But the queue is normal queue which is not thread-safe (unlike async_queue).

@noonien-d
Copy link
Contributor

noonien-d commented Mar 5, 2021

BTW is it happening only on the phone? I'm just thinking sailfish is actually using iphb (while non-sfos normal timers). And iphb runs own thread. But the queue is normal queue which is not thread-safe (unlike async_queue).

No, its also happening on my local machine, without any additional modifications to wocky itself.

I just added additional debug output for the heartbeat call, but it might take a while to catch an occurrence.

@noonien-d
Copy link
Contributor

It seems the "r" stanza is being sent twice:

(telepathy-gabble:143351): wocky-DEBUG: 20:55:16.391: wocky_porter_sm_handle: wocky-c2s-porter.c:1704: send 'r' while reqs=0 - 771 776 10
(telepathy-gabble:143351): wocky-DEBUG: 20:55:16.391: send_head_stanza: send head:
* r xmlns='urn:xmpp:sm:3'
(telepathy-gabble:143351): wocky-DEBUG: 20:55:16.391: _write_node_tree: Serializing tree:
* r xmlns='urn:xmpp:sm:3'
(telepathy-gabble:143351): wocky-DEBUG: 20:55:16.391: Writing xml: <r xmlns="urn:xmpp:sm:3"/>
(telepathy-gabble:143351): GLib-Net-DEBUG: 20:55:16.391: CLIENT[0x55e4f795a800]: starting to write 26 bytes to TLS connection
(telepathy-gabble:143351): GLib-Net-DEBUG: 20:55:16.391: CLIENT[0x55e4f795a800]: claiming operation OP_WRITE
(telepathy-gabble:143351): GLib-Net-DEBUG: 20:55:16.391: CLIENT[0x55e4f795a800]: claiming operation OP_WRITE succeeded
(telepathy-gabble:143351): GLib-Net-DEBUG: 20:55:16.391: CLIENT[0x55e4f795a800]: yielding operation OP_WRITE
(telepathy-gabble:143351): GLib-Net-DEBUG: 20:55:16.391: CLIENT[0x55e4f795a800]: successfully write 26 bytes to TLS connection
(telepathy-gabble:143351): wocky-DEBUG: 20:55:16.391: send_head_stanza: send head:
* r xmlns='urn:xmpp:sm:3'
(telepathy-gabble:143351): wocky-DEBUG: 20:55:16.391: wocky_xmpp_connection_send_stanza_async: Failed: Another send operation is pending:
* r xmlns='urn:xmpp:sm:3'
(telepathy-gabble:143351): wocky-DEBUG: 20:55:16.392: send_stanza_cb: Sent stanza:
* r xmlns='urn:xmpp:sm:3'
(telepathy-gabble:143351): wocky-DEBUG: 20:55:16.392: send_head_stanza: send head:
* message xmlns='jabber:client' type='chat' to='[email protected]/resource' id='84008e61-da5c-4609-83f2-ca4746532a63'
    * body
        "Text"
    * active xmlns='http://jabber.org/protocol/chatstates'
    * request xmlns='urn:xmpp:receipts'
    * markable xmlns='urn:xmpp:chat-markers:0'
(telepathy-gabble:143351): wocky-DEBUG: 20:55:16.392: _write_node_tree: Serializing tree:
* message xmlns='jabber:client' type='chat' to='[email protected]/resource' id='84008e61-da5c-4609-83f2-ca4746532a63'
    * body
        "Text"
    * active xmlns='http://jabber.org/protocol/chatstates'
    * request xmlns='urn:xmpp:receipts'
    * markable xmlns='urn:xmpp:chat-markers:0'
(telepathy-gabble:143351): wocky-DEBUG: 20:55:16.392: Writing xml: <message type="chat" to="[email protected]/resource" id="84008e61-da5c-4609-83f2-ca4746532a63"><body>Text</body><active xmlns="http://jabber.org/protocol/chatstates"/><request xmlns="urn:xmpp:receipts"/><markable xmlns="urn:xmpp:chat-markers:0"/></message>
(telepathy-gabble:143351): GLib-Net-DEBUG: 20:55:16.392: CLIENT[0x55e4f795a800]: starting to write 347 bytes to TLS connection
(telepathy-gabble:143351): GLib-Net-DEBUG: 20:55:16.392: CLIENT[0x55e4f795a800]: claiming operation OP_WRITE
(telepathy-gabble:143351): GLib-Net-DEBUG: 20:55:16.392: CLIENT[0x55e4f795a800]: claiming operation OP_WRITE succeeded
(telepathy-gabble:143351): GLib-Net-DEBUG: 20:55:16.392: CLIENT[0x55e4f795a800]: yielding operation OP_WRITE
(telepathy-gabble:143351): GLib-Net-DEBUG: 20:55:16.392: CLIENT[0x55e4f795a800]: successfully write 347 bytes to TLS connection
(telepathy-gabble:143351): wocky-DEBUG: 20:55:16.392: send_stanza_cb: wocky-c2s-porter.c:722: Canceling stanza Another send operation is pending

Something like send_head_stanza gets triggered before the callback removes the stanza from the queue?

@rufferson
Copy link
Collaborator Author

rufferson commented Mar 8, 2021

Just thinking aloud now...
send_head_stanza is called 3 times

  • At send_async when queue is empty (straight after making it non-empty)
  • At send callback when queue is not empty (continue sending sequence)
  • At send heartbeat callback if queue is not empty

Assuming 771 776 10 is acked/sent/win the r is an early notification (half window) so is called from inside callback.
That calls send_stanza_async - which if the queue is empty calls send_head_stanza (and adds an r to the queue).
Now when send_async returns there is an r in the queue and then it checks if the queue is not empty - it calls send_head_stanza to flush the queue.
I wonder why I never hit this >_>

P.S.: good catch and useful debug :)
P.P.S.: Can you check with latest gir branch?

@rufferson
Copy link
Collaborator Author

But that shouldn't affect ack counting because r is not counted? And this condition only triggers when the queue is empty but unack is high (eg at the end of sending burst). So i might have faced with this just didn't notice, as I presume such situation always happens during initial setup sequence (disco, presence, pep, mam, etc.).
The problem is two send_head called in single callback, any stanza added after callback is finished should follow normal sequence. If stanzas are injected during callback (from another thread) that may break the porter regardless of the sm (damage non-thread-safe queue).
There might be some racy case when stanza is added right after faulty callback and before callback's callback. Then misfired send_head's callback may remove wrong element from the queue, but then again that should not break counting (but may break higher level logic in gabble). Anyway let see may be it does something dodgy as a side effect, I'm still unable to reproduce this IRL, need to create test unit perhaps.

@noonien-d
Copy link
Contributor

Its been a while, but I figured one stable hack and noticed another issue:

I'm currently using a workaround for the resend-issue which is hacky, but might be okay to demonstrate the issue: noonien-d/wocky@085fcf8

In a cleaner approach, it might be nicer to remove the stanza from the queue when it gets send, and to use another reference in the callback. That way resending is impossible.

A completely unrelated issue is, that (at least on Sailfish) messages received on resumption won't get through to the GUI. They get lost somewhere in gabble. I will try to reproduce that error with a little bit more logging.

@rufferson
Copy link
Collaborator Author

ehm, let me try to get my head around this commit, it doesn't make sense to me (but still smells a race condition). But with this change you don't get any ack overrun right?
The GUI - it may have something to do with signal sequence, could be commhistoryd just goes nuts from signal+reason combination I'm using (connecting+error, connected+none)

@rufferson
Copy link
Collaborator Author

I really cannot reproduce it. I've made syntetic test in twisted, I've recovered my old laptop and installed plasma there to make real-life test - joined coversations MUC which generated 1100 incouming and 400 outgoing messages within several seconds - everything smooth. If you're getting double-send I need that traceback when you hit it as I reviewed all the possible entry points and cannot see how it could happen.

@noonien-d
Copy link
Contributor

Weirdly enough, I was not able to get receive a record of this error either.
I still use my workaround, so I wouldn't notice without watching the logs, though.

@rufferson
Copy link
Collaborator Author

If you are really not getting double-send after code modification - that actually adds more to the race condition case (code modification caused timing shift and hence race avoidance). Then again - eiether your other patches are adding some concurrency (another thread) which may trigger send while main loop is sending or the client maybe making some call patterns (although client is supposed to use dbus and dbus runs in main loop). Do I make it up or you mentioned once that you can reproduce it with other TP clients (not only comhistoryd)? I've used it so far with kde and empathy (and twisted python test script). neither of those is able to trigger the error.

@noonien-d
Copy link
Contributor

I had this issue with empathy at least once.

But I just checked my logs on the phone (which don't cover 100% of the last month):
My last build was on March 15, the last entry in the log is from March 19. Nothing since then. Super weird.

@noonien-d
Copy link
Contributor

[…]The GUI - it may have something to do with signal sequence, could be commhistoryd just goes nuts from signal+reason combination I'm using (connecting+error, connected+none)

This seems to be the cause, I disabled status signaling on resume and no problems since then. I didn't manage to debug commhistoryd though ...

@noonien-d
Copy link
Contributor

I finally managed to get a debug trace of commhistoryd: it seems the first messages don't get their sender property set, probably because of this:

[W] unknown:0 - tp-qt 0.9.8 WARN: Gathering contacts failed: "org.freedesktop.Telepathy.Error.NotAvailable": "Connection::FeatureCore is not ready"

This ends up in a null pointer on the messages sender-property, crashing commhistoryd. The crash is fixable, but due to the missing sender-property we don't even know if the message has been sent by ourself (from another device).

The root cause seems to be in TelepathyQt ... and might affect other clients as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

XEP-0198 SM Support
2 participants