Acceptor stops working after session end #763

andrewrimmer · 2023-03-21T18:30:24Z

We have an intermittent issue in production, which causes quickfixn to stop working around the end of the session. The fix server does not respond to connection attempts, and the fix sessions are not created the following day. We have to restart our application for things to start working again.

I am not seeing any clues in our logs, and we are not getting any exceptions reported.

At the end of the session we get the following messages/events: -

LOGOUT
Session FIXT.1.1:[redacted]->[redacted] disconnecting: Resetting...
Session reset: Out of SessionTime (Session.Next())

Then on a day it is working, we will see connection/logon attempts repeatedly until the sessions open again.

However, if it is not working we will see nothing else in the logs after end time. At this stage we have to restart the application to get things working again.

We are using a version of quickfixn from the master branch of September 2022.

Any idea how we would go about troubleshooting an issue like this?

Is there any more logging we can enable, to perhaps see what might be occurring internally.

If quickfixn throws an exception, how would we log and surface this?

We are using a pretty standard setup of quickfix, with the in-built SSL, on .net framework running on windows server. We haven't encountered this problem in dev or test environments, but we have less variety of connections coming in.

Any help, greatly appreciated.

andrewrimmer · 2023-05-11T19:08:12Z

Whilst this issue often is observed at session end, it looks like the issue isn't directly related.

It looks like Quickfixn gets into trouble during the day, and stops accepting new connections.

When netstat is run there are perhaps 150-200 CLOSE_WAIT connections. Could this be what is preventing new connections?

When we restart the FIX application, it starts behaving correctly until the next time.

Any suggestions on how we could debug the issue?

We run a pretty vanilla configuration, and use the SSL support built into quickfixn. When we used the old C++ wrapper, we used stunnel for SSL. We could try this to rule out any of the SSL pipeline/processing.

Any thoughts @gbirchmeier @mgatny?

ririvas · 2023-05-18T01:15:49Z

@gbirchmeier - We also run into this issue often

andrewrimmer · 2023-05-18T09:14:39Z

@ririvas may I ask what version of quickfixn you are using? Is it heavily locked down with a firewall?

ririvas · 2023-05-18T13:02:31Z

@andrewrimmer

<PackageReference Include="QuickFix.Net.NETCore" Version="1.8.1" />

[Edit from @gbirchmeier: The above package is an unauthorized release created by a third party. Official QF/n packages start with "QuickFIXn."]

DataDictionary=./spec/fix/FIX44.xml

Our engine sits separately from our internal network but does have strict networking rules. It does sit on an Azure VM.

andrewrimmer · 2023-05-18T14:08:40Z

@ririvas thanks a lot for sharing more info. We ourselves have recently placed further firewall restrictions on how accessible our FIX server is. We are now waiting to see if that has any effect. If you are pretty locked down yourselves, then it maybe won't really help.

Do you use the SSL layer in quickfixn or is yours separate?

We are using the library in a pretty standard & simple way, and the stability issues have only occurred since we ported from the old quickfix .net wapper (over the C++ version) to quickfixn. We did use stunnel as part of that legacy solution, which worked fine.

It would be fantastic to get to the bottom of the issues.

ririvas · 2023-05-18T14:56:13Z

@andrewrimmer - We use the SSL layer in quickfixn but our counterparty uses stunnel on their end. And we're having a hard time reproducing the issue internally.

andrewrimmer · 2023-05-22T11:19:31Z

@ririvas we have only observed this behaviour in production, and cannot reproduce internally.

ririvas · 2023-05-30T13:15:04Z

@andrewrimmer Have you tried using stunnel instead of the built-in SSL support? We may consider that next.

Other notes

Most of the connection issues occur on a Sunday.
We have some use of threads on our end.
We use the MemoryStore instead of the FileStore

andrewrimmer · 2023-05-30T14:28:16Z

Thanks @ririvas.

Yeah, our next step would be switching back to stunnel to rule out the SSL layer.

We use the FileStore and our own elasticsearch log store.

We haven't noticed the issue reoccurring since we improved the security around the endpoint availability over the internet. That could be a coincidence as we sometimes go for weeks without issues, and then may get several consecutive days of issues.

To improve health reporting, we also have a monitor continually checking you can ping/telnet to the fix server port(s).We have observed that the fix server will stop accepting new connections (cannot telnet/psping) but any existing sessions/connections are fine. In this state the problem would tend to occur if, in this state, we get a new connection attempt or the session end/start causes reconnections. In the case it is completely broken. A full application restart fixes the issues, until the next time.

ririvas · 2023-07-09T21:13:40Z

Hey @andrewrimmer - we saw that the quickfixn code will catch errors and output to console. We weren't seeing these messages until we redirect console output. Now we can see the following

Error accepting connection: Received an unexpected EOF or 0 bytes from the transport stream.

Assume it's coming from the line below

Now we'll see if we can identify a cause with this message

gbirchmeier · 2024-06-05T20:19:41Z

Did either of you get closer to root-causing this?

andrewrimmer · 2024-06-06T07:41:36Z

We tried going back from using the latest version to the last stable version but this did not resolve the issues. We stopped the SSL provision in QuickFIX and now have stunnel in between the clients and our FIX server. This workaround stopped the issue from occurring.

…

On Wed, 5 Jun 2024 at 21:20, Grant Birchmeier ***@***.***> wrote: Did either of you get closer to root-causing this? — Reply to this email directly, view it on GitHub <#763 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAK6HQJ7K5632VMCO76RPPDZF5XHJAVCNFSM6AAAAAAWC2RVV6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJQHA4TOMRWGQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

gbirchmeier · 2024-06-06T13:37:35Z

So it sounds like this issue is specific to Acceptors that use the built-in SSL connectivity. Thanks, that is very helpful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Acceptor stops working after session end #763

Acceptor stops working after session end #763

andrewrimmer commented Mar 21, 2023 •

edited

Loading

andrewrimmer commented May 11, 2023 •

edited

Loading

ririvas commented May 18, 2023

andrewrimmer commented May 18, 2023

ririvas commented May 18, 2023 •

edited by gbirchmeier

Loading

andrewrimmer commented May 18, 2023

ririvas commented May 18, 2023

andrewrimmer commented May 22, 2023

ririvas commented May 30, 2023

andrewrimmer commented May 30, 2023

ririvas commented Jul 9, 2023

gbirchmeier commented Jun 5, 2024

andrewrimmer commented Jun 6, 2024 via email

gbirchmeier commented Jun 6, 2024 •

edited

Loading

Acceptor stops working after session end #763

Acceptor stops working after session end #763

Comments

andrewrimmer commented Mar 21, 2023 • edited Loading

andrewrimmer commented May 11, 2023 • edited Loading

ririvas commented May 18, 2023

andrewrimmer commented May 18, 2023

ririvas commented May 18, 2023 • edited by gbirchmeier Loading

andrewrimmer commented May 18, 2023

ririvas commented May 18, 2023

andrewrimmer commented May 22, 2023

ririvas commented May 30, 2023

andrewrimmer commented May 30, 2023

ririvas commented Jul 9, 2023

gbirchmeier commented Jun 5, 2024

andrewrimmer commented Jun 6, 2024 via email

gbirchmeier commented Jun 6, 2024 • edited Loading

andrewrimmer commented Mar 21, 2023 •

edited

Loading

andrewrimmer commented May 11, 2023 •

edited

Loading

ririvas commented May 18, 2023 •

edited by gbirchmeier

Loading

gbirchmeier commented Jun 6, 2024 •

edited

Loading