Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Acceptor stops working after session end #763

Open
andrewrimmer opened this issue Mar 21, 2023 · 13 comments
Open

Acceptor stops working after session end #763

andrewrimmer opened this issue Mar 21, 2023 · 13 comments

Comments

@andrewrimmer
Copy link

andrewrimmer commented Mar 21, 2023

We have an intermittent issue in production, which causes quickfixn to stop working around the end of the session. The fix server does not respond to connection attempts, and the fix sessions are not created the following day. We have to restart our application for things to start working again.

I am not seeing any clues in our logs, and we are not getting any exceptions reported.

At the end of the session we get the following messages/events: -

  • LOGOUT
  • Session FIXT.1.1:[redacted]->[redacted] disconnecting: Resetting...
  • Session reset: Out of SessionTime (Session.Next())

Then on a day it is working, we will see connection/logon attempts repeatedly until the sessions open again.

However, if it is not working we will see nothing else in the logs after end time. At this stage we have to restart the application to get things working again.

We are using a version of quickfixn from the master branch of September 2022.

Any idea how we would go about troubleshooting an issue like this?

Is there any more logging we can enable, to perhaps see what might be occurring internally.

If quickfixn throws an exception, how would we log and surface this?

We are using a pretty standard setup of quickfix, with the in-built SSL, on .net framework running on windows server. We haven't encountered this problem in dev or test environments, but we have less variety of connections coming in.

Any help, greatly appreciated.

@andrewrimmer
Copy link
Author

andrewrimmer commented May 11, 2023

Whilst this issue often is observed at session end, it looks like the issue isn't directly related.

It looks like Quickfixn gets into trouble during the day, and stops accepting new connections.

When netstat is run there are perhaps 150-200 CLOSE_WAIT connections. Could this be what is preventing new connections?

When we restart the FIX application, it starts behaving correctly until the next time.

Any suggestions on how we could debug the issue?

We run a pretty vanilla configuration, and use the SSL support built into quickfixn. When we used the old C++ wrapper, we used stunnel for SSL. We could try this to rule out any of the SSL pipeline/processing.

Any thoughts @gbirchmeier @mgatny?

@ririvas
Copy link

ririvas commented May 18, 2023

@gbirchmeier - We also run into this issue often

@andrewrimmer
Copy link
Author

@ririvas may I ask what version of quickfixn you are using? Is it heavily locked down with a firewall?

@ririvas
Copy link

ririvas commented May 18, 2023

@andrewrimmer

<PackageReference Include="QuickFix.Net.NETCore" Version="1.8.1" />

[Edit from @gbirchmeier: The above package is an unauthorized release created by a third party. Official QF/n packages start with "QuickFIXn."]

DataDictionary=./spec/fix/FIX44.xml

Our engine sits separately from our internal network but does have strict networking rules. It does sit on an Azure VM.

@andrewrimmer
Copy link
Author

@ririvas thanks a lot for sharing more info. We ourselves have recently placed further firewall restrictions on how accessible our FIX server is. We are now waiting to see if that has any effect. If you are pretty locked down yourselves, then it maybe won't really help.

Do you use the SSL layer in quickfixn or is yours separate?

We are using the library in a pretty standard & simple way, and the stability issues have only occurred since we ported from the old quickfix .net wapper (over the C++ version) to quickfixn. We did use stunnel as part of that legacy solution, which worked fine.

It would be fantastic to get to the bottom of the issues.

@ririvas
Copy link

ririvas commented May 18, 2023

@andrewrimmer - We use the SSL layer in quickfixn but our counterparty uses stunnel on their end. And we're having a hard time reproducing the issue internally.

@andrewrimmer
Copy link
Author

@ririvas we have only observed this behaviour in production, and cannot reproduce internally.

@ririvas
Copy link

ririvas commented May 30, 2023

@andrewrimmer Have you tried using stunnel instead of the built-in SSL support? We may consider that next.

Other notes

  • Most of the connection issues occur on a Sunday.
  • We have some use of threads on our end.
  • We use the MemoryStore instead of the FileStore

@andrewrimmer
Copy link
Author

Thanks @ririvas.

Yeah, our next step would be switching back to stunnel to rule out the SSL layer.

We use the FileStore and our own elasticsearch log store.

We haven't noticed the issue reoccurring since we improved the security around the endpoint availability over the internet. That could be a coincidence as we sometimes go for weeks without issues, and then may get several consecutive days of issues.

To improve health reporting, we also have a monitor continually checking you can ping/telnet to the fix server port(s).We have observed that the fix server will stop accepting new connections (cannot telnet/psping) but any existing sessions/connections are fine. In this state the problem would tend to occur if, in this state, we get a new connection attempt or the session end/start causes reconnections. In the case it is completely broken. A full application restart fixes the issues, until the next time.

@ririvas
Copy link

ririvas commented Jul 9, 2023

Hey @andrewrimmer - we saw that the quickfixn code will catch errors and output to console. We weren't seeing these messages until we redirect console output. Now we can see the following

Error accepting connection: Received an unexpected EOF or 0 bytes from the transport stream.

Assume it's coming from the line below
image

Now we'll see if we can identify a cause with this message

@gbirchmeier
Copy link
Member

Did either of you get closer to root-causing this?

@andrewrimmer
Copy link
Author

andrewrimmer commented Jun 6, 2024 via email

@gbirchmeier
Copy link
Member

gbirchmeier commented Jun 6, 2024

So it sounds like this issue is specific to Acceptors that use the built-in SSL connectivity. Thanks, that is very helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants