Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition crash by removing unnecessary retain/release #75

Merged
merged 1 commit into from
Jul 10, 2019

Conversation

miketsprague
Copy link
Contributor

@miketsprague miketsprague commented Jun 25, 2019

What did you change and why?

  • Remove unnecessary retain/release logic in NTPConnection that was crashing after being released too many times in some race conditions.
  • Retaining here is not necessary. It was originally implemented here as part of fixing an issue where HostResolver didn’t retain itself for resolving connections when the initial attempt timed out. From what I can tell, it was just added here to be safe. @msanders if you’re still active and somehow remember, feel free to correct me if I’m wrong!

Potential risks introduced?

  • If I'm missing something, or if NTPConnection is used elsewhere in your app, since the socket isn't retaining it it's possible that a socket callback could cause a crash.
  • If query(addresses: [SocketAddress], host: String) is called when connections is not cleared, that could cause the same crash as above. I cannot find a case where this happens, and it's in a private extension.

What tests were performed (include steps)?

  • See steps to reproduce below. I've performed those, and additionally I will deploy my own fork in my codebase and will report back the stability results.

Checklist

  • Unit/UI tests have been written (if necessary)
  • Manually tested

Additional Information

Why don’t we need to increment the retain count of NTPConnection for the socket’s callback?

NTPConnections are used by NTPClient, and NTPClient is used by the singleton TrueTimeClient. TrueTimeClient maintains a strong reference to NTPClient. NTPClient maintains a strong reference to [NTPConnection]. When removing the connections, each connection is synchronously closed (which removes itself from the socket callback), then the array is cleared. At no point is it necessary to increment the retain count from the sockets as long we’re guaranteed that close will be called before release our strong reference, which it is.

History of this bug

How do you reproduce the new crash?

  • Create a sample project where you start the TrueTimeClient instance and call fetchIfNeeded
  • Edit the timeout of the NTPConnection objects to be really small (ex: 0.0001) NTPConnection.swift line 28. If you felt more ambitious, you could fire up the Network Link Conditioner and set it to be 8 seconds or more.
  • Lock the thread for the length of time it’ll take the network response to come back in NTPConnection complete after the release() is called. I just added Thread.sleep(forTimeInterval: 0.5) on line 207.
  • Run the code. It will crash, assuming the timeoutError is hit.
  • Sample NTPConnection.swift if you want to try for yourself.

Okay but why does this happen?

There are actually two stack traces: one in NTPClient, and one in NTPConnection. The latter is the more prevalent one by about 10x. We’ll start with that one. Note: this is probably pretty hard to follow, and I'm not 100% sure about the correctness. All of the logic below is moot by this PR though, and these race conditions no longer exist.

Assume that we are running on a slow internet where the roundtrip time of the requests is about 8 seconds, conveniently the same as the default timeout :P

  • NTPConnections are created.
  • When they are started, the reference count of each is incremented by 1 with passRetained. The socket completion callback (dataCallback) is then enqueued to be executed on the main thread.
  • The timeout of 8 seconds hits (timeoutError), and complete(.failure(error)) is called. This timer callback happens on the lock queue (instacart npt connection queue).
  • complete(), calls close(), but notice in that implementation of close all of the code is executed asynchronously on the lockQueue. Calling lockQueue.async if you’re already on lockQueue will append that code to the end of the current runloop. This is important because it means that the socket callback isn't actually removed yet.
  • Back in complete(), we’re specifically a timeout error, so onComplete is called on the callbackQueue (instacart ntp client queue, this is a different thread).
  • Finally, release() is called if callbackPending is true (which it is here).
  • While the above is happening, the NTPConnection completion handler is still active, because it’s been appended to the end of the current run loop.
  • The socket success completion comes in at some point during this time and is run on the main thread (dataCallback). release() is then called again. Not good. We’ve now accidentally released this twice.
  • The next connection in the queue will eventually then call onComplete.
  • Our ref count for the original connection is one less than what it’s supposed to be. Go look at the definition for onComplete on line 35. Since there’s both the reference to connections in NTPClient and the connection here, the ref count for the NTPConnection is still one. We can call progress(connection, result) without any issue.
  • However, once throttleConnections is called, and the above extra reference in the closure is released, $0.canRetry now causes a crash because the reference count is 0.

I’m less sure about the NTPClient crash. I think it might be the same as above, except in that race condition the NTPConnection has managed to be released twice (maybe if it was also cancelled?) and the reference count is 0 for the progress callback. We’d expect it to crash then on line 186 of NTPClient.

tl;dr: we have several different threads with several connections happening at the same time that enqueue differing things, with logic that retains, then releases at different times depending on state (and potentially in different threads). This is both hard to follow and leads to potential race conditions. I do not see a reason for an extra retain and release here at all, since connections are retained by NTPClient and only released after synchronously closing each connection and removing the callback.

…g reference to this until it has been removed from the socket callback
@fddecc
Copy link
Contributor

fddecc commented Jun 25, 2019

@miketsprague thanks for looking into this - we've had limited time to devote to this issue. I'd like @msanders to also take a look if possible, since he will have most context.

Our apologies for taking time to resolve this issue.

@miketsprague
Copy link
Contributor Author

Just reporting back here to share results of what I've seen in our app.

With 5.0.2, we had ~3,000 sessions crash from TrueTime over a two week period. In the latest version of our app with the changes from this PR included, we've seen 0 TrueTime crashes, and haven't seen an increase in OOM sessions. Assuming there's no weird side effects that I haven't caught, this has fixed the issue and is working as intended.

@Ariandr
Copy link

Ariandr commented Jul 2, 2019

Hi @miketsprague That's great! The report is very promising, thank you :)

Hi @00FA9A @msanders
Have you reviewed the PR? I believe it should be merged if everything is okay.

Copy link
Contributor

@fddecc fddecc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @miketsprague and once again thank you for the PR and detailed description! I was hoping to get more clarity on why this was used in first place, but as you suggest, seems like it was just to be extra safe. Thank you for providing a solution! I'll go ahead and approve this, merge it and create 5.0.3 shortly.

@fddecc fddecc merged commit ab9066c into instacart:master Jul 10, 2019
@msanders
Copy link
Contributor

msanders commented Jul 12, 2019

Sorry for missing this, looks like the mention notification got routed to the wrong email. From what I can tell this was just precautionary, although it's been a while since I looked at the code. The only thing I would confirm is that this isn't causing any memory leaks in instruments, but it sounds like you already did that with your OOM monitoring. Thank you for the great contribution @miketsprague!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants