Fix race condition crash by removing unnecessary retain/release #75

miketsprague · 2019-06-25T16:35:25Z

What did you change and why?

Remove unnecessary retain/release logic in NTPConnection that was crashing after being released too many times in some race conditions.
Retaining here is not necessary. It was originally implemented here as part of fixing an issue where HostResolver didn’t retain itself for resolving connections when the initial attempt timed out. From what I can tell, it was just added here to be safe. @msanders if you’re still active and somehow remember, feel free to correct me if I’m wrong!

Potential risks introduced?

If I'm missing something, or if NTPConnection is used elsewhere in your app, since the socket isn't retaining it it's possible that a socket callback could cause a crash.
If query(addresses: [SocketAddress], host: String) is called when connections is not cleared, that could cause the same crash as above. I cannot find a case where this happens, and it's in a private extension.

What tests were performed (include steps)?

See steps to reproduce below. I've performed those, and additionally I will deploy my own fork in my codebase and will report back the stability results.

Checklist

Unit/UI tests have been written (if necessary)
Manually tested

Additional Information

Why don’t we need to increment the retain count of `NTPConnection` for the socket’s callback?

NTPConnections are used by NTPClient, and NTPClient is used by the singleton TrueTimeClient. TrueTimeClient maintains a strong reference to NTPClient. NTPClient maintains a strong reference to [NTPConnection]. When removing the connections, each connection is synchronously closed (which removes itself from the socket callback), then the array is cleared. At no point is it necessary to increment the retain count from the sockets as long we’re guaranteed that close will be called before release our strong reference, which it is.

History of this bug

Increasing the retain count to fix bug with HostResolver
- This caused a memory leak because NTPConnection was not getting released on a timeout, only on success.
Fixing memory leak with timeouts
- This fixed the above issue, but caused the crash in 5.0.0/1
Fix specific crash from above
- Fixed the issue that @00FA9A was able to reproduce.
- This causes an increase in crashes, because it just increased the race condition surface area around when timeouts happen.

How do you reproduce the new crash?

Create a sample project where you start the TrueTimeClient instance and call fetchIfNeeded
Edit the timeout of the NTPConnection objects to be really small (ex: 0.0001) NTPConnection.swift line 28. If you felt more ambitious, you could fire up the Network Link Conditioner and set it to be 8 seconds or more.
Lock the thread for the length of time it’ll take the network response to come back in NTPConnection complete after the release() is called. I just added Thread.sleep(forTimeInterval: 0.5) on line 207.
Run the code. It will crash, assuming the timeoutError is hit.
Sample NTPConnection.swift if you want to try for yourself.

Okay but why does this happen?

There are actually two stack traces: one in NTPClient, and one in NTPConnection. The latter is the more prevalent one by about 10x. We’ll start with that one. Note: this is probably pretty hard to follow, and I'm not 100% sure about the correctness. All of the logic below is moot by this PR though, and these race conditions no longer exist.

Assume that we are running on a slow internet where the roundtrip time of the requests is about 8 seconds, conveniently the same as the default timeout :P

NTPConnections are created.
When they are started, the reference count of each is incremented by 1 with passRetained. The socket completion callback (dataCallback) is then enqueued to be executed on the main thread.
The timeout of 8 seconds hits (timeoutError), and complete(.failure(error)) is called. This timer callback happens on the lock queue (instacart npt connection queue).
complete(), calls close(), but notice in that implementation of close all of the code is executed asynchronously on the lockQueue. Calling lockQueue.async if you’re already on lockQueue will append that code to the end of the current runloop. This is important because it means that the socket callback isn't actually removed yet.
Back in complete(), we’re specifically a timeout error, so onComplete is called on the callbackQueue (instacart ntp client queue, this is a different thread).
Finally, release() is called if callbackPending is true (which it is here).
While the above is happening, the NTPConnection completion handler is still active, because it’s been appended to the end of the current run loop.
The socket success completion comes in at some point during this time and is run on the main thread (dataCallback). release() is then called again. Not good. We’ve now accidentally released this twice.
The next connection in the queue will eventually then call onComplete.
Our ref count for the original connection is one less than what it’s supposed to be. Go look at the definition for onComplete on line 35. Since there’s both the reference to connections in NTPClient and the connection here, the ref count for the NTPConnection is still one. We can call progress(connection, result) without any issue.
However, once throttleConnections is called, and the above extra reference in the closure is released, $0.canRetry now causes a crash because the reference count is 0.

I’m less sure about the NTPClient crash. I think it might be the same as above, except in that race condition the NTPConnection has managed to be released twice (maybe if it was also cancelled?) and the reference count is 0 for the progress callback. We’d expect it to crash then on line 186 of NTPClient.

tl;dr: we have several different threads with several connections happening at the same time that enqueue differing things, with logic that retains, then releases at different times depending on state (and potentially in different threads). This is both hard to follow and leads to potential race conditions. I do not see a reason for an extra retain and release here at all, since connections are retained by NTPClient and only released after synchronously closing each connection and removing the callback.

…g reference to this until it has been removed from the socket callback

fddecc · 2019-06-25T16:39:59Z

@miketsprague thanks for looking into this - we've had limited time to devote to this issue. I'd like @msanders to also take a look if possible, since he will have most context.

Our apologies for taking time to resolve this issue.

miketsprague · 2019-07-01T09:19:55Z

Just reporting back here to share results of what I've seen in our app.

With 5.0.2, we had ~3,000 sessions crash from TrueTime over a two week period. In the latest version of our app with the changes from this PR included, we've seen 0 TrueTime crashes, and haven't seen an increase in OOM sessions. Assuming there's no weird side effects that I haven't caught, this has fixed the issue and is working as intended.

Ariandr · 2019-07-02T09:50:39Z

Hi @miketsprague That's great! The report is very promising, thank you :)

Hi @00FA9A @msanders
Have you reviewed the PR? I believe it should be merged if everything is okay.

fddecc

Hi @miketsprague and once again thank you for the PR and detailed description! I was hoping to get more clarity on why this was used in first place, but as you suggest, seems like it was just to be extra safe. Thank you for providing a solution! I'll go ahead and approve this, merge it and create 5.0.3 shortly.

msanders · 2019-07-12T01:31:41Z

Sorry for missing this, looks like the mention notification got routed to the wrong email. From what I can tell this was just precautionary, although it's been a while since I looked at the code. The only thing I would confirm is that this isn't causing any memory leaks in instruments, but it sounds like you already did that with your OOM monitoring. Thank you for the great contribution @miketsprague!

Remove unneeded manual reference counting since NTPClient has a stron…

bb4f692

…g reference to this until it has been removed from the socket callback

miketsprague mentioned this pull request Jun 25, 2019

NTPConnection Crash #58

Open

fddecc approved these changes Jul 10, 2019

View reviewed changes

fddecc merged commit ab9066c into instacart:master Jul 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race condition crash by removing unnecessary retain/release #75

Fix race condition crash by removing unnecessary retain/release #75

miketsprague commented Jun 25, 2019 •

edited

Loading

fddecc commented Jun 25, 2019

miketsprague commented Jul 1, 2019

Ariandr commented Jul 2, 2019

fddecc left a comment •

edited

Loading

msanders commented Jul 12, 2019 •

edited

Loading

Fix race condition crash by removing unnecessary retain/release #75

Fix race condition crash by removing unnecessary retain/release #75

Conversation

miketsprague commented Jun 25, 2019 • edited Loading

What did you change and why?

Potential risks introduced?

What tests were performed (include steps)?

Checklist

Additional Information

Why don’t we need to increment the retain count of NTPConnection for the socket’s callback?

History of this bug

How do you reproduce the new crash?

Okay but why does this happen?

fddecc commented Jun 25, 2019

miketsprague commented Jul 1, 2019

Ariandr commented Jul 2, 2019

fddecc left a comment • edited Loading

Choose a reason for hiding this comment

msanders commented Jul 12, 2019 • edited Loading

miketsprague commented Jun 25, 2019 •

edited

Loading

Why don’t we need to increment the retain count of `NTPConnection` for the socket’s callback?

fddecc left a comment •

edited

Loading

msanders commented Jul 12, 2019 •

edited

Loading