-
Notifications
You must be signed in to change notification settings - Fork 275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crash : SIGSEGV /SEGV_MAPERR #1131
Comments
Most likely thing here, based on where and how it's crashing, is some kind of native heap corruption. Either in the app, okhttp or the version of Conscrypt shipping as a Mainline module. That module is identical across Android 11 through 14, so if you're only seeing crashes on 13 then again it points to heap corruption as the native allocator changed between 12 and 13. If you can consistently reproduce this, and are willing to share the code, then the best way forward is to open an Android bug at https://issuetracker.google.com/issues/new?component=190923&template=841312 and then we can try and debug it. |
is this repo accepting contributions or not?? |
Sure :) |
Closing this issue here though as it is now on the Android issue tracker. |
This issue is happening both on Android 12 and 13 devices. Does not seem to be related to changes in native heap allocator.
Do you have any idea if the library is getting into some bad state during a network call? Where can we see the conscrypt releases? This issue started happening from March 2023. Can we debug this better? I am not able to get a repro locally though. |
Right, but the root cause of that could be any kind of heap corruption. Typically that happens if there is a concurrency bug between threads, or a native pointer gets re-used after its memory has been freed. The same platform version of Conscrypt runs on Android 11 through 14, so the fact you're only seeing crashes on 12 and 13 is unexpected. The fact that it the issue started in March makes me think some other component is corrupting the heap, as we didn't ship any Conscrypt changes in February or March. However the fact it's consistently crashing in
Without any kind of repro steps it's going to be very difficult. |
Thanks for responding. Is there any other way we could debug this? Any thoughts? There is one change that is specific to Android 13. There was a change in the garbage collection algorithm. Could this be causing it? But we also observed huge volumes of this crash in Android 12 as well. |
@prbprbprb Could this crash be a manifestation of OOMs in native heap? |
Thanks for the update! I believe the userfaultd GC is active on (at least some) Android 12 devices now, so what you're seeing does suggest it's related to low memory conditions and aggressive GCing. What's interesting is that Another possibility is a good old-fashioned concurrency bug exacerbated by the device being slowed down by GCs. When a TLS connection is established the certificate is set up in a convoluted way... The native TLS code calls back into Java to select a certificate, and that callback calls further JNI code to set the certificate on the native |
Chatted to ART and it doesn't seem to be related to userfaultd GC, or the stack trace would be different. But the finalizer for |
We don't have a definitive root cause for google#1131 but it seems like either use-after-free (e.g. finalizer ordering) or concurrency issue, so: 1. Make the native pointer private and move all accesses into AbstractSessionContext 2. Zero it out on finalisation 3. Add locking. Note we only need a read lock for the sslNew() path as this is thread safe and doesn't modify the native SSL_CTX.
We don't have a definitive root cause for google#1131 but it seems like either use-after-free (e.g. finalizer ordering) or concurrency issue, so: 1. Make the native pointer private and move all accesses into AbstractSessionContext 2. Zero it out on finalisation 3. Add locking. Note we only need a read lock for the sslNew() path as this is thread safe and doesn't modify the native SSL_CTX.
We don't have a definitive root cause for google#1131 but it seems like either use-after-free (e.g. finalizer ordering) or concurrency issue, so: 1. Make the native pointer private and move all accesses into AbstractSessionContext 2. Zero it out on finalisation 3. Add locking. Note we only need a read lock for the sslNew() path as this is thread safe and doesn't modify the native SSL_CTX aside from atomic refcounts. The above change is broadly equivalent to turning the native pointer into a NativeRef, which would mean its finalizer shouldn't run until after the AbstractSessionContext object is unreachable, but (currently) NativeRefs don't zero out the native address on finalization.
We don't have a definitive root cause for #1131 but it seems like either use-after-free (e.g. finalizer ordering) or concurrency issue, so: 1. Make the native pointer private and move all accesses into AbstractSessionContext 2. Zero it out on finalisation 3. Add locking. Note we only need a read lock for the sslNew() path as this is thread safe and doesn't modify the native SSL_CTX aside from atomic refcounts. The above change is broadly equivalent to turning the native pointer into a NativeRef, which would mean its finalizer shouldn't run until after the AbstractSessionContext object is unreachable, but (currently) NativeRefs don't zero out the native address on finalization.
Thanks for merging a possible fix for the above issue. @prbprbprb Assuming that the crash is happening due to the concurrency issue of AbstractSessionContext, I am curious to know why this would be specific to Android 12 and 13 devices. Could they be related to userfaultd GC algo in any way or due to high memory usage? |
@prbprbprb Gentle ping on the above query |
Oh, sorry, I missed the. previous comment! #1154 (and also #1157 and #1164) are planned to go out in the November Mainline build, that is they'll start reaching devices at the start of November and should be fully rolled out by the end of that month. Non-Mainline devices (e.g. Android Go) won't get the fix then, but the next time their vendor sends an OTA... But the fixes apart from #1164 are already in AOSP for them and #1164 should land in AOSP today. For the second part of your question (root cause), I'm frankly not sure because we haven't managed to reproduce the issues. However the I suspect these bugs have been causing crashes forever, just at a frequency low enough that nobody noticed and then recent ART changes (e.g. GC patterns) meant we started seeing them more often. The long term fix here is to use |
Since we are closely tracking this fix, could you please help me in monitoring the rollout of the November mainline build? How could we find the devices / the dates on which the mainline build is released? @prbprbprb |
I'm not sure there's a public source of that information but I'll try and find out. Very approximately though, the first few weeks are taken up with "canary" rollouts to detect issues, then there's a progressive rollout to 50%, 99% and the last percent only get updated towards the very end of the month. |
Hi @prbprbprb, Can we consider that the mainline build with the fix was merged and is available on at least the Android 13 phones? The crash does not seem to be showing a downward trend in Play console at least. In the below article, they have mentioned only the update of 2 Mainline components - https://source.android.com/docs/security/bulletin/2023-11-01 Does that mean conscrypt lib update was not included? Would be great if you could share any resource around mainline build release timeline/ notes. Thanks |
Ah, that note is a bit confusing. You're linking the release notes for the November Security Bulletin, which goes out as an OTA update because it needs to be able to update components anywhere in the Android platform. But what the release notes are saying is that the fixes for those two CVEs are going out as part of a Mainline update, rather than with the security bulletin OTA[1]. There are no security fixed for Conscrypt in the November bulletin, so it isn't mentioned. Meanwhile, it appears that the November Mainline train is still in its canary phase due to the US Thanksgiving holidays, which means it is on less than 2% of Mainline devices (maybe even less than that), which I wasn't expecting... It looks to me like it's supposed to ramp up to 99% by the end of this week, so if you don't hear any more from me by Friday then please ping the issue again. [1] It's not really feasible for OTAs to update Mainline modules, or for Mainline updates to update non-Mainline components. |
Hi @prbprbprb, Thanks a lot for the detailed explanation to my queries. Could you please confirm if the mainline build rollout is 100% now? Is there anyway we could check if devices have received this update? (Any document?) One update: The native crash is translating to a java crash now (which we could catch with a try-catch). Please let me know if there is a fix for this that you are aware of or any possible cause.
Thanks |
@prbprbprb Gentle reminder on the above query. |
Hi @prbprbprb Can you please help with this? |
@prbprbprb Gentle reminder on the above query. |
Sorry, got lost in the Christmas backlog! On the plus side, we did indeed fix the code path causing the native crashes, and now we have a Java stack trace to work with. On the minus side this situation shouldn't be possible........The root cause exception is because a socket factory is trying create a new SSL session for a new socket but the native pointer to its ssl session context is 0. Every => There is no way to create an (if the code creating the Since #1154 the native pointer is never shared outside the class and all accesses are synchronized. => There is no way for it to become 0 due to concurrency bugs => The only way for the native pointer to become 0 is through finalisation. The => As the crash happens during socket creation there should be no way the There's probably a flaw in my reasoning but I'm failing to see it. :/ The crash is too consistent for an ART bug, and so far this is the only report of it that I'm aware of... Is it possible your app is doing anything unusual with reflection around |
App crashes with the below exception.
Using okhttp version : 3.12.13
Crash happening only on Android 13 devices while doing network call.
Any pointers on why this is happening only on specific devices?
Please let me know if any additional details are required to debug this further.
pid: 0, tid: 3268 >>> com.example.app <<<
backtrace:
#00 pc 0x0000000000038600 /apex/com.android.conscrypt/lib64/libssl.so (bssl::ssl_cert_dup(bssl::CERT*)+68)
#1 pc 0x000000000003f984 /apex/com.android.conscrypt/lib64/libssl.so (SSL_new+484)
#2 pc 0x000000000002212c /apex/com.android.conscrypt/lib64/libjavacrypto.so (NativeCrypto_SSL_new(_JNIEnv*, _jclass*, long, _jobject*)+24)
#3 pc 0x0000000000461554 /apex/com.android.art/lib64/libart.so (art_quick_generic_jni_trampoline+148)
#4 pc 0x0000000000209a9c /apex/com.android.art/lib64/libart.so (nterp_helper+1948)
#5 pc 0x0000000000024644 /apex/com.android.conscrypt/javalib/conscrypt.jar (com.android.org.conscrypt.NativeSsl.newInstance+12)
#6 pc 0x0000000000209334 /apex/com.android.art/lib64/libart.so (nterp_helper+52)
#7 pc 0x000000000001983c /apex/com.android.conscrypt/javalib/conscrypt.jar (com.android.org.conscrypt.ConscryptEngine.newSsl)
#8 pc 0x0000000000209334 /apex/com.android.art/lib64/libart.so (nterp_helper+52)
#9 pc 0x000000000001b0e6 /apex/com.android.conscrypt/javalib/conscrypt.jar (com.android.org.conscrypt.ConscryptEngine.+94)
#10 pc 0x000000000020a254 /apex/com.android.art/lib64/libart.so (nterp_helper+3924)
#11 pc 0x0000000000018822 /apex/com.android.conscrypt/javalib/conscrypt.jar (com.android.org.conscrypt.ConscryptEngineSocket.newEngine+54)
#12 pc 0x0000000000209334 /apex/com.android.art/lib64/libart.so (nterp_helper+52)
#13 pc 0x0000000000018d68 /apex/com.android.conscrypt/javalib/conscrypt.jar (com.android.org.conscrypt.ConscryptEngineSocket.+52)
#14 pc 0x000000000020a958 /apex/com.android.art/lib64/libart.so (nterp_helper+5720)
#15 pc 0x0000000000021814 /apex/com.android.conscrypt/javalib/conscrypt.jar (com.android.org.conscrypt.Java8EngineSocket.)
#16 pc 0x000000000020a958 /apex/com.android.art/lib64/libart.so (nterp_helper+5720)
#17 pc 0x00000000000360ec /apex/com.android.conscrypt/javalib/conscrypt.jar (com.android.org.conscrypt.Platform.createEngineSocket+16)
#18 pc 0x0000000000209334 /apex/com.android.art/lib64/libart.so (nterp_helper+52)
#19 pc 0x0000000000031c8c /apex/com.android.conscrypt/javalib/conscrypt.jar (com.android.org.conscrypt.OpenSSLSocketFactoryImpl.createSocket+84)
#20 pc 0x00000000026cc1c4 /data/app/~~x5omOPFUGuwkf_7eG32MIw==/com.example.app-_KOMTFChZ1oamat1_QhM_g==/oat/arm64/base.odex (okhttp3.internal.connection.RealConnection.connectTls+164)
#21 pc 0x00000000026cd3f8 /data/app/~~x5omOPFUGuwkf_7eG32MIw==/com.example.app-_KOMTFChZ1oamat1_QhM_g==/oat/arm64/base.odex (okhttp3.internal.connection.RealConnection.establishProtocol+440)
#22 pc 0x00000000026cdedc /data/app/~~x5omOPFUGuwkf_7eG32MIw==/com.example.app-_KOMTFChZ1oamat1_QhM_g==/oat/arm64/base.odex (okhttp3.internal.connection.RealConnection.connect+1884)
#23 pc 0x00000000025bae44 /data/app/~~x5omOPFUGuwkf_7eG32MIw==/com.example.app-_KOMTFChZ1oamat1_QhM_g==/oat/arm64/base.odex (okhttp3.internal.connection.StreamAllocation.findConnection+1812)
#24 pc 0x00000000025bb44c /data/app/~~x5omOPFUGuwkf_7eG32MIw==/com.example.app-_KOMTFChZ1oamat1_QhM_g==/oat/arm64/base.odex (okhttp3.internal.connection.StreamAllocation.findHealthyConnection+92)
#25 pc 0x00000000025bbbf8 /data/app/~~x5omOPFUGuwkf_7eG32MIw==/com.example.app-_KOMTFChZ1oamat1_QhM_g==/oat/arm64/base.odex (okhttp3.internal.connection.StreamAllocation.newStream+280)
#26 pc 0x00000000026cb940 /data/app/~~x5omOPFUGuwkf_7eG32MIw==/com.example.app-_KOMTFChZ1oamat1_QhM_g==/oat/arm64/base.odex (okhttp3.internal.connection.ConnectInterceptor.intercept+224)
#27 pc 0x00000000026d2c48 /data/app/~~x5omOPFUGuwkf_7eG32MIw==/com.example.app-_KOMTFChZ1oamat1_QhM_g==/oat/arm64/base.odex (okhttp3.internal.http.RealInterceptorChain.proceed+1544)
#28 pc 0x00000000026d2618 /data/app/~~x5omOPFUGuwkf_7eG32MIw==/com.example.app-_KOMTFChZ1oamat1_QhM_g==/oat/arm64/base.odex (okhttp3.internal.http.RealInterceptorChain.proceed+104)
#29 pc 0x00000000026cb03c /data/app/~~x5omOPFUGuwkf_7eG32MIw==/com.example.app-_KOMTFChZ1oamat1_QhM_g==/oat/arm64/base.odex (okhttp3.internal.cache.CacheInterceptor.intercept+1468)
#30 pc 0x00000000026d2c48 /data/app/~~x5omOPFUGuwkf_7eG32MIw==/com.example.app-_KOMTFChZ1oamat1_QhM_g==/oat/arm64/base.odex (okhttp3.internal.http.RealInterceptorChain.proceed+1544)
#31 pc 0x00000000026d2618 /data/app/~~x5omOPFUGuwkf_7eG32MIw==/com.example.app-_KOMTFChZ1oamat1_QhM_g==/oat/arm64/base.odex (okhttp3.internal.http.RealInterceptorChain.proceed+104)
#32 pc 0x00000000026d0820 /data/app/~~x5omOPFUGuwkf_7eG32MIw==/com.example.app-_KOMTFChZ1oamat1_QhM_g==/oat/arm64/base.odex (okhttp3.internal.http.BridgeInterceptor.intercept+4288)
#33 pc 0x00000000026d2c48 /data/app/~~x5omOPFUGuwkf_7eG32MIw==/com.example.app-_KOMTFChZ1oamat1_QhM_g==/oat/arm64/base.odex (okhttp3.internal.http.RealInterceptorChain.proceed+1544)
#34 pc 0x00000000026d4e5c /data/app/~~x5omOPFUGuwkf_7eG32MIw==/com.example.app-_KOMTFChZ1oamat1_QhM_g==/oat/arm64/base.odex (okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept+700)
#35 pc 0x00000000026d2c48 /data/app/~~x5omOPFUGuwkf_7eG32MIw==/com.example.app-_KOMTFChZ1oamat1_QhM_g==/oat/arm64/base.odex (okhttp3.internal.http.RealInterceptorChain.proceed+1544)
#36 pc 0x00000000026c97b8 /data/app/~~x5omOPFUGuwkf_7eG32MIw==/com.example.app-_KOMTFChZ1oamat1_QhM_g==/oat/arm64/base.odex (okhttp3.RealCall.getResponseWithInterceptorChain+3528)
#37 pc 0x00000000026c6ea0 /data/app/~~x5omOPFUGuwkf_7eG32MIw==/com.example.app-_KOMTFChZ1oamat1_QhM_g==/oat/arm64/base.odex (okhttp3.RealCall$AsyncCall.execute+128)
#38 pc 0x00000000025ad87c /data/app/~~x5omOPFUGuwkf_7eG32MIw==/com.example.app-_KOMTFChZ1oamat1_QhM_g==/oat/arm64/base.odex (okhttp3.internal.NamedRunnable.run+124)
#39 pc 0x0000000000588960 /data/misc/apexdata/com.android.art/dalvik-cache/arm64/boot.oat (java.util.concurrent.ThreadPoolExecutor.runWorker+976)
#40 pc 0x0000000000585b48 /data/misc/apexdata/com.android.art/dalvik-cache/arm64/boot.oat (java.util.concurrent.ThreadPoolExecutor$Worker.run+72)
#41 pc 0x00000000003fe840 /data/misc/apexdata/com.android.art/dalvik-cache/arm64/boot.oat (java.lang.Thread.run+80)
#42 pc 0x0000000000457b6c /apex/com.android.art/lib64/libart.so (art_quick_invoke_stub+556)
#43 pc 0x0000000000484e54 /apex/com.android.art/lib64/libart.so (art::ArtMethod::Invoke(art::Thread*, unsigned int*, unsigned int, art::JValue*, char const*)+156)
#44 pc 0x0000000000484b20 /apex/com.android.art/lib64/libart.so (art::JValue art::InvokeVirtualOrInterfaceWithJValuesart::ArtMethod*(art::ScopedObjectAccessAlreadyRunnable const&, _jobject*, art::ArtMethod*, jvalue const*)+400)
#45 pc 0x00000000005ce334 /apex/com.android.art/lib64/libart.so (art::Thread::CreateCallback(void*)+1684)
#46 pc 0x00000000000b6668 /apex/com.android.runtime/lib64/bionic/libc.so (__pthread_start(void*)+208)
#47 pc 0x00000000000532cc /apex/com.android.runtime/lib64/bionic/libc.so (__start_thread+64)
The text was updated successfully, but these errors were encountered: