Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XdsSecurityClientServerTest.tlsClientServer_useSystemRootCerts_validationContext is flaky #11678

Open
ejona86 opened this issue Nov 8, 2024 · 10 comments
Labels
highly flaky Issue is for a test that is crying wolf
Milestone

Comments

@ejona86
Copy link
Member

ejona86 commented Nov 8, 2024

@erm-g
https://github.com/grpc/grpc-java/actions/runs/11745673719/job/32723647320?pr=11673

XdsSecurityClientServerTest > tlsClientServer_useSystemRootCerts_validationContext[enableSpiffe=true] FAILED
    io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
    Channel Pipeline: [SslHandler#0, ProtocolNegotiators$ClientTlsHandler#0, WriteBufferingAndExceptionHandler#0, DefaultChannelPipeline$TailContext#0]
        at app//io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:268)
        at app//io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:249)
        at app//io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:167)
        at app//io.grpc.testing.protobuf.SimpleServiceGrpc$SimpleServiceBlockingStub.unaryRpc(SimpleServiceGrpc.java:340)
        at app//io.grpc.xds.XdsSecurityClientServerTest.unaryRpc(XdsSecurityClientServerTest.java:668)
        at app//io.grpc.xds.XdsSecurityClientServerTest.tlsClientServer_useSystemRootCerts_validationContext(XdsSecurityClientServerTest.java:238)

        Caused by:
        javax.net.ssl.SSLHandshakeException: General OpenSslEngine problem
            at io.netty.handler.ssl.ReferenceCountedOpenSslEngine.handshakeException(ReferenceCountedOpenSslEngine.java:1927)
            at io.netty.handler.ssl.ReferenceCountedOpenSslEngine.wrap(ReferenceCountedOpenSslEngine.java:848)
            at java.base/javax.net.ssl.SSLEngine.wrap(SSLEngine.java:564)
            at io.netty.handler.ssl.SslHandler.wrap(SslHandler.java:1131)
            at io.netty.handler.ssl.SslHandler.wrapNonAppData(SslHandler.java:975)
            at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1511)
            at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1338)
            at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1387)
            at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:530)
            at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:469)
            at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:290)
            at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
            at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
            at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
            at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1407)
            at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
            at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
            at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:918)
            at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:799)
            at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:501)
            at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:399)
            at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:994)
            at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
            at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
            at java.base/java.lang.Thread.run(Thread.java:840)

            Caused by:
            sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
                at java.base/sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:439)
                at java.base/sun.security.validator.PKIXValidator.engineValidate(PKIXValidator.java:306)
                at java.base/sun.security.validator.Validator.validate(Validator.java:264)
                at java.base/sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:285)
                at java.base/sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:144)
                at io.netty.handler.ssl.EnhancingX509ExtendedTrustManager.checkServerTrusted(EnhancingX509ExtendedTrustManager.java:69)
                at io.netty.handler.ssl.ReferenceCountedOpenSslClientContext$ExtendedTrustManagerVerifyCallback.verify(ReferenceCountedOpenSslClientContext.java:235)
                at io.netty.handler.ssl.ReferenceCountedOpenSslContext$AbstractCertificateVerifier.verify(ReferenceCountedOpenSslContext.java:797)
                at io.netty.internal.tcnative.CertificateVerifierTask.runTask(CertificateVerifierTask.java:36)
                at io.netty.internal.tcnative.SSLTask.run(SSLTask.java:48)
                at io.netty.internal.tcnative.SSLTask.run(SSLTask.java:42)
                at io.netty.handler.ssl.ReferenceCountedOpenSslEngine.runAndResetNeedTask(ReferenceCountedOpenSslEngine.java:1533)
                at io.netty.handler.ssl.ReferenceCountedOpenSslEngine.access$700(ReferenceCountedOpenSslEngine.java:94)
                at io.netty.handler.ssl.ReferenceCountedOpenSslEngine$TaskDecorator.run(ReferenceCountedOpenSslEngine.java:1505)
                at io.netty.handler.ssl.SslHandler.runDelegatedTasks(SslHandler.java:1651)
                at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1497)
                ... 19 more

                Caused by:
                sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
                    at java.base/sun.security.provider.certpath.SunCertPathBuilder.build(SunCertPathBuilder.java:148)
                    at java.base/sun.security.provider.certpath.SunCertPathBuilder.engineBuild(SunCertPathBuilder.java:129)
                    at java.base/java.security.cert.CertPathBuilder.build(CertPathBuilder.java:297)
                    at java.base/sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:434)

3079 tests completed, 1 failed, 7 skipped
                    ... 34 more

Given this code just went in, seems highly flaky.

@ejona86 ejona86 added the highly flaky Issue is for a test that is crying wolf label Nov 8, 2024
@erm-g
Copy link
Contributor

erm-g commented Nov 9, 2024

@ejona86 how these runs are executed (how to reproduce it locally)?
I just ran the whole test class and then the failing test itself locally (cloudtop) few thousand times and didn't observe any issues.

@ejona86
Copy link
Member Author

ejona86 commented Nov 9, 2024

This run was in Github Actions, which runs gradle build.

Testing with blaze for the --runs_per_test, I saw a few different flakes with a total flake rate around 1%, all with useSystemRootCerts but a mix of enableSpiffe of true and false. Importantly, with 1000 runs I didn't see any flakes without #11627 nor any flakes with --test_filter='XdsSecurityClientServerTest.tlsClientServer_useSystemRootCerts_validationContext'. So somehow the tests are interacting with each other or causing a different execution speed.

(While it is a 1% flake rate with blaze, those are different machines than Github Actions, so it may be a much higher flake rate on Github Actions.)

http://sponge2/45c92135-104b-49fd-92cb-02899959ef7b
http://sponge2/510a9a83-1283-47a4-adda-06ae17ce6918

I'm suspected @Parameterized is the trigger more than the new spiffe code. To confirm, I added parameterized to the test without PR #11627:

--- grpc/xds/src/test/java/io/grpc/xds/XdsSecurityClientServerTest.java	2024-10-29 09:51:25.000000000 -0700
+++ grpc/xds/src/test/java/io/grpc/xds/XdsSecurityClientServerTest.java	2024-11-09 06:59:07.000000000 -0800
@@ -96,12 +96,13 @@
 import org.junit.Test;
 import org.junit.runner.RunWith;
 import org.junit.runners.JUnit4;
+import org.junit.runners.Parameterized;
 
 /**
  * Unit tests for {@link XdsChannelCredentials} and {@link XdsServerBuilder} for plaintext/TLS/mTLS
  * modes.
  */
-@RunWith(JUnit4.class)
+@RunWith(Parameterized.class)
 public class XdsSecurityClientServerTest {
 
   @Rule public final GrpcCleanupRule cleanupRule = new GrpcCleanupRule();
@@ -115,6 +116,14 @@
   private FakeXdsClientPoolFactory fakePoolFactory = new FakeXdsClientPoolFactory(xdsClient);
   private static final String OVERRIDE_AUTHORITY = "foo.test.google.fr";
 
+  @Parameterized.Parameter
+  public boolean bool;
+
+  @Parameterized.Parameters(name = "thebool={0}")
+  public static List<Boolean> data() {
+    return Arrays.asList(new Boolean[] {true, false});
+  }
+
   @After
   public void tearDown() throws IOException {
     if (fakeNameResolverFactory != null) {

That gave me a .5% flake rate in 1000 runs. http://sponge2/3a854fc8-f335-4388-ac75-d1e7d9138cb4 . So it's possible to reproduce without the spiffe code.

@kannanjgithub

@erm-g
Copy link
Contributor

erm-g commented Nov 9, 2024

Hmmm, if it's some side effect of @Parameterized I can try to isolate the new tests using subclasses like I did here - https://github.com/grpc/grpc-java/blob/master/core/src/test/java/io/grpc/internal/SpiffeUtilTest.java#L48

Trying to reproduce...

@erm-g
Copy link
Contributor

erm-g commented Nov 9, 2024

Ok caught it using
./gradlew build :grpc-xds:jacocoTestReport -PtargetArch=x86_64 -Pcheckstyle.ignoreFailures=false -PfailOnWarnings=false -PerrorProne=true -Dorg.gradle.parallel=true -PskipAndroid=true -PskipCodegen=true
and

@Parameters(name = "enableSpiffe={0}")
public static Collection<Boolean> data() {
    return IntStream.range(0, 1000)
        .mapToObj(i -> i%2 == 0)
        .collect(toList());
  }

trick

@erm-g
Copy link
Contributor

erm-g commented Nov 10, 2024

Ok, so the exception actually happens if we don't set trustStore System Property (just by commenting out https://github.com/grpc/grpc-java/blob/master/xds/src/test/java/io/grpc/xds/XdsSecurityClientServerTest.java#L226)
So I suspect that the rootcause is concurrent execution of the tests where one thread calls clearTrustStoreSystemProperties immediately after another calls setTrustStoreSystemProperties. However I don't see any gradle properties enabling parallel execution of test methods.

I see 3 possible options:

  1. Add synchronization technique for the test. I used trustStoreFilePath as an object to sync on for setTrustStoreSystemProperties, clearTrustStoreSystemProperties, and tlsClientServer_useSystemRootCerts_validationContext - 1000 runs didn't produce the exception, 5k sometimes fail with non-related to this issue timeouts and OOM. So I can create a PR with that.
  2. I introduced Parameterized here to avoid directl manipulation with System properties (GRPC_EXPERIMENTAL_SPIFFE_TRUST_BUNDLE_MAP flag), but since we do it anyway in this test, I can drop Parameterized in favor of a similar manipulation
  3. Mark the test as 'known flaky' (aka do nothing) - once the flag is removed, we can remove Parameterized and the test should be fine.

@ejona86 WDYT?

@ejona86
Copy link
Member Author

ejona86 commented Nov 11, 2024

It shouldn't be concurrent tests. Lots would break if that was happening. I think it is more likely a resource is left around (not shutdown quickly enough) after some tests, as running that one test by itself didn't flake.

@ejona86
Copy link
Member Author

ejona86 commented Nov 11, 2024

@ejona86 ejona86 added this to the 1.69 milestone Nov 12, 2024
@matthewstevenson88
Copy link
Contributor

@rockspore is taking a look at this.

@rockspore
Copy link
Collaborator

I have to admit I don't have much expertise here. It seems the flakiness goes away if we comment out the following two lines in all 3 relevant tests.:

      Files.deleteIfExists(trustStoreFilePath.toAbsolutePath());
      clearTrustStoreSystemProperties();

Interestingly, both had to be commented out, as either could not remove the flakiness, during my local test runs.

So I was also suspecting something related to SSL wasn't shutting down soon enough before the properties or file is erased, but haven't found any clue.

@erm-g
Copy link
Contributor

erm-g commented Nov 13, 2024

It shouldn't be concurrent tests. Lots would break if that was happening. I think it is more likely a resource is left around (not shutdown quickly enough) after some tests, as running that one test by itself didn't flake.

In that case, the following addition (imho very ugly) to the failing test works:

      setTrustStoreSystemProperties(trustStoreFilePath.toAbsolutePath().toString());
      for (int i = 0; i < 5; i++) {
        if (System.getProperty("javax.net.ssl.trustStore") == null) {
          Thread.sleep(200L);
          setTrustStoreSystemProperties(trustStoreFilePath.toAbsolutePath().toString());
        } else {
          break;
        }
      }

The number 5 here comes from 3 (tests manipulating System.properties) * 2 (spiffe Param values) - 1 (current test), and 200ms is max time per test I observed locally.
Also 0 failures on 1000 local runs, some OOMs on 5000
MB just single sleep for 200ms * total number of tests in this class would also work.

I guess that making this test class execute the test methods in the particular order would also work - Junit4 supports it (but very limited) - FixMethodOrder

WDYT?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
highly flaky Issue is for a test that is crying wolf
Projects
None yet
Development

No branches or pull requests

4 participants