-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
XdsSecurityClientServerTest.tlsClientServer_useSystemRootCerts_validationContext is flaky #11678
Comments
@ejona86 how these runs are executed (how to reproduce it locally)? |
This run was in Github Actions, which runs Testing with blaze for the --runs_per_test, I saw a few different flakes with a total flake rate around 1%, all with (While it is a 1% flake rate with blaze, those are different machines than Github Actions, so it may be a much higher flake rate on Github Actions.) http://sponge2/45c92135-104b-49fd-92cb-02899959ef7b I'm suspected --- grpc/xds/src/test/java/io/grpc/xds/XdsSecurityClientServerTest.java 2024-10-29 09:51:25.000000000 -0700
+++ grpc/xds/src/test/java/io/grpc/xds/XdsSecurityClientServerTest.java 2024-11-09 06:59:07.000000000 -0800
@@ -96,12 +96,13 @@
import org.junit.Test;
import org.junit.runner.RunWith;
import org.junit.runners.JUnit4;
+import org.junit.runners.Parameterized;
/**
* Unit tests for {@link XdsChannelCredentials} and {@link XdsServerBuilder} for plaintext/TLS/mTLS
* modes.
*/
-@RunWith(JUnit4.class)
+@RunWith(Parameterized.class)
public class XdsSecurityClientServerTest {
@Rule public final GrpcCleanupRule cleanupRule = new GrpcCleanupRule();
@@ -115,6 +116,14 @@
private FakeXdsClientPoolFactory fakePoolFactory = new FakeXdsClientPoolFactory(xdsClient);
private static final String OVERRIDE_AUTHORITY = "foo.test.google.fr";
+ @Parameterized.Parameter
+ public boolean bool;
+
+ @Parameterized.Parameters(name = "thebool={0}")
+ public static List<Boolean> data() {
+ return Arrays.asList(new Boolean[] {true, false});
+ }
+
@After
public void tearDown() throws IOException {
if (fakeNameResolverFactory != null) { That gave me a .5% flake rate in 1000 runs. http://sponge2/3a854fc8-f335-4388-ac75-d1e7d9138cb4 . So it's possible to reproduce without the spiffe code. |
Hmmm, if it's some side effect of Trying to reproduce... |
Ok caught it using @Parameters(name = "enableSpiffe={0}")
public static Collection<Boolean> data() {
return IntStream.range(0, 1000)
.mapToObj(i -> i%2 == 0)
.collect(toList());
} trick |
Ok, so the exception actually happens if we don't set trustStore System Property (just by commenting out https://github.com/grpc/grpc-java/blob/master/xds/src/test/java/io/grpc/xds/XdsSecurityClientServerTest.java#L226) I see 3 possible options:
@ejona86 WDYT? |
It shouldn't be concurrent tests. Lots would break if that was happening. I think it is more likely a resource is left around (not shutdown quickly enough) after some tests, as running that one test by itself didn't flake. |
@rockspore is taking a look at this. |
I have to admit I don't have much expertise here. It seems the flakiness goes away if we comment out the following two lines in all 3 relevant tests.:
Interestingly, both had to be commented out, as either could not remove the flakiness, during my local test runs. So I was also suspecting something related to SSL wasn't shutting down soon enough before the properties or file is erased, but haven't found any clue. |
In that case, the following addition (imho very ugly) to the failing test works: setTrustStoreSystemProperties(trustStoreFilePath.toAbsolutePath().toString());
for (int i = 0; i < 5; i++) {
if (System.getProperty("javax.net.ssl.trustStore") == null) {
Thread.sleep(200L);
setTrustStoreSystemProperties(trustStoreFilePath.toAbsolutePath().toString());
} else {
break;
}
} The number 5 here comes from 3 (tests manipulating I guess that making this test class execute the test methods in the particular order would also work - Junit4 supports it (but very limited) - WDYT? |
@erm-g
https://github.com/grpc/grpc-java/actions/runs/11745673719/job/32723647320?pr=11673
Given this code just went in, seems highly flaky.
The text was updated successfully, but these errors were encountered: