Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] : sudden cessation of message consumption using ServiceBusProcessorClient in multiple Java applications #40496

Closed
Poseithon opened this issue Jun 5, 2024 · 13 comments
Assignees
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Service Bus tracking-external-issue The issue is caused by external problem (e.g. OS) - nothing we can do to fix it directly

Comments

@Poseithon
Copy link

Poseithon commented Jun 5, 2024

Describe the bug
As part of an integration project that has recently been deployed to production and is still in the pilot phase, we have several Java applications that connect to Azure Service Bus to consume messages from various topics and subscriptions. For message consumption, we are using the ServiceBusProcessorClient.

However, we have frequently observed that, within the same timeframe, all our consumers stop consuming messages despite there being messages remaining in the various subscriptions. A week or even 10 days can go by without interruption, and sometimes all our clients stop consuming two or three times a day.

To resolve this issue and resume message consumption, we are forced to restart our containers. And a few days ago we put in place a dirty fix, which I'll talk about below.

Environment:
We have 6 environments, including production. We have three service bus servers and three gateways. The production environment has its own dedicated service bus server and application gateway.

Service Bus configuration / Namespace environment
Namespace : prenium
Message entity : Topic some are with session and sonme not.
Average size of Message : 80Ko - 100Ko

Machine Spec:
Our applications run in Docker using Docker Swarm. The JVM used is openjdk:17-jdk-debian with the following JVM options:
-XX:InitialRAMPercentage=25 -XX:MinRAMPercentage=75 -XX:MaxRAMPercentage=75

The resource specifications for our Docker containers, defined in the Docker Compose YAML file, are as follows:

resources:
  limits:
    cpus: "2"
    memory: 1500M
  reservations:
    cpus: "0.05"
    memory: 700M

Additionally, some applications run with a single replica, while others run with two replicas to handle varying load and ensure high availability.

ServiceBusProcessorClient Configuration:
Following the standards and recommendations of the documentation, we configure our clients as follows:

public ServiceBusProcessorClient getOrCreateClaimProcessorClient() {
    return new ServiceBusClientBuilder()
            .fullyQualifiedNamespace(XXXX.getFullyQualifiedName())
            .credential(new ClientSecretCredentialBuilder()
                    .tenantId(XXXX.getTenantId())
                    .clientId(XXXX.getStId())
                    .clientSecret(XXXX.getStSecret())
                    .build())
            .customEndpointAddress(XXXX.getApplicationGatewayEndpointUrl())
            .transportType(AmqpTransportType.AMQP_WEB_SOCKETS)
            .processor()
            .topicName(YYYY.getTopic())
            .subscriptionName(YYYY.getSubscription())
            .receiveMode(ServiceBusReceiveMode.PEEK_LOCK)
            .disableAutoComplete()
            .processMessage(YYYY.processMessage())
            .processError(YYYY.processError())
            .buildProcessorClient();
}

As you can see, we are not using auto-complete, and there is no prefetch configured. In our case, we are not using sessions, but other applications are using sessions.

Traffic Pattern:
Since we are in the pilot phase, we experience varying traffic patterns. There are periods with several messages per minute and long periods without messages. Sometimes, we receive only one message every 12 hours. This traffic pattern is due to the pilot phase, and we expect a significant increase in message volume after the pilot phase.

To Reproduce / Exception or Stack Trace:
We have attempted multiple times to reproduce this issue, which we have named the "zombie mode"—a state where our applications are up and running but not consuming messages. However, we have been unable to replicate it. In the other 5 environments, we have never encountered the zombie mode.

Regarding logs, there are no errors or crashes reported. As a result, we do not have logs to provide for this issue.

To troubleshoot, we have implemented a workaround. This "dirty fix" involves closing and restarting the processor if the client has not processed any messages for 5 minutes. This forces the closure of the connection, the session and the links. We are unsure if this solution is effective.

What we have done to attempt to reproduce the Zombie mode:
We have taken several steps to try to reproduce the zombie mode:

  1. We lowered the timeout of the application gateway to its minimum value, but the consumers continued to consume messages without issues.
  2. We placed our consumers behind a forward proxy, which we then disconnected for several minutes. After reconnecting, our consumers were able to recover the connection and continue consuming messages.

Despite these efforts, we have not been able to consistently reproduce the zombie mode.

We need your support to resolve this issue.

Thank you for your assistance.

PS: Question aside : If we create two ServiceBusProcessorClient instances with the same ServiceBusClientBuilder, according to the documentation, they will share the same connection. Does closing one of them, by calling close(), and then restart it, does it close and create the connection for both ?

@github-actions github-actions bot added Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Service Bus labels Jun 5, 2024
Copy link

github-actions bot commented Jun 5, 2024

@anuchandy @conniey @lmolkova

@Poseithon
Copy link
Author

Poseithon commented Jun 6, 2024

Hello,

We have understood where the zombie mode comes from. It occurs when we update properties of the gateway application. The new configuration is first loaded on the passive node and then on the next node where our connections are active. However, it seems that this is not detected by the mechanism in the ServiceBusProcessorClient responsible for checking if the connection is active. It seems that this is done via a scheduler boundedElastic. The connection_ID remains unchanged.

We still need your help to understand why the disconnection is not detected.

@anuchandy
Copy link
Member

Hello @Poseithon, is it possible to capture the TCP level traffic when the application transitions to zombie mode? A tool I’m aware of is Wireshark where we can filter traffic by IP address (of gateway in this case). What we’re looking for is, if any TCP level disconnect signal is arriving to docker’s network stack from the gateway peer. The result of processor level connection-active check depends on whether the underlying network stack reported any connection drop.

@Poseithon
Copy link
Author

Poseithon commented Jun 7, 2024

Hello @anuchandy ,

Yes, we'll try to capture TCP packets during a gateway application property update from a non-productive environment. We are still in the process of validating this hypothesis, and reproducing zombie mode in a non-productive env., but we'll be sure to make a capture.

Having said that, I was left with the following question : If we create two ServiceBusProcessorClient instances with the same ServiceBusClientBuilder, according to the documentation, they will share the same connection. Does closing one of them, by calling close(), and then restart it, does it close and create the connection for both ?

Thank you again for taking the time to address our concerns.

@anuchandy
Copy link
Member

Hi @Poseithon, closing a client instance will not close the connection if there are other client instances actively using that connection.

If we are expecting a fair amount of traffic going forward / after the piloting stage, it would be better to use dedicated builder per client rather than shared builder, this ensures each client instances get dedicated underlying async engine and any sudden peek in operations or time-consuming IO activities in one client will not negatively stall other clients. It also narrows down the investigation scope if any application issues arise. Reference .

In a micro service setup, I’ve seen 1-container:1-processor as a common pattern than running multiple Processor in one container. Also, another learning from customer cases is, low core allocation in micro service env can prevent SDK from performing certain internal time sensitive activities on time, leading to loss or dead lettering of messages. The "Open JDK Team" at Microsoft recommend (based on their user study of thousands of containerized java app in azure) two or more cores and strongly discourage selecting anything less than 1 core.

@Poseithon
Copy link
Author

Thank you very much.

The reason I ask is that we all decided to implement the following work around: if no message is received for five minutes, we close the processor and start it again. It worked for everyone, except for one team whose builder was shared by two processors.

So you've just confirmed what we thought and given us a clear explanation. Thank you very much.

We're also going to follow your advice to have one dedicated builder per processor. And for the docker configuration, we'll make sure there are at least 2 CPUs.

The next step is to understand why we have these zombie modes. So I'm going to do what you told me, and try to capture TCP packets when an app gateway is updated. So we're going to try to set up an environment that's just like production, with a dedicated service bus and a dedicated gateway application.

@Poseithon
Copy link
Author

Poseithon commented Jun 11, 2024

Hello @anuchandy

We have finally been able to reproduce the zombie mode in a non-production environment and have identified the root cause. It is related to the SSL policy. The version in production was lower than in other environments.

When the App Gateway is set to TLSv1.0 and a property is modified, we do receive a connection termination, but it seems this is not propagated to the application level.

Here are the logs. Please note that the update was made at 11:42;09 AM. and that nothing is received after 11:42:27. We kept the application running until 12:30 PM and nothing happened.

WireShark :
image

SDK :
image

Apologies for not providing raw logs directly; they appear disordered when I copy and paste them here, so I'm attaching an image instead.

We will correct the App Gateway to upgrade the TLS version.

However, we can see that even though the connection termination was acknowledged by the application, nothing is logged. Shouldn't we have an onConnectionShutDown event?

@anuchandy
Copy link
Member

anuchandy commented Jun 11, 2024

Hi @Poseithon, great job on the TLSv1.0 root cause analysis❤️.

In my App Gateway setup, I can see that with TLSv1.0 (AppGwSslPolicy20150501), the lower-level ProtonJ library (the open-source Apache AMQP library Azure Service Bus library uses) never signals termination if the FIN + ACK (+ RST) traffic from the peer arrives in the networking layer. Due to this, the Azure Service Bus library's ProtonJ hooks to detect the connection termination never notified. The traffic I captured looks like what you shared. Like you, I modified one of the properties (Backend request timeout for port 443) to trigger this FIN + ACK (+ RST) traffic.

TLSv1_0

In fact, Last month, I noticed this (but never correlated it to TLSv1.0) and debugged. I opened a changeset to Apache ProtonJ AMQP library. Now that we correlated this to TLSv1.0, I’m not sure if the problem is in Java TLS layer dealing with TLSv1.0 and associated cipher suites. Interesting thing is, ProtonJ detects that the outbound is closed but never detects the inbound closure, leaving the transport half-closed. Only after a full-closure, ProtonJ will invoke hooks (that Service Bus SDK registered).

I’ve tried using App Gateway with TLSv1.2 (AppGwSslPolicy20220101) and triggered FIN + ACK (+ RST) traffic, this time the ProtonJ library went through full-closure and Service Bus library recovered successfully. The traffic looks like below. It took around ~40 seconds for networking layer to signal this to ProtonJ and successfully complete the close.

TLSv1_2

I agree with your decision to upgrade App Gateway to TLSv1.2+. The Azure Services are phasing out TLSv1.0 and support will end on October 31st 2024 (announcement). Moving to 1.2 will help to identify any potential issues, not only for the Service Bus SDK but also for all Azure Services the application relies on to prevent any disruption in 6 months.

As far as the TLSv1.0 + Apache ProtonJ is concerned, I’ll update the Apache Jira ticket with these details. Please note that the ProtonJ library is not maintained by Azure SDK team but by the Apache community. I’m unsure when the community experts get to that Jira ticket or if lower the priority considering the general shift towards higher TLS version. But I believe we've now identified the solution (TLS upgrade) to unblock the work.

@Poseithon
Copy link
Author

Poseithon commented Jun 11, 2024

Hello @anuchandy ,

Thank you very much for this exchange. I forgot to mention it in my ticket, but this problem was also reported in this ticket: #40020. It was my collegue working to solve the same issue

You had already mentioned a potential bug in the proton-j library. However, we realized that this ticket was too vague. Indeed, the ticket mentioned a timeout issue, but in reality, it was just a non-significant side effect. So we decided to open this one.

Thank you again for your support and help.

@anuchandy
Copy link
Member

I see, thanks for the clarification 👍. I’ll go ahead and close the other ticket.

@anuchandy
Copy link
Member

We’ve published an official Trouble-Shooting-Guideline section about this - https://learn.microsoft.com/en-us/azure/developer/java/sdk/troubleshooting-messaging-service-bus-overview#clients-halt-when-using-application-gateway-custom-endpoint so that it's easier to find for anyone who faces a similar problem.

@anuchandy anuchandy added the tracking-external-issue The issue is caused by external problem (e.g. OS) - nothing we can do to fix it directly label Jun 12, 2024
@Poseithon
Copy link
Author

Hello @anuchandy

thanks a lot. I m sure it will help the community. Thank you.

@anuchandy
Copy link
Member

I’m closing this, given - the non-recovery problem resolved with TLSv1.2+, the trouble shooting guideline updated with details on TLSv1.0, and an PR to ProtonJ was opened (for TLSv1.0) which is external to azure-sdk repro.

@github-actions github-actions bot locked and limited conversation to collaborators Sep 19, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Service Bus tracking-external-issue The issue is caused by external problem (e.g. OS) - nothing we can do to fix it directly
Projects
None yet
Development

No branches or pull requests

2 participants