Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixed the alarm event handler performance issue #3136

Conversation

jeonghanlee
Copy link
Contributor

@jeonghanlee jeonghanlee commented Sep 13, 2024

Hi Kay @kasemir and Kunal @shroffk

While intensively testing the alarm performance, we noticed that the current alarm server cannot handle simultaneous alarm events (50 alarms with the exact process record time with the reference PV's TESL of 50 PVs per second during 10 minutes).

Our suspicion was Flowable<VType> buffersize, but increasing the size is not a real solution. While debugging the performance problem, we noticed that the alarm server uses the producer.send(record).get() function intensively without returning its metadata in sendStateUpdate() in ServerModel.java.

We have investigated this get function in more detail and concluded that it is the main trouble of alarm performance. This function is synchronous and blocking. And send itself is asynchronous and non-blocking.

So here we propose the code without the get function, which has been verified that the current (this branch) alarm server can handle simultaneous alarm events (now 100 alarms with the exact process record time with the reference PV's TESL - of 50 PVs per second during 10 minutes).

We have seen that there is no use case for handling the return value of the get function in other code, so we have removed all these functions everywhere. This may not be the solution if you want to receive some metadata after .send.get.

In this case, where you want to handle the metadata, we also find the solution, which is as follows:

Producer.send(record, onCompletion override of callback);

I don't check the Kafka version, so you can find the appropriate version of Kafka in any version documentation.

Detailed information can be found at

https://kafka.apache.org/35/javadoc/org/apache/kafka/clients/producer/KafkaProducer.html#send(org.apache.kafka.clients.producer.ProducerRecord,org.apache.kafka.clients.producer.Callback)

We now extend our tests to "500, 1000, and 5000 alarm events" to see where the story ends. If all goes well, we finish our test for MAX 5000 alarms with the exact process record time with the reference PV's TESL of 50 PVs per second for 10 minutes.

So we want to be very clear about the limits of the alarm services we want to use.

If we have the system more than 5000 alarm events per second, we will be completely wrong for the control system anyway. 🥲

@jeonghanlee, Soo Ryu, and @Sangil-Lee at ALS-U Controls, LBNL.

 - cannot handle 50 alarm events / sec
 - with this, can handle 100 alarm events / sec
 - remove all .get methods from everywhere
@georgweiss
Copy link
Collaborator

I guess with 5000 alarms per second you need to run rather than check alarm messages.

@kasemir
Copy link
Collaborator

kasemir commented Sep 13, 2024

Makes sense to me, thanks for digging into this!

@shroffk shroffk added the FAQ label Sep 13, 2024
@shroffk
Copy link
Member

shroffk commented Sep 13, 2024

Great find,
Makes sense to me to remove the get, I appreciate that you have also found an alternative in case we need to metadata at some later date.
I have tagged this as FAQ to hopefully retain it as a reference.

@shroffk shroffk merged commit ff78e8a into ControlSystemStudio:master Sep 13, 2024
1 check passed
@jeonghanlee
Copy link
Contributor Author

We can now see all 5000 alarm events / sec in alarm server and the alarm logger without any single loss. I know there are more issues which we will see, however, the alarm trio services work well and now we are going to prepare in-depth technical documentation for this study.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants