fixed the alarm event handler performance issue #3136

jeonghanlee · 2024-09-13T02:25:43Z

While intensively testing the alarm performance, we noticed that the current alarm server cannot handle simultaneous alarm events (50 alarms with the exact process record time with the reference PV's TESL of 50 PVs per second during 10 minutes).

Our suspicion was Flowable<VType> buffersize, but increasing the size is not a real solution. While debugging the performance problem, we noticed that the alarm server uses the producer.send(record).get() function intensively without returning its metadata in sendStateUpdate() in ServerModel.java.

We have investigated this get function in more detail and concluded that it is the main trouble of alarm performance. This function is synchronous and blocking. And send itself is asynchronous and non-blocking.

So here we propose the code without the get function, which has been verified that the current (this branch) alarm server can handle simultaneous alarm events (now 100 alarms with the exact process record time with the reference PV's TESL - of 50 PVs per second during 10 minutes).

We have seen that there is no use case for handling the return value of the get function in other code, so we have removed all these functions everywhere. This may not be the solution if you want to receive some metadata after .send.get.

In this case, where you want to handle the metadata, we also find the solution, which is as follows:

Producer.send(record, onCompletion override of callback);

I don't check the Kafka version, so you can find the appropriate version of Kafka in any version documentation.

Detailed information can be found at

https://kafka.apache.org/35/javadoc/org/apache/kafka/clients/producer/KafkaProducer.html#send(org.apache.kafka.clients.producer.ProducerRecord,org.apache.kafka.clients.producer.Callback)

We now extend our tests to "500, 1000, and 5000 alarm events" to see where the story ends. If all goes well, we finish our test for MAX 5000 alarms with the exact process record time with the reference PV's TESL of 50 PVs per second for 10 minutes.

So we want to be very clear about the limits of the alarm services we want to use.

If we have the system more than 5000 alarm events per second, we will be completely wrong for the control system anyway. 🥲

@jeonghanlee, Soo Ryu, and @Sangil-Lee at ALS-U Controls, LBNL.

- cannot handle 50 alarm events / sec - with this, can handle 100 alarm events / sec - remove all .get methods from everywhere

georgweiss · 2024-09-13T07:35:05Z

I guess with 5000 alarms per second you need to run rather than check alarm messages.

kasemir · 2024-09-13T13:00:32Z

Makes sense to me, thanks for digging into this!

shroffk · 2024-09-13T13:19:49Z

Great find,
Makes sense to me to remove the get, I appreciate that you have also found an alternative in case we need to metadata at some later date.
I have tagged this as FAQ to hopefully retain it as a reference.

jeonghanlee · 2024-09-21T04:22:13Z

We can now see all 5000 alarm events / sec in alarm server and the alarm logger without any single loss. I know there are more issues which we will see, however, the alarm trio services work well and now we are going to prepare in-depth technical documentation for this study.

fixed the alarm event handler performance issue

c3eaa70

- cannot handle 50 alarm events / sec - with this, can handle 100 alarm events / sec - remove all .get methods from everywhere

kasemir approved these changes Sep 13, 2024

View reviewed changes

shroffk added the FAQ label Sep 13, 2024

shroffk merged commit ff78e8a into ControlSystemStudio:master Sep 13, 2024
1 check passed

GDH-ISIS mentioned this pull request Oct 21, 2024

Alarm-logger - Kafka connection - shutdown_client #3033

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fixed the alarm event handler performance issue #3136

fixed the alarm event handler performance issue #3136

jeonghanlee commented Sep 13, 2024 •

edited

Loading

georgweiss commented Sep 13, 2024

kasemir commented Sep 13, 2024

shroffk commented Sep 13, 2024

jeonghanlee commented Sep 21, 2024

fixed the alarm event handler performance issue #3136

fixed the alarm event handler performance issue #3136

Conversation

jeonghanlee commented Sep 13, 2024 • edited Loading

georgweiss commented Sep 13, 2024

kasemir commented Sep 13, 2024

shroffk commented Sep 13, 2024

jeonghanlee commented Sep 21, 2024

jeonghanlee commented Sep 13, 2024 •

edited

Loading