Collect destination status through rabbitmq #17

pauldg · 2024-11-06T00:11:00Z

In order to collect destination metrics (available CPU/memory) from Pulsar destinations in the network we can re-purpose the rabbitmq connection to send back htcondor status metrics which can be used for scheduling.

Producer:
1.We write a Python script that is deployed by the admin on the Pulsar side
2. The Python script will collect the metrics from the scheduler and pushes the metrics to its queue in the MQ using kombu
3. Python script should have access to the pulsar conf file (to access the MQ creds)
Consumer:
1. There will be another Python script on the consumer side running as a Telegraf task which acknowledges and fetches the metrics from the queue in the MQ
2. Compares the time stamp of the last entry in the InfluxDB to it’s local server time and pushes the new metrics and sets a field/tag for online/offline. If there is no metric in the queue then the Telegraf task will automatically set the endpoint to offline when it pushes empty/null metrics to the InfluxDB and TPV API can eliminate the specific destination from its candidates list of destination.
3. Consumer will be running on the Galaxy side (for example on the maintenance node in the EU) which will have access to the jobconf where it can access the MQ credentials of each queue.
4. Consumer telegraf task/script will be parallelized (uses multiprocessing) to talk to multiple queues in the MQ

sanjaysrikakulam · 2024-11-06T07:26:40Z

Here is where the scripts are collected at the moment: https://github.com/pauldg/bh2024-metric-collection-scripts/

abdulrahmanazab · 2024-11-06T07:47:37Z

And this will be "push" mode right? I mean the connection is initiated from the Pulsar endpoint?

sanjaysrikakulam · 2024-11-06T08:20:27Z

And this will be "push" mode right? I mean the connection is initiated from the Pulsar endpoint?

Yes, the pulsar destinations will push the metrics (interval needs to be determined) to the queue.

sebastian-luna-valero · 2024-11-06T13:20:04Z

Many thanks for working on this!

My question is whether this tpv-metascheduler-api repository is the right one to collect destination status.

I was wondering whether we would need to have two separate repositories instead:

The existing tpv-metascheduler-api repository will execute the meta-scheduling algorithms.
A new metrics-collector repository to collect destination status and send this back to Galaxy via RabbitMQ (e.g. reusing the scripts proposed in Add ansible example playbook for deployment #14)

Each destination will have a metric-collector installed alongside Pulsar/ARC and HTCondor/SGE/Slurm (by the way, we should design this as an opt-in service, in case some pulsar destinations don't want to share these information?). However, there will only be one instance of the tpv-metascheduler-api service running per Galaxy instance.

What do you think?

sanjaysrikakulam · 2024-11-11T07:47:12Z

Yes, we will move the producer script to the pulsar-depoyment repo (and/or will create a dedicated Ansible role) and add optional variable and based on this the ansible tasks for copying and setting up a corn job will make decisions. The consumer will have an Ansible role, so admins can easily install it. This consumer role will also have the telegraph task deployment tasks.

Since this is a PoC, we didn't implement the SLURM and Kubernetes metrics collection in the producer script. This will follow up. We want to create a rank function that could be added to a user's conf in the TPV and run some tests on EU to see how this works together.

The Ansible role to deploy the API itself is already available here. I made that for the deployment in ESG instance and is currently being used. I will extract it as an individual role to its dedicated repo.

sanjaysrikakulam · 2024-11-11T09:06:18Z

xref:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collect destination status through rabbitmq #17

Collect destination status through rabbitmq #17

pauldg commented Nov 6, 2024

sanjaysrikakulam commented Nov 6, 2024

abdulrahmanazab commented Nov 6, 2024

sanjaysrikakulam commented Nov 6, 2024

sebastian-luna-valero commented Nov 6, 2024

sanjaysrikakulam commented Nov 11, 2024 •

edited

Loading

sanjaysrikakulam commented Nov 11, 2024

Collect destination status through rabbitmq #17

Collect destination status through rabbitmq #17

Comments

pauldg commented Nov 6, 2024

sanjaysrikakulam commented Nov 6, 2024

abdulrahmanazab commented Nov 6, 2024

sanjaysrikakulam commented Nov 6, 2024

sebastian-luna-valero commented Nov 6, 2024

sanjaysrikakulam commented Nov 11, 2024 • edited Loading

sanjaysrikakulam commented Nov 11, 2024

sanjaysrikakulam commented Nov 11, 2024 •

edited

Loading