Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collect destination status through rabbitmq #17

Open
pauldg opened this issue Nov 6, 2024 · 6 comments
Open

Collect destination status through rabbitmq #17

pauldg opened this issue Nov 6, 2024 · 6 comments

Comments

@pauldg
Copy link
Collaborator

pauldg commented Nov 6, 2024

In order to collect destination metrics (available CPU/memory) from Pulsar destinations in the network we can re-purpose the rabbitmq connection to send back htcondor status metrics which can be used for scheduling.

  1. Producer:
    1.We write a Python script that is deployed by the admin on the Pulsar side
    2. The Python script will collect the metrics from the scheduler and pushes the metrics to its queue in the MQ using kombu
    3. Python script should have access to the pulsar conf file (to access the MQ creds)
  2. Consumer:
    1. There will be another Python script on the consumer side running as a Telegraf task which acknowledges and fetches the metrics from the queue in the MQ
    2. Compares the time stamp of the last entry in the InfluxDB to it’s local server time and pushes the new metrics and sets a field/tag for online/offline. If there is no metric in the queue then the Telegraf task will automatically set the endpoint to offline when it pushes empty/null metrics to the InfluxDB and TPV API can eliminate the specific destination from its candidates list of destination.
    3. Consumer will be running on the Galaxy side (for example on the maintenance node in the EU) which will have access to the jobconf where it can access the MQ credentials of each queue.
    4. Consumer telegraf task/script will be parallelized (uses multiprocessing) to talk to multiple queues in the MQ
@sanjaysrikakulam
Copy link
Member

Here is where the scripts are collected at the moment: https://github.com/pauldg/bh2024-metric-collection-scripts/

@abdulrahmanazab
Copy link
Contributor

And this will be "push" mode right? I mean the connection is initiated from the Pulsar endpoint?

@sanjaysrikakulam
Copy link
Member

And this will be "push" mode right? I mean the connection is initiated from the Pulsar endpoint?

Yes, the pulsar destinations will push the metrics (interval needs to be determined) to the queue.

@sebastian-luna-valero
Copy link
Contributor

Many thanks for working on this!

My question is whether this tpv-metascheduler-api repository is the right one to collect destination status.

I was wondering whether we would need to have two separate repositories instead:

  • The existing tpv-metascheduler-api repository will execute the meta-scheduling algorithms.
  • A new metrics-collector repository to collect destination status and send this back to Galaxy via RabbitMQ (e.g. reusing the scripts proposed in Add ansible example playbook for deployment #14)

Each destination will have a metric-collector installed alongside Pulsar/ARC and HTCondor/SGE/Slurm (by the way, we should design this as an opt-in service, in case some pulsar destinations don't want to share these information?). However, there will only be one instance of the tpv-metascheduler-api service running per Galaxy instance.

What do you think?

@sanjaysrikakulam
Copy link
Member

sanjaysrikakulam commented Nov 11, 2024

Yes, we will move the producer script to the pulsar-depoyment repo (and/or will create a dedicated Ansible role) and add optional variable and based on this the ansible tasks for copying and setting up a corn job will make decisions. The consumer will have an Ansible role, so admins can easily install it. This consumer role will also have the telegraph task deployment tasks.

Since this is a PoC, we didn't implement the SLURM and Kubernetes metrics collection in the producer script. This will follow up. We want to create a rank function that could be added to a user's conf in the TPV and run some tests on EU to see how this works together.

The Ansible role to deploy the API itself is already available here. I made that for the deployment in ESG instance and is currently being used. I will extract it as an individual role to its dedicated repo.

@sanjaysrikakulam
Copy link
Member

xref:

  1. Updated api to handle new influx data #19
  2. Add fuzzy matchmaking algorithm to rank and sort destinations #20

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants