-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Collect destination status through rabbitmq #17
Comments
Here is where the scripts are collected at the moment: https://github.com/pauldg/bh2024-metric-collection-scripts/ |
And this will be "push" mode right? I mean the connection is initiated from the Pulsar endpoint? |
Yes, the pulsar destinations will push the metrics (interval needs to be determined) to the queue. |
Many thanks for working on this! My question is whether this I was wondering whether we would need to have two separate repositories instead:
Each destination will have a What do you think? |
Yes, we will move the producer script to the pulsar-depoyment repo (and/or will create a dedicated Ansible role) and add optional variable and based on this the ansible tasks for copying and setting up a corn job will make decisions. The consumer will have an Ansible role, so admins can easily install it. This consumer role will also have the telegraph task deployment tasks. Since this is a PoC, we didn't implement the SLURM and Kubernetes metrics collection in the producer script. This will follow up. We want to create a rank function that could be added to a user's conf in the TPV and run some tests on EU to see how this works together. The Ansible role to deploy the API itself is already available here. I made that for the deployment in ESG instance and is currently being used. I will extract it as an individual role to its dedicated repo. |
In order to collect destination metrics (available CPU/memory) from Pulsar destinations in the network we can re-purpose the rabbitmq connection to send back htcondor status metrics which can be used for scheduling.
1.We write a Python script that is deployed by the admin on the Pulsar side
2. The Python script will collect the metrics from the scheduler and pushes the metrics to its queue in the MQ using kombu
3. Python script should have access to the pulsar conf file (to access the MQ creds)
1. There will be another Python script on the consumer side running as a Telegraf task which acknowledges and fetches the metrics from the queue in the MQ
2. Compares the time stamp of the last entry in the InfluxDB to it’s local server time and pushes the new metrics and sets a field/tag for online/offline. If there is no metric in the queue then the Telegraf task will automatically set the endpoint to offline when it pushes empty/null metrics to the InfluxDB and TPV API can eliminate the specific destination from its candidates list of destination.
3. Consumer will be running on the Galaxy side (for example on the maintenance node in the EU) which will have access to the jobconf where it can access the MQ credentials of each queue.
4. Consumer telegraf task/script will be parallelized (uses multiprocessing) to talk to multiple queues in the MQ
The text was updated successfully, but these errors were encountered: