Skip to content

Commit

Permalink
Update Prometheus metrics in doc (#1880)
Browse files Browse the repository at this point in the history
* Removing dead code

* Update monitoring guide
  • Loading branch information
johnugeorge committed Aug 5, 2023
1 parent 3e08340 commit 855e096
Showing 1 changed file with 6 additions and 76 deletions.
82 changes: 6 additions & 76 deletions docs/monitoring/README.md
Original file line number Diff line number Diff line change
@@ -1,106 +1,36 @@
# Prometheus Monitoring for TFJob
# Prometheus Monitoring for Training Job

## Available Metrics

Currently available metrics to monitor are listed below.

### Metrics for Each Component Container for TFJob

Component Containers:

- tf-operator
- tf-chief
- tf-ps
- tf-worker

#### Each Container Reports on its:

Use prometheus graph to run the following example commands to visualize metrics.

_Note_: These metrics are derived from [cAdvisor](https://github.com/google/cadvisor) kubelet integration which reports to Prometheus through our prometheus-operator installation. You may see a complete list of metrics available in `\metrics` page of your Prometheus web UI which you can further use to compose your own queries.

**CPU usage**

```
sum (rate (container_cpu_usage_seconds_total{pod_name=~"tfjob-name-.*"}[1m])) by (pod_name)
```

**GPU Usage**

```
sum (rate (container_accelerator_memory_used_bytes{pod_name=~"tfjob-name-.*"}[1m])) by (pod_name)
```

**Memory Usage**

```
sum (rate (container_memory_usage_bytes{pod_name=~"tfjob-name-.*"}[1m])) by (pod_name)
```

**Network Usage**

```
sum (rate (container_network_transmit_bytes_total{pod_name=~"tfjob-name-.*"}[1m])) by (pod_name)
```

**I/O Usage**

```
sum (rate (container_fs_write_seconds_total{pod_name=~"tfjob-name-.*"}[1m])) by (pod_name)
```

**Keep-Alive check**

```
up
```

This is maintained by Prometheus on its own with its `up` metric detailed in the documentation [here](https://prometheus.io/docs/concepts/jobs_instances/#automatically-generated-labels-and-time-series).

**Is Leader check**

```
tf_operator_is_leader
```

_Note_: Replace `tfjob-name` with your own TF Job name you want to monitor for the example queries above.

### Report TFJob metrics:

_Note_: If you are using release v1 tf-operator, these TFJob metrics don't have suffix `total`. So you have to use metric name like `tf_operator_jobs_created` to get your metrics. See [PR](https://github.com/kubeflow/training-operator/pull/1055) to get more information.

**Job Creation**

```
tf_operator_jobs_created_total
```

**Job Creation**

```
sum (rate (tf_operator_jobs_created_total[60m]))
training_operator_jobs_created_total
```

**Job Deletion**

```
tf_operator_jobs_deleted_total
training_operator_jobs_deleted_total
```

**Successful Job Completions**

```
tf_operator_jobs_successful_total
training_operator_jobs_successful_total
```

**Failed Jobs**

```
tf_operator_jobs_failed_total
training_operator_jobs_failed_total
```

**Restarted Jobs**

```
tf_operator_jobs_restarted_total
training_operator_jobs_restarted_total
```

0 comments on commit 855e096

Please sign in to comment.