Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add anomaly detection support to TensorboardLogger #854

Closed
wants to merge 4 commits into from

Conversation

diego-urgell
Copy link
Contributor

Summary:

This Stack

Based on this RFC, we are adding a new logger that warns about anomalous values in metrics, and optionally executes a callback function with potential side effects. This could be useful for users to realize sooner that something has gone wrong during training.

This Diff

To start leveraging the AnomalyLogger as easily as possible, let's make it the base class for the Tensorboard logger instead of MetricLogger. This will have no effect unless users specify the tracked_metrics attribute, which is optional. However, if they do want to use it, they have to make very little changes.

Next diff will do the same for the AIXLogger

Reviewed By: JKSenthil

Differential Revision: D58593222

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D58593222

Diego Urgell added 3 commits June 25, 2024 15:33
Summary:
### This Stack

Based on [this RFC](https://docs.google.com/document/d/1K1KQ886dynMRejR0ySH1fctOjS7gxaCS8AB1L_PHxU4/edit?usp=sharing), we are adding a new logger that warns about anomalous values in metrics, and optionally executes a callback function with potential side effects. This could be useful for users to realize sooner that something has gone wrong during training.

### This Diff

To provide flexibility when detecting anomalous metric values, instead of assuming and hardcoding a predefined check (like a threshold), let's create an interface that can be overriden to implement custom checks.

Differential Revision: D58564201
Summary:
### This Stack

Based on [this RFC](https://docs.google.com/document/d/1K1KQ886dynMRejR0ySH1fctOjS7gxaCS8AB1L_PHxU4/edit?usp=sharing), we are adding a new logger that warns about anomalous values in metrics, and optionally executes a callback function with potential side effects. This could be useful for users to realize sooner that something has gone wrong during training.

### This Diff

To get started with anomaly detection, let's first define two evaluators:
- Threshold is the most intuitive one, and checks that a metric value is within a predefined range.
- IsNaN would be useful to catch fast cases where the loss is NaN because of bad inputs.

Later on we can implement more interesting evaluators like outliers, changepoint detection, etc. if needed.

Differential Revision: D58564199
Summary:
### This Stack

Based on [this RFC](https://docs.google.com/document/d/1K1KQ886dynMRejR0ySH1fctOjS7gxaCS8AB1L_PHxU4/edit?usp=sharing), we are adding a new logger that warns about anomalous values in metrics, and optionally executes a callback function with potential side effects. This could be useful for users to realize sooner that something has gone wrong during training.

### This Diff

After implementing the evaluators, let's add the `AnomalyLogger` class that receives some configuration of metrics to check for. If an anomaly is detected, then it will call an optional `on_anomaly_detected` method that can be overriden by the user.

Next diffs will add this to our `AIXLogger` and `TensorboardLogger` as a base class.

Differential Revision: D58564200
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D58593222

diego-urgell added a commit to diego-urgell/tnt that referenced this pull request Jun 25, 2024
Summary:
Pull Request resolved: pytorch#854

### This Stack

Based on [this RFC](https://docs.google.com/document/d/1K1KQ886dynMRejR0ySH1fctOjS7gxaCS8AB1L_PHxU4/edit?usp=sharing), we are adding a new logger that warns about anomalous values in metrics, and optionally executes a callback function with potential side effects. This could be useful for users to realize sooner that something has gone wrong during training.

### This Diff

To start leveraging the AnomalyLogger as easily as possible, let's make it the base class for the Tensorboard logger instead of MetricLogger. This will have no effect unless users specify the `tracked_metrics` attribute, which is optional. However, if they do want to use it, they have to make very little changes.

Next diff will do the same for the AIXLogger

Reviewed By: JKSenthil

Differential Revision: D58593222
Summary:
Pull Request resolved: pytorch#854

### This Stack

Based on [this RFC](https://docs.google.com/document/d/1K1KQ886dynMRejR0ySH1fctOjS7gxaCS8AB1L_PHxU4/edit?usp=sharing), we are adding a new logger that warns about anomalous values in metrics, and optionally executes a callback function with potential side effects. This could be useful for users to realize sooner that something has gone wrong during training.

### This Diff

To start leveraging the AnomalyLogger as easily as possible, let's make it the base class for the Tensorboard logger instead of MetricLogger. This will have no effect unless users specify the `tracked_metrics` attribute, which is optional. However, if they do want to use it, they have to make very little changes.

Next diff will do the same for the AIXLogger

Reviewed By: JKSenthil

Differential Revision: D58593222
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D58593222

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants