Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add logger for anomaly detection #852

Closed
wants to merge 3 commits into from

Conversation

diego-urgell
Copy link
Contributor

Summary:

This Stack

Based on this RFC, we are adding a new logger that warns about anomalous values in metrics, and optionally executes a callback function with potential side effects. This could be useful for users to realize sooner that something has gone wrong during training.

This Diff

After implementing the evaluators, let's add the AnomalyLogger class that receives some configuration of metrics to check for. If an anomaly is detected, then it will call an optional on_anomaly_detected method that can be overriden by the user.

Next diffs will add this to our AIXLogger and TensorboardLogger as a base class.

Reviewed By: JKSenthil

Differential Revision: D58564200

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D58564200

Diego Urgell added 2 commits June 25, 2024 15:33
Summary:
### This Stack

Based on [this RFC](https://docs.google.com/document/d/1K1KQ886dynMRejR0ySH1fctOjS7gxaCS8AB1L_PHxU4/edit?usp=sharing), we are adding a new logger that warns about anomalous values in metrics, and optionally executes a callback function with potential side effects. This could be useful for users to realize sooner that something has gone wrong during training.

### This Diff

To provide flexibility when detecting anomalous metric values, instead of assuming and hardcoding a predefined check (like a threshold), let's create an interface that can be overriden to implement custom checks.

Differential Revision: D58564201
Summary:
### This Stack

Based on [this RFC](https://docs.google.com/document/d/1K1KQ886dynMRejR0ySH1fctOjS7gxaCS8AB1L_PHxU4/edit?usp=sharing), we are adding a new logger that warns about anomalous values in metrics, and optionally executes a callback function with potential side effects. This could be useful for users to realize sooner that something has gone wrong during training.

### This Diff

To get started with anomaly detection, let's first define two evaluators:
- Threshold is the most intuitive one, and checks that a metric value is within a predefined range.
- IsNaN would be useful to catch fast cases where the loss is NaN because of bad inputs.

Later on we can implement more interesting evaluators like outliers, changepoint detection, etc. if needed.

Differential Revision: D58564199
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D58564200

diego-urgell added a commit to diego-urgell/tnt that referenced this pull request Jun 25, 2024
Summary:
Pull Request resolved: pytorch#852

### This Stack

Based on [this RFC](https://docs.google.com/document/d/1K1KQ886dynMRejR0ySH1fctOjS7gxaCS8AB1L_PHxU4/edit?usp=sharing), we are adding a new logger that warns about anomalous values in metrics, and optionally executes a callback function with potential side effects. This could be useful for users to realize sooner that something has gone wrong during training.

### This Diff

After implementing the evaluators, let's add the `AnomalyLogger` class that receives some configuration of metrics to check for. If an anomaly is detected, then it will call an optional `on_anomaly_detected` method that can be overriden by the user.

Next diffs will add this to our `AIXLogger` and `TensorboardLogger` as a base class.

Reviewed By: JKSenthil

Differential Revision: D58564200
Summary:
Pull Request resolved: pytorch#852

### This Stack

Based on [this RFC](https://docs.google.com/document/d/1K1KQ886dynMRejR0ySH1fctOjS7gxaCS8AB1L_PHxU4/edit?usp=sharing), we are adding a new logger that warns about anomalous values in metrics, and optionally executes a callback function with potential side effects. This could be useful for users to realize sooner that something has gone wrong during training.

### This Diff

After implementing the evaluators, let's add the `AnomalyLogger` class that receives some configuration of metrics to check for. If an anomaly is detected, then it will call an optional `on_anomaly_detected` method that can be overriden by the user.

Next diffs will add this to our `AIXLogger` and `TensorboardLogger` as a base class.

Reviewed By: JKSenthil

Differential Revision: D58564200
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D58564200

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants