Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add anomaly detection support to TensorboardLogger #854

Closed
wants to merge 4 commits into from

Commits on Jun 25, 2024

  1. Add base AnomalyEvaluator class

    Summary:
    ### This Stack
    
    Based on [this RFC](https://docs.google.com/document/d/1K1KQ886dynMRejR0ySH1fctOjS7gxaCS8AB1L_PHxU4/edit?usp=sharing), we are adding a new logger that warns about anomalous values in metrics, and optionally executes a callback function with potential side effects. This could be useful for users to realize sooner that something has gone wrong during training.
    
    ### This Diff
    
    To provide flexibility when detecting anomalous metric values, instead of assuming and hardcoding a predefined check (like a threshold), let's create an interface that can be overriden to implement custom checks.
    
    Differential Revision: D58564201
    Diego Urgell authored and facebook-github-bot committed Jun 25, 2024
    Configuration menu
    Copy the full SHA
    2ef31b8 View commit details
    Browse the repository at this point in the history
  2. Implement starter anomaly evaluators

    Summary:
    ### This Stack
    
    Based on [this RFC](https://docs.google.com/document/d/1K1KQ886dynMRejR0ySH1fctOjS7gxaCS8AB1L_PHxU4/edit?usp=sharing), we are adding a new logger that warns about anomalous values in metrics, and optionally executes a callback function with potential side effects. This could be useful for users to realize sooner that something has gone wrong during training.
    
    ### This Diff
    
    To get started with anomaly detection, let's first define two evaluators:
    - Threshold is the most intuitive one, and checks that a metric value is within a predefined range.
    - IsNaN would be useful to catch fast cases where the loss is NaN because of bad inputs.
    
    Later on we can implement more interesting evaluators like outliers, changepoint detection, etc. if needed.
    
    Differential Revision: D58564199
    Diego Urgell authored and facebook-github-bot committed Jun 25, 2024
    Configuration menu
    Copy the full SHA
    c382602 View commit details
    Browse the repository at this point in the history
  3. Implement starter anomaly evaluators

    Summary:
    ### This Stack
    
    Based on [this RFC](https://docs.google.com/document/d/1K1KQ886dynMRejR0ySH1fctOjS7gxaCS8AB1L_PHxU4/edit?usp=sharing), we are adding a new logger that warns about anomalous values in metrics, and optionally executes a callback function with potential side effects. This could be useful for users to realize sooner that something has gone wrong during training.
    
    ### This Diff
    
    After implementing the evaluators, let's add the `AnomalyLogger` class that receives some configuration of metrics to check for. If an anomaly is detected, then it will call an optional `on_anomaly_detected` method that can be overriden by the user.
    
    Next diffs will add this to our `AIXLogger` and `TensorboardLogger` as a base class.
    
    Differential Revision: D58564200
    Diego Urgell authored and facebook-github-bot committed Jun 25, 2024
    Configuration menu
    Copy the full SHA
    1117930 View commit details
    Browse the repository at this point in the history

Commits on Jun 26, 2024

  1. Add anomaly detection support to TensorboardLogger (pytorch#854)

    Summary:
    Pull Request resolved: pytorch#854
    
    ### This Stack
    
    Based on [this RFC](https://docs.google.com/document/d/1K1KQ886dynMRejR0ySH1fctOjS7gxaCS8AB1L_PHxU4/edit?usp=sharing), we are adding a new logger that warns about anomalous values in metrics, and optionally executes a callback function with potential side effects. This could be useful for users to realize sooner that something has gone wrong during training.
    
    ### This Diff
    
    To start leveraging the AnomalyLogger as easily as possible, let's make it the base class for the Tensorboard logger instead of MetricLogger. This will have no effect unless users specify the `tracked_metrics` attribute, which is optional. However, if they do want to use it, they have to make very little changes.
    
    Next diff will do the same for the AIXLogger
    
    Reviewed By: JKSenthil
    
    Differential Revision: D58593222
    diego-urgell authored and facebook-github-bot committed Jun 26, 2024
    Configuration menu
    Copy the full SHA
    30c7b56 View commit details
    Browse the repository at this point in the history