[Bug] AffineAutoregressive transform leads to exploding gradients #85

francois-rozet · 2021-12-26T18:10:07Z

Issue Description

In the Affine bijector, the scale parameter is obtained by clamping the parameters (network) output. According to some of my experiments this results in very unstable behavior and exploding gradients, especially in low entropy settings. I believe this is due to the non-continuities introduced in the gradients by the clamp operation.

Instead of clamping, the nflows package applies softplus to the network's output which also has the effect to bound (by below) the scale, while keeping smooth gradients. According to my experiments with Pyro, softplus works better than clamping and, importantly, is not subject to exploding gradients. I would suggest replacing clamping by softplus.

Expected Behavior

Avoid exploding gradients. I have implemented the replacement of clamping by softplus for FlowTorch (https://github.com/francois-rozet/flowtorch/commit/9bf41e5b67a8993aa6173d6341f9d99ae5e7178b) but haven't had the time to test it properly.

Additional Context

This issue is a replica of pyro-ppl/pyro#2998

Merry Christmas 🎄

The text was updated successfully, but these errors were encountered:

vmoens · 2022-02-03T14:03:53Z

Hi @francois-rozet thanks for raising this.

I agree, we should have a softplus non-linearity. I usually use f = softplus(x + bias) where bias=0.54... such that f(torch.zeros(1)) = 1.0 (otherwise the layer will 'shrink' the input).

@stefanwebb what about letting the user choose which non-linearity must be used for the positive mapping of parameters? Something like

layer = AffineLayer(positive_map='softplus')

I think that for actnorm, batchnorm etc and deep architectures (e.g. glow, iResnet) this will be useful.

vmoens · 2022-02-09T11:08:51Z

Side note: the method on this:
clamp_preserve_gradients could be simplified with

def clamp_preserve_gradients(x: torch.Tensor, min: float, max: float) -> torch.Tensor:
    """
    This helper function clamps gradients but still passes through the
    gradient in clamped regions
    """
    x.data.clamp_(min, max)
    return x

where all modifications are done in place.

stefanwebb · 2022-05-04T23:57:01Z

@vmoens won't f = softplus(x + bias) already have a bias term added to x since x is the output of a feedforward network? I've removed this feature to simplify the logic and have broken out into a separate PR: #109

Summary: ### Motivation As pointed out in #85, it may be preferable to use `softplus` rather than `exp` to calculate the scale parameter of the affine map in `bij.ops.Affine`. ### Changes proposed Another PR #92 by vmoens implements `softplus`, `sigmoid`, and `exp` options for the scale parameter - I have factored that out and simplified some of the design in order to make #92 easier for review. `softplus` is now the default option for `Affine` Pull Request resolved: #109 Test Plan: `pytest tests/` Reviewed By: vmoens Differential Revision: D36169529 Pulled By: stefanwebb fbshipit-source-id: 625387e10399291a5a404c28f4ada743d0945649

stefanwebb mentioned this issue May 4, 2022

Additional scale functions for AffineOp #109

Closed

francois-rozet closed this as completed Sep 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] AffineAutoregressive transform leads to exploding gradients #85

[Bug] AffineAutoregressive transform leads to exploding gradients #85

francois-rozet commented Dec 26, 2021

vmoens commented Feb 3, 2022

vmoens commented Feb 9, 2022

stefanwebb commented May 4, 2022

[Bug] AffineAutoregressive transform leads to exploding gradients #85

[Bug] AffineAutoregressive transform leads to exploding gradients #85

Comments

francois-rozet commented Dec 26, 2021

Issue Description

Expected Behavior

Additional Context

vmoens commented Feb 3, 2022

vmoens commented Feb 9, 2022

stefanwebb commented May 4, 2022