Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large difference of inference result between forward and step #135

Open
billshoo opened this issue Feb 21, 2024 · 0 comments
Open

Large difference of inference result between forward and step #135

billshoo opened this issue Feb 21, 2024 · 0 comments

Comments

@billshoo
Copy link

billshoo commented Feb 21, 2024

Albert -

Thank you for the wonderful S4 model you've invented and kept improving.

I am getting very big difference of inference result between forward() and step(), for models trained with parameters like this:

Depth: 20-30
kernel_size: 400-800
mode_init: 'diag-inv'
discretization: 'zoh'
ar_transform: 'softplus'
dt_transform: 'relu'

My test sequences are time series of tens of millions of time steps and I just keep running step() function on it, one step at a time. And I just see its predictions deviates from forward more and more, and starts to loose predictiveness altogether. However, the forward maintain its predictiveness with a fixed receptive field of roughly 25 (depth) x 500 (kernel size).

I wonder:

  1. Is there any diagnostics to make sure I am not having any bugs?

E.g. for a model of kernel size 500, I've tried to verify that the 500th output of step() matches the forward() output. As my mental model is that the step() function is effectively having a variable receptive field that gets longer and longer, while the forward() function is having a fixed kernel size that cuts off at 500. This method seems to only work for depth=1, since at depth 2, a step function is already facing inputs from the previous layer's step function where the receptive field is variable, depending on its position in the sequence. In contrast, if I use forward, every layer's receptive field will be fixed at 500.

  1. If the difference turn out to be real, Is there anything I can do to promote forward-step agreement? The step() inference have huge performance advantage, for my use case. Since the real part of the diag-inv matrix has been initialized at -1/2, and softplus constraint is imposed on the real part of the diagonal elements of the matrix A during training, I don't get how it can go unstable in autoregressive generation.

Best Rgds,
Bill

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant