Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RNN update to drop CUDNN, fix LSTM bug and output type stability #1367

Merged
merged 23 commits into from
Nov 7, 2020

Conversation

jeremiedb
Copy link
Contributor

@jeremiedb jeremiedb commented Oct 21, 2020

PR related to #1114 #1360 #1365

Some experiment for RNN handling.

Hidden state of each cell structure was dropped as they weren't needed (AFAIK, only needed for size inference for CUDNN, but bias size could be used as a substitute to cells' h there as well).

Looked to drop dependence on CUDNN entirely, so it's a pure Flux/CUDA.jl. File src/cuda/curnnjl no longer used. No modifications were made to the cell computations. Initial test seems to show decent performance, but yet to benchmark.

Pending issue: despite having dropped completely the CUDNN dependency, there's still an instability issue that seems present when running on GPU. This is illustrated in the test at lines 1-50 of file test\rnn-test-jdb.jl. If that test runs on CPU, it goes well thorugh the 100 iterations. However, the same on GPU will thow NAs after couple dozens of iterations.
My only hypothesis so far: when performing the iteration over the sequence through m.(x) or map(rnn, x), is the order of the execution safe? Ie: is it possible that there isn't a sync() on the CUDA side between those seq steps, which may mess up the state?

PR Checklist

  • Tests are added
  • Entry in NEWS.md
  • Documentation, if applicable
  • Final review from @dhairyagandhi96 (for API changes).

@jeremiedb
Copy link
Contributor Author

jeremiedb commented Oct 21, 2020

@maleadt

There's a non-deterministic failure that hits computations on the GPU when running a RNN. CPU and GPU models are identical through a couple of iterations, then crashes:

using Revise
using Flux
using Statistics: mean

feat = 32
h_size = 64
seq_len = 10
batch_size = 32

rnn = Chain(RNN(feat, h_size),
  Dense(h_size, 1, σ),
  x -> reshape(x,:))

X = [rand(feat, batch_size) for i in 1:seq_len]
Y = rand(batch_size, seq_len) ./ 10

######################################
rnn_gpu = rnn |> gpu
X_gpu = gpu(X)
Y_gpu = gpu(Y)
######################################

θ = Flux.params(rnn)
θ_gpu = Flux.params(rnn_gpu)
function loss(x,y)
  l = mean((Flux.stack(map(rnn, x),2) .- y) .^ 2f0)
  Flux.reset!(rnn)
  return l
end
function loss_gpu(x,y)
  l = mean((Flux.stack(map(rnn_gpu, x),2) .- y) .^ 2f0)
  Flux.reset!(rnn_gpu)
  return l
end

opt = ADAM(1e-4)
opt_gpu = ADAM(1e-4)

for i in 1:50
  println("iter: ", i)
  Flux.train!(loss, θ, [(X,Y)], opt)
  Flux.train!(loss_gpu, θ_gpu, [(X_gpu,Y_gpu)], opt_gpu)
  println("loss_cpu: ", loss(X, Y))
  println("loss_gpu: ", loss_gpu(X_gpu, Y_gpu))
end
iter: 1
loss_cpu: 0.30154937090167455
loss_gpu: 0.3015062
iter: 2
loss_cpu: 0.291274794331852
loss_gpu: 0.29123157
iter: 3
loss_cpu: 0.2811146438490362
loss_gpu: 0.28107145

... 

iter: 38
loss_cpu: 0.0699269985382991
loss_gpu: 0.06990228
iter: 39
loss_cpu: 0.0676047453112943
loss_gpu: 0.06758058
iter: 40
loss_cpu: 0.06539862587330023
loss_gpu: NaN
iter: 41
loss_cpu: 0.06330197006630321
loss_gpu: NaN
iter: 42
loss_cpu: 0.061308456094240696
loss_gpu: NaN

On this run, it failed at iteration 40, but another run it can be at iteration 28, etc.
A pattern that has been observed is that the number of iterations before throwing NAs is dependent on the network size. So with larger number of features or hidden size, it will throw NAs earlier. Any thought on what might be going wrong?

@maleadt
Copy link
Collaborator

maleadt commented Oct 23, 2020

Nope, no immediate thoughts. Besides, I haven't worked much on the CUDNN wrappers, most are a copy of what used to exist in Flux and was developed here.

@jeremiedb
Copy link
Contributor Author

@maleadt Thanks for feedback. In the above example however, the CUDNN wrappers aren't in cause. The example is based on this current PR in which the CUDNN wrappers were dropped, so the issue would be CUDA related (I assume).

@maleadt
Copy link
Collaborator

maleadt commented Oct 23, 2020

Ah, I misunderstood. I don't see how the CUDA stack would be to blame, then. Since you're not using streams, kernels
are executed in-order and adding synchronizations isn't going to help. It's possible there's some issue with a specific kernel or operation, of course, but that's impossible to say from the current code.

@jeremiedb
Copy link
Contributor Author

@maleadt Considering that the RNN layer running on GPU is just the result of model |> gpu, where the model is the basic RNN as defined in https://github.com/jeremiedb/Flux.jl/blob/rnn-gpu/src/layers/recurrent.jl that applies to the cpu computations which work properly, am I wrong to assume that these GPU operations should then all be handled through the kernels defined in CUDA.jl?

The basic RNN cell simply applies matrix multiplication and addition, along with broadcasted activation function, which shouldn't be any different than the basic Dense operator. As for the recurrent structure, the tricky part might be that it is a mutable struct whose state element is update through each call: https://github.com/jeremiedb/Flux.jl/blob/a8a95d73a872fc52ddd6c361bdc872aa909e9601/src/layers/recurrent.jl#L41

So if the issue explaining the that cpu case works but not the gpu is not on the CUDA/CuArray, I am a bit confused on where to look at. Any help be would be welcomed!

@DhairyaLGandhi Has there been a Flux release whose RNN GPU support was validated (might help trace back if something broke along the way)?

@maleadt
Copy link
Collaborator

maleadt commented Oct 26, 2020

@maleadt Considering that the RNN layer running on GPU is just the result of model |> gpu, where the model is the basic RNN as defined in https://github.com/jeremiedb/Flux.jl/blob/rnn-gpu/src/layers/recurrent.jl that applies to the cpu computations which work properly, am I wrong to assume that these GPU operations should then all be handled through the kernels defined in CUDA.jl?

Is there no other (GPU-specific) functionality from Flux involved at all? Anyway, as I said its possible there's an issue with a specific kernel or operation in CUDA.jl, but that's impossible to tell from this code. Happy to give it a better look if you can give me some CuArray operation(s) that don't match up with their Array counterparts.

@jeremiedb
Copy link
Contributor Author

@maleadt I created a gist with a reproducible example of the RNN failure on GPU. It builds from scratch the RNN structures.
Flux utilities include @functor, Chain and train! notably, but not any GPU functionality.
https://gist.github.com/jeremiedb/999852ea469db3035b4be3ca7d7b6b67

I couldn't isolate the problem down to a single array operation. As can be seen by executing the script, the predictions of the cpu and gpu models are exactly the same for a couple of iterations, until NaNs are returned on the GPU side (below shows the prints in the gist showing models' loss and some parameters), so it seems more tied to the iterative process. Let me know if there are some operations or tests that I could perform that may help the diagnosis.

iter: 38
loss_cpu: 0.023338849
loss_gpu: 0.023338849
θ[3][1:2]: Float32[0.2143463, -0.060762767]
θ[4][1:2]: Float32[0.2847736, 0.30347818]
θ_gpu[3][1:2]: Float32[0.2143463, -0.060762767]
θ_gpu[4][1:2]: Float32[0.2847736, 0.30347818]
iter: 39
loss_cpu: 0.022398682
loss_gpu: NaN
θ[3][1:2]: Float32[0.21423565, -0.060854755]
θ[4][1:2]: Float32[0.2847141, 0.3035958]
θ_gpu[3][1:2]: Float32[NaN, NaN]
θ_gpu[4][1:2]: Float32[NaN, NaN]

@maleadt
Copy link
Collaborator

maleadt commented Oct 29, 2020

Thanks, but this is still way to high level to debug from the GPU side of things.

I started reducing further, and after a couple of hours (and getting dumbfounded by JuliaGPU/CUDA.jl#515) I'm at:

using Flux
using CUDA
using Statistics: mean
using Random

mutable struct MyRecur{T}
  cell::T
  init
  state
end

MyRecur(m, h = hidden(m)) = MyRecur(m, h, h)

function (m::MyRecur)(xs...)
  m.state, y = m.cell(m.state, xs...)
  return y
end

Flux.@functor MyRecur
Flux.trainable(a::MyRecur) = (a.cell,)

struct MyRNNCell{F,A,V}
  σ::F
  Wi::A
  Wh::A
  b::V
end

MyRNNCell(in::Integer, out::Integer, σ = tanh; init = Flux.glorot_uniform) =
    MyRNNCell(σ, init(out, in), init(out, out), init(out))

function (m::MyRNNCell)(h, x)
  h = m.σ.(m.Wi*x)
  return h, h
end

hidden(m::MyRNNCell) = m.h
Flux.@functor MyRNNCell

MyRecur(m::MyRNNCell) = MyRecur(m, zeros(length(m.b)), zeros(length(m.b)))
MyRNN(a...; ka...) = MyRecur(MyRNNCell(a...; ka...))

function main(seed=0x6e73f7d40a1175b8)
    feat = 2
    h_size = 2

    Random.seed!(seed)
    rnn = Chain(MyRNN(feat, h_size))

    X = [[0.9426079f0; 0.8691368f0]]
    Y = [0.098483585f0]

    rnn_gpu = rnn |> gpu
    X_gpu = gpu(X)
    Y_gpu = gpu(Y)

    θ = Flux.params(rnn)
    θ_gpu = Flux.params(rnn_gpu)

    function loss(x,y)
        l = mean((Flux.stack(map(rnn, x),2) .- y) .^ 2f0)
        return l
    end

    function loss_gpu(x,y)
        l = mean((Flux.stack(map(rnn_gpu, x),2) .- y) .^ 2f0)
        return l
    end

    println("CPU")
    train!(loss, θ, (X,Y))
    println()

    println("GPU")
    train!(loss_gpu, θ_gpu, (X_gpu,Y_gpu))
    println()
end

function train!(loss, ps, d)
    gs = gradient(ps) do
        loss(d...)
    end
    x = first(ps)
    @show x gs[x]
end

isinteractive() || main()
CPU
x = Float32[-0.9715874 -1.175203; -0.9317107 -0.05941889]
gs[x] = Float32[-0.07948378 -0.073288456; -0.3643974 -0.33599466]

GPU
x = Float32[-0.9715874 -1.175203; -0.9317107 -0.05941889]
gs[x] = Float32[NaN NaN; NaN NaN]

So it looks like the gradients are wrong? I'll defer to somebody who knows about Zygote here.

@jeremiedb
Copy link
Contributor Author

Thanks for taking a look at this, and sorry for not nailing down the example to a basic operator.
It does comfort me that there's effectively something buggy happening around RNN design. I'll try isolating further what could go wrong with the gradient, hope a Zygote expert will help!

@jeremiedb
Copy link
Contributor Author

@maleadt I think I finally put the finger on what was causing the problem... and I'm a little disappointed of myself!

By replacing the Float32 exponent by an integer in the loss function, things are back in order. That is use this loss:
l = mean((Flux.stack(map(rnn_gpu, x), 2) .- y).^2) instead of l = mean((Flux.stack(map(rnn_gpu, x),2) .- y) .^ 2f0).

It remains unclear to me whether it would be considered an anomaly that on some iterations such NaN be thrown, like if the 2f0 got sometimes rounded not exactly to 2, resulting is complex number errors when the basis is negative, while the cpu seems not subject to the same rounding issue (or not as sensitive).

Anyhow, I think I now have all in place to complete this PR on RNNs, thanks again for you help.

@jeremiedb jeremiedb changed the title WIP - RNN rework and drop CUDNN RNN update to drop CUDNN, fix LSTM bug and output type stability Oct 31, 2020
@jeremiedb
Copy link
Contributor Author

@CarloLucibello Anything left to do?

@CarloLucibello
Copy link
Member

just two questions:

  • is this breaking, by requiring the input sequence to contains 2d arrays instead of 1d?
  • I didn't check recurrent layers, but in other layers, but in other layers we support a bias of Zeros type, which is going under revision in Fix some issues with Zeros option 2 #1379 . This may be a problem if we infer the hidden state's size from the bias. Should we just store the size internally instead?

@jeremiedb
Copy link
Contributor Author

jeremiedb commented Nov 6, 2020

  • It is not breaking: 1D input is still supported and results in same model as if it was a 2D with last dimension of size 1.

  • I just came about the "incremental compilation may be fatally broken for this module" when using Flux #1370. I'm still a little puzzles about this one, I'm not sure to follow the intent as I would assume that training or not a bias could already be well handled through the trainable or functor excluding the bias argument? So the idea is to introduce a Zeros as initializer that would mean no bias, while zeros mean trainable? It sounds a little bit confusing to me. As for the RNN definitions, it is the standard from my understanding to have the bias initialized with zeros, yet be trainable.
    I think it would makes sense to frame this PR with regard to the current functionalities / master rather than the one still in development.

@DrChainsaw
Copy link
Contributor

Zeros and RNNs does not work on master, see #1279

I'm not sure to follow the intent as I would assume that training or not a bias could already be well handled through the trainable or functor excluding the bias argument?

This is for being able to use the same layertype (e.g. Dense) with or without bias and I don't think this can be done easily through trainable or functor, or?

@jeremiedb
Copy link
Contributor Author

@DrChainsaw Thanks for the clarifications! From my understanding, the challenges relatives to the Zeros / no bias are more general than just the RNN. Therefore, I think it makes sense to consider this current RNN PR independent and bring the eventual adaptation from the no bias approach selected as needed.

@CarloLucibello
Copy link
Member

Ok, let's merge this then

bors r+

@bors
Copy link
Contributor

bors bot commented Nov 7, 2020

Build succeeded:

@bors bors bot merged commit 0e0e2e7 into FluxML:master Nov 7, 2020
@DhairyaLGandhi
Copy link
Member

DhairyaLGandhi commented Nov 7, 2020

Cudnn is an industry standard for cuda usage and dropping it can come with hidden costs, so we should be extra careful before releasing. Further, is storing the state necessary? It doesn't seem like it should be

@jeremiedb
Copy link
Contributor Author

is storing the state necessary?

In the RNN cells (RNN, GRU, LSTM), the state element refers to the state initial conditions, which is a learnable parameter. So it acts differently to the state element in the Recur structure. As such, this approach avoid the redundancy that was present before with both the init state in both the RNN cells and the Recur structure. Maybe the RNN cell state could instead be named state_init for disambiguation?

I agree that CUDNN is a standard, however, as previously discussed, the CUDNN's RNN would be beneficial in a 3D data logic, which I think would effectively be much beneficial to add. In a single timestep logic as of now, performance test performed above actually showed slight improvement with this pure CUDA approach on fairly good sized data on LSTMs as well as consistent results between CPU and GPU.

Given various issues raised around RNNs, aim here was to remove some of the frictions encountered (couple issues seems to be fixed from it). I agree though about importance to effectively not to brake something aside, so let me know if there are some tests you think would help in that assessment.

@CarloLucibello
Copy link
Member

CarloLucibello commented Nov 7, 2020

I would definitely change state to init_state or state0 in cells. We could make the change less breaking defining a getproperty(cell, :state). This is a good time window for implementing such changes since the next release is going to be breaking

@CarloLucibello
Copy link
Member

getproperty(cell, :state)

actually, there is no need for that this since you just renamed h to state. We could add a getproperty(cell, :h) and getproperty(rnn, :init) to smooth the transition, unless we decided that those layers' internal structure is not part of the interface contract

@DhairyaLGandhi
Copy link
Member

h referred to hidden state, I see where the confusion came from.

@jeremiedb
Copy link
Contributor Author

I would opt for state0 to refer to the hidden state initial parameters. That will avoid ending with something like init_init to refer to its initialization!
That way, I think the overall recurrent syntax will mostly avoid possible meaning collision, where state refers to a non learnable hidden state that refers either to h or (h, c) for RNN/GRU and LSTM, and state0 a learnable parameter.

I'm not sure of the use case for getproperty(cell, :h), could you precise what this would be needed for?
For getproperty(rnn, :init), I'm assuming that it would actually be getproperty(rnn, :state0) which would return the rnn.cell.state0? Or would you like to have the actual name :init used so as to refer to previous name used for initialization parameters?

@CarloLucibello
Copy link
Member

I'm not sure of the use case for getproperty(cell, :h), could you precise what this would be needed for?

for retro-compatibility, since RNNCell and others used to have an h field. But maybe we can ignore the issue if you expect the change to not cause much breakage (I'm not sure about that).

For getproperty(rnn, :init), I'm assuming that it would actually be getproperty(rnn, :state0) which would return the rnn.cell.state0? Or would you like to have the actual name :init used so as to refer to previous name used for initialization parameters?

I was thinking about retro-compatibility.

Could you file a new PR with the renaming and the getproperty with some deprecation messages?

@ToucheSir
Copy link
Member

RE cuDNN, JAX-based frameworks don't use the cuDNN RNN functionality at all. This is presumably because XLA doesn't support it and can optimize aggressively enough to be competitive, though.

@jeremiedb
Copy link
Contributor Author

Could you file a new PR with the renaming and the getproperty with some deprecation messages?

I'll do!

bors bot added a commit that referenced this pull request Nov 9, 2020
1390: RNN deprecations and naming fixes r=CarloLucibello a=jeremiedb

Following discussion in #1367 

This PR brings the disambiguation between initial state parameters named `state0` in the rnn cells with the state of the rnn chain named `state` in the `Recur` struct. 

Add getproperty with deprecation messages to access the legacy `h` and `c` in the rnn cells as well as the `init` field in `Recu` (which now points to `recur.cell.state0`). 

Include both 1D and 2D input dimensions to the basic BPTT test.   

Co-authored-by: jeremie.db <[email protected]>
@DhairyaLGandhi
Copy link
Member

Revisiting this, perhaps we should also experiment with 3d data and look at testing. Currently we seen to have a lot of broken tests which was not the case until recently

@jeremiedb
Copy link
Contributor Author

@DhairyaLGandhi Could you point to the actual broken tests? I'm only only aware of issues that were actually fixed.
I think Cudnn RNN is orthogonal to the single step API, as it doesn't add much value in such context (no optimization kicks in). I agree though there are thoughts to be put on how to best present tools for 3D vs single steps usages.
Current implementation simply uses CUDA.jl functionalities (batch mul), so we shouldn't expect any more difference for RNN cpu vs GPU than for Dense layer.
AFAIK, frameworks as Flax / MXNet do have single cell implementation, and fused one as well (MXNet)

@DhairyaLGandhi
Copy link
Member

You might want to check the tests marked broken in the linked pr #1472

@jeremiedb jeremiedb mentioned this pull request Jan 22, 2021
@jeremiedb
Copy link
Contributor Author

I'm a bit confused by the your comment regarding the broken tests since from what I see these were introduced as "broken" in July 2020, so prior to this refactor of RNN for Flux v0.11.3:
8ab0abd#diff-27864bcbbd16e272365b3997b516f378d49d0b4670b28e4b79850bb1689bb83f

As for the underlying cause for the broken tests, it seems to be a minor thing, I've just opened a PR where those test should now be passing: #1473

Regarding performance, as previously mentionned in the PR, I got a slight improvement vs prior CURNN dependency, which was observed while on CUDA.jl v1. However, there was a performance regression introduced in CUDA.jl v2 which seems to stem from a different GEMM behind called: JuliaGPU/CUDA.jl#538. It is from my understanding that this regression results in a less than 10% regression, which should in principle revert back to improved performance if fixed.

I think it is very desirable to have a CURNN integration for 3D input. However for the single step and other custom cases, I'm don't see the benefits vs relying on the transparent integration of FLux with CUDA.jl which is where I think Flux shines. But this is arguably just a very subjective perception of mine :)

@DhairyaLGandhi
Copy link
Member

Right I wasn't certain as to when the tests were marked so, and the point about the regression is well taken.

bors bot added a commit that referenced this pull request Jan 22, 2021
1473: Fix RNN tests on GPU r=DhairyaLGandhi a=jeremiedb

Fix for RNN on CUDA, as discussed in #1367 .

Co-authored-by: jeremie.db <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants