Rudimentary Flux functionality crawling #159

rgobbel · 2023-04-11T07:31:12Z

rgobbel
Apr 11, 2023

With some fairly simple changes to Flux.jl, I got a very simple program working:

using Statistics, ProgressMeter

using Pkg; Pkg.activate("/Users/gobbel/src/Flux.jl")
using Flux, Metal

device = gpu

if device == gpu
    lr = 0.01f0
    max_epochs = 500
else
    lr = 0.00005f0
    max_epochs = 100_000
end

p = Progress(max_epochs; showspeed=true)

println("device: $device, learning rate: $lr")

Metal.allowscalar(false)

one = 1.0f0

input_cpu = rand(Bool, 2, 1000) # 2 x 1000 Matrix{Bool}
input = input_cpu * one |> device

truth = input_cpu[1,:] .⊻ input_cpu[2,:]  # 1000-element Vector{Bool}

target = Flux.onehotbatch(truth, [true, false]) |> device

println("typeof(target)=$(typeof(target)), typeof(input)=$(typeof(input))")

model = Chain(
    (Dense(2 => 3, relu)),
    (Dense(3 => 2)),
    softmax) |> device        # move model to GPU, if available

loader = Flux.DataLoader((input, target) |>  device, batchsize=64, shuffle=true);

optim = Flux.setup(Flux.Adam(lr), model)

losses = []

for epoch in 1:max_epochs
    for (x, y) in loader
        loss, grads = Flux.withgradient(model) do m
            # Evaluate model and loss inside gradient context:
            y_hat = m(x)
            Flux.crossentropy(y_hat, y)
        end
        Flux.update!(optim, model, grads[1])
        push!(losses, loss)  # logging, outside gradient context
    end
    ProgressMeter.next!(p; showvalues=[(:epoch,epoch),(:loss,losses[end])])
end

It's extremely slow (CPU-only must be at least 100 times faster), but it does reduce the error:

julia> include("/Users/gobbel/src/experiments/FluxL.jl")
  Activating project at `~/src/Flux.jl`
device: gpu, learning rate: 0.01
typeof(target)=OneHotArrays.OneHotMatrix{UInt32, MtlVector{UInt32}}, typeof(input)=MtlMatrix{Float32}
Progress:  33%|█████████                  |  ETA: 0:07:27 ( 1.34  s/it)
  epoch:  166
  loss:   0.00086150446

A little later:

Progress: 100%|███████████████████████████| Time: 0:10:35 ( 1.27  s/it)
  epoch:  500
  loss:   2.2823198e-5

At this point it's no more than a proof of concept (no layers other than a basic Dense layer, and obviously far from optimized), but it's something. Suggestions welcomed.

maxwindiff · 2023-04-24T04:12:38Z

maxwindiff
Apr 24, 2023

@rgobbel Can you share the changes to Flux.jl and the data you used so that we can try this out?

6 replies

maxwindiff Apr 26, 2023

Awesome, thank you. I'll play around a bit!

maxwindiff May 1, 2023

I tried to run FluxCLI and got an error from Flux saying Metal is not supported. I tried rebasing metal-poc branch onto the current master (https://github.com/maxwindiff/Flux.jl/tree/metal-poc2) but that didn't help either.

Then I tried running the Metal tests in Flux.jl first (with FLUX_TEST_METAL=true), but it seems MetalExt is never loaded. Do you know why? This is the relevant part of the test output: https://gist.github.com/maxwindiff/ab7d5bc11f3e6bf6a5acc6e675f6e958

rgobbel May 1, 2023
Author

Ah, I see the problem, or at least part of it. I tried to follow the pattern for backward compatibility for extensions, but honestly, I didn't test that. I ran only on 1.9.0-rc2. With 1.8.5, I see the same errors.

Tests are mostly not running—actually I'm not sure any of them are. I was able to train a minimal net, no more. Proof of concept only. I will take another look at that.

I just tried running FluxCLI.jl in 1.9.0-rc2, installing the package by ]add https://github.com/rgobbel/Flux.jl#metal-poc, and that's still working for me.

maxwindiff May 7, 2023

Sorry I haven't been able to spend much time on this because of some IRL stuff, but I was able to run FluxCLI after switching to 1.9 and updating to the latest metal-poc head. Hope I can make some progress this week...

rgobbel May 9, 2023
Author

Good to hear. I was able to get it working in 1.8.5 after implementing compatibility code using Requires. Pushed out to my repository.

rgobbel · 2023-05-01T05:35:11Z

rgobbel
May 1, 2023
Author

Somewhere in my fumbling around to get it working, the backward compatibility code disappeared. I've recreated it, and done a minimal check that I could run FluxCLI.jl in both 1.8.5 and 1.9.0-rc2.

In addition to restoring the backward compatibility code, I changed a misleading message that mentions the GPU, when it's really just CUDA. Before running anything with Metal, you need to run Flux.gpu_backend("Metal"). After that, calls to gpu should automatically invoke the Metal extension.

0 replies

maxwindiff · 2023-06-26T05:14:25Z

maxwindiff
Jun 26, 2023

Apologies for the long radio silence. I finally got most of the life/work stuff out of the way, so I'm starting to have more time to investigate this.

The first thing I noticed is that a lot of the slowness were from Metal/GPU overhead, which is somewhat expected. For example if we apply a 8x8 Dense layer on a 8x8 input, the GPU code is about 1000x slower:

julia> a = Dense(8 => 8); da = gpu(a);

julia> x = rand(Float32, 8, 8); dx = gpu(x);

julia> @btime a(x);
  199.583 ns (2 allocations: 672 bytes)

julia> @btime Metal.@sync da(dx);
  209.709 μs (355 allocations: 9.10 KiB)

However if the input sizes are increased to 1024x1024, the GPU is now ahead:

julia> a = Dense(1024 => 1024); da = gpu(a);

julia> x = rand(Float32, 1024, 1024); dx = gpu(x);

julia> @btime a(x);
  3.388 ms (4 allocations: 8.00 MiB)

julia> @btime Metal.@sync da(dx);
  1.302 ms (364 allocations: 9.24 KiB)

These tests were done on Metal.jl v0.4.1 + (vanilla) Flux.jl v0.13.17, which now has built-in Metal support similar to your implementation.

I also stepped through the forward pass execution line-by-line-ish, and didn't see anything obviously wrong (the correct temporary MtlArrays were allocated, MPS gemm was used, etc).

However, I did recall seeing a lot of time spent in metallib compilation when profiling the full FluxCLI training loop. One possibility is that Zygote is messing up Metal.jl's compilation cache. This is the next thing on my list to investigate.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rudimentary Flux functionality crawling #159

{{title}}

Replies: 3 comments 6 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Rudimentary Flux functionality crawling #159

rgobbel Apr 11, 2023

Replies: 3 comments · 6 replies

maxwindiff Apr 24, 2023

maxwindiff Apr 26, 2023

maxwindiff May 1, 2023

rgobbel May 1, 2023 Author

maxwindiff May 7, 2023

rgobbel May 9, 2023 Author

rgobbel May 1, 2023 Author

maxwindiff Jun 26, 2023

rgobbel
Apr 11, 2023

Replies: 3 comments 6 replies

maxwindiff
Apr 24, 2023

rgobbel May 1, 2023
Author

rgobbel May 9, 2023
Author

rgobbel
May 1, 2023
Author

maxwindiff
Jun 26, 2023