How can you continue training a model from a file? #188

opfromthestart · 2023-01-31T03:32:14Z

** Question: **
I am not sure if this is a bug or just a something I am doing wrong, but when I save the model as a file and then reload it as a layer, it does not have a similar loss to the saved version of the model, and it also trains much slower. Is there a setting needed to tell it how to continue? Do I need to save additional data to resume properly?
Project can be found here
Lines 231-238, 296-299 are the significant parts to this question. If I train it from scratch the loss will go down into the 80s after 20,000 iterations, but when I reload it it will start in the 130s and not decrease significantly in 20,000 iterations.

drahnr · 2023-01-31T07:09:27Z

Which GPU do you have and could you provide your saves files. It looks very fishy to me.

drahnr · 2023-01-31T14:58:53Z

warning: `logic-ai` (bin "logic-ai") generated 1 warning
    Finished dev [unoptimized + debuginfo] target(s) in 50.09s
     Running `../cargo_target/debug/logic-ai`
2 3 0 1 
3 0 1 2 
0 1 2 3 
1 2 3 0 

Did not load
0: 95.65555, 0.008372713
thread 'main' panicked at 'Could not write to file: Os { code: 2, kind: NotFound, message: "No such file or directory" }', src/main.rs:297:53
stack backtrace:
<snip>

What's the required path? Is the above output something you'd expect?

I'd be very happy to help!

opfromthestart · 2023-01-31T15:11:30Z

My GPU is a NVIDIA GeForce GTX 1650 Ti Mobile.
Here are two save files (they need to be renamed to futo.net and placed in the saves folder).
I pushed a new version of my project that should include the needed folder, so that error should no longer occur.
saves.zip

drahnr · 2023-02-01T11:52:44Z

Long story short, it's a bug. Some things became private and the current automated testing is done without such an example, #190 addresses the principal issue of lacking API to manage to do it at all. On the other handside: That API is very rough and requires understanding of knowledge around capnp which is not ideal. I'll create an abstraction soon™ but until then I'd recommend to patch your project with that PR.

opfromthestart · 2023-02-01T16:08:47Z

The example seems to store and retrieve the config, but not the actual network parameters(eg the ILayer object) itself. What would I need to change to also save and load a Layer object alongside the config?

drahnr · 2023-02-01T16:38:41Z

Edit: https://github.com/spearow/juice/blob/dda9d01c1dd81f6174b0340f12eb3d4f30551488/juice/src/layer.rs#L764-L767

drahnr · 2023-02-01T16:39:58Z

There is a save and load function should be what you need.

opfromthestart · 2023-02-01T18:54:08Z

I already use the save and load functions. I implemented saving of the config, but the learning stall after a reset still happens. Should I be getting the SequentialConfig from somewhere other than when I first make it? I would guess that it is probably a bug in the save and load functions rather than not being able to save configs.

drahnr · 2023-02-01T20:42:36Z

I'll dig deeper into this, on first glance save and load appear ok. I have yet to finish a unit test for a trained network with equality checks.

drahnr · 2023-02-02T08:59:37Z

#190 does implement a unit test now, but the PartialEq implementation does not cover all items. The weights are checked for equiv though, so that cannot be the root of the issue. The weight_gradients are not retained, but they are only accumulations for a minibatch anyways, and are reset after each. So this investigation needs some more time.

opfromthestart · 2023-02-11T22:02:02Z

I wrote a simple xor example from the project linked at the top. In the main function, first only run the xor_train() function, then stop it once it has learned, change the line to xor_eval(), and see that the stored weights do not produce the correct results.

drahnr · 2023-02-12T07:45:48Z

I think the correct test would be:

train
...
train
eval
save
load
eval

and compare the output on the two eval invocations on the same input. Or is that what you meant?

I didn't get around to dig deeper yet

opfromthestart · 2023-02-13T15:38:32Z

I revised the main function so that it does that, the issue still persists.
Could it have to do with the forward function itself? It theoretically should not need to be mutable, so maybe something is being overwritten there?

opfromthestart · 2023-02-22T15:06:04Z

I think it may have to do with the loading of the bias weights. I added a third example which is just a single linear layer, and it learned to just be the identity function. When I load it from file, it has the same slope, but all the outputs are shifted. I'm guessing that there is something that is not being saved or loaded properly from the weights.

0: 0.3421242
Trained model before reload from disk:
[1.4901191e-6, 0.9999985]
Loaded net
Model after reload from disk:
[1.4076138, 2.407611]
There are 0 differences in weights.

drahnr · 2023-02-22T16:07:53Z

I'll try to make some time for investigating further, personal life events just consume a lot of my spare time lately.

opfromthestart added documentation Documentation related troubles question labels Jan 31, 2023

opfromthestart assigned drahnr Jan 31, 2023

drahnr mentioned this issue Feb 1, 2023

load store example, make necessary APIs public #190

Open

4 tasks

opfromthestart mentioned this issue Feb 22, 2023

Now also saves bias layers #193

Merged

4 tasks

drahnr closed this as completed in #193 Feb 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can you continue training a model from a file? #188

How can you continue training a model from a file? #188

opfromthestart commented Jan 31, 2023

drahnr commented Jan 31, 2023

drahnr commented Jan 31, 2023

opfromthestart commented Jan 31, 2023 •

edited

Loading

drahnr commented Feb 1, 2023 •

edited

Loading

opfromthestart commented Feb 1, 2023 •

edited

Loading

drahnr commented Feb 1, 2023 •

edited

Loading

drahnr commented Feb 1, 2023

opfromthestart commented Feb 1, 2023

drahnr commented Feb 1, 2023

drahnr commented Feb 2, 2023

opfromthestart commented Feb 11, 2023 •

edited

Loading

drahnr commented Feb 12, 2023 •

edited

Loading

opfromthestart commented Feb 13, 2023

opfromthestart commented Feb 22, 2023

drahnr commented Feb 22, 2023 •

edited

Loading

How can you continue training a model from a file? #188

How can you continue training a model from a file? #188

Comments

opfromthestart commented Jan 31, 2023

drahnr commented Jan 31, 2023

drahnr commented Jan 31, 2023

opfromthestart commented Jan 31, 2023 • edited Loading

drahnr commented Feb 1, 2023 • edited Loading

opfromthestart commented Feb 1, 2023 • edited Loading

drahnr commented Feb 1, 2023 • edited Loading

drahnr commented Feb 1, 2023

opfromthestart commented Feb 1, 2023

drahnr commented Feb 1, 2023

drahnr commented Feb 2, 2023

opfromthestart commented Feb 11, 2023 • edited Loading

drahnr commented Feb 12, 2023 • edited Loading

opfromthestart commented Feb 13, 2023

opfromthestart commented Feb 22, 2023

drahnr commented Feb 22, 2023 • edited Loading

opfromthestart commented Jan 31, 2023 •

edited

Loading

drahnr commented Feb 1, 2023 •

edited

Loading

opfromthestart commented Feb 1, 2023 •

edited

Loading

drahnr commented Feb 1, 2023 •

edited

Loading

opfromthestart commented Feb 11, 2023 •

edited

Loading

drahnr commented Feb 12, 2023 •

edited

Loading

drahnr commented Feb 22, 2023 •

edited

Loading