Finetuning #19

Stamenov · 2019-09-02T13:43:14Z

Hi,

just wondering, since you are basing the tf train.py on nshepperd's finetuning script, I was wonder if this code also supports finetuning, or are models trained here from scratch, finetunable with nshepperd's train.py?

Best regards.

lopuhin · 2019-09-02T13:47:57Z

Hi, it's possible to resume training from a checkpoint (so it's the same functionality as fine-tuning), but it's not possible to fine-tune original gpt-2 model, because tokenizer is different.

Stamenov · 2019-09-02T14:04:23Z

I am currently looking at these models to finetune, which were trained with this repo, or at least a fork of it. So I guess simply resuming with the tf version would suffice.
Thanks.

lopuhin · 2019-09-02T14:26:15Z

Oh nice, thanks for sharing the link. Then yes, fine-tuning should work.

Stamenov · 2019-09-02T14:33:21Z

Now, that I think about, I am not sure if the models are tf or torch. Is there a way to find out, given the model files:

Thanks again!

lopuhin · 2019-09-02T14:34:26Z

These are pytorch models, which is good, because TF code is not really supported, while pytorch code is better developed and supported.

Stamenov · 2019-09-02T14:40:55Z

Cool, will try on the weekend, thanks for the blazing fast responses 🥇

gooofy · 2019-09-02T18:33:24Z

hey, cool - thanks for trying out my GPT-2 models! would be happy to hear your feedback on these.

the larger GPT-2 model is still training, so if you want I can provide an updated model this week which should have slightly lower loss than the one released so far.

Stamenov · 2019-09-02T20:51:29Z

Hey @gooofy, this would be very cool, please do!
Thanks.

Stamenov · 2019-09-04T08:03:32Z

Hi, it's me again. I am not sure this is the right thread to follow up, so feel free to move it/let me know.
I am trying to start an adaptation from the 355M German model, but I seem to get a mismatch in the layer sizes. I guess I need the hyperparameters from the initial training:
This the hyperparameter configuratoin I get in the beginning of the training:

"batch_size": 2, "epochs": 10, "g_accum_gradients": 1, "hparams": { "gradient_checkpointing": false, "n_ctx": 1024, "n_embed": 768, "n_head": 12, "n_hidden": 768, "n_layer": 12, "n_vocab": 50000 }, "lr": 0.00025 } Loading dataset from /bpe Train dataset has 419,935 tokens Validation dataset has 3,608 tokens

And this is a small part of the size mismatch errors:

size mismatch for blocks.11.attn.c_attn.weight: copying a param with shape torch.Size([3072, 1024]) from checkpoint, the shape in current model is torch.Size([2304, 768]). size mismatch for blocks.11.attn.c_attn.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([2304]). size mismatch for blocks.11.attn.c_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]). size mismatch for blocks.11.attn.c_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for ln_f.g: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for ln_f.b: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).

lopuhin · 2019-09-04T08:05:50Z

Right, on each invocation you'll need to set all hyperparameters, and the error is indeed due to hyperparameter mismatch. I think that correct hyperparameters should be in params.json file which comes with the model - unfortunately currently we can't load them automatically.

Stamenov · 2019-09-04T08:20:56Z

Is there a CLI for the hyperparams? I cant seem to find one.

lopuhin · 2019-09-04T08:22:15Z

Yes, it's defined implicitly via fire library, so all main arguments are settable via command-line arguments. And also params.json should contain the full argument string which can serve as an example.

Stamenov · 2019-09-04T08:26:51Z

I guess resuming is also implicit, whenever there are the *.pt files in the model directory, furthermore the params.json is being overwritten on the invocation with current ones.

lopuhin · 2019-09-04T08:29:48Z

Indeed, resuming is implicit here:

transformer-lm/lm/main.py

Lines 139 to 140 in fa3f529

    
           if model_path.exists(): 
        
               load_model()

and right, params.json file will be overwritten, which is not great.

Stamenov · 2019-09-04T14:51:38Z

Hey, I am running into some further problems trying to resume from the big German model, even after I set the params. I would appreciate any help. Also, again, I am running the latest version from @gooofy 's fork:

Train dataset has 419,935 tokens Validation dataset has 3,608 tokens Resuming from seen_tokens 2,835,581,952 epochs: 6787it [00:00, 41376077.40it/s] Traceback (most recent call last): | 0/417792 [00:00<?, ?it/s] File "/home/martin/miniconda/envs/topics/bin/gpt-2", line 11, in <module> load_entry_point('lm', 'console_scripts', 'gpt-2')() File "/home/martin/dev/gtp2/gpt-2-german/transformer-lm/lm/main.py", line 322, in fire_main fire.Fire(only_allow_defined_args(main)) File "/home/martin/miniconda/envs/topics/lib/python3.6/site-packages/fire/core.py", line 127, in Fire component_trace = _Fire(component, args, context, name) File "/home/martin/miniconda/envs/topics/lib/python3.6/site-packages/fire/core.py", line 366, in _Fire component, remaining_args) File "/home/martin/miniconda/envs/topics/lib/python3.6/site-packages/fire/core.py", line 542, in _CallCallable result = fn(*varargs, **kwargs) File "/home/martin/dev/gtp2/gpt-2-german/transformer-lm/lm/fire_utils.py", line 30, in _return_wrapped return function_to_decorate(*args, **kwargs) File "/home/martin/dev/gtp2/gpt-2-german/transformer-lm/lm/main.py", line 259, in main train() File "/home/martin/dev/gtp2/gpt-2-german/transformer-lm/lm/main.py", line 213, in train validate() File "/home/martin/dev/gtp2/gpt-2-german/transformer-lm/lm/main.py", line 219, in validate valid_loss=get_valid_loss()) File "/home/martin/dev/gtp2/gpt-2-german/transformer-lm/lm/main.py", line 233, in get_valid_loss logits = model(ctx)['logits'] File "/home/martin/miniconda/envs/topics/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__ result = self.forward(*input, **kwargs) File "/home/martin/dev/gtp2/gpt-2-german/transformer-lm/lm/model.py", line 53, in forward h, present = torch.utils.checkpoint.checkpoint(block, h, past[:, i] if past is not None else None) File "/home/martin/miniconda/envs/topics/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 128, in checkpoint return CheckpointFunction.apply(function, *args) File "/home/martin/miniconda/envs/topics/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 34, in forward check_backward_validity(args) File "/home/martin/miniconda/envs/topics/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 20, in check_backward_validity if not any(inp.requires_grad for inp in inputs): File "/home/martin/miniconda/envs/topics/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 20, in <genexpr> if not any(inp.requires_grad for inp in inputs): AttributeError: 'NoneType' object has no attribute 'requires_grad' epochs: 6787it [01:16, 88.50it/s] 7%|██████▍

lopuhin · 2019-09-04T16:57:43Z

Hmm I see, this looks related to gradient checkpointing (which I didn't get a chance to try yet), I wonder if it will work if you disable it? Could be something else as well, hard to tell, sorry.

gooofy · 2019-09-04T19:29:38Z

here is the command line I am using for training this model - does this help?

gpt-2 de345-root data/encoded-de sp-model.model --n_embed=1024 --n_head=16 --n_layer=24 --batch_size=3 --gradient_checkpointing --save_every=5000

params.json:

{
"argv": "/home/bofh/projects/ai/torch/bin/gpt-2 de345-root data/encoded-de sp-model.model --n_embed=1024 --n_head=16 --n_layer=24 --batch_size=3 --gradient_checkpointing --save_every=5000",
"batch_size": 3,
"epochs": 10,
"g_accum_gradients": 1,
"hparams": {
"gradient_checkpointing": true,
"n_ctx": 1024,
"n_embed": 1024,
"n_head": 16,
"n_hidden": 1024,
"n_layer": 24,
"n_vocab": 50000
},
"lr": 0.00025
}

Stamenov · 2019-09-05T08:42:48Z

I have now disabled the gradient checkpointing, and I get stuck at the same place, but no error this time:
--n_embed=1024 --n_head=16 --n_layer=24 --batch_size=3 --gradient_checkpointing=0 --save_every=5000", "batch_size": 3, "epochs": 10, "g_accum_gradients": 1, "hparams": { "gradient_checkpointing": 0, "n_ctx": 1024, "n_embed": 1024, "n_head": 16, "n_hidden": 1024, "n_layer": 24, "n_vocab": 50000 }, "lr": 0.00025 } Loading dataset from /bpe Train dataset has 419,935 tokens Validation dataset has 3,608 tokens Resuming from seen_tokens 2,835,581,952 epochs: 6787it [01:17, 87.53it/s] 7%|███████▍ | 27648/417792 [01:17<18:14, 356.55it/s]

gooofy · 2019-09-05T18:44:29Z

just a wild guess: maybe you're using a different torch version?

lm                      0.1.0             /home/bofh/projects/ai/torch/transformer-lm
pytorch-pretrained-bert 0.6.2
torch                   1.2.0a0+6f6a680

Stamenov · 2019-09-05T19:02:11Z

Aahh, I am indeed.
pytorch-pretrained-bert 0.6.1
torch 1.0.1.post2

Stamenov · 2019-09-05T21:25:14Z

Just wondering, which CUDA version do you use? 10.0?

gooofy · 2019-09-06T05:47:04Z

yes, 10.0

gooofy · 2019-09-06T21:39:01Z

new release has finished uploading, available here: https://zamia.org/brain/

trained for 4.5 epochs on 27GB text corpus

Stamenov · 2019-09-07T12:49:27Z

Hi,
for some reason, even after I installed pytorch 1.2.0, cuda 10, conda with python 3.7.4 and nvidia drivers "NVIDIA-SMI 410.104", the training just quits after 7%, with the no error message, similarly to my previous post.

attrs 19.1.0
certifi 2019.6.16
cycler 0.10.0
filelock 3.0.12
fire 0.1.3
json-lines 0.5.0
json-log-plots 0.0.1
kiwisolver 1.1.0
lm 0.1.0 /home/martin/dev/gtp2/gpt-2-german/transformer-lm
matplotlib 3.0.3
numpy 1.16.2
pip 19.2.2
pyparsing 2.4.2
python-dateutil 2.8.0
sentencepiece 0.1.8
setuptools 41.0.1
six 1.12.0
torch 1.2.0
tqdm 4.31.1
wheel 0.33.4

lopuhin · 2019-09-07T15:13:01Z

@Stamenov I wonder if this could be some bug of resume code, I didn't test it that much. Does progress bar jump to 7% immediately, or it's getting there after some time? There is no error message printed, right? Can you check the exit code? Also I wonder if training from scratch will work for you (to narrow down the issue)?

gooofy · 2019-09-07T16:08:45Z

I think I resumed training for this model several times over the weeks and never noticed any issue. There is however still this so far unexplained loss spike that happened pretty early in the training, not sure if this could be related.

Stamenov · 2019-09-09T14:27:37Z

@lopuhin I does take some time to get there, it also just jumps to 7% from 0, after using my GPU for some time (reported using nvidia-smi) and after briefly showing a "0/3 validation" progress bar, just below the overall progress bar.

Training from scratch works, but with the default params only. With the ones from the german model, as supplied by @gooofy, I get CUDA out of memory error. Maybe this is related?

Are there any additional logs or informations stored while the training is going on, that I could check?

EDIT: Reducing the batch size to "1" shows the same behaviour, the progress bar now shows 96% and quits.

gooofy · 2019-09-09T18:43:42Z

what gpu model are you using? my settings are aimed at 11/12GB models (1080ti / titan x)

Stamenov · 2019-09-09T21:02:34Z

I tried it with K80 and Tesla V100, same results.

Stamenov · 2019-09-14T14:13:56Z

Okay, I think I was able to debug why this code does not work for finetuning (at least my case).
Basically I think this condition:
https://github.com/lopuhin/transformer-lm/blob/master/lm/main.py#L185
does not account for the fact that the resuming might happen with a new training set, which can be much smaller then the original one. The condition exists immediately, as it finds that it has seen already many more tokens then the current dataset, so it must be finished.
In case of the german model, it has seen already 4202621952 token, while my new finetuning dataset is only 419935.
How could this be fixed?

gooofy · 2019-09-15T21:51:05Z

uh, wow, nice find! congrats you got to the bottom of this :)

haven't had a chance to look deeper into this one, but maybe we could add a --finetune command line option which would simply make load_model() set seen_tokens back to zero?

hbajohr · 2020-05-01T19:48:45Z

uh, wow, nice find! congrats you got to the bottom of this :)

haven't had a chance to look deeper into this one, but maybe we could add a --finetune command line option which would simply make load_model() set seen_tokens back to zero?

Hi, this sounds great - did you implement the --finetune flag?

khalo-sa · 2020-05-27T21:38:15Z

Okay, I think I was able to debug why this code does not work for finetuning (at least my case).
Basically I think this condition:
https://github.com/lopuhin/transformer-lm/blob/master/lm/main.py#L185
does not account for the fact that the resuming might happen with a new training set, which can be much smaller then the original one. The condition exists immediately, as it finds that it has seen already many more tokens then the current dataset, so it must be finished.
In case of the german model, it has seen already 4202621952 token, while my new finetuning dataset is only 419935.
How could this be fixed?

Quite some time has passed, but I just wonder if you were able to find a solution for you? Also I'm not sure if line 185 in the code is still the same you were referring to back then?

SaschaStenger · 2020-07-15T11:38:39Z

So i've tried finetuning the german model, by just setting the seen tokens back to 0, as was suggested.

def load_model():        
        nonlocal seen_tokens
        if torch.cuda.is_available():
            state = torch.load(model_path)
        else:
             state = torch.load(model_path, map_location=torch.device('cpu'))
        if 'seen_tokens' in state:
            seen_tokens = state['seen_tokens']
        else:  # legacy format
            seen_tokens = state['step'] * step_tokens
        if finetune:
            seen_tokens = 0
        state_dict = fixed_state_dict(state['state_dict'])
        model.load_state_dict(state_dict)
        if torch.cuda.is_available():
            optimizer.load_state_dict(torch.load(optimizer_path))
        else:
            optimizer.load_state_dict(torch.load(optimizer_path, map_location=torch.device('cpu')))
        print(f'Resuming from seen_tokens {seen_tokens:,}')

But the trained model does perform much worse. And not at all, as before.
I'm getting output like:
der dem das den es ist der das
So my question would be: is there anything else, that i have to take into account, when finetuning such a model?
Or might it just be, that my finetuning dataset just isn't good? (the size of the encoded training set is around 1.5MB)

hafsahabib-educator · 2020-07-31T04:45:21Z

@SaschaStenger Were you able to fix the problem? I am also using the model by @gooofy and i am unable to finetune it. I will apply your patch, if you were able to succeed.

SaschaStenger · 2020-07-31T12:16:36Z

@SaschaStenger Were you able to fix the problem? I am also using the model by @gooofy and i am unable to finetune it. I will apply your patch, if you were able to succeed.

Sorry, so far i haven't been able to. But i'm still very interested in a solution and will look into it again and post any solution that i might find.
Although if anyone else has any suggestions on how to enable finetuning on this, i'd be more then happy to try them out.

hafsahabib-educator · 2020-07-31T18:07:44Z

@SaschaStenger I am trying few things. Will surely let you know if all goes well.

SaschaStenger · 2020-08-26T11:57:07Z

Thank you @hafsabukhary. I wanted to ask, if any of your approaches might have been fruitful.

hafsahabib-educator · 2020-08-26T16:26:42Z

@SaschaStenger I used the old main.py from https://github.com/gooofy/transformer-lm/tree/master/lm.
updated following code in train

        prev_tokens =   0
        if finetune:
            print('fine tuning enabled')
            prev_tokens=seen_tokens
        while seen_tokens < prev_tokens+(epochs * epoch_size):

this way training continues. you have to use default parameters of German model. e.g., vocab_size,

weibelbit · 2020-09-24T10:00:07Z

hi,
i have used the old main.py and tried to update the code as @hafsabukhary and @SaschaStenger proposed.
but i still have the same problem as @SaschaStenger describes:

But the trained model does perform much worse. And not at all, as before.
I'm getting output like:
der dem das den es ist der das

i am trying to finetune on a specific dataset. shouldn't the finetune from the beginning have the same gramatical quality as the german model 346 and implement the new vocabulary better with every iteration?

has anybody found a solution for this problem? does finetuning work for you?

thank you.

SaschaStenger · 2020-09-30T07:51:59Z

i am trying to finetune on a specific dataset. shouldn't the finetune from the beginning have the same gramatical quality as the german model 346 and implement the new vocabulary better with every iteration?

has anybody found a solution for this problem? does finetuning work for you?

I'm having similar issues.
I did add some more general text to my finetuning dataset, but it still takes quiet a few iterations until it produces anything intelligible. And even then it is nowhere near the original performance.
Any help in this matter would be greatly appreciated.

weibelbit · 2020-10-02T09:53:46Z

Made another finetuning test around 970 epochs, now it sometimes seems to overfit, by generating sentences that are the same as in the corpus that i use, (3,1 MB .txt) on other times it just sticks random snippets together witout any sense.

Finetuning #19

Finetuning #19

Comments

Stamenov commented Sep 2, 2019

lopuhin commented Sep 2, 2019

Stamenov commented Sep 2, 2019

lopuhin commented Sep 2, 2019

Stamenov commented Sep 2, 2019

lopuhin commented Sep 2, 2019

Stamenov commented Sep 2, 2019

gooofy commented Sep 2, 2019

Stamenov commented Sep 2, 2019

Stamenov commented Sep 4, 2019 • edited Loading

lopuhin commented Sep 4, 2019

Stamenov commented Sep 4, 2019

lopuhin commented Sep 4, 2019

Stamenov commented Sep 4, 2019

lopuhin commented Sep 4, 2019

Stamenov commented Sep 4, 2019

lopuhin commented Sep 4, 2019

gooofy commented Sep 4, 2019

Stamenov commented Sep 5, 2019 • edited Loading

gooofy commented Sep 5, 2019

Stamenov commented Sep 5, 2019 • edited Loading

Stamenov commented Sep 5, 2019

gooofy commented Sep 6, 2019

gooofy commented Sep 6, 2019

Stamenov commented Sep 7, 2019 • edited Loading

lopuhin commented Sep 7, 2019

gooofy commented Sep 7, 2019

Stamenov commented Sep 9, 2019 • edited Loading

gooofy commented Sep 9, 2019

Stamenov commented Sep 9, 2019

Stamenov commented Sep 14, 2019 • edited Loading

gooofy commented Sep 15, 2019

hbajohr commented May 1, 2020

khalo-sa commented May 27, 2020

SaschaStenger commented Jul 15, 2020

hafsahabib-educator commented Jul 31, 2020

SaschaStenger commented Jul 31, 2020

hafsahabib-educator commented Jul 31, 2020

SaschaStenger commented Aug 26, 2020

hafsahabib-educator commented Aug 26, 2020

weibelbit commented Sep 24, 2020

SaschaStenger commented Sep 30, 2020

weibelbit commented Oct 2, 2020

Stamenov commented Sep 4, 2019 •

edited

Loading

Stamenov commented Sep 5, 2019 •

edited

Loading

Stamenov commented Sep 5, 2019 •

edited

Loading

Stamenov commented Sep 7, 2019 •

edited

Loading

Stamenov commented Sep 9, 2019 •

edited

Loading

Stamenov commented Sep 14, 2019 •

edited

Loading