Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finetuning #19

Open
Stamenov opened this issue Sep 2, 2019 · 42 comments
Open

Finetuning #19

Stamenov opened this issue Sep 2, 2019 · 42 comments

Comments

@Stamenov
Copy link

Stamenov commented Sep 2, 2019

Hi,

just wondering, since you are basing the tf train.py on nshepperd's finetuning script, I was wonder if this code also supports finetuning, or are models trained here from scratch, finetunable with nshepperd's train.py?

Best regards.

@lopuhin
Copy link
Owner

lopuhin commented Sep 2, 2019

Hi, it's possible to resume training from a checkpoint (so it's the same functionality as fine-tuning), but it's not possible to fine-tune original gpt-2 model, because tokenizer is different.

@Stamenov
Copy link
Author

Stamenov commented Sep 2, 2019

I am currently looking at these models to finetune, which were trained with this repo, or at least a fork of it. So I guess simply resuming with the tf version would suffice.
Thanks.

@lopuhin
Copy link
Owner

lopuhin commented Sep 2, 2019

Oh nice, thanks for sharing the link. Then yes, fine-tuning should work.

@Stamenov
Copy link
Author

Stamenov commented Sep 2, 2019

Now, that I think about, I am not sure if the models are tf or torch. Is there a way to find out, given the model files:
image
Thanks again!

@lopuhin
Copy link
Owner

lopuhin commented Sep 2, 2019

These are pytorch models, which is good, because TF code is not really supported, while pytorch code is better developed and supported.

@Stamenov
Copy link
Author

Stamenov commented Sep 2, 2019

Cool, will try on the weekend, thanks for the blazing fast responses 🥇

@gooofy
Copy link
Contributor

gooofy commented Sep 2, 2019

hey, cool - thanks for trying out my GPT-2 models! would be happy to hear your feedback on these.

the larger GPT-2 model is still training, so if you want I can provide an updated model this week which should have slightly lower loss than the one released so far.

@Stamenov
Copy link
Author

Stamenov commented Sep 2, 2019

Hey @gooofy, this would be very cool, please do!
Thanks.

@Stamenov
Copy link
Author

Stamenov commented Sep 4, 2019

Hi, it's me again. I am not sure this is the right thread to follow up, so feel free to move it/let me know.
I am trying to start an adaptation from the 355M German model, but I seem to get a mismatch in the layer sizes. I guess I need the hyperparameters from the initial training:
This the hyperparameter configuratoin I get in the beginning of the training:

"batch_size": 2, "epochs": 10, "g_accum_gradients": 1, "hparams": { "gradient_checkpointing": false, "n_ctx": 1024, "n_embed": 768, "n_head": 12, "n_hidden": 768, "n_layer": 12, "n_vocab": 50000 }, "lr": 0.00025 } Loading dataset from /bpe Train dataset has 419,935 tokens Validation dataset has 3,608 tokens

And this is a small part of the size mismatch errors:

size mismatch for blocks.11.attn.c_attn.weight: copying a param with shape torch.Size([3072, 1024]) from checkpoint, the shape in current model is torch.Size([2304, 768]). size mismatch for blocks.11.attn.c_attn.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([2304]). size mismatch for blocks.11.attn.c_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]). size mismatch for blocks.11.attn.c_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for ln_f.g: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for ln_f.b: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).

@lopuhin
Copy link
Owner

lopuhin commented Sep 4, 2019

Right, on each invocation you'll need to set all hyperparameters, and the error is indeed due to hyperparameter mismatch. I think that correct hyperparameters should be in params.json file which comes with the model - unfortunately currently we can't load them automatically.

@Stamenov
Copy link
Author

Stamenov commented Sep 4, 2019

Is there a CLI for the hyperparams? I cant seem to find one.

@lopuhin
Copy link
Owner

lopuhin commented Sep 4, 2019

Yes, it's defined implicitly via fire library, so all main arguments are settable via command-line arguments. And also params.json should contain the full argument string which can serve as an example.

@Stamenov
Copy link
Author

Stamenov commented Sep 4, 2019

I guess resuming is also implicit, whenever there are the *.pt files in the model directory, furthermore the params.json is being overwritten on the invocation with current ones.

@lopuhin
Copy link
Owner

lopuhin commented Sep 4, 2019

Indeed, resuming is implicit here:

transformer-lm/lm/main.py

Lines 139 to 140 in fa3f529

if model_path.exists():
load_model()

and right, params.json file will be overwritten, which is not great.

@Stamenov
Copy link
Author

Stamenov commented Sep 4, 2019

Hey, I am running into some further problems trying to resume from the big German model, even after I set the params. I would appreciate any help. Also, again, I am running the latest version from @gooofy 's fork:

Train dataset has 419,935 tokens Validation dataset has 3,608 tokens Resuming from seen_tokens 2,835,581,952 epochs: 6787it [00:00, 41376077.40it/s] Traceback (most recent call last): | 0/417792 [00:00<?, ?it/s] File "/home/martin/miniconda/envs/topics/bin/gpt-2", line 11, in <module> load_entry_point('lm', 'console_scripts', 'gpt-2')() File "/home/martin/dev/gtp2/gpt-2-german/transformer-lm/lm/main.py", line 322, in fire_main fire.Fire(only_allow_defined_args(main)) File "/home/martin/miniconda/envs/topics/lib/python3.6/site-packages/fire/core.py", line 127, in Fire component_trace = _Fire(component, args, context, name) File "/home/martin/miniconda/envs/topics/lib/python3.6/site-packages/fire/core.py", line 366, in _Fire component, remaining_args) File "/home/martin/miniconda/envs/topics/lib/python3.6/site-packages/fire/core.py", line 542, in _CallCallable result = fn(*varargs, **kwargs) File "/home/martin/dev/gtp2/gpt-2-german/transformer-lm/lm/fire_utils.py", line 30, in _return_wrapped return function_to_decorate(*args, **kwargs) File "/home/martin/dev/gtp2/gpt-2-german/transformer-lm/lm/main.py", line 259, in main train() File "/home/martin/dev/gtp2/gpt-2-german/transformer-lm/lm/main.py", line 213, in train validate() File "/home/martin/dev/gtp2/gpt-2-german/transformer-lm/lm/main.py", line 219, in validate valid_loss=get_valid_loss()) File "/home/martin/dev/gtp2/gpt-2-german/transformer-lm/lm/main.py", line 233, in get_valid_loss logits = model(ctx)['logits'] File "/home/martin/miniconda/envs/topics/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__ result = self.forward(*input, **kwargs) File "/home/martin/dev/gtp2/gpt-2-german/transformer-lm/lm/model.py", line 53, in forward h, present = torch.utils.checkpoint.checkpoint(block, h, past[:, i] if past is not None else None) File "/home/martin/miniconda/envs/topics/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 128, in checkpoint return CheckpointFunction.apply(function, *args) File "/home/martin/miniconda/envs/topics/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 34, in forward check_backward_validity(args) File "/home/martin/miniconda/envs/topics/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 20, in check_backward_validity if not any(inp.requires_grad for inp in inputs): File "/home/martin/miniconda/envs/topics/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 20, in <genexpr> if not any(inp.requires_grad for inp in inputs): AttributeError: 'NoneType' object has no attribute 'requires_grad' epochs: 6787it [01:16, 88.50it/s] 7%|██████▍

@lopuhin
Copy link
Owner

lopuhin commented Sep 4, 2019

Hmm I see, this looks related to gradient checkpointing (which I didn't get a chance to try yet), I wonder if it will work if you disable it? Could be something else as well, hard to tell, sorry.

@gooofy
Copy link
Contributor

gooofy commented Sep 4, 2019

here is the command line I am using for training this model - does this help?

gpt-2 de345-root data/encoded-de sp-model.model --n_embed=1024 --n_head=16 --n_layer=24 --batch_size=3 --gradient_checkpointing --save_every=5000

params.json:

{
"argv": "/home/bofh/projects/ai/torch/bin/gpt-2 de345-root data/encoded-de sp-model.model --n_embed=1024 --n_head=16 --n_layer=24 --batch_size=3 --gradient_checkpointing --save_every=5000",
"batch_size": 3,
"epochs": 10,
"g_accum_gradients": 1,
"hparams": {
"gradient_checkpointing": true,
"n_ctx": 1024,
"n_embed": 1024,
"n_head": 16,
"n_hidden": 1024,
"n_layer": 24,
"n_vocab": 50000
},
"lr": 0.00025
}

@Stamenov
Copy link
Author

Stamenov commented Sep 5, 2019

I have now disabled the gradient checkpointing, and I get stuck at the same place, but no error this time:
--n_embed=1024 --n_head=16 --n_layer=24 --batch_size=3 --gradient_checkpointing=0 --save_every=5000", "batch_size": 3, "epochs": 10, "g_accum_gradients": 1, "hparams": { "gradient_checkpointing": 0, "n_ctx": 1024, "n_embed": 1024, "n_head": 16, "n_hidden": 1024, "n_layer": 24, "n_vocab": 50000 }, "lr": 0.00025 } Loading dataset from /bpe Train dataset has 419,935 tokens Validation dataset has 3,608 tokens Resuming from seen_tokens 2,835,581,952 epochs: 6787it [01:17, 87.53it/s] 7%|███████▍ | 27648/417792 [01:17<18:14, 356.55it/s]

@gooofy
Copy link
Contributor

gooofy commented Sep 5, 2019

just a wild guess: maybe you're using a different torch version?

lm                      0.1.0             /home/bofh/projects/ai/torch/transformer-lm
pytorch-pretrained-bert 0.6.2
torch                   1.2.0a0+6f6a680

@Stamenov
Copy link
Author

Stamenov commented Sep 5, 2019

Aahh, I am indeed.
pytorch-pretrained-bert 0.6.1
torch 1.0.1.post2

@Stamenov
Copy link
Author

Stamenov commented Sep 5, 2019

Just wondering, which CUDA version do you use? 10.0?

@gooofy
Copy link
Contributor

gooofy commented Sep 6, 2019

yes, 10.0

@gooofy
Copy link
Contributor

gooofy commented Sep 6, 2019

new release has finished uploading, available here: https://zamia.org/brain/

trained for 4.5 epochs on 27GB text corpus

@Stamenov
Copy link
Author

Stamenov commented Sep 7, 2019

Hi,
for some reason, even after I installed pytorch 1.2.0, cuda 10, conda with python 3.7.4 and nvidia drivers "NVIDIA-SMI 410.104", the training just quits after 7%, with the no error message, similarly to my previous post.

attrs 19.1.0
certifi 2019.6.16
cycler 0.10.0
filelock 3.0.12
fire 0.1.3
json-lines 0.5.0
json-log-plots 0.0.1
kiwisolver 1.1.0
lm 0.1.0 /home/martin/dev/gtp2/gpt-2-german/transformer-lm
matplotlib 3.0.3
numpy 1.16.2
pip 19.2.2
pyparsing 2.4.2
python-dateutil 2.8.0
sentencepiece 0.1.8
setuptools 41.0.1
six 1.12.0
torch 1.2.0
tqdm 4.31.1
wheel 0.33.4

@lopuhin
Copy link
Owner

lopuhin commented Sep 7, 2019

@Stamenov I wonder if this could be some bug of resume code, I didn't test it that much. Does progress bar jump to 7% immediately, or it's getting there after some time? There is no error message printed, right? Can you check the exit code? Also I wonder if training from scratch will work for you (to narrow down the issue)?

@gooofy
Copy link
Contributor

gooofy commented Sep 7, 2019

I think I resumed training for this model several times over the weeks and never noticed any issue. There is however still this so far unexplained loss spike that happened pretty early in the training, not sure if this could be related.
loss_de345-root

@Stamenov
Copy link
Author

Stamenov commented Sep 9, 2019

@lopuhin I does take some time to get there, it also just jumps to 7% from 0, after using my GPU for some time (reported using nvidia-smi) and after briefly showing a "0/3 validation" progress bar, just below the overall progress bar.

Training from scratch works, but with the default params only. With the ones from the german model, as supplied by @gooofy, I get CUDA out of memory error. Maybe this is related?

Are there any additional logs or informations stored while the training is going on, that I could check?

EDIT: Reducing the batch size to "1" shows the same behaviour, the progress bar now shows 96% and quits.

@gooofy
Copy link
Contributor

gooofy commented Sep 9, 2019

what gpu model are you using? my settings are aimed at 11/12GB models (1080ti / titan x)

@Stamenov
Copy link
Author

Stamenov commented Sep 9, 2019

I tried it with K80 and Tesla V100, same results.

@Stamenov
Copy link
Author

Stamenov commented Sep 14, 2019

Okay, I think I was able to debug why this code does not work for finetuning (at least my case).
Basically I think this condition:
https://github.com/lopuhin/transformer-lm/blob/master/lm/main.py#L185
does not account for the fact that the resuming might happen with a new training set, which can be much smaller then the original one. The condition exists immediately, as it finds that it has seen already many more tokens then the current dataset, so it must be finished.
In case of the german model, it has seen already 4202621952 token, while my new finetuning dataset is only 419935.
How could this be fixed?

@gooofy
Copy link
Contributor

gooofy commented Sep 15, 2019

uh, wow, nice find! congrats you got to the bottom of this :)

haven't had a chance to look deeper into this one, but maybe we could add a --finetune command line option which would simply make load_model() set seen_tokens back to zero?

@hbajohr
Copy link

hbajohr commented May 1, 2020

uh, wow, nice find! congrats you got to the bottom of this :)

haven't had a chance to look deeper into this one, but maybe we could add a --finetune command line option which would simply make load_model() set seen_tokens back to zero?

Hi, this sounds great - did you implement the --finetune flag?

@khalo-sa
Copy link

Okay, I think I was able to debug why this code does not work for finetuning (at least my case).
Basically I think this condition:
https://github.com/lopuhin/transformer-lm/blob/master/lm/main.py#L185
does not account for the fact that the resuming might happen with a new training set, which can be much smaller then the original one. The condition exists immediately, as it finds that it has seen already many more tokens then the current dataset, so it must be finished.
In case of the german model, it has seen already 4202621952 token, while my new finetuning dataset is only 419935.
How could this be fixed?

Quite some time has passed, but I just wonder if you were able to find a solution for you? Also I'm not sure if line 185 in the code is still the same you were referring to back then?

@SaschaStenger
Copy link

So i've tried finetuning the german model, by just setting the seen tokens back to 0, as was suggested.

def load_model():        
        nonlocal seen_tokens
        if torch.cuda.is_available():
            state = torch.load(model_path)
        else:
             state = torch.load(model_path, map_location=torch.device('cpu'))
        if 'seen_tokens' in state:
            seen_tokens = state['seen_tokens']
        else:  # legacy format
            seen_tokens = state['step'] * step_tokens
        if finetune:
            seen_tokens = 0
        state_dict = fixed_state_dict(state['state_dict'])
        model.load_state_dict(state_dict)
        if torch.cuda.is_available():
            optimizer.load_state_dict(torch.load(optimizer_path))
        else:
            optimizer.load_state_dict(torch.load(optimizer_path, map_location=torch.device('cpu')))
        print(f'Resuming from seen_tokens {seen_tokens:,}')

But the trained model does perform much worse. And not at all, as before.
I'm getting output like:
der dem das den es ist der das
So my question would be: is there anything else, that i have to take into account, when finetuning such a model?
Or might it just be, that my finetuning dataset just isn't good? (the size of the encoded training set is around 1.5MB)

@hafsahabib-educator
Copy link

@SaschaStenger Were you able to fix the problem? I am also using the model by @gooofy and i am unable to finetune it. I will apply your patch, if you were able to succeed.

@SaschaStenger
Copy link

@SaschaStenger Were you able to fix the problem? I am also using the model by @gooofy and i am unable to finetune it. I will apply your patch, if you were able to succeed.

Sorry, so far i haven't been able to. But i'm still very interested in a solution and will look into it again and post any solution that i might find.
Although if anyone else has any suggestions on how to enable finetuning on this, i'd be more then happy to try them out.

@hafsahabib-educator
Copy link

@SaschaStenger I am trying few things. Will surely let you know if all goes well.

@SaschaStenger
Copy link

Thank you @hafsabukhary. I wanted to ask, if any of your approaches might have been fruitful.

@hafsahabib-educator
Copy link

@SaschaStenger I used the old main.py from https://github.com/gooofy/transformer-lm/tree/master/lm.
updated following code in train

        prev_tokens =   0
        if finetune:
            print('fine tuning enabled')
            prev_tokens=seen_tokens
        while seen_tokens < prev_tokens+(epochs * epoch_size):

this way training continues. you have to use default parameters of German model. e.g., vocab_size,

@weibelbit
Copy link

hi,
i have used the old main.py and tried to update the code as @hafsabukhary and @SaschaStenger proposed.
but i still have the same problem as @SaschaStenger describes:

But the trained model does perform much worse. And not at all, as before.
I'm getting output like:
der dem das den es ist der das

i am trying to finetune on a specific dataset. shouldn't the finetune from the beginning have the same gramatical quality as the german model 346 and implement the new vocabulary better with every iteration?

has anybody found a solution for this problem? does finetuning work for you?

thank you.

@SaschaStenger
Copy link

i am trying to finetune on a specific dataset. shouldn't the finetune from the beginning have the same gramatical quality as the german model 346 and implement the new vocabulary better with every iteration?

has anybody found a solution for this problem? does finetuning work for you?

I'm having similar issues.
I did add some more general text to my finetuning dataset, but it still takes quiet a few iterations until it produces anything intelligible. And even then it is nowhere near the original performance.
Any help in this matter would be greatly appreciated.

@weibelbit
Copy link

Made another finetuning test around 970 epochs, now it sometimes seems to overfit, by generating sentences that are the same as in the corpus that i use, (3,1 MB .txt) on other times it just sticks random snippets together witout any sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants