Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train on responses only does not work with TinyLlama-chat #1015

Closed
akhlakm opened this issue Sep 11, 2024 · 6 comments · Fixed by unslothai/unsloth-zoo#4
Closed

Train on responses only does not work with TinyLlama-chat #1015

akhlakm opened this issue Sep 11, 2024 · 6 comments · Fixed by unslothai/unsloth-zoo#4
Labels
currently fixing Am fixing now! URGENT BUG Urgent bug

Comments

@akhlakm
Copy link

akhlakm commented Sep 11, 2024

The following error occurs while using train_on_responses_only on the unsloth/tinyllama-chat-bnb-4bit model.

/usr/local/lib/python3.10/dist-packages/unsloth/chat_templates.py in <listcomp>(.0)
   1714     substring = _longest_common_substring([str(x + [0]) for x in all_input_ids])
   1715     substring = substring.split(", ")[:-1]
-> 1716     substring = [int(x) for x in substring]
   1717 
   1718     # Also get rest of tokenized string

ValueError: invalid literal for int() with base 10: ''

Link to the test notebook: https://colab.research.google.com/gist/akhlakm/c7c40b0c29d112f2544168be42d3410b/llama-3-1-8b-conversational-unsloth-2x-faster-finetuning.ipynb

Also, when the chat template defined in the tokenizer_config.json file is used, I get the following error if train_on_responses_only is used.

trainer_stats = trainer.train()
                    ^^^^^^^^^^^^^^^
  File "<string>", line 145, in train
  File "<string>", line 320, in _fast_inner_training_loop
  File "/home/user/unsloth_env/lib/python3.11/site-packages/accelerate/data_loader.py", line 550, in __iter__
    current_batch = next(dataloader_iter)
                    ^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/unsloth_env/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
           ^^^^^^^^^^^^^^^^^
  File "/home/user/unsloth_env/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 673, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/unsloth_env/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 55, in fetch
    return self.collate_fn(data)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/unsloth_env/lib/python3.11/site-packages/transformers/data/data_collator.py", line 45, in __call__
    return self.torch_call(features)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/unsloth_env/lib/python3.11/site-packages/transformers/data/data_collator.py", line 806, in torch_call
    batch = pad_without_fast_tokenizer_warning(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/unsloth_env/lib/python3.11/site-packages/transformers/data/data_collator.py", line 66, in pad_without_fast_tokenizer_warning
    padded = tokenizer.pad(*pad_args, **pad_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/unsloth_env/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 3560, in pad
    return BatchEncoding(batch_outputs, tensor_type=return_tensors)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/unsloth_env/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 227, in __init__
    self.convert_to_tensors(tensor_type=tensor_type, prepend_batch_axis=prepend_batch_axis)
  File "/home/user/unsloth_env/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 778, in convert_to_tensors
    raise ValueError(
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`labels` in this case) have excessive nesting (inputs type `list` where type `int` is expected).
@LostRuins
Copy link

I am getting the same error.

ValueError                                Traceback (most recent call last)
[<ipython-input-11-5017548030e3>](https://localhost:8080/#) in <cell line: 259>()
    257 # optionally train only on resps
    258 from unsloth.chat_templates import train_on_responses_only
--> 259 trainer = train_on_responses_only(
    260     trainer,
    261     instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",

2 frames
[/usr/local/lib/python3.10/dist-packages/unsloth/chat_templates.py](https://localhost:8080/#) in train_on_responses_only(trainer, instruction_part, response_part)
   1754 
   1755     # Get most common tokens since tokenizers can tokenize stuff differently!
-> 1756     Q_must, Q_left, Q_right = _find_common_token_ids(instruction_part, tokenizer)
   1757     A_must, A_left, A_right = _find_common_token_ids(response_part,    tokenizer)
   1758 

[/usr/local/lib/python3.10/dist-packages/unsloth/chat_templates.py](https://localhost:8080/#) in _find_common_token_ids(component, tokenizer)
   1714     substring = _longest_common_substring([str(x + [0]) for x in all_input_ids])
   1715     substring = substring.split(", ")[:-1]
-> 1716     substring = [int(x) for x in substring]
   1717 
   1718     # Also get rest of tokenized string

[/usr/local/lib/python3.10/dist-packages/unsloth/chat_templates.py](https://localhost:8080/#) in <listcomp>(.0)
   1714     substring = _longest_common_substring([str(x + [0]) for x in all_input_ids])
   1715     substring = substring.split(", ")[:-1]
-> 1716     substring = [int(x) for x in substring]
   1717 
   1718     # Also get rest of tokenized string

ValueError: invalid literal for int() with base 10: ''

@danielhanchen danielhanchen added currently fixing Am fixing now! URGENT BUG Urgent bug labels Sep 14, 2024
@danielhanchen
Copy link
Contributor

Oh whoops ok just saw your other issue @LostRuins as well - will definitely investigate this - sorry about this!

@NazimHAli
Copy link
Contributor

NazimHAli commented Oct 4, 2024

Get the same error using train on responses only, but with unsloth/mistral-7b-instruct-v0.3-bnb-4bit and chatml template. Is there a workaround or can someone point to a potential way to fix it? The line where it errors out is beyond my understanding as to how it could be fixed.

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

[<ipython-input-15-eec2c571abe8>](https://localhost:8080/#) in <cell line: 2>()
      1 from unsloth.chat_templates import train_on_responses_only
----> 2 trainer = train_on_responses_only(
      3     trainer,
      4     instruction_part = "<|im_start|>user\n",
      5     response_part = "<|im_start|>assistant\n",

2 frames

[/usr/local/lib/python3.10/dist-packages/unsloth/chat_templates.py](https://localhost:8080/#) in <listcomp>(.0)
   1864     substring = _longest_common_substring([str(x + [0]) for x in all_input_ids])
   1865     substring = substring.split(", ")[:-1]
-> 1866     substring = [int(x) for x in substring]
   1867 
   1868     # Also get rest of tokenized string

ValueError: invalid literal for int() with base 10: ''

@danielhanchen
Copy link
Contributor

@NazimHAli Ok apologies I kinda forgot about this issue :( I will escalate this! Sorry on this!

@NazimHAli
Copy link
Contributor

@NazimHAli Ok apologies I kinda forgot about this issue :( I will escalate this! Sorry on this!

No worries man, there are tons of open issues and you can't fix all of them yourself! If I understood the logic here, would fix it myself 😅

@4kasha
Copy link

4kasha commented Oct 15, 2024

As suggested here, how about directly using DataCollatorForCompletionOnlyLM? The relevant part should look like this:

from trl import SFTTrainer, DataCollatorForCompletionOnlyLM
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported

response_template_with_context = "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
response_template_ids = tokenizer.encode(response_template_with_context, add_special_tokens=False)[2:]

collator = DataCollatorForCompletionOnlyLM(response_template_ids, tokenizer=tokenizer)

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    # data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
    data_collator = collator,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

I verified that this works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
currently fixing Am fixing now! URGENT BUG Urgent bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants