Train on responses only does not work with TinyLlama-chat #1015

akhlakm · 2024-09-11T15:42:42Z

The following error occurs while using train_on_responses_only on the unsloth/tinyllama-chat-bnb-4bit model.

/usr/local/lib/python3.10/dist-packages/unsloth/chat_templates.py in <listcomp>(.0)
   1714     substring = _longest_common_substring([str(x + [0]) for x in all_input_ids])
   1715     substring = substring.split(", ")[:-1]
-> 1716     substring = [int(x) for x in substring]
   1717 
   1718     # Also get rest of tokenized string

ValueError: invalid literal for int() with base 10: ''

Link to the test notebook: https://colab.research.google.com/gist/akhlakm/c7c40b0c29d112f2544168be42d3410b/llama-3-1-8b-conversational-unsloth-2x-faster-finetuning.ipynb

Also, when the chat template defined in the tokenizer_config.json file is used, I get the following error if train_on_responses_only is used.

trainer_stats = trainer.train()
                    ^^^^^^^^^^^^^^^
  File "<string>", line 145, in train
  File "<string>", line 320, in _fast_inner_training_loop
  File "/home/user/unsloth_env/lib/python3.11/site-packages/accelerate/data_loader.py", line 550, in __iter__
    current_batch = next(dataloader_iter)
                    ^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/unsloth_env/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
           ^^^^^^^^^^^^^^^^^
  File "/home/user/unsloth_env/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 673, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/unsloth_env/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 55, in fetch
    return self.collate_fn(data)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/unsloth_env/lib/python3.11/site-packages/transformers/data/data_collator.py", line 45, in __call__
    return self.torch_call(features)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/unsloth_env/lib/python3.11/site-packages/transformers/data/data_collator.py", line 806, in torch_call
    batch = pad_without_fast_tokenizer_warning(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/unsloth_env/lib/python3.11/site-packages/transformers/data/data_collator.py", line 66, in pad_without_fast_tokenizer_warning
    padded = tokenizer.pad(*pad_args, **pad_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/unsloth_env/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 3560, in pad
    return BatchEncoding(batch_outputs, tensor_type=return_tensors)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/unsloth_env/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 227, in __init__
    self.convert_to_tensors(tensor_type=tensor_type, prepend_batch_axis=prepend_batch_axis)
  File "/home/user/unsloth_env/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 778, in convert_to_tensors
    raise ValueError(
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`labels` in this case) have excessive nesting (inputs type `list` where type `int` is expected).

The text was updated successfully, but these errors were encountered:

LostRuins · 2024-09-11T16:06:04Z

I am getting the same error.

ValueError                                Traceback (most recent call last)
[<ipython-input-11-5017548030e3>](https://localhost:8080/#) in <cell line: 259>()
    257 # optionally train only on resps
    258 from unsloth.chat_templates import train_on_responses_only
--> 259 trainer = train_on_responses_only(
    260     trainer,
    261     instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",

2 frames
[/usr/local/lib/python3.10/dist-packages/unsloth/chat_templates.py](https://localhost:8080/#) in train_on_responses_only(trainer, instruction_part, response_part)
   1754 
   1755     # Get most common tokens since tokenizers can tokenize stuff differently!
-> 1756     Q_must, Q_left, Q_right = _find_common_token_ids(instruction_part, tokenizer)
   1757     A_must, A_left, A_right = _find_common_token_ids(response_part,    tokenizer)
   1758 

[/usr/local/lib/python3.10/dist-packages/unsloth/chat_templates.py](https://localhost:8080/#) in _find_common_token_ids(component, tokenizer)
   1714     substring = _longest_common_substring([str(x + [0]) for x in all_input_ids])
   1715     substring = substring.split(", ")[:-1]
-> 1716     substring = [int(x) for x in substring]
   1717 
   1718     # Also get rest of tokenized string

[/usr/local/lib/python3.10/dist-packages/unsloth/chat_templates.py](https://localhost:8080/#) in <listcomp>(.0)
   1714     substring = _longest_common_substring([str(x + [0]) for x in all_input_ids])
   1715     substring = substring.split(", ")[:-1]
-> 1716     substring = [int(x) for x in substring]
   1717 
   1718     # Also get rest of tokenized string

ValueError: invalid literal for int() with base 10: ''

danielhanchen · 2024-09-14T08:27:16Z

Oh whoops ok just saw your other issue @LostRuins as well - will definitely investigate this - sorry about this!

NazimHAli · 2024-10-04T11:55:45Z

Get the same error using train on responses only, but with unsloth/mistral-7b-instruct-v0.3-bnb-4bit and chatml template. Is there a workaround or can someone point to a potential way to fix it? The line where it errors out is beyond my understanding as to how it could be fixed.

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

[<ipython-input-15-eec2c571abe8>](https://localhost:8080/#) in <cell line: 2>()
      1 from unsloth.chat_templates import train_on_responses_only
----> 2 trainer = train_on_responses_only(
      3     trainer,
      4     instruction_part = "<|im_start|>user\n",
      5     response_part = "<|im_start|>assistant\n",

2 frames

[/usr/local/lib/python3.10/dist-packages/unsloth/chat_templates.py](https://localhost:8080/#) in <listcomp>(.0)
   1864     substring = _longest_common_substring([str(x + [0]) for x in all_input_ids])
   1865     substring = substring.split(", ")[:-1]
-> 1866     substring = [int(x) for x in substring]
   1867 
   1868     # Also get rest of tokenized string

ValueError: invalid literal for int() with base 10: ''

danielhanchen · 2024-10-05T08:46:04Z

@NazimHAli Ok apologies I kinda forgot about this issue :( I will escalate this! Sorry on this!

NazimHAli · 2024-10-05T13:51:48Z

@NazimHAli Ok apologies I kinda forgot about this issue :( I will escalate this! Sorry on this!

No worries man, there are tons of open issues and you can't fix all of them yourself! If I understood the logic here, would fix it myself 😅

4kasha · 2024-10-15T03:58:44Z

As suggested here, how about directly using DataCollatorForCompletionOnlyLM? The relevant part should look like this:

from trl import SFTTrainer, DataCollatorForCompletionOnlyLM
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported

response_template_with_context = "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
response_template_ids = tokenizer.encode(response_template_with_context, add_special_tokens=False)[2:]

collator = DataCollatorForCompletionOnlyLM(response_template_ids, tokenizer=tokenizer)

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    # data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
    data_collator = collator,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

I verified that this works.

danielhanchen added currently fixing Am fixing now! URGENT BUG Urgent bug labels Sep 14, 2024

rlrs mentioned this issue Oct 17, 2024

Fix longest common substring implementation unslothai/unsloth-zoo#4

Merged

danielhanchen closed this as completed in unslothai/unsloth-zoo#4 Oct 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Train on responses only does not work with TinyLlama-chat #1015

Train on responses only does not work with TinyLlama-chat #1015

akhlakm commented Sep 11, 2024 •

edited

Loading

LostRuins commented Sep 11, 2024

danielhanchen commented Sep 14, 2024

NazimHAli commented Oct 4, 2024 •

edited

Loading

danielhanchen commented Oct 5, 2024

NazimHAli commented Oct 5, 2024

4kasha commented Oct 15, 2024 •

edited

Loading

Train on responses only does not work with TinyLlama-chat #1015

Train on responses only does not work with TinyLlama-chat #1015

Comments

akhlakm commented Sep 11, 2024 • edited Loading

LostRuins commented Sep 11, 2024

danielhanchen commented Sep 14, 2024

NazimHAli commented Oct 4, 2024 • edited Loading

danielhanchen commented Oct 5, 2024

NazimHAli commented Oct 5, 2024

4kasha commented Oct 15, 2024 • edited Loading

akhlakm commented Sep 11, 2024 •

edited

Loading

NazimHAli commented Oct 4, 2024 •

edited

Loading

4kasha commented Oct 15, 2024 •

edited

Loading