You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe
I found that after finetuning with Lora, the token throughput is significantly reduced. I trained a model on the unit test generation. And then fused the Lora adapter.
For my test dataset, the Lora-tuned model took 8:55:34h and generated a total of 246,362 tokens. That’s a token throughput of 7.67 tokens per second.
The base model only took 2:2:17h and generated 189,509 tokens. By my calculation, that’s around 21 tokens per second.
In the LoRA-Paper is written:
Our simple linear design allows us to merge the trainable matrices with the frozen weights
when deployed, introducing no inference latency compared to a fully fine-tuned model, by
construction.
Is this reduction normal or within the expected range?
Describe
I found that after finetuning with Lora, the token throughput is significantly reduced. I trained a model on the unit test generation. And then fused the Lora adapter.
For my test dataset, the Lora-tuned model took 8:55:34h and generated a total of 246,362 tokens. That’s a token throughput of 7.67 tokens per second.
The base model only took 2:2:17h and generated 189,509 tokens. By my calculation, that’s around 21 tokens per second.
In the LoRA-Paper is written:
Is this reduction normal or within the expected range?
To Reproduce
Include code snippet
Expected behavior
I would expect a significantly higher token rate. With only a minor impact from the LoRA tuning
Desktop (please complete the following information):
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: