Significantly reduced inference speed after Lora finetunig #1104

hschaeufler · 2024-11-12T21:30:40Z

Describe
I found that after finetuning with Lora, the token throughput is significantly reduced. I trained a model on the unit test generation. And then fused the Lora adapter.

For my test dataset, the Lora-tuned model took 8:55:34h and generated a total of 246,362 tokens. That’s a token throughput of 7.67 tokens per second.

The base model only took 2:2:17h and generated 189,509 tokens. By my calculation, that’s around 21 tokens per second.

In the LoRA-Paper is written:

Our simple linear design allows us to merge the trainable matrices with the frozen weights
when deployed, introducing no inference latency compared to a fully fine-tuned model, by
construction.

Is this reduction normal or within the expected range?

To Reproduce

Include code snippet

mlx_lm.fuse --model "meta-llama/Meta-Llama-3.1-8B-Instruct" \
    --adapter-path "results/llama3_1_8B_instruct_lora/tuning_20/adapters" \
    --save-path "results/llama3_1_8B_instruct_lora/tuning_20/lora_fused_model/"

Expected behavior
I would expect a significantly higher token rate. With only a minor impact from the LoRA tuning

Desktop (please complete the following information):

MLX-LM [0.18.2]

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Significantly reduced inference speed after Lora finetunig #1104

Significantly reduced inference speed after Lora finetunig #1104

hschaeufler commented Nov 12, 2024 •

edited

Loading

Significantly reduced inference speed after Lora finetunig #1104

Significantly reduced inference speed after Lora finetunig #1104

Comments

hschaeufler commented Nov 12, 2024 • edited Loading

hschaeufler commented Nov 12, 2024 •

edited

Loading