-
-
Notifications
You must be signed in to change notification settings - Fork 279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tensor parallelism issues #598
Comments
This check runs regardless on NVIDIA systems, and doesn't generally cause an issue. But it looks like you have an executable called I've changed it so that it should catch any exceptions from that check and you shouldn't have an issue. It's in the dev branch but a new release with this and a bunch more fixes is coming real soon. |
After thinking about your response and the original error message a little, I managed to discover that I actually managed to add a directory my user doesn't have access to into PATH - That's the cause of the permission denied. The most dev commit still errors identically when this is the case, but does resolve correctly when the user running the software actually has permissions to their own PATH. Thanks/Sorry! However, the rocm stuff is the far smaller of the two issues I mentioned. Here's a log of attempting to run inference with tensor parallelism but without flash attention in the most recent dev:
|
Oh, right, I guess this wasn't really communicated anywhere, I apologize for that. But the TP feature currently requires flash attn. :/ It's still a little experimental and unfinished, and I'll probably add SDPA support sometime soon, though it still won't work with the dynamic generator. |
Right, I've pushed an update to the dev branch that should allow inference with TP mode even when flash-attn isn't available. It uses a slower code path for now so it's hard to say if you'll see any speedup. Torch SDPA is very limited (it's not just paged attn that's missing but also GQA support) but I'll see about further improving performance down the line. |
No speedup, but it does function! Running the following command on a 8x P100 machine: python test_inference.py -m /home/llama/mod/exl2/magnum-v2-123b-exl2/ -nfa -p "Once upon a time," -gs auto With -tp: For comparison, on my hardware, I'm used to getting around 8-10 tokens/second with a similar model in GPTQ running on whatever aphrodite engine uses for tensor parallelism. I thought specifying -nxf might prompt exllamav2 to use sdpa, and thus help prove whether it was the lack of GQA support that hurt performance so much, but it didn't actually seem to affect performance at all. |
A couple issues with the new tensor parallelism implementation!
Tensor Parallelism doesn't appear to respect a lack of flash attention, even via the -nfa flag. It also doesn't document flash attention as a requirement, instead crashing on the first attempted inference run when flash attention isn't available. My hardware doesn't have support for flash attention, so it would be super cool if the tensor parallelism implementation could fall back to xformers or similar.
Attempting to run tensor parallelism without also supplying gpu-split appears to result in the code looking for amd memory on nvidia computers. Adding in -gs appears to fix this, but it didn't seem like intended behavior?
The text was updated successfully, but these errors were encountered: