-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature request] Compatibility with transformers>=4.43.2 #65
Comments
Yes, also reported in #59 (comment). The streaming code unfortunately relies a lot on internals of the transformers library, so it can break at any time. Best would probably be to pin a specific version that works. Could you share which exact package requires the latest transformers version? |
It's not necessarily a package, but rather, it is a dependency for running the latest version of Meta LLaMA, LLaMA 3.1, which uses the transformers library and it was recommended to use the latest version of transformers. Right now, I am running the latest version of what Coqui TTS allows, which works fine, but I have a lot of warning messages about deprecating implementations from the transformers library. |
Yeah, I'm currently trying out Google's Gemma 2 LLM and yeah, this is going to be an issue for those who are doing LLM + XTTS. Gemma 2 requires a newer version transformers because it doesn't recognize it in version 4.40.2. So we're left with a choice to be less flexible on what LLMs we can use or drop XTTS completely. |
It would be helpful if you shared what package/repo/code you're running to be aware of how Coqui is used and how it is affected by external changes. But for this kind of use case the best solution is probably to put the TTS and the LLM into separate environments, so that their dependencies don't affect each other. |
For my current use case, if I am using Nextcord for my Discord bot and I have TTS and LLM running in the same "cog", which is a way to isolated bot features grouped into their own "cog" for the sake of modularity. So to separate XTTS from my LLM requires a bit of an architectural change to my private codebase and having to separate them would add a bit more latency between the two modules, which is not ideal for real-time application. Everything runs locally on my machine. The issue is mainly incompatibility between the versions of transformers required to run newer local open-source LLMs and XTTS. I'm using inference streaming normally by passing text into the text input parameter as written in the docs for XTTS V2. |
I tried the patch (https://github.com/h2oai/h2ogpt/blob/52923ac21a1532983c72b45a8e0785f6689dc770/docs/xtt.patch) mentioned in that thread and it worked. |
Just throwing this in here because I ran into another set of models that relies on 4.43: Microsoft Phi-3.5-mini-instruct, which apparently is very decent for how small it is. I spent a day attempting to have gpt4o help me make coqui streaming work with transformers 4.43 and it did, I got it to output voice from text! but it added stuff that caused my vram to spike and I'm not familiar enough with neural net code to figure out what it did wrong. Python is also not my strong suit! |
It would help to see your implementation for streaming to see if it's the problem. It could be the LLM if you are running it locally and it is an issue for some LLMs to spike in VRAM usage as you use it, especially if you feed it with context like a chat history. |
I'm just using the xtts/stream_generator.py script. I haven't tried to use Phi-3.5 because it relies on transformers 4.43, but coqui only works up to 4.42.4 or something right now. When I ran the gpt changed script (while transformers 4.43 was installed) it wasn't using other models so the spike in vram was just related to the changes it made (I'm guessing). It was pretty ugly looking to be honest. |
+1 for this There is some transformers code that breaks on the Mac M1 family, specifically this:
This appears to be fixed in more recent transformers releases but can't be leveraged by coqui-ai-tts due to incompatibility. |
I would also greatly appreciate the ability for mps Apple Silicon speedup on xtts inference 🥺 |
Thanks a lot to @JohnnyStreet for submitting a fix for this, which I just merged into the |
Running on
|
This looks to be a limitation in Pytorch that will hopefully get fixed in future versions. You can set that environment variable to avoid the error. |
I posted the original pr and I am not convinced this is a limitation [of PyTorch], but I don't have an mps device available to debug it. This is a very hacky shot in the dark, but as a workaround you might try installing accelerate and then doing
and then reference |
Updated testing and still: Running on
|
I attempted to fix the issue myself by moving the convolution operation in
|
Update: tried comparing with raw mps, with accelerate mps, and with only cpu, lol had the same results anywayResults here ⬇️ Details### Script used 📑import torch
import time
from TTS.api import TTS
from accelerate import Accelerator
def tts_generate(tts, device, text, output_file):
"""Generate speech and measure inference time."""
print(f"Generating on {device}...")
start_time = time.time()
tts.tts_to_file(
text=text,
speaker_wav="ref.wav",
language="en",
file_path=output_file,
)
elapsed_time = time.time() - start_time
print(f"Generation time on {device}: {elapsed_time:.2f} seconds")
return elapsed_time
def main():
# Initialize the TTS model
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2")
# Test on CPU
cpu_device = torch.device("cpu")
tts.to(cpu_device)
cpu_time = tts_generate(tts, "CPU", "Hello world!", "output_cpu.wav")
# Test with Raw MPS
if torch.backends.mps.is_available():
raw_mps_device = torch.device("mps")
tts.to(raw_mps_device)
mps_time = tts_generate(tts, "Raw MPS", "Hello world!", "output_mps.wav")
else:
print("MPS is not available on this machine.")
mps_time = None
# Test with Accelerate MPS
accelerator = Accelerator()
accel_device = accelerator.device
tts.to(accel_device)
accelerate_time = tts_generate(tts, f"Accelerate ({accel_device})", "Hello world!", "output_accel_mps.wav")
# Print comparison results
print("\n--- Inference Time Comparison ---")
print(f"CPU Time: {cpu_time:.2f} seconds")
if mps_time:
print(f"Raw MPS Time: {mps_time:.2f} seconds")
print(f"Time difference (CPU vs Raw MPS): {cpu_time - mps_time:.2f} seconds")
print(f"Accelerate MPS Time: {accelerate_time:.2f} seconds")
print(f"Time difference (CPU vs Accelerate MPS): {cpu_time - accelerate_time:.2f} seconds")
if __name__ == "__main__":
main() Results: 🧪(newtts_test) drew@wmughal-CN4D09397T Downloads % export PYTORCH_ENABLE_MPS_FALLBACK=1
python test.py
Generating on CPU...
Generation time on CPU: 2.93 seconds
Generating on Raw MPS...
Generation time on Raw MPS: 3.04 seconds
Generating on Accelerate (mps)...
Generation time on Accelerate (mps): 3.16 seconds
--- Inference Time Comparison ---
CPU Time: 2.93 seconds
Raw MPS Time: 3.04 seconds
Time difference (CPU vs Raw MPS): -0.11 seconds
Accelerate MPS Time: 3.16 seconds
Time difference (CPU vs Accelerate MPS): -0.23 seconds Output Files ⬇️ |
Hello, I am currently working with the new LLaMA 3.1 models by Meta and they require the newer versions of transformers, optimum, and accelerate. I ran into compatibility issues with XTTS regarding the version of transformers.
I personally use the inference streaming feature, and that's where I am having issues.
Here is an error log I got:
The text was updated successfully, but these errors were encountered: