[assistance] Confirmation on Data Format and Structure for Fine-Tuning #141

IrisSally · 2024-08-28T03:12:22Z

确认清单

我已经阅读过 README.md 和 dependencies.md 文件
我已经确认之前没有 issue 或 discussion 涉及此 BUG
我已经确认问题发生在最新代码或稳定版本中
我已经确认问题与 API 无关
我已经确认问题与 WebUI 无关
我已经确认问题与 Finetune 无关

你的issues

Hi,

I am planning to fine-tune ChatTTS using my own dataset, and I would like to confirm a few details regarding the data format and requirements.

1. Data Structure and .list File Format

Based on the documentation and examples, I have organized my data as follows:

File Structure

datasets/
└── data_speaker_a/
    ├── speaker_a/
    │   ├── 1.wav
    │   ├── 2.wav
    │   └── ... (more audio files)
    └── speaker_a.list

.list File Format

Each line in the .list file is formatted as filepath|speaker|lang|text, where:

filepath: Relative path to the audio file (relative to the directory containing the .list file).
speaker: Name of the speaker.
lang: Language code (e.g., ZH for Chinese, EN for English).
text: Transcription of the audio content.

Example:

speaker_a/1.wav|John|ZH|你好
speaker_a/2.wav|John|EN|Hello

Could you please confirm if this structure and format are correct?

2. Audio Data Specifications

I am planning to use 100 audio files, each approximately 10 seconds long, with a sampling rate of 24000 Hz for training.

Is this a suitable setup for fine-tuning the model? Are there any specific recommendations or requirements?

Thank you for your assistance!

The text was updated successfully, but these errors were encountered:

zhzLuke96 · 2024-08-28T09:29:57Z

First, it's important to note that the current fine-tuning code is still in an unusable state.

Regarding your question about the dataset format, your understanding is correct. The configuration you described is appropriate.

As for the dataset size, there's no precise limitation or recommended size. Modern TTS models are complex with multiple trainable modules, each potentially requiring different amounts of data and configurations. For example, simple embedding fine-tuning might only need 10 voice samples, but for fine-tuning the GPT module, the amount of data needed depends on your training objective. If you're just adding a new voice, 100 samples should be sufficient. However, if you need to train instructional capabilities or enhance prompt following, you might need more.

A simple suggestion would be: if the dataset quality is poor, it's better to have more data. If the quality is high, then even a small amount of data (less than 30 samples) could be enough.

By the way, almost all of the training code in this repository comes from this PR: 2noise/ChatTTS#680. I've only made simple modifications to adapt it and pre-test the entire forge inference system (because we've made some changes to ChatTTS and have an internal .spkv1.json speaker file format).

IrisSally · 2024-08-28T10:34:14Z

Thank you for your patient explanation and assistance. It's been very helpful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[assistance] Confirmation on Data Format and Structure for Fine-Tuning #141

[assistance] Confirmation on Data Format and Structure for Fine-Tuning #141

IrisSally commented Aug 28, 2024

zhzLuke96 commented Aug 28, 2024

IrisSally commented Aug 28, 2024

[assistance] Confirmation on Data Format and Structure for Fine-Tuning #141

[assistance] Confirmation on Data Format and Structure for Fine-Tuning #141

Comments

IrisSally commented Aug 28, 2024

确认清单

你的issues

1. Data Structure and .list File Format

File Structure

.list File Format

2. Audio Data Specifications

zhzLuke96 commented Aug 28, 2024

IrisSally commented Aug 28, 2024