Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is the format of "list" file? #9

Open
couragelfyang opened this issue Jan 6, 2020 · 2 comments
Open

What is the format of "list" file? #9

couragelfyang opened this issue Jan 6, 2020 · 2 comments

Comments

@couragelfyang
Copy link

couragelfyang commented Jan 6, 2020

I saw list files such as "LibriSpeech/list/train.txt" are required parameters for main.py. It seems such files are not provided by librispeech officially. What is the format of them? Could you provide them or the script to generate them?

@colinator
Copy link

I believe they are just lists of utterance ids. In my librispeech install, I found a bunch of files ending with .txt, that had utterance ids and transcriptions. This is how I generated the list files:

# Generates train.txt, eval.txt, validation.txt, which
# are just lists of utterance ids. This script looks
# at all the .txt files within LibriSpeech to extract
# the ids and write the files.
# An utterance id is a string like "61-70968-0009".

import os

trainroot = 'LibriSpeech/train-clean-100/' #, 'train-clean-360/', 'train-other-500/'
devroot = 'LibriSpeech/dev-clean/' #, 'LibriSpeech/dev-other/'
testroot = 'LibriSpeech/test-clean/'

def generate_list(root_dir, fn):

    # get the utterance ids
    utterance_ids = []
    for subdir, _, files in os.walk(root_dir):
        for filename in [f for f in files if f.endswith(".txt")]:
            with open(os.path.join(subdir, filename)) as f:
                ids = [l.split(" ")[0] + "\n" for l in f.readlines()]
                utterance_ids.extend(ids)

    # write them
    with open(fn, "w") as of:
        of.writelines(utterance_ids)

if __name__ == "__main__":
    generate_list(trainroot, "LibriSpeech/list/train.txt")
    generate_list(testroot, "LibriSpeech/list/eval.txt")
    generate_list(devroot, "LibriSpeech/list/validation.txt")

@wubo2180
Copy link

wubo2180 commented Jul 7, 2021

Dataset is available at the website http://www.openslr.org/12/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants