Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unable to process multiple files in python #330

Open
kartikay-eltropy opened this issue Jul 4, 2024 · 8 comments
Open

unable to process multiple files in python #330

kartikay-eltropy opened this issue Jul 4, 2024 · 8 comments
Assignees

Comments

@kartikay-eltropy
Copy link

when i run
model.diarize_list("wav.scp")
i get this error after 2 -3 files get processed
ValueError: need at least one array to stack

but i am able to diarize files individually. I can't diarize multiple files using a for loop either because i get the same error. please help

@JiJiJiang
Copy link
Collaborator

@cdliang11 Please check this error

@kartikay-eltropy
Copy link
Author

kartikay-eltropy commented Jul 8, 2024

you are using
vad = SileroVAD()
this vad variable need to be reset everytime a new file comes in.
vad.reset()
otherwise it continues with the context of the previous audio file. please check diarize code. i was able to debug and implement
@JiJiJiang @cdliang11

@JiJiJiang
Copy link
Collaborator

you are using vad = SileroVAD() this vad variable need to be reset everytime a new file comes in. vad.reset() otherwise it continues with the context of the previous audio file. please check diarize code. i was able to debug and implement @JiJiJiang @cdliang11

Thank you for your question.
In the examples page of silero-vad, vad.reset() is only called in the stream imitation example, but not in the full audio case.
In Wespeaker, we also use the full audio interface.

@kartikay-eltropy
Copy link
Author

but when i was creating an scp and trying to tun, it was crashing

@JiJiJiang
Copy link
Collaborator

JiJiJiang commented Jul 10, 2024

vad = SileroVAD()

And after you add vad.reset(), it runs normally ?

@kartikay1999
Copy link

kartikay1999 commented Jul 10, 2024

from silero_vad import SileroVAD
import os
from tqdm import tqdm
import wespeaker
import matplotlib.pyplot as plt
import torchaudio
from sklearn.cluster._kmeans import k_means
import scipy.linalg
import torch
import numpy as np

vad = SileroVAD()
embedding_model = wespeaker.load_model_local('wespeaker-voxceleb-resnet34-LM')


def subsegment(fbank, seg_id, window_fs, period_fs, frame_shift):
    subsegs = []
    subseg_fbanks = []

    seg_begin, seg_end = seg_id.split('-')[-2:]
    seg_length = (int(seg_end) - int(seg_begin)) // frame_shift

    # We found that the num_frames + 2 equals to seg_length, which is caused
    # by the implementation of torchaudio.compliance.kaldi.fbank.
    # Thus, here seg_length is used to get the subsegs.
    num_frames, feat_dim = fbank.shape
    # print(feat_dim)
    if seg_length <= window_fs:
        subseg = seg_id + "-{:08d}-{:08d}".format(0, seg_length)
        subseg_fbank = np.resize(fbank, (window_fs, feat_dim))

        subsegs.append(subseg)
        subseg_fbanks.append(subseg_fbank)
    else:
        max_subseg_begin = seg_length - window_fs + period_fs
        for subseg_begin in range(0, max_subseg_begin, period_fs):
            subseg_end = min(subseg_begin + window_fs, seg_length)
            subseg = seg_id + "-{:08d}-{:08d}".format(subseg_begin, subseg_end)
            subseg_fbank = np.resize(fbank[subseg_begin:subseg_end],
                                     (window_fs, feat_dim))

            subsegs.append(subseg)
            subseg_fbanks.append(subseg_fbank)

    return subsegs, subseg_fbanks


def cluster(embeddings, p=.01, num_spks=None, min_num_spks=1, max_num_spks=20):
    # Define utility functions
    def cosine_similarity(M):
        M = M / np.linalg.norm(M, axis=1, keepdims=True)
        return 0.5 * (1.0 + np.dot(M, M.T))

    def prune(M, p):
        m = M.shape[0]
        if m < 1000:
            n = max(m - 10, 2)
        else:
            n = int((1.0 - p) * m)

        for i in range(m):
            indexes = np.argsort(M[i, :])
            low_indexes, high_indexes = indexes[0:n], indexes[n:m]
            M[i, low_indexes] = 0.0
            M[i, high_indexes] = 1.0
        return 0.5 * (M + M.T)

    def laplacian(M):
        M[np.diag_indices(M.shape[0])] = 0.0
        D = np.diag(np.sum(np.abs(M), axis=1))
        return D - M

    def spectral(M, num_spks, min_num_spks, max_num_spks):
        eig_values, eig_vectors = scipy.linalg.eigh(M)
        num_spks = num_spks if num_spks is not None \
            else np.argmax(np.diff(eig_values[:max_num_spks + 1])) + 1
        num_spks = max(num_spks, min_num_spks)
        return eig_vectors[:, :num_spks]

    def kmeans(data):
        k = data.shape[1]
        # centroids, labels = scipy.cluster.vq.kmeans2(data, k, minit='++')
        _, labels, _ = k_means(data, k, random_state=None, n_init=10)
        return labels

    # Fallback for trivial cases
    if len(embeddings) <= 2:
        return [0] * len(embeddings)

    # Compute similarity matrix
    similarity_matrix = cosine_similarity(np.array(embeddings))
    # Prune matrix with p interval
    pruned_similarity_matrix = prune(similarity_matrix, p)
    # Compute Laplacian
    laplacian_matrix = laplacian(pruned_similarity_matrix)
    # Compute spectral embeddings
    spectral_embeddings = spectral(laplacian_matrix, num_spks, min_num_spks,
                                   max_num_spks)
    # Assign class labels
    labels = kmeans(spectral_embeddings)

    return labels



def diarize(input_wav,resample=False):
    if resample==True:
          pass
    # fbanks=[]
    diar_window_secs = 1.5
    diar_period_secs = 0.75
    diar_frame_shift = 10
    diar_batch_size = 32
    diar_min_num_spks = 1
    diar_max_num_spks = 20
    diar_min_duration = 0.255
    diar_num_spks = None
    diar_subseg_cmn = True
    window_fs = int(diar_window_secs * 1000) // diar_frame_shift
    period_fs = int(diar_period_secs * 1000) // diar_frame_shift
    subsegs, subsegmnt_fbanks = [], []
    pcm, sample_rate = torchaudio.load(input_wav, normalize=False)
    vad_segments = vad.get_speech_timestamps(input_wav,return_seconds=True)
    # see this line ----------------------------
    vad.reset()

    # print(vad_segments)
    for item in vad_segments:
        # print(item)
        begin, end = item['start'], item['end']
        # print(item['end'] -  item['start'] >= diar_min_durx/ation)
        if item['end'] - item['start'] >= diar_min_duration:
            begin_idx = int(begin * sample_rate)
            end_idx = int(end * sample_rate)
            tmp_wavform = pcm[0, begin_idx:end_idx].unsqueeze(0).to(
                torch.float)
            fbank = embedding_model.compute_fbank(tmp_wavform,
                                    sample_rate=sample_rate,
                                    cmn=False)
            tmp_subsegs, tmp_subseg_fbanks = subsegment(
                fbank=fbank,
                seg_id="{:08d}-{:08d}".format(int(begin * 1000),
                                            int(end * 1000)),
                window_fs=window_fs,
                period_fs=period_fs,
                frame_shift=diar_frame_shift)
            subsegs.extend(tmp_subsegs)
            subsegmnt_fbanks.extend(tmp_subseg_fbanks)
    # del vad_segments
    if subsegmnt_fbanks==[]:
        #   exit()
        # pass
        print("------")
        return None

            # 3. extract embedding
    embeddings = embedding_model.extract_embedding_feats(subsegmnt_fbanks,
                                                diar_batch_size,
                                                diar_subseg_cmn)

    subseg2label = []
    labels = cluster(embeddings,
                        num_spks=diar_num_spks,
                        min_num_spks=diar_min_num_spks,
                        max_num_spks=diar_max_num_spks)
    for (_subseg, _label) in zip(subsegs, labels):
        # b, e = process_seg_id(_subseg, frame_shift=self.diar_frame_shift)
        # subseg2label.append([b, e, _label])
        begin_ms, end_ms, begin_frames, end_frames = _subseg.split('-')
        begin = (int(begin_ms) +
                    int(begin_frames) * diar_frame_shift) / 1000.0
        end = (int(begin_ms) +
                int(end_frames) * diar_frame_shift) / 1000.0
        subseg2label.append([begin, end, _label])
    return subseg2label

@kartikay1999
Copy link

kartikay1999 commented Jul 10, 2024

yes check out the above code @JiJiJiang

@JiJiJiang
Copy link
Collaborator

The silero-vad 5.1 has been released and used in our cli. Does this new version still have silimar vad.reset() problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants