unable to process multiple files in python #330

kartikay-eltropy · 2024-07-04T08:53:25Z

when i run
model.diarize_list("wav.scp")
i get this error after 2 -3 files get processed
ValueError: need at least one array to stack

but i am able to diarize files individually. I can't diarize multiple files using a for loop either because i get the same error. please help

The text was updated successfully, but these errors were encountered:

JiJiJiang · 2024-07-06T07:04:12Z

@cdliang11 Please check this error

kartikay-eltropy · 2024-07-08T08:38:29Z

you are using
vad = SileroVAD()
this vad variable need to be reset everytime a new file comes in.
vad.reset()
otherwise it continues with the context of the previous audio file. please check diarize code. i was able to debug and implement
@JiJiJiang @cdliang11

JiJiJiang · 2024-07-09T02:43:46Z

you are using vad = SileroVAD() this vad variable need to be reset everytime a new file comes in. vad.reset() otherwise it continues with the context of the previous audio file. please check diarize code. i was able to debug and implement @JiJiJiang @cdliang11

Thank you for your question.
In the examples page of silero-vad, vad.reset() is only called in the stream imitation example, but not in the full audio case.
In Wespeaker, we also use the full audio interface.

kartikay-eltropy · 2024-07-10T07:27:51Z

but when i was creating an scp and trying to tun, it was crashing

JiJiJiang · 2024-07-10T08:13:44Z

vad = SileroVAD()

And after you add vad.reset(), it runs normally ?

kartikay1999 · 2024-07-10T08:17:21Z

from silero_vad import SileroVAD
import os
from tqdm import tqdm
import wespeaker
import matplotlib.pyplot as plt
import torchaudio
from sklearn.cluster._kmeans import k_means
import scipy.linalg
import torch
import numpy as np

vad = SileroVAD()
embedding_model = wespeaker.load_model_local('wespeaker-voxceleb-resnet34-LM')


def subsegment(fbank, seg_id, window_fs, period_fs, frame_shift):
    subsegs = []
    subseg_fbanks = []

    seg_begin, seg_end = seg_id.split('-')[-2:]
    seg_length = (int(seg_end) - int(seg_begin)) // frame_shift

    # We found that the num_frames + 2 equals to seg_length, which is caused
    # by the implementation of torchaudio.compliance.kaldi.fbank.
    # Thus, here seg_length is used to get the subsegs.
    num_frames, feat_dim = fbank.shape
    # print(feat_dim)
    if seg_length <= window_fs:
        subseg = seg_id + "-{:08d}-{:08d}".format(0, seg_length)
        subseg_fbank = np.resize(fbank, (window_fs, feat_dim))

        subsegs.append(subseg)
        subseg_fbanks.append(subseg_fbank)
    else:
        max_subseg_begin = seg_length - window_fs + period_fs
        for subseg_begin in range(0, max_subseg_begin, period_fs):
            subseg_end = min(subseg_begin + window_fs, seg_length)
            subseg = seg_id + "-{:08d}-{:08d}".format(subseg_begin, subseg_end)
            subseg_fbank = np.resize(fbank[subseg_begin:subseg_end],
                                     (window_fs, feat_dim))

            subsegs.append(subseg)
            subseg_fbanks.append(subseg_fbank)

    return subsegs, subseg_fbanks


def cluster(embeddings, p=.01, num_spks=None, min_num_spks=1, max_num_spks=20):
    # Define utility functions
    def cosine_similarity(M):
        M = M / np.linalg.norm(M, axis=1, keepdims=True)
        return 0.5 * (1.0 + np.dot(M, M.T))

    def prune(M, p):
        m = M.shape[0]
        if m < 1000:
            n = max(m - 10, 2)
        else:
            n = int((1.0 - p) * m)

        for i in range(m):
            indexes = np.argsort(M[i, :])
            low_indexes, high_indexes = indexes[0:n], indexes[n:m]
            M[i, low_indexes] = 0.0
            M[i, high_indexes] = 1.0
        return 0.5 * (M + M.T)

    def laplacian(M):
        M[np.diag_indices(M.shape[0])] = 0.0
        D = np.diag(np.sum(np.abs(M), axis=1))
        return D - M

    def spectral(M, num_spks, min_num_spks, max_num_spks):
        eig_values, eig_vectors = scipy.linalg.eigh(M)
        num_spks = num_spks if num_spks is not None \
            else np.argmax(np.diff(eig_values[:max_num_spks + 1])) + 1
        num_spks = max(num_spks, min_num_spks)
        return eig_vectors[:, :num_spks]

    def kmeans(data):
        k = data.shape[1]
        # centroids, labels = scipy.cluster.vq.kmeans2(data, k, minit='++')
        _, labels, _ = k_means(data, k, random_state=None, n_init=10)
        return labels

    # Fallback for trivial cases
    if len(embeddings) <= 2:
        return [0] * len(embeddings)

    # Compute similarity matrix
    similarity_matrix = cosine_similarity(np.array(embeddings))
    # Prune matrix with p interval
    pruned_similarity_matrix = prune(similarity_matrix, p)
    # Compute Laplacian
    laplacian_matrix = laplacian(pruned_similarity_matrix)
    # Compute spectral embeddings
    spectral_embeddings = spectral(laplacian_matrix, num_spks, min_num_spks,
                                   max_num_spks)
    # Assign class labels
    labels = kmeans(spectral_embeddings)

    return labels



def diarize(input_wav,resample=False):
    if resample==True:
          pass
    # fbanks=[]
    diar_window_secs = 1.5
    diar_period_secs = 0.75
    diar_frame_shift = 10
    diar_batch_size = 32
    diar_min_num_spks = 1
    diar_max_num_spks = 20
    diar_min_duration = 0.255
    diar_num_spks = None
    diar_subseg_cmn = True
    window_fs = int(diar_window_secs * 1000) // diar_frame_shift
    period_fs = int(diar_period_secs * 1000) // diar_frame_shift
    subsegs, subsegmnt_fbanks = [], []
    pcm, sample_rate = torchaudio.load(input_wav, normalize=False)
    vad_segments = vad.get_speech_timestamps(input_wav,return_seconds=True)
    # see this line ----------------------------
    vad.reset()

    # print(vad_segments)
    for item in vad_segments:
        # print(item)
        begin, end = item['start'], item['end']
        # print(item['end'] -  item['start'] >= diar_min_durx/ation)
        if item['end'] - item['start'] >= diar_min_duration:
            begin_idx = int(begin * sample_rate)
            end_idx = int(end * sample_rate)
            tmp_wavform = pcm[0, begin_idx:end_idx].unsqueeze(0).to(
                torch.float)
            fbank = embedding_model.compute_fbank(tmp_wavform,
                                    sample_rate=sample_rate,
                                    cmn=False)
            tmp_subsegs, tmp_subseg_fbanks = subsegment(
                fbank=fbank,
                seg_id="{:08d}-{:08d}".format(int(begin * 1000),
                                            int(end * 1000)),
                window_fs=window_fs,
                period_fs=period_fs,
                frame_shift=diar_frame_shift)
            subsegs.extend(tmp_subsegs)
            subsegmnt_fbanks.extend(tmp_subseg_fbanks)
    # del vad_segments
    if subsegmnt_fbanks==[]:
        #   exit()
        # pass
        print("------")
        return None

            # 3. extract embedding
    embeddings = embedding_model.extract_embedding_feats(subsegmnt_fbanks,
                                                diar_batch_size,
                                                diar_subseg_cmn)

    subseg2label = []
    labels = cluster(embeddings,
                        num_spks=diar_num_spks,
                        min_num_spks=diar_min_num_spks,
                        max_num_spks=diar_max_num_spks)
    for (_subseg, _label) in zip(subsegs, labels):
        # b, e = process_seg_id(_subseg, frame_shift=self.diar_frame_shift)
        # subseg2label.append([b, e, _label])
        begin_ms, end_ms, begin_frames, end_frames = _subseg.split('-')
        begin = (int(begin_ms) +
                    int(begin_frames) * diar_frame_shift) / 1000.0
        end = (int(begin_ms) +
                int(end_frames) * diar_frame_shift) / 1000.0
        subseg2label.append([begin, end, _label])
    return subseg2label

kartikay1999 · 2024-07-10T08:18:21Z

yes check out the above code @JiJiJiang

JiJiJiang · 2024-08-20T03:56:14Z

The silero-vad 5.1 has been released and used in our cli. Does this new version still have silimar vad.reset() problem?

JiJiJiang assigned cdliang11 Jul 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unable to process multiple files in python #330

unable to process multiple files in python #330

kartikay-eltropy commented Jul 4, 2024

JiJiJiang commented Jul 6, 2024

kartikay-eltropy commented Jul 8, 2024 •

edited

Loading

JiJiJiang commented Jul 9, 2024

kartikay-eltropy commented Jul 10, 2024

JiJiJiang commented Jul 10, 2024 •

edited

Loading

kartikay1999 commented Jul 10, 2024 •

edited

Loading

kartikay1999 commented Jul 10, 2024 •

edited

Loading

JiJiJiang commented Aug 20, 2024

unable to process multiple files in python #330

unable to process multiple files in python #330

Comments

kartikay-eltropy commented Jul 4, 2024

JiJiJiang commented Jul 6, 2024

kartikay-eltropy commented Jul 8, 2024 • edited Loading

JiJiJiang commented Jul 9, 2024

kartikay-eltropy commented Jul 10, 2024

JiJiJiang commented Jul 10, 2024 • edited Loading

kartikay1999 commented Jul 10, 2024 • edited Loading

kartikay1999 commented Jul 10, 2024 • edited Loading

JiJiJiang commented Aug 20, 2024

kartikay-eltropy commented Jul 8, 2024 •

edited

Loading

JiJiJiang commented Jul 10, 2024 •

edited

Loading

kartikay1999 commented Jul 10, 2024 •

edited

Loading

kartikay1999 commented Jul 10, 2024 •

edited

Loading