We propose a new end-to-end (E2E) method for automatic speaker verification, specifically tailored for scenarios with multiple enrollment utterances. Unlike conventional systems, which separately optimize front-end models like TDNN for speaker embeddings and back-end models like PLDA for scoring, our approach aims to overcome local optimization limits by jointly optimizing these components. Our model incorporates frame-level and utterance-level attention mechanisms to leverage the relationships among multiple utterances. Additionally, we enhance optimization through data augmentation techniques, including conventional noise augmentation with MUSAN and RIRs datasets, and a novel speaker embedding-level mixup strategy.
Chang Zeng,
Xiaoxiao Miao,
Xin Wang,
Erica Cooper,
Junichi Yamagishi