Generative AI

Jul 1, 2022

Photo by rawpixel on Unsplash

The research topics cover text-to-audio (TTA), singing voice synthesis (SVS), music generation, and more multi-modality generative task in the future.

Publications

InstructSing: High-Fidelity Singing Voice Generation via Instructing Yourself

This paper presents a novel neural vocoder called InstructSing, which can converge much faster compared with other neural vocoders while maintaining good performance by integrating differentiable digital signal processing and adversarial training.

Chang Zeng, Chunhui Wang, Xiaoxiao Miao, Jian Zhao, Zhonglin Jiang, Yong Chen

InstructSing: High-Fidelity Singing Voice Generation via Instructing Yourself

HAM-TTS: Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech with Model and Data Scaling

We introduce a novel token-based text-to-speech (TTS) model with 0.8B parameters, trained on a mix of real and synthetic data totaling 650k hours, to address issues like pronunciation accuracy and style consistency. This model integrates a latent variable sequence with enhanced acoustic information into the TTS system, reducing errors and style changes. Our training includes data augmentation for improved timbre consistency, and we use a few-shot voice conversion model to generate diverse voices. This approach enables learning of one-to-many mappings in speech, ensuring both diversity and timbre consistency. Our model outperforms VALL-E in pronunciation, style maintenance, and timbre continuity.

Chunhui Wang, Chang Zeng, Bowen Zhang, Ziyang Ma, Yefan Zhu, Zifeng Cai, Jian Zhao, Zhonglin Jiang, Yong Chen

HAM-TTS: Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech with Model and Data Scaling

CrossSinger: A Cross-Lingual Multi-Singer High-Fidelity Singing Voice Synthesizer Trained on Monolingual Singers

This paper presents CrossSinger, a cross-lingual singing voice synthesizer based on Xiaoicesing2. It tackles the challenge of creating a multi-singer high-fidelity singing voice synthesis system with cross-lingual capabilities using only monolingual singers during training. The system unifies language representation, incorporates language information, and removes singer biases. Experimental results show that CrossSinger can synthesize high-quality songs for different singers in various languages, including code-switching cases.

Xintong Wang, Chang Zeng, Jun Chen, Chunhui Wang

CrossSinger: A Cross-Lingual Multi-Singer High-Fidelity Singing Voice Synthesizer Trained on Monolingual Singers

HiFi-WaveGAN: Generative Adversarial Network with Auxiliary Spectrogram-Phase Loss for High-Fidelity Singing Voice Generation

This paper introduces HiFi-WaveGAN, a system designed for real-time synthesis of high-quality 48kHz singing voices from full-band mel-spectrograms. It improves upon WaveNet with a generator, incorporates elements from HiFiGAN and UnivNet, and introduces an auxiliary spectrogram-phase loss to enhance high-frequency reconstruction and accelerate training. HiFi-WaveGAN outperforms other neural vocoders like Parallel WaveGAN and HiFiGAN in quality metrics, with faster training and better high-frequency modeling.

Chunhui Wang, Chang Zeng, Jun Chen, Yuhao Wang, Xing He

HiFi-WaveGAN: Generative Adversarial Network with Auxiliary Spectrogram-Phase Loss for High-Fidelity Singing Voice Generation

Xiaoicesing 2: A High-Fidelity Singing Voice Synthesizer Based on Generative Adversarial Network

This paper presents XiaoiceSing2, an enhanced singing voice synthesis system that addresses over-smoothing issues in middle- and high-frequency areas of mel-spectrograms. It employs a generative adversarial network (GAN) with improved model architecture to capture finer details.

Chunhui Wang, Chang Zeng, Xing He

SSI-Net: A Multi-Stage Speech Signal Improvement System for ICASSP 2023 SSI Challenge

We introduce SSI-Net, our submission to the ICASSP 2023 Speech Signal Improvement (SSI) Challenge, designed for real-time communication systems. SSI-Net features a multi-stage architecture, beginning with a time-domain restoration generative adversarial network (TRGAN) for initial speech restoration. In the second stage, we use a lightweight multi-scale temporal frequency convolutional network with axial self-attention (MTFAA-Lite) for fullband speech enhancement. In subjective tests on the SSI Challenge blind test set, SSI-Net achieved a P.835 mean opinion score (MOS) of 3.190 and a P.804 MOS of 3.178, ranking 3rd in tracks 1&2.

Weixin Zhu, Zilin Wang, Jiuxin Lin, Chang Zeng, Tao Yu

Generative AI

Chang ZENG 曾畅　曾暢 (ソウチョウ)

Independent Researcher

Generative AI

Chang ZENG 曾畅 曾 暢 (ソウ チョウ)

Independent Researcher

Publications

Chang ZENG 曾畅　曾暢 (ソウチョウ)