Chang ZENG 曾畅 曾 暢 (ソウ チョウ)

Chang ZENG 曾畅 曾 暢 (ソウ チョウ)

Senior Research Scientist

Shanda AI Research Tokyo

Biography

Senior Research Scientist in generative audio and voice LLMs with 7+ years of experience from research to production. I have led end-to-end development of expressive TTS systems and contributed to full-duplex speech systems across data curation, codec and tokenization design, large-scale multi-GPU training, and deployment. My work focuses on turning advanced speech and audio methods into product-ready AI systems for voice generation, multimodal interaction, and intelligent audio understanding.

Download my resumé .

Interests
  • Generative Audio and Voice LLMs
  • Multimodal Foundation Models
  • Speech and Singing Voice Generation
  • Speaker Recognition and Antispoofing
  • Audio Separation and Enhancement
Education
  • PhD in Informatics, 2024

    National Institute of Informatics & SOKENDAI

  • MEng in Electrical Engineering and Information System, 2020

    The University of Tokyo

  • BEng in Measurement and Control Technology and Instruments, 2016

    Tianjin University

News

Publications

Quickly discover relevant content by filtering publications.
(2026). A Semantically Consistent Dataset for Data-Efficient Query-Based Universal Sound Separation. In arXiv.

PDF Code Dataset ArXiv HF Dataset (3rd Party)

(2026). DrivingScene: A Multi-Task Online Feed-Forward 3D Gaussian Splatting Method for Dynamic Driving Scenes. Accepted by ICASSP 2026.

PDF ArXiv

(2026). PAGS: Priority-Adaptive Gaussian Splatting for Dynamic Driving Scenes. Accepted by ICASSP 2026.

PDF ArXiv

(2025). Towards Interactive Intelligence for Digital Humans. In arXiv.

PDF Project ArXiv Demo

(2025). Critical Information Only: A Content Privacy-Preserving Framework for Detecting Audio Deepfakes. In IEEE TDSC.

PDF

(2025). SonicSim: A Customizable Simulation Platform for Speech Processing in Moving Sound Source Scenarios. Accepted by ICLR 2025.

PDF Code ArXiv

(2025). A Benchmark for Multi-Speaker Anonymization. In IEEE TIFS.

PDF

(2024). InstructSing: High-Fidelity Singing Voice Generation via Instructing Yourself. In SLT 2024.

PDF Cite Project DOI SLT2024

(2024). Spoofing-Aware Speaker Verification Robust Against Domain and Channel Mismatches. In SLT 2024.

PDF

(2024). HAM-TTS: Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech with Model and Data Scaling. In ArXiv.

PDF Cite Project DOI ArXiv

(2024). Joint Speaker Encoder and Neural Back-end Model for Fully End-to-End Automatic Speaker Verification with Multiple Enrollment Utterances. In Computer Speech & Language.

PDF Cite Dataset Project DOI CSL

(2023). Cross-Modal Audio-Visual Co-Learning for Text-Independent Speaker Verification. In ICASSP 2023.

PDF Cite Project DOI ICASSP

(2023). SSI-Net: A Multi-Stage Speech Signal Improvement System for ICASSP 2023 SSI Challenge. In ICASSP 2023.

PDF Cite Project DOI ICASSP Link

(2022). Deep Spectro-temporal Artifacts for Detecting Synthesized Speech. In DDAM 2022 Workshop.

PDF Cite Project DOI ACMMM Link

(2022). Data Augmentation Using McAdams-Coefficient-Based Speaker Anonymization for Fake Audio Detection. In Interspeech 2022.

PDF Cite Project DOI INTERSPEECH Link

(2022). Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances. In ICASSP 2022.

PDF Cite Code Dataset Project Video DOI ICASSP

(2021). DeepLip: A Benchmark for Deep Learning-Based Audio-Visual Lip Biometrics. In ASRU 2021.

PDF Cite Project DOI ASRU Link

Skills

Languages and Tools

Python, C++, Shell, Git, MySQL

Deep Learning

PyTorch, PyTorch Lightning, Hugging Face

Speech Toolkits

SpeechBrain, WeNet, WeSpeaker, Kaldi, ESPnet

Generative Audio

Expressive TTS, codec and tokenizer design, voice LLMs

Multimodal AI

Audio-language modeling, full-duplex speech systems

Communication

Chinese, English, Japanese (N2-125)

Activities

Reviewer Service

  • Conferences: NeurIPS, ICLR, ICML, AISTATS, ICASSP, ICME, INTERSPEECH
  • Journals: IEEE Open Journal of Signal Processing

Academic Activities

  • Research Assistant at Yamagishi Lab, National Institute of Informatics
  • ICASSP 2022 Oral Presentation
  • Interspeech 2022 Oral Presentation
  • Interspeech 2023 Poster Presentation
  • Organizing Committee, Joint Workshop of VoicePersonae and ASVspoof 2023
  • SLT 2024, SVDD Challenge (Invited Talk): Shared insights on singing voice generation as a guest speaker

Competitions

  • 4th/77 place, VoxCeleb Speaker Recognition Challenge 2019
  • 2nd/110 place, Zhijiang Cup Speech Recognition for Conversational Scenario 2021
  • 4th/42 place, Audio Deep Synthesis Detection Challenge 2022 Track 1
  • 5th/27 place, Audio Deep Synthesis Detection Challenge 2022 Track 2

Open Source

  • WeSpeaker contributor
  • ASV-Subtools contributor
  • HIVE dataset contributor

Experience

 
 
 
 
 
Shanda AI Research Tokyo
Senior AI Researcher
Shanda AI Research Tokyo
Sep 2025 – Present Tokyo (Hybrid)

Lead R&D of expressive TTS and full-stack voice-agent technologies for avatar and game products.

  • Built KodamaTTS from scratch on a Qwen-based foundation model for virtual human and gaming applications
  • Covered data curation, codec design, multi-node multi-GPU training, and evaluation in one pipeline
  • Achieved sub-1 kbps codec quality with strong objective performance and reached top benchmark rankings in 3 languages
  • Served as Co-PI on a joint project with Tsinghua University on cocktail-party speech interaction and audio separation
 
 
 
 
 
Li Auto
Multimodal Generative AI Researcher
Apr 2024 – Sep 2025 Hangzhou

Developed voice generation systems for Li Auto smart-space products and contributed to the multimodal foundation model MindGPT-4o.

  • Proposed the GFSQ tokenizer for GPT-SoVITS to improve codebook utilization and decoding quality
  • Trained a multi-timbre, multi-style voice generation model for in-car voice-blog scenarios in production
  • Built synthetic-data workflows to scale accents, dialects, languages, emotions, and scenarios
  • Led data production and audio-head pretraining and post-training for a full-duplex conversational model
 
 
 
 
 
Bombax XiaoIce Technology Co., Ltd
Avatar Research Intern
Jul 2022 – Jul 2023 Remote

Focused on high-fidelity 48kHz singing voice generation in collaboration with research and engineering teams.

  • Upgraded XiaoiceSing to XiaoiceSing2 with adversarial training and achieved near-human MOS
  • Developed HiFi-WaveGAN with a pulse-sequence design for stronger 48kHz singing synthesis quality
  • Built CrossSinger for cross-lingual multi-singer SVS in English, Japanese, and Chinese
  • Improved training efficiency with InstructSing and explored hierarchical acoustic modeling for voice LMs
 
 
 
 
 
Alibaba
Speech Recognition Researcher
Apr 2020 – Sep 2020 Hangzhou

Developed speech AI systems for Taobao Live compliance and broadcaster-risk control.

  • Built a large-scale speaker recognition system for broadcaster identity verification in livestream scenarios
  • Developed a spoken-term detection pipeline for policy-sensitive and illegal word monitoring
  • Researched self-supervised speech representations and implemented an ESPnet-based end-to-end ASR system