Chang ZENG 曾畅 曾 暢 (ソウ チョウ)

Chang ZENG 曾畅 曾 暢 (ソウ チョウ)

Independent Researcher

Biography

Independent researcher in generative audio and voice LLMs with 7+ years of experience from research to production. I have led end-to-end development of expressive TTS systems and contributed to full-duplex speech systems across data curation, codec and tokenization design, large-scale multi-GPU training, and deployment. My work focuses on turning advanced speech and audio methods into product-ready AI systems for voice generation, multimodal interaction, and intelligent audio understanding.

Download my resumé .

Interests
  • Generative Audio and Voice LLMs
  • Multimodal Foundation Models
  • Speech and Singing Voice Generation
  • Speaker Recognition and Antispoofing
  • Audio Separation and Enhancement
Education
  • PhD in Informatics, 2024

    National Institute of Informatics & SOKENDAI

  • MEng in Electrical Engineering and Information System, 2020

    The University of Tokyo

  • BEng in Measurement and Control Technology and Instruments, 2016

    Tianjin University

News

Publications

Google Scholar citations Google Scholar h-index Google Scholar i10-index
Quickly discover relevant content by filtering publications.
(2026). Speech Codec Probing from Semantic and Phonetic Perspectives. Accepted by Interspeech 2026.

PDF ArXiv

(2026). StepAudio 2.5 Technical Report. In arXiv.

PDF Cite ArXiv

(2026). A Semantically Consistent Dataset for Data-Efficient Query-Based Universal Sound Separation. Accepted by ICML 2026.

PDF Code Dataset ArXiv HF Dataset (3rd Party)

(2026). BEAVER: A Training-Free Hierarchical Prompt Compression Method via Structure-Aware Page Selection. In arXiv.

PDF Code Project Demo ArXiv

(2026). DrivingScene: A Multi-Task Online Feed-Forward 3D Gaussian Splatting Method for Dynamic Driving Scenes. Accepted by ICASSP 2026.

PDF ArXiv

(2026). PAGS: Priority-Adaptive Gaussian Splatting for Dynamic Driving Scenes. Accepted by ICASSP 2026.

PDF ArXiv

(2025). Towards Interactive Intelligence for Digital Humans. In arXiv.

PDF Project ArXiv Demo

(2025). Critical Information Only: A Content Privacy-Preserving Framework for Detecting Audio Deepfakes. In IEEE TDSC.

PDF

(2025). SonicSim: A Customizable Simulation Platform for Speech Processing in Moving Sound Source Scenarios. Accepted by ICLR 2025.

PDF Code ArXiv

(2025). A Benchmark for Multi-Speaker Anonymization. In IEEE TIFS.

PDF

(2024). InstructSing: High-Fidelity Singing Voice Generation via Instructing Yourself. In SLT 2024.

PDF Cite Project DOI SLT2024

(2024). Spoofing-Aware Speaker Verification Robust Against Domain and Channel Mismatches. In SLT 2024.

PDF

(2024). HAM-TTS: Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech with Model and Data Scaling. In ArXiv.

PDF Cite Project DOI ArXiv

(2024). Joint Speaker Encoder and Neural Back-end Model for Fully End-to-End Automatic Speaker Verification with Multiple Enrollment Utterances. In Computer Speech & Language.

PDF Cite Dataset Project DOI CSL

(2023). Cross-Modal Audio-Visual Co-Learning for Text-Independent Speaker Verification. In ICASSP 2023.

PDF Cite Project DOI ICASSP

(2023). SSI-Net: A Multi-Stage Speech Signal Improvement System for ICASSP 2023 SSI Challenge. In ICASSP 2023.

PDF Cite Project DOI ICASSP Link

(2022). Deep Spectro-temporal Artifacts for Detecting Synthesized Speech. In DDAM 2022 Workshop.

PDF Cite Project DOI ACMMM Link

(2022). Data Augmentation Using McAdams-Coefficient-Based Speaker Anonymization for Fake Audio Detection. In Interspeech 2022.

PDF Cite Project DOI INTERSPEECH Link

(2022). Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances. In ICASSP 2022.

PDF Cite Code Dataset Project Video DOI ICASSP

(2021). DeepLip: A Benchmark for Deep Learning-Based Audio-Visual Lip Biometrics. In ASRU 2021.

PDF Cite Project DOI ASRU Link

Skills

Languages and Tools

Python, C++, Shell, Git, MySQL

Deep Learning

PyTorch, PyTorch Lightning, Hugging Face

Speech Toolkits

SpeechBrain, WeNet, WeSpeaker, Kaldi, ESPnet

Generative Audio

Expressive TTS, codec and tokenizer design, voice LLMs

Multimodal AI

Audio-language modeling, full-duplex speech systems

Communication

Chinese, English, Japanese

Activities

Reviewer Service

  • Conferences: NeurIPS, ICLR, ICML, ACL, ICASSP, ICME, INTERSPEECH
  • Journals: IEEE OJSP, IEEE TASLP

Academic Activities

  • ICASSP22, Interspeech22 Oral Presentation
  • Organizing Committee, Joint Workshop of VoicePersonae and ASVspoof 2023
  • SLT 2024, SVDD Challenge (Invited Talk): Shared insights on singing voice generation as a guest speaker

Open Source

Experience

 
 
 
 
 
Independent Researcher
Independent
May 2026 – Present Remote

Conduct independent research on generative audio, voice LLMs, multimodal interaction, and intelligent audio understanding.

  • Explore foundation models and data-centric methods for speech, audio, and multimodal AI
  • Continue open research collaborations and publication work across generative audio and sound understanding
 
 
 
 
 
Shanda AI Research Tokyo
Senior AI Researcher
Sep 2025 – May 2026 Tokyo (Hybrid)

Led R&D of expressive TTS and full-stack voice-agent technologies for avatar and game products.

  • Built KodamaTTS from scratch on a Qwen-based foundation model for virtual human and gaming applications
  • Covered data curation, codec design, multi-node multi-GPU training, and evaluation in one pipeline
  • Achieved sub-1 kbps codec quality with strong objective performance and reached top benchmark rankings in 3 languages
  • Served as Co-PI on a joint project with Tsinghua University on cocktail-party speech interaction and audio separation
 
 
 
 
 
Li Auto
Multimodal Generative AI Researcher
Apr 2024 – Sep 2025 Hangzhou

Developed voice generation systems for Li Auto smart-space products and contributed to the multimodal foundation model MindGPT-4o.

  • Proposed the GFSQ tokenizer for GPT-SoVITS to improve codebook utilization and decoding quality
  • Trained a multi-timbre, multi-style voice generation model for in-car voice-blog scenarios in production
  • Built synthetic-data workflows to scale accents, dialects, languages, emotions, and scenarios
  • Led data production and audio-head pretraining and post-training for a full-duplex conversational model
 
 
 
 
 
Bombax XiaoIce Technology Co., Ltd
Avatar Research Intern
Jul 2022 – Jul 2023 Remote

Focused on high-fidelity 48kHz singing voice generation in collaboration with research and engineering teams.

  • Upgraded XiaoiceSing to XiaoiceSing2 with adversarial training and achieved near-human MOS
  • Developed HiFi-WaveGAN with a pulse-sequence design for stronger 48kHz singing synthesis quality
  • Built CrossSinger for cross-lingual multi-singer SVS in English, Japanese, and Chinese
  • Improved training efficiency with InstructSing and explored hierarchical acoustic modeling for voice LMs
 
 
 
 
 
Alibaba
Speech Recognition Researcher
Apr 2020 – Sep 2020 Hangzhou

Developed speech AI systems for Taobao Live compliance and broadcaster-risk control.

  • Built a large-scale speaker recognition system for broadcaster identity verification in livestream scenarios
  • Developed a spoken-term detection pipeline for policy-sensitive and illegal word monitoring
  • Researched self-supervised speech representations and implemented an ESPnet-based end-to-end ASR system