3

BEAVER: A Training-Free Hierarchical Prompt Compression Method via Structure-Aware Page Selection

BEAVER is a training-free hierarchical prompt compression framework for long-context LLM inference with strong benchmark performance and 26.4x lower latency on 128k contexts.

Zhengpei Hu, Kai Li, Dapeng Fu, Chang Zeng, Yue Li, Yuanhao Tang, Jianqiang Huang

BEAVER: A Training-Free Hierarchical Prompt Compression Method via Structure-Aware Page Selection

Speech Codec Probing from Semantic and Phonetic Perspectives

We probe widely used speech tokenizers and show they encode phonetic structure much more strongly than lexical-semantic content.

Xuan Shi, Chang Zeng, Tiantian Feng, Shih-Heng Wang, Jianbo Ma, Shrikanth Narayanan

A Semantically Consistent Dataset for Data-Efficient Query-Based Universal Sound Separation

We propose Hive, a semantically consistent 2.4k-hour dataset for data-efficient universal sound separation with strong zero-shot generalization.

Kai Li, Jintao Cheng, Chang Zeng, Zijun Yan, Helin Wang, Zixiong Su, Bo Zheng, Xiaolin Hu

A Semantically Consistent Dataset for Data-Efficient Query-Based Universal Sound Separation

Towards Interactive Intelligence for Digital Humans

Mio is an end-to-end multimodal interactive digital-human framework that combines reasoning and real-time embodiment with state-of-the-art performance.

Yiyi Cai, Xuangeng Chu, Xiwei Gao, Sitong Gong, Yifei Huang, Caixin Kang, Kunhang Li, Haiyang Liu, Ruicong Liu, Yun Liu, Dianwen Ng, Zixiong Su, Erwin Wu, Yuhan Wu, Dingkun Yan, Tianyu Yan, Chang Zeng, Bo Zheng, You Zhou

Towards Interactive Intelligence for Digital Humans

HAM-TTS: Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech with Model and Data Scaling

We introduce a novel token-based text-to-speech (TTS) model with 0.8B parameters, trained on a mix of real and synthetic data totaling 650k hours, to address issues like pronunciation accuracy and style consistency. This model integrates a latent variable sequence with enhanced acoustic information into the TTS system, reducing errors and style changes. Our training includes data augmentation for improved timbre consistency, and we use a few-shot voice conversion model to generate diverse voices. This approach enables learning of one-to-many mappings in speech, ensuring both diversity and timbre consistency. Our model outperforms VALL-E in pronunciation, style maintenance, and timbre continuity.

Chunhui Wang, Chang Zeng, Bowen Zhang, Ziyang Ma, Yefan Zhu, Zifeng Cai, Jian Zhao, Zhonglin Jiang, Yong Chen

HAM-TTS: Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech with Model and Data Scaling