Paper IDPaper TitleAuthors
29SMILE: Speech Meta In-Context Learning for Low-Resource Language Automatic Speech RecognitionMing-Hao Hsu (National Taiwan University)*; Hung-yi Lee (National Taiwan University)
35Unifying model and layer fusion for Speech Foundation ModelsYi-Jen Shih (The University of Texas at Austin)*; David Harwath (The University of Texas at Austin)
36Analysis of Domain Shift across ASR Architectures via TTS-Enabled Separation of Target Domain and Acoustic ConditionsTina Raissi (RWTH Aachen University)*; Nick Rossenbach (RWTH Aachen University); Ralf Schlüter (RWTH Aachen University)
81Enhancing Fully Formatted End-to-End Speech Recognition with Knowledge Distillation via Multi-Codebook Vector QuantizationJian You (Cisco Systems)*; Xiangfeng Li (Cisco Systems); Erwan Zerhouni (Cisco Systems)
85Fewer Hallucinations, More Verification: A Three-Stage LLM-Based Framework for ASR Error CorrectionYangui Fang (Huazhong University of Science and Technology)*; Baixu Chen (Huazhong University of Science and Technology); Jing Peng (MoE Key Lab of Artificial Intelligence, AI Institute, X-LANCE Lab, Shanghai Jiao Tong University, Shanghai, China); Xu Li (AISpeech Ltd, Suzhou); Yu Xi (MoE Key Lab of Artificial Intelligence, AI Institute, X-LANCE Lab, Shanghai Jiao Tong University, Shanghai, China); Chengwei Zhang (Huazhong University of Science and Technology); Guohui Zhong (Huazhong University of Science and Technology)
99Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech RecognitionZijin Gu (Apple)*; Tatiana Likhomanenko (Apple); Navdeep Jaitly (Apple)
116Revealing the Role of Audio Channels in ASR Performance DegradationKuan-Tang Huang (National Taiwan Normal University)*; Li-Wei Chen (National Tsing Hua University); Hung-Shin Lee (United Link Co., Ltd.); Berlin Chen (National Taiwan Normal University); Hsin-Min Wang (Academia Sinica)
120Low-Resource Domain Adaptation for Speech LLMs via Text-Only Fine-TuningYangui Fang (Huazhong University of Science and Technology)*; Jing Peng (Shanghai Jiao Tong University); Xu Li (AISpeech Ltd, Suzhou, China); Yu Xi (Shanghai Jiao Tong University); Chengwei Zhang (Huazhong University of Science and Technology); Guohui Zhong (Huazhong University of Science and Technology); Kai Yu (Shanghai Jiao Tong University)
145Benchmarking Rotary Position Embeddings for Automatic Speech RecognitionShucong Zhang (Samsung)*; Titouan Parcollet (Samsung); Rogier van Dalen (Samsung); Sourav Bhattacharya (Samsung)
185PRIME: Novel Prompting Strategies for Effective Biasing Word Recognition in Contextualized ASRYu-Chun Liu (National Taiwan Normal University)*; Li-Ting Pai (National Taiwan Normal University); Yi-Cheng Wang (National Taiwan Normal University); Bi-Cheng Yan (National Taiwan Normal University); Hsin-Wei Wang (National Taiwan Normal University); Chi-Han Lin (E.SUN Financial Holding Co., Ltd.); Juan-Wei Xu (E.SUN Financial Holding Co., Ltd.); Berlin Chen (National Taiwan Normal University)
205LCS-CTC: Leveraging Soft Alignments to Enhance Phonetic Transcription RobustnessZongli Ye (Zhejiang University)*; Jiachen Lian (University of California, Berkeley); Akshaj Gupta (University of California, Berkeley); Xuanru Zhou (Zhejiang University); Haodong Li (Southern University of Science and Technology); Krish Patel (University of California, Berkeley); Hwi Joo Park (University of California, Berkeley); Dingkun Zhou ( University of California, Berkeley); Chenxu Guo (Zhejiang University); Shuhe Li (Zhejiang University); Sam Wang (University of California, Berkeley); Iris Zhou (University of California, Berkeley); Cheol Jun Cho (University of California, Berkeley); Zoe Ezzes (University of California, San Francisco); Jet M.J. Vonk (University of California, San Francisco); Brittany T. Morin ( University of California, San Francisco); Rian Bogley (University of California, San Francisco); Lisa Wauters (University of California, San Francisco); Zachary A. Miller (University of California, San Francisco); Maria Luisa Gorno-Tempini (University of California, San Francisco); Gopala Anumanchipalli (University of California, Berkeley)
208A Neural Model for Contextual Biasing Score Learning and FilteringWanting Huang (University of Iowa); Weiran Wang (University of Iowa)*
245Non-Autoregressive Multi-Speaker ASR with Decoupled Speaker Change DetectionYingke Zhu (Fano)*; Lahiru Samarakoon (Fano)
257Hybrid Decoding: Rapid Pass and Selective Detailed Correction for Sequence ModelsYunkyu Lim (42dot)*; Jihwan Park (42dot); Hyung Yong Kim (42dot); Hanbin Lee (42dot); Byeong-Yeol Kim (42dot)
261Evaluation of LLMs in Speech is Often Flawed: Test Set Contamination in Large Language Models for Speech RecognitionYuan Tseng (Samsung AI Center-Cambridge)*; Titouan Parcollet (Samsung AI Center-Cambridge); Rogier van Dalen (Samsung AI Center-Cambridge); Shucong Zhang (Samsung AI Center-Cambridge); Sourav Bhattacharya (Samsung AI Center-Cambridge)
10Sinba: Singing-to-Accompaniment Generation with Pitch Guidance via Mamba-Based Language ModelJianwei Cui (University of Science and Technology of China)*; Shihao Chen (University of Science and Technology of China); Yu Gu (Tencent); Jie Zhang (University of Science and Technology of China); Liping Chen (University of Science and Technology of China); Na Li (Tencent); Chengxing Li (Tencent); Shan Yang (Tencent); Lirong Dai (University of Science and Technology of China)
40Analysing the Language of Neural Audio CodecsJoonyong Park (The University of Tokyo)*; Shinnosuke Takamichi (Keio University, The University of Tokyo); David M. Chan (University of California, Berkeley); Shunsuke Kando (The University of Tokyo); Yuki Saito (The University of Tokyo); Hiroshi Saruwatari (The University of Tokyo)
199L2 Vowel Acquisition Analysis at the Inventory LevelShuju Shi (University of Illinois Urbana-Champaign)*
275Benchmarking Fast Domain Adaptation for Unsupervised Speech UnitsRobin San Roman (Meta)*; Manel Khentout (ENS); Tu anh Nguyen (ENS); Paul Michel (ENS); Yossi Adi (Hebrew University of Jerusalem); Emmanuel Dupoux (ENS)
324On the Difficulty of Token-Level Modeling of Dysfluency and Fluency Shaping ArtifactsKashaf Gulzar (Technische Hochschule Nürnberg Georg Simon Ohm)*; Dominik Wagner (Technische Hochschule Nürnberg Georg Simon Ohm); Sebastian P. Bayerl (Technische Hochschule Rosenheim); Florian Hönig (KST Institut GmbH); Tobias Bocklet (Technische Hochschule Nürnberg Georg Simon Ohm); Korbinian Riedhammer (Technische Hochschule Nürnberg Georg Simon Ohm)
351Reliability of Lexical Richness Measures for ASR-Based Children’s Speech AssessmentImen Talbi (Leibniz University Hannover)*; Christopher Gebauer (Leibniz University Hannover ); Lars Rumberg (Leibniz Universität Hannover); Edith Beaulac (Leibniz Universität Hannover); Hanna Ehlert (Leibniz Universität Hannover); Jörn Ostermann (Leibniz Universität Hannover)
360LLM-Based Dictation Detection from Doctor-Patient ConversationsSiyuan Chen (Solventum Health Information Systems); Mojtaba Kadkhodaie Elyaderani (Solventum Health Information Systems); Jing Su (Solventum Health Information Systems); Susanne Burger (Solventum Health Information Systems); Thomas Schaaf (Solventum Health Information Systems)*
395Acoustic to Articulatory Speech Inversion for Children with Velopharyngeal InsufficiencySaba Tabatabaee (University of Maryland)*; Suzanne Boyce (University of Cincinnati); Liran Oren (University of Cincinnati); Mark Tiede (Yale University); Carol Espy-Wilson (University of Maryland)
443Text-Guided Speech Representations for Language Acquisition AssessmentIlja Baumann (Technische Hochschule Nürnberg Georg Simon Ohm)*; Dominik Wagner (Technische Hochschule Nürnberg Georg Simon Ohm); Philipp Seeberger (Technische Hochschule Nürnberg Georg Simon Ohm); Korbinian Riedhammer (Technische Hochschule Nürnberg Georg Simon Ohm); Tobias Bocklet (Technische Hochschule Nürnberg Georg Simon Ohm)
96OOQ: Outlier-Oriented Quantization for Efficient Large Language ModelsHaoyu Wang (Shanghai Jiao Tong University)*; Bei Liu (Shanghai Jiao Tong University); Hang Shao (Shanghai Jiao Tong University); Bo Xiao ( Meituan); Ke Zeng ( Meituan); Guanglu Wan (Meituan); Yanmin Qian ( Shanghai Jiao Tong University)
42CLAIRA: Leveraging Large Language Models to Judge Audio CaptionsTsung-Han Wu (UC Berkeley); Joseph E Gonzalez (UC Berkeley); Trevor Darrell (UC Berkeley); David Chan (University of California, Berkeley)*
Paper IDPaper TitleAuthors
12EMO-Debias: Benchmarking Gender De-biasing Techniques in Multi-Label Speech Emotion RecognitionYi-Cheng Lin (National Taiwan University)*; Huang-Cheng Chou (Independent Researcher); Yu-Hsuan Li Liang (National Taiwan University); Hung-yi Lee (National Taiwan University)
32CO-VADA: A Confidence-Oriented Voice Augmentation Debiasing Approach for Fair Speech Emotion RecognitionYun-Shao Tsai (National Taiwan University)*; Yi-Cheng Lin (National Taiwan University); Huang-Cheng Chou (Independent Researcher); Hung-yi Lee (National Taiwan University)
55Efficient Speech Watermarking for Speech Synthesis via Progressive Knowledge DistillationYang Cui (Microsoft)*; Lei He (Microsoft); Peter Pan (Microsoft); Sheng Zhao (Microsoft)
144StellarTTS: Sparse Temporal Embedding for Low-Latency and Robust Speech SynthesisKaicheng Luo (Honor Device Co., Ltd.)*; Xuefei Gong (Honor Device Co., Ltd.); Yutao Sun (Honor Device Co., Ltd.); Jinling He (Honor Device Co., Ltd.); Yujie Hou (Honor Device Co., Ltd.); Xiaoyang Xing (Honor Device Co., Ltd.); Huiyan Li (Honor Device Co., Ltd.); Bing Han (Shanghai Jiao Tong University); Yanmin Qian (Shanghai Jiao Tong University)
230Advancing Controllable Music Generation with Latent Rectified Flow Guided by Rhythm and HarmonyHaibin Yu (Shanghai Jiao Tong University)*; Jiayi Zhou (Ant Group); Wei Wang (Shanghai Jiao Tong University); Zhiming Wang (Ant Group); Huijia Zhu (Ant Group); Yanmin Qian (Shanghai Jiao Tong Univerisity)
333Improving Streaming ASR via Differentially Private Fusion of Data from Multiple SourcesVirat Shejwalkar (Google)*; Om Thakkar (OpenAI); Steve Chien (Google); Nicole Rafidi (Google); Arun Narayanan (Google )
503HAAQI-Net: A Non-intrusive Neural Music Audio Quality Assessment Model for Hearing AidsDyah A. M. G. Wisnu (Academia Sinica); Stefano Rini (National Yang Ming Chiao Tung University); Ryandhimas E. Zezario (Academia Sinica); Hsin-Min Wang (Academia Sinica); Yu Tsao (Academia Sinica)*
57Acoustic Phonetic Temporal Speech RepresentationYunbin Deng (MIT)*
76Token-based Attractors and Cross-attention in Spoof Diarizationkyo-won koo (University of Seoul)*; Chan-yeong Lim (University of Seoul); Jee-weon Jung (Carnegie Mellon University); Hye-jin Shim (Carnegie Mellon University); Ha-Jin Yu (University of Seoul)
173Rapidly Adapting to New Voice Spoofing: Few-Shot Detection of Synthesized Speech Under Distribution ShiftsAshi Garg (Johns Hopkins University)*; Zexin Cai (Johns Hopkins University); Henry Li (Johns Hopkins Universtiy); Paola Garcia (Johns Hopkins University); Kevin Duh (Johns Hopkins University); Sanjeev Khudanpur (Johns Hopkins University); Matthew Wiesner (Johns Hopkins University); Nicholas Andrews (Johns Hopkins University)
211MoLEx: Mixture of LoRA Experts in Speech Self-Supervised Models for Audio Deepfake DetectionZihan Pan (Institute for Infocomm Research (I2R), A*STAR, Singapore)*; Hardik Sailor (Institute for Infocomm Research (I2R), A*STAR, Singapore); Jinyang Wu (Institute for Infocomm Research (I2R), A*STAR, Singapore)
215Towards Generalized Source Tracing for Codec-Based Deepfake SpeechI-Ming Lin (National Taiwan University); XUANJUN CHEN (National Taiwan University)*; Lin Zhang (Johns Hopkins University); Haibin Wu (National Taiwan University); Hung-yi Lee (National Taiwan University); Jyh-Shing Roger Jang (National Taiwan University)
241Post-training for Deepfake Speech DetectionWanying Ge (National Institute of Informatics)*; Xin Wang (National Institute of Informatics); Xuechen Liu (National Institute of Informatics); Junichi Yamagishi (National Institute of Informatics)
334Multilingual Dataset Integration Strategies for Robust Audio Deepfake Detection: A SAFE Challenge SystemHashim Ali (University of Michigan)*; Surya Subramani (University of Michigan); Lekha Bollinani (University of Michigan); Nithin Sai Adupa (University of Michigan); Hafiz Malik (University of Michigan); Sali El-Loh (University of Michigan-Dearborn)
419Fake-Mamba: Real-Time Speech Deepfake Detection Using Bidirectional Mamba as Self-Attention AlternativeXi Xuan (University of Eastern Finland)*; Zimo Zhu (University of California, Santa Barbara); Wenxin Zhang (University of Chinese Academy of Science); Yi-Cheng Lin (National Taiwan University); Tomi Kinnunen (University of Eastern Finland)
Paper IDPaper TitleAuthors
26SincQDR-VAD: A Noise-Robust Voice Activity Detection Framework Leveraging Learnable Filters and Ranking-Aware OptimizationChien-Chun Wang (National Taiwan Normal University)*; En-Lun Yu (National Taiwan Normal University); Jeih-Weih Hung (National Chi Nan University); Shih-Chieh Huang (Realtek Semiconductor Corp.); Berlin Chen (National Taiwan Normal University)
59PhysMVNet: Physics-Informed End-to-End MVDR Beamformer with Residual Spectral Mapping for Multichannel Speech EnhancementXingyu Shen (Concordia University); Wei-Ping Zhu (Concordia University)*; Benoit Champagne (McGill University)
65LauraTSE: Target Speaker Extraction using Auto-Regressive Decoder-Only Language ModelsBeilong Tang (Duke Kunshan University)*; Bang Zeng (Duke Kunshan University); Ming Li (Duke Kunshan University)
80Lightweight Wasserstein Audio-Visual Model for Unified Speech Enhancement and SeparationJisoo Park (Chung-Ang University); Seonghak Lee (Chung-Ang University); Guisik Kim (Korea Electronics Technology Institute (KETI)); Taewoo Kim (Korea Electronics Technology Institute (KETI)); Junseok Kwon (Chung-Ang University)*
114MBENet: Bone-conduction and Air-conduction Fusion Network for Target Speaker ExtractionChen Zhang (School of Marine Science and Technology, Northwestern Polytechnical University); Linfeng Feng (School of Marine Science and Technology, Northwestern Polytechnical University); Zhi Liu (Shenzhen Huangli Techonogies Company Ltd., Shenzhen, China); Xiao-Lei Zhang (Research & Development Institute of Northwestern Polytechnical University in Shenzhen)*; Xuelong Li (Institute of Artificial intelligence (TeleAl), China Telecom Corporation Ltd., Beiing, China)
136Bottleneck Transformer-Based Approach for Improved Automatic STOI Score PredictionAmartya Veer (Indian Institute of Science)*; Murali Kadambi (Indian Institute of Science); Chandra Mohan Sharma (Center for Artificial Intelligence and Robotics, DRDO); Anupam Mondal (Center for Artificial Intelligence and Robotics, DRDO); Prasanta Kumar Ghosh (Indian Institute of Science)
154Time-Frequency-Based Attention Cache Memory Model for Real-Time Speech SeparationGuo Chen (Tsinghua University); Kai Li (Tsinghua University)*; Runxuan Yang (Tsinghua University); Xiaolin Hu (Tsinghua University)
162A Study of the Scale Invariant Signal to Distortion Ratio in Speech Separation with Noisy ReferencesSimon Jepsen (Aalborg University)*; Mads Græsbøll Christensen (Aalborg University); Jesper Rindom Jensen (Aalborg University)
187Deep Audio Zooming: Creating a Sound Barrier with Microphone Array ProcessingMeng Yu (Tencent)*; Dong Yu (Tencent )
226EchoFree: Towards Ultra Lightweight and Efficient Neural Acoustic Echo CancellationXingchen Li (Northwestern Polytechnical University)*; Boyi Kang ( Northwestern Polytechnical University); Ziqian Wang (Northwestern Polytechnical University); Zihan Zhang (Northwestern Polytechnical University); Mingshuai Liu (Northwestern Polytechnical University); Zhonghua Fu (Northwestern Polytechnical University); Lei Xie (Northwestern Polytechnical University)
286Geometry-Agnostic Acoustic Processing: A Dynamic Spatial Network for Joint Echo Cancellation and Noise Suppressionkangqi jing (Southeast University)*; wenbin zhang (midea); jun du (University of Science and Technology of China); qing wang (University of Science and Technology of China); yu gao (midea)
296AdaBit-TasNet: Speech Separation with Inference Adaptable PrecisionMohamed Elminshawi (International Audio Laboratories Erlangen)*; Srikanth Raj Chetupalli (Indian Institute of Technology Bombay); Emanuël Habets (AudioLabs)
310Improving Speech Enhancement with Multi-Metric Supervision from Learned Quality AssessmentWei Wang (Shanghai Jiao Tong University)*; Wangyou Zhang (Shanghai Jiao Tong University); Chenda Li (Shanghai Jiao Tong University); JiaTong Shi (Carnegie Mellon University); Shinji Watanabe (Carnegie Mellon University); Yanmin Qian (Shanghai Jiao Tong University)
319URGENT-PK: Perceptually-Aligned Ranking Model Designed for Speech Enhancement CompetitionJiahe Wang (Shanghai Jiao Tong University); Chenda Li (Shanghai Jiao Tong University)*; Wei Wang (Shanghai Jiao Tong University); Wangyou Zhang (Shanghai Jiao Tong University); Samuele Cornell (Carnegie Mellon University); Marvin Sach (Technische Universität Braunschweig); Robin Scheibler (Google Deepmind); Kohei Saijo (Waseda University); Yihui Fu (Technische Universität Braunschweig ); Zhaoheng Ni (Meta AI); Anurag Kumar (Meta AI); Tim Fingscheidt (Technische Universität Braunschweig ); Shinji Watanabe (Carnegie Mellon University); Yanmin Qian (Shanghai Jiao Tong University)
332Less is More: Data Curation Matters in Scaling Speech EnhancementChenda Li (Shanghai Jiao Tong University)*; Wangyou Zhang (Shanghai Jiao Tong University); Wei Wang (Shanghai Jiao Tong University); Robin Scheibler (Google Deepmind); Kohei Saijo (Waseda University); Samuele Cornell (Carnegie Mellon University ); Yihui Fu (Technische Universität Braunschweig); Marvin Sach (Technische Universität Braunschweig); Zhaoheng Ni (Meta AI); Anurag Kumar (Meta AI); Tim Fingscheidt (Technische Universität Braunschweig); Shinji Watanabe (Carnegie Mellon University); Yanmin Qian (Shanghai Jiao Tong University)
383Improving Resource-Efficient Speech Enhancement via Neural Differentiable DSP Vocoder RefinementHeitor Guimarães (Institut National de la Recherche Scientifique)*; Ke Tan (Meta Reality Labs); Juan Azcarreta (Meta Reality Labs ); Jesus M. Alvarez (Meta Reality Labs ); Prabhav Agrawal (Meta AI); Ashutosh Pandey (Meta Reality Labs ); Buye Xu (Meta Reality Labs )
431Pitch-Assistant Harmonic Recovery for Efficient Speech EnhancementBiao Liu (Institute of Acoustics Chinese Academy of Sciences)*; Zengqiang Shang (Institute of Acoustics Chinese Academy of Sciences); Haoyuan Xie (Institute of Acoustics Chinese Academy of Sciences); Mou Wang (Hardware Engineering System, OPPO); Xin Liu (Hardware Engineering System, OPPO); Pengyuan Zhang (Institute of Acoustics Chinese Academy of Sciences)
25KAN-AST: Kolmogorov-Arnold Network based Audio Spectrogram Transformer for Audio ClassificationTuan Dat Phuong (Hanoi University of Science and Technology )*; Huy Dat Tran (Institute for Infocomm Research, Agency for Science, Technology and Research)
256Learning Marmoset Vocal Patterns with a Masked Autoencoder for Robust Call Segmentation, Classification, and Caller IdentificationBIN WU (RIKEN AIP)*; Shinnosuke Takamichi (RIKEN AIP/Keio University); Sakriani Sakti (RIKEN AIP/NAIST); Satoshi Nakamura (RIKEN AIP/NAIST/CUHK-Shenzhen)
71Enhancing Code-switched Text-to-Speech Synthesis Capability in Large Language Models with only Monolingual CorporaJing Xu (The Chinese University of Hong Kong)*; Daxin Tan (Huawei Noah's Ark Lab); Jiaqi Wang (The Chinese Unviersity of Hong Kong); Xiao Chen (Huawei Noah's Ark Lab)
44Lightweight Prompt Biasing for Contextualized End-to-End ASR SystemsBo Ren (Microsoft)*; Yu Shi (Microsoft); Jinyu Li (Microsoft)
142TurboBias: Universal ASR Context-Biasing powered by GPU-accelerated Phrase-Boosting TreeAndrei Andrusenko (NVIDIA)*; Vladimir Bataev (NVIDIA); Lilit Grigoryan (NVIDIA); Vitaly Lavrukhin (NVIDIA); Boris Ginsburg (NVIDIA)
224Efficient ASR Domain Adaptation with Long Noun Phrases: Harnessing the Linguistic Characteristics of JapaneseShusuke Komatsu (mocomoco inc. / Nara Institute of Science and Technology / RIKEN Guardian Robot Project)*; Kazuyo Onishi ( mocomoco inc. / Nara Institute of Science and Technology / RIKEN Guardian Robot Project); Kouki Tanaka (mocomoco inc. / Nara Institute of Science and Technology); Dohyun Kim (mocomoco inc. / Nara Institute of Science and Technology); Koichiro Yoshino (Institute of Science Tokyo / RIKEN Guardian Robot Project / Nara Institute of Science and Technology)
408Customizing Speech Recognition Model with Large Language Model FeedbackShaoshi Ling (Microsoft)*; Guoli Ye (Microsoft)
254From Simulation to Strategy: Automating Personalized Interaction Planning for Conversational AgentsWen Yu Chang (National Taiwan University)*; Tzu-Hung Huang (National Taiwan University); Chih-Ho Chen (National Taiwan University); Yun-Nung Chen (National Taiwan University)
Paper IDPaper TitleAuthors
301Graph Connectionist Temporal Classification for Phoneme RecognitionHenry Grafé (KU Leuven)*; Hugo Van hamme (KU Leuven)
313A Self-Refining Framework for Enhancing ASR Using TTS-Synthesized DataCheng Kang Chou (National Taiwan University)*; Chan-Jan Hsu (National Taiwan University); Ho-Lam Chung ( National Taiwan University); Liang-Hsuan Tseng (National Taiwan University); Hsi-Chun Cheng (National Taiwan University); Yu-Kuan Fu (National Taiwan University); Kuan Po Huang (National Taiwan University); Hung-Yi Lee (National Taiwan University)
325CAMÕES: A Comprehensive Automatic Speech Recognition Benchmark for European PortugueseCarlos Carvalho (INESC-ID/Instituto Superior Técnico, Universidade de Lisboa)*; Francisco Teixeira (INESC-ID); Catarina Botelho (INESC-ID); Anna Pompili (INESC-ID); Rubén Solera-Ureña (INESC-ID); Sérgio Paulo (INESC-ID); Mariana Julião (INESC-ID/Instituto Superior Técnico, Universidade de Lisboa); Thomas Rolland (INESC-ID); John Mendonça (INESC-ID/Instituto Superior Técnico, Universidade de Lisboa); Diogo Pereira (INESC-ID/Instituto Superior Técnico, Universidade de Lisboa); Isabel Trancoso (INESC-ID/Instituto Superior Técnico, Universidade de Lisboa); Alberto Abad (INESC-ID/Instituto Superior Técnico, Universidade de Lisboa)
337Whisper Has an Internal Word AlignerSung-Lin Yeh (University of Edinburgh)*; Yen Meng (University of Edinburgh); Hao Tang (University of Edinburgh)
355SSVD: Structured SVD for Parameter-Efficient Fine-Tuning and Benchmarking under Domain Shift in ASRPu Wang (KU LEUVEN)*; Shinji Watanabe (Carnegie Mellon University); Hugo Van hamme (KU Leuven)
412WST: Weakly Supervised Transducer for Automatic Speech RecognitionDongji Gao (Johns Hopkins University)*; Chenda Liao (Microsoft); Changliang Liu (Microsoft); Matthew Wiesner (Johns Hopkins University); Leibny Paola Garcia (Johns Hopkins University); Daniel Povey (Xiaomi); Sanjeev Khudanpur (Johns Hopkins University); Jian Wu (Microsoft)
433Bridging the Modality Gap: Softly Discretizing Audio Representation for LLM-based Automatic Speech RecognitionMu Yang (University of Texas at Dallas)*; Szu-Jui Chen (University of Texas at Dallas); Jiamin Xie (University of Texas at Dallas); John Hansen (University of Texas at Dallas)
453Evaluating Self-Supervised Speech Models via Text-based LLMsTakashi Maekaku (LY Corporation)*; Keita Goto (LY Corporation); Jinchuan Tian (carnegie mellon university); Yusuke Shinohara (LY Corporation); Shinji Watanabe (carnegie mellon university)
502Aggregation-Free Uncertainty Estimation for CTC-Based Automatic Speech RecognitionLars Rumberg (Leibniz Universität Hannover)*; Christopher Gebauer (Leibniz Universität Hannover); Jörn Ostermann (Leibniz Universität Hannover)
28Masked Self-distilled Transducer-based Keyword Spotting with Semi-autoregressive DecodingYu Xi (Shanghai Jiao Tong University)*; Xiaoyu Gu (Shanghai Jiao Tong University); Haoyu Li (Shanghai Jiao Tong University); Jun Song (Alibaba); Bo Zheng (Alibaba); Kai Yu (Shanghai Jiao Tong University)
73Serialized Output Prompting for Large Language Model-based Multi-Talker Speech RecognitionHao Shi (SB Intuitions)*; Yusuke Fujita (SB Intuitions); Tomoya Mizumoto (SB Intuitions); Lianbo Liu (SB Intuitions); Atsushi Kojima ( SB Intuitions); Yui Sudo ( SB Intuitions)
103SUTA-LM: Bridging Test-Time Adaptation and Language Model Rescoring for Robust ASRWei-Ping Huang (National Taiwan University)*; Guan-Ting Lin (National Taiwan University); Hung-yi Lee (National Taiwan University)
122Identifying and Calibrating Overconfidence in Noisy Speech RecognitionMingyue Huo (University of Illinois at Urbana-Champaign)*; Yuheng Zhang (University of Illinois at Urbana-Champaign); Yan Tang (University of Illinois at Urbana-Champaign)
269DeCRED: Decoder-Centric Regularization for Encoder-Decoder Based Speech RecognitionAlexander Polok (Brno University of Technology)*; Santosh Kesiraju (Brno University of Technology); Karel Beneš (Brno University of Technology); Bolaji Yusuf ( Brno University of Technology); Lukáš Burget (Brno University of Technology); Jan Černocký (Brno University of Technology)
416Group Relative Policy Optimization for Speech RecognitionPrashanth Gurunath Shivakumar (Amazon)*; Yile Gu (Amazon); Ankur Gandhe (Amazon); Ivan Bulyko (Amazon)
462A Front-End Adaptation Network for Improving Speech Recognition Performance in Packet Loss and Noisy EnvironmentsYehoshua Dissen (Technion - Israel Institute of Technology)*; Israel Cohen (Technion - Israel Institute of Technology); Shiry Yonash (Technion - Israel Institute of Technology); Joseph Keshet (Technion - Israel Institute of Technology)
320Spiralformer: Low Latency Encoder for Streaming Speech Recognition with Circular Layer Skipping and Early ExitingEmiru Tsunoo (Sony Group Corporation)*; Hayato Futami (Sony Group Corporation); Yosuke Kashiwagi (Sony Group Corporation); Siddhant Arora (Carnegie Mellon University); Shinji Watanabe (Carnegie Mellon University)
283A Momentum-Based Framework with Contrastive Data Generation for Robust Sound Source LocalizationHyun-Soo Kim (Hanyang University); Da-Hee Yang (Hanyang University); Joon-Hyuk Chang (Hanyang University)*
23MMW: Side Talk Rejection Multi-Microphone Whisper on Smart GlassesYang Liu (Meta)*; Li Wan (Meta); Yiteng Huang (Meta); Yong Xu (Meta); yangyang shi (Meta); Saurabh Adya (Meta); ming sun (meta); Florian Metze (Meta)
75SV-Mixer: Replacing the Transformer Encoder with Lightweight MLPs for Self-Supervised model compresison in Speaker VerificationJungwoo Heo (University of Seoul)*; Hyun-seo Shin (University of Seoul); Chan-yeong Lim (University of Seoul); kyo-won koo (University of Seoul); Seung-bin KIM (University of Seoul); Ji-soo SON (University of Seoul); Ha-Jin YU (University of Seoul)
195Multi-Target Backdoor Attacks Against Speaker RecognitionAlexandrine Fortier (École de technologie supérieure (ÉTS))*; Sonal Joshi (Johns Hopkins University); Thomas Thebaud (Johns Hopkins University); Jesus Villalba Lopez (Johns Hopkins University); Najim Dehak (Johns Hopkins University); Patrick Cardinal (École de technologie supérieure (ÉTS))
364State-of-the-art Embeddings with Video-free Segmentation of the Source VoxCeleb DataSara Barahona (AUDIAS Research Group, Universidad Autónoma de Madrid)*; Ladislav Mošner (Brno University of Technology); Themos Stafylakis (Athens University of Economics and Business | Omilia | Archimedes AI/Athena RC); Oldřich Plchot (Brno University of Technology); Junyi Peng (Brno University of Technology); Lukáš Burget (Brno University of Technology); Jan Černocký (Brno University of Technology)
378The JHU-MIT Speaker Recognition System for NIST SRE24: Post-Evaluation AnalysisJesus Villalba (Johns Hopkins University)*; Jonas Borgstrom (MIT Lincoln Laboratory); Prabhav Singh (Johns Hopkins University); Leibny Paola Garcia Perera (Johns Hopkins University); Pedro Torres-Carrasquillo (Johns Hopkins University); Najim Dehak (Johns Hopkins University)
404Geolocation-Aware Robust Spoken Language IdentificationQingzheng Wang (Carnegie Mellon University)*; Hye-jin Shim (Carnegie Mellon University); Jiancheng Sun (Carnegie Mellon University); Shinji Watanabe (Carnegie Mellon University)
409CoLMbo: Speaker Language Model for Descriptive ProfilingMassa Baali (CMU)*; Shuo Han (CMU); Syed Abdul Hannan (CMU); Purusottam Samal (CMU); Karan Veer Singh (FPrime AI); Soham Deshmukh (CMU); Rita Singh (CMU); Bhiksha Raj (CMU)
45Speech in-context learning of paralinguistic tasksJeremy Wong (Institute for Infocomm Research)*; Muhammad Huzaifah (Institute for Infocomm Research); Nancy Chen ( Institute for Infocomm Research); Ai Ti Aw (Institute for Infocomm Research)
58Adaptive Audio-Visual Speech Recognition via Matryoshka-Based Multimodal LLMsUmberto Cappellazzo (University of Trento)*; Minsu Kim (Meta AI); Stavros Petridis (Imperial College London)
Paper IDPaper TitleAuthors
172Granite-speech: open-source speech-aware LLMs with strong English ASR capabilitiesGeorge Saon (IBM)*; Avihu Dekel (IBM); Alexander Brooks (IBM); Tohru Nagano (IBM); Abraham Daniels (IBM); Aharon Satt (IBM); Ashish Mittal (IBM); Brian Kingsbury (IBM); David Haws (IBM); Edmilson Morais (IBM); Gakuto Kurata (IBM); Hagai Aronowitz (IBM); Ibrahim Ibrahim (IBM); Jeff Kuo (IBM); Kate Soule (IBM); Luis Lastras (IBM); Masayuki Suzuki (IBM); Ron Hoory (IBM); Samuel Thomas (IBM); Sashi Novitasari (IBM); Takashi Fukuda (IBM); Vishal Sunder (IBM); Xiaodong Cui (IBM); Zvi Kons (IBM)
285All-in-One ASR: Unifying Encoder-Decoder Models of CTC, Attention, and Transducer in Dual-Mode ASRTakafumi Moriya (NTT Corporation)*; Masato Mimura (NTT); Tomohiro Tanaka (NTT); Hiroshi Sato (NTT); Ryo Masumura (NTT); Atsunori Ogawa (NTT)
64AsyncSwitch: Asynchronous Text-Speech Adaptation for Code-Switched ASRTuan Nguyen (Institute for Infocomm Research, A*STAR)*; Huy-Dat Tran (Institute for Infocomm Research, A*STAR)
178PARCO: Phoneme-Augmented Robust Contextual ASR via Contrastive Entity DisambiguationJiajun He (Nagoya University)*; Naoki Sawada (CyberAgent); Koichi Miyazaki (CyberAgent); Tomoki Tomoki (Nagoya University)
190JOOCI: a Novel Method for Learning Comprehensive Speech RepresentationsHemant Yadav (IIIT Delhi)*; Sunayana Sitaram (Microsoft research); Rajiv ratn shah (IIIT Delhi)
237Efficient Scaling for LLM-based ASRBingshen Mu (Northwestern Polytechnical University)*; Yiwen Shao (Tencent AI Lab); Kun Wei (Tencent AI Lab); Dong Yu (Tencent AI Lab); Lei Xie (Northwestern Polytechnical University)
270Whispering Context: Distilling Syntax and Semantics for Long Speech TranscriptsDUYGU ALTINOK (Deepgram)*
292Training and Inference Efficiency of Encoder-Decoder Speech ModelsPiotr Żelasko (NVIDIA)*; Kunal Dhawan (NVIDIA); Daniel Galvez (NVIDIA); Krishna Puvvada (NVIDIA); Ankita Pasad (NVIDIA); Travis Bartley (NVIDIA); Nithin Koluguri (NVIDIA); Ke Hu (NVIDIA); Vitaly Lavrukhin (NVIDIA); Jagadeesh Balam (NVIDIA); Boris Ginsburg (NVIDIA)
397Phoneme Overlapping-Aware Pre-Training with External Text Resources for Multi-Talker ASRRyo Masumura (NTT Corporation)*; Tomohiro Tanaka (NTT Corporation); Naoki Makishima (NTT Corporation); Mana Ihori (NTT Corporation); Shota Orihashi (NTT Corporation); Naotaka Kawata (NTT Corporation); Taiga Yamane (NTT Corporation); Satoshi Suzuki (NTT Corporation); Takafumi Moriya (NTT Corporation)
399Unifying Diarization, Separation, and ASR with Multi-Speaker EncoderMuhammad Shakeel (Honda Research Institute Japan)*; Yui Sudo (Honda Research Institute Japan); Yifan Peng (Carnegie Mellon University); Chyi-Jiunn Lin (Carnegie Mellon University); Shinji Watanabe (Carnegie Mellon University)
429ZO-ASR: Zeroth-Order Fine-Tuning of Speech Foundation Models without Back-PropagationYuezhang PENG (Shanghai Jiao Tong University)*; Yuxin Liu (Shanghai Jiao Tong University); Yao Li (AVIC); Sheng Wang (Shanghai Jiao Tong University ); Fei Wen (Shanghai Jiao Tong University); Xie Chen (Shanghai Jiao Tong University)
43Flow-SLM: Joint Learning of Linguistic and Acoustic Information for Spoken Language ModelingJu-Chieh Chou (TTIC)*; Jiawei Zhou (Stony); Karen Livescu (TTIC)
125AudioLens: A Closer Look at Auditory Attribute Perception of Large Audio-Language ModelsChih-Kai Yang (National Taiwan University)*; Neo Ho (National Taiwan University); Yi-Jyun Lee (National Taiwan University); Hung-yi Lee (National Taiwan University)
290Joint Multimodal Contrastive Learning for Robust Spoken Term Detection and Keyword SpottingRamesh Gundluru (Indian Institute of Technology , Hyderabad, India)*; Shubham Gupta (Indian Institute of Technology, Hyderabad, India); Sri Rama Murty Kodukula (Indian Institute of Technology, Hyderabad, India )
291TokenVerse++: Towards Flexible Multitask Learning with Dynamic Task ActivationShashi Kumar (Idiap Research Institute, Martigny)*; Srikanth Madikeri (University of Zurich); Esaú Villatoro-Tello (Idiap Research Institute); Sergio Burdisso (Idiap Research Institute); Pradeep Rangappa ( Idiap Research Institute); Andrés Carofilis (Idiap Research Institute); Petr Motlicek (Idiap Research Institute); Karthik Pandia (Uniphore); Shankar Venkatesan (Uniphore); Kadri Hacioğlu (Uniphore); Andreas Stolcke (Uniphore)
300Reducing Object Hallucination in Large Audio-Language Models via Audio-Aware DecodingTzu-wen Hsu (Purdue University)*; Ke-Han Lu (National Taiwan University); Cheng-Han Chiang (National Taiwan University); Hung-yi Lee (National Taiwan University)
406Robot Confirmation Generation and Action Planning Using Long-context Q-Former Integrated with Multimodal LLMChiori Hori (Mitsubishi Electric Research Laboratories (MERL))*; Yoshiki Masuyama (Mitsubishi Electric Research Laboratories (MERL)); Diego Romeres (Mitsubishi Electric Research Laboratories (MERL)); Devesh Jha (Mitsubishi Electric Research Laboratories (MERL)); Radu Corcodel (Mitsubishi Electric Research Laboratories (MERL) ); Siddarth Jain (Mitsubishi Electric Research Laboratories (MERL) ); Jonathan Le Roux (Mitsubishi Electric Research Laboratories (MERL))
407Towards Efficient Speech-Text Jointly Decoding within One Speech Language ModelHaibin Wu (National Taiwan University)*; Yuxuan Hu (Microsoft); Ruchao Fan (Microsoft); Xiaofei Wang (Microsoft); Kenichi Kumatani (Microsoft); Bo Ren (Microsoft); Jianwei Yu (Microsoft); Heng Lu (Microsoft); Lijuan Wang (Microsoft); Yao Qian (Microsoft); Jinyu Li (Microsoft)
104Balancing Speech Understanding and Generation Using Continual Pre-training for Codec-based Speech LLMJiatong Shi (Carnegie Mellon University)*; Chunlei Zhang (Bytedance); Jinchuan Tian (Carnegie Mellon University); Junrui Ni (UIUC); Hao Zhang (Tencent AI Lab); Shinji Watanabe (Carnegie Mellon University); Dong Yu (Tencent AI Lab)
402SLM-S2ST: A multimodal language model for direct speech-to-speech translationYuxuan Hu (Microsoft); Haibin Wu (National Taiwan University)*; Ruchao Fan (Microsoft); Xiaofei Wang (Microsoft); Heng Lu (Microsoft); Yao Qian (Microsoft); Jinyu Li (Microsoft)
422Evaluating Japanese Dialect Robustness across Speech and Text-based Large Language ModelsTomoya Mizumoto (SB Intuitions Corp.)*; Yusuke Fujita (SB Intuitions Corp.); Hao Shi (SB Intuitions Corp.); Lianbo Liu (SB Intuitions Corp.); Atsushi Kojima (SB Intuitions Corp.); Yui Sudo (SB Intuitions Corp.)
192Streaming Endpointer for Spoken Dialogue using Neural Audio Codecs and Label-Delayed TrainingSathvik Udupa (Brno University of technology)*; Shinji Watanabe (Carnegie Mellon University); Petr Schwarz (Brno University of technology); Jan Cernocky (Brno University of technology)
236Predictive ASR and Turn-taking Prediction at Once: Towards More Responsive Spoken Dialog SystemRyo Fukuda (NTT Corporation)*; Takatomo Kano (NTT Corporation); Naohiro Tawara (NTT Corporation); Marc Delcroix (NTT Corporation); Atsunori Ogawa (NTT Corporation); Yuya Chiba (NTT Corporation); Atsushi Ando (NTT Corporation)
198Expressive Speech Retrieval using Natural Language Descriptions of Speaking StyleWonjune Kang (Massachusetts Institute of Technology)*; Deb Roy (Massachusetts Institute of Technology)
127WhisperNER: Unified Open Named Entity and Speech RecognitionGil Ayache (aiOla); Menahem Pirchi (aiOla); Aviv Navon (aiOla); Aviv Shamsian (aiOla)*; Gill Hetz (aiOla); Joseph Keshet (aiOla)
175Do Self-Supervised Speech Models Exhibit the Critical Period Effects in Language Acquisition?Yurie Koga (The University of Tokyo)*; Shunsuke Kando (The University of Tokyo); Yusuke Miyao (The University of Tokyo, NII LLMC)
Paper IDPaper TitleAuthors
38Emotional Styles Hide in Deep Speaker Embeddings: Disentangle Deep Speaker Embeddings for Speaker ClusteringChaohao Lin (Florida International University)*; Xu Zheng (Florida International University); Kaida Wu (Florida International University); Peihao Xiang ( Florida International University); Ou Bai (Florida International University)
298ASR for Affective Speech: Investigating Impact of Emotion and Speech Generative StrategyYa-Tse Wu (Department of Electrical Engineering, National Tsing Hua University)*; Chi-Chun Lee (National Tsing Hua University)
181On the use of self-supervised representation learning for speaker diarization and separationSéverin BAROUDI (LIS)*; Hervé BREDIN (IRIT); Joseph RAZIK (LIS); Ricard MARXER (LIS)
253Can We Really Repurpose Multi-Speaker ASR Corpus for Speaker Diarization?Shota Horiguchi (NTT, Inc.)*; Naohiro Tawara (NTT, Inc.); Takanori Ashihara (NTT, Inc.); Atsushi Ando (NTT, Inc.); Marc Delcroix (NTT, Inc.)
338Utilizing Kolmogorov-Arnold Network in Self-Supervised Learning for Speaker DiarizationMinh Vu (Hanoi University of Science and Technology)*; Tuan Dat Phuong (Hanoi University of Science and Technology); Kah Kuan Teh (Institute for Infocomm Research (I2R)); Van Tuan Nguyen ( Institute for Infocomm Research (I2R)); Huy Dat Tran (Institute for Infocomm Research (I2R))
152XEmoRAG: Cross-Lingual Emotion Transfer with Controllable Intensity Using Retrieval-Augmented GenerationTianlun Zuo (Northwestern Polytechnical University)*; Jingbin Hu (Northwestern Polytechnical University); Yuke Li (Northwestern Polytechnical University); Xinfa Zhu (Northwestern Polytechnical University); Hai Li (iQIYI, Inc.); Ying Yan (iQIYI, Inc.); Junhui Liu (iQIYI, Inc.); Danming Xie (iQIYI, Inc.); Lei Xie (Northwestern Polytechnical University)
4Scalable Controllable Accented TTSHenry Li Xinyuan (Johns Hopkins University)*; Zexin Cai (Johns Hopkins University); Ashi Garg (Johns Hopkins University); Kevin Duh (Johns Hopkins University); Leibny Paola García-Perera (Johns Hopkins University); Sanjeev Khudanpur (Johns Hopkins University); Nicholas Andrews (Johns Hopkins University); Mathew Wiesner (Johns Hopkins University)
33GenVC: Self-Supervised Zero-Shot Voice ConversionZexin Cai (Johns Hopkins University)*; Henry Li (johns hopkins university); Ashi Grag (Johns Hopkins University); Paola Garcia (Johns Hopkins University); Kevin Duh (Johns Hopkins University); Sanjeev Khudanpur (Johns Hopkins University); Matthew Wiesner (Johns Hopkins University); Nicholas Andrews (Johns Hopkins University)
78REF-VC: Robust, Expressive and Fast Zero-Shot Voice Conversion with Diffusion TransformersYuepeng Jiang (Northwestern Polytechnical University)*; Ziqian Ning (Northwestern Polytechnical University); Shuai Wang (School of Intelligence Science and Technology, Nanjing University, Suzhou, China); Chengjia Wang (Fuxi AI Lab, NetEase Inc.); Mengxiao Bi (Fuxi AI Lab, NetEase Inc.); Pengcheng Zhu (Fuxi AI Lab, NetEase Inc.); Zhonghua Fu (Northwestern Polytechnical University); Lei Xie (Northwestern Polytechnical University)
107Conan: A Chunkwise Online Network for Zero-shot Adaptive Voice ConversionYu Zhang (Zhejiang University)*; Baotong Tian (University of Rochester); Zhiyao Duan (University of Rochester)
110Layer-wise Analysis for Quality of Multilingual Synthesized SpeechErica Cooper (National Institute of Information and Communications Technology )*; Takuma Okamoto (National Institute of Information and Communications Technology); Yamato Ohtani (National Institute of Information and Communications Technology); Tomoki Toda (Nagoya University); Hisashi Kawai (NICT)
129Confidence-Based Self-Training for EMG-to-Speech: Leveraging Synthetic EMG for Robust ModelingXiaodan Chen (Institute for Infocomm Research (I2R), A*STAR)*; Xiaoxue Gao (Institute for Infocomm Research (I2R), A*STAR); Mathias Quoy (CY Cergy Paris University); Alexandre Pitti (CY Cergy Paris University); Nancy F. Chen ( Institute for Infocomm Research (I2R), A*STAR)
143A Universal Harmonic Discriminator for High-quality GAN-based VocoderNan Xu (Alibaba Digital Media & Entertainment Group)*; Zhaolong Huang (Alibaba Digital Media & Entertainment Group); Xiao Zeng (Alibaba Digital Media & Entertainment Group)
150Diffrhythm+: Controllable and Flexible Full-Length Song Generation with Preference OptimizationHuakang Chen (Northwestern Polytechnical University); Yuepeng Jiang (Northwestern Polytechnical University); Guobin Ma (Northwestern Polytechnical University); Chunbo Hao (Northwestern Polytechnical University); Shuai Wang (School of Intelligence Science and Technology, Nanjing University, Suzhou, China); Jixun Yao (Northwestern Polytechnical University); Ziqian Ning (Northwestern Polytechnical University); Meng Meng (MiLM Plus, Xiaomi Inc.); Jian Luan (MiLM Plus, Xiaomi Inc.); Lei Xie (Northwestern Polytechnical University)*
225DarkStream: real-time speech anonymization with low latencyWaris Quamer (Texas A&M University)*; Ricardo Gutierrez-Osuna (Texas A&M University)
267SEF-MK: Speaker-Embedding-Free Voice Anonymization through Multi-k-means QuantizationBeilong Tang (Duke Kunshan University)*; Xiaoxiao Miao (Duke Kunshan University); Xin Wang (National Institute of Informatics); Ming Li (Duke Kunshan University)
289Speech Synthesis From Continuous Features Using Per-Token Latent DiffusionArnon Turetzky (Hebrew University of Jerusalem)*; Nimrod Shabtay (IBM); Slava Shechtman (IBM); David Haws (IBM); Hagai Aronowitz (IBM); Ron Hoory (IBM); Yossi Adi (Hebrew University of Jerusalem); avihu dekel (IBM)
295Mel-Refine: A Plug-and-Play Approach to Refine Mel-Spectrogram in Audio GenerationHongming Guo (Beijing University of Posts and Telecommunications); Ruibo Fu (Institute of Automation,Chinese Academy of Sciences)*; Yizhong Geng (Beijing University of Posts and Telecommunications); Shuchen Shi (Shanghai Polytechnic University); Tao Wang (Institute of Automation, Chinese Academy of Sciences); Chunyu Qiang (Tianjin University); Ya Li (Beijing University of Posts and Telecommunications); Zhengqi Wen (Tsinghua University); Yukun Liu (UCAS); Xuefei Liu (Qiyuan Lab); Chenxing Li (Tencent, AI lab)
348Enhancing In-the-Wild Speech Emotion Conversion with Resynthesis-based Duration ModelingNavin Raj Prabhu (University of Hamburg, Signal Processign Lab)*; Danilo de Oliveira (University of Hamburg, Signal Processign Lab); Nale Lehmann-Willenbrock (University of Hamburg); Timo Gerkmann (University of Hamburg, Signal Processign Lab)
384Can self-supervised models of speech predict the perceived acceptability in prosodic variation?Sarenne Wallbridge (University of Edinburgh)*; Adaeze Adigwe (University of Edinburgh); Peter Bell (University of Edinburgh)
386Maestro-EVC: Controllable Emotional Voice Conversion Guided by References and Explicit ProsodyJinsung Yoon (POSTECH); Wooyeol Jeong (POSTECH)*; Young-Joo Suh (POSTECH); Jio Gim (POSTECH)
400Robust Training of Singing Voice Synthesis Using Prior and Posterior UncertaintyYiwen Zhao (Carnegie Mellon University)*; Jiatong Shi (Carnegie Mellon University); Yuxun Tang (Renmin University of China); William Chen (Carnegie Mellon University); Shinji Watanabe (Carnegie Mellon University)
430ZipVoice: Fast and High-Quality Zero-Shot Text-to-Speech with Flow MatchingHan Zhu (Xiaomi Corp.)*; Wei Kang (Xiaomi Corp.); Zengwei Yao (Xiaomi Corp.); Liyong Guo (Xiaomi Corp.); Fangjun Kuang (Xiaomi Corp.); Zhaoqing Li (Xiaomi Corp.); Weiji Zhuang (Xiaomi Corp.); Long Lin (Xiaomi Corp.); Daniel Povey (Xiaomi Corp.)
450EmoBiMamba-TTS: Bidirectional State Space Models for Emotion-Intensity Controllable Text-to-SpeechINSUNG HAM (Korea-Univ)*; BONWHA KU (Korea-Univ); HANSEOK KO (Korea-Univ); HANSEOK KO (catholic unviersity of america)
458Controllable Singing Voice Synthesis using Phoneme-Level Energy SequenceYerin Ryu (Korea University)*; Inseop Shin (Korea University); Chanwoo Kim (Korea University)
46Obtaining objective labels and analysing annotator subjectivity by using a Rasch model for ordinal speech processingJeremy Wong (Institute for Infocomm Research)*; Nancy Chen (Institute for Infocomm Research)
Paper IDPaper TitleAuthors
169Selection of Layers from Self-supervised Learning Models for Predicting Mean-Opinion-Score of SpeechXinyu Liang (HCLTech)*; Fredrik Cumlin (KTH Royal Institute of Technology); Victor Ungureanu (Google); Chandan K.A. Reddy (Google); Christian Schuldt (Google); Saikat Chatterjee (KTH Royal Institute of Technology)
18Diversity and complementarity of speech encoders across diverse tasks in a multi-modal large language modelJeremy Wong (Institute for Infocomm Research)*; Muhammad Huzaifah (Institute for Infocomm Research); Hardik Sailor (Institute for Infocomm Research); Shuo Sun (Institute for Infocomm Research); Kye Min Tan (Institute for Infocomm Research); Bin Wang (MiroMind); Qiongqiong Wang (Institute for Infocomm Research); Wenyu Zhang ( Institute for Infocomm Research); Xunlong Zou (Institute for Infocomm Research); Nancy Chen (Institute for Infocomm Research); Ai Ti Aw ( Institute for Infocomm Research)
31Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language ModelZiyang Ma (Shanghai Jiao Tong University)*; Zhuo Chen (ByteDance Inc.); Yuping Wang (ByteDance Inc.); Eng-Siong Chng (Nanyang Technological University); Xie Chen (Shanghai Jiao Tong University)
54Competitive Audio-Language Models with Data-Efficient Single-Stage Training on Public DataGokul Karthik Kumar (Technology Innovation Institute)*; Rishabh Saraf ( Technology Innovation Institute); Ludovick Lepauloux (Technology Innovation Institute); Abdul Muneer (Technology Innovation Institute); Billel Mokeddem (Technology Innovation Institute); Hakim Hacid (Technology Innovation Institute)
133Improving Noise Robust Audio-Visual Speech Recognition via Router-Gated Cross-Modal Feature FusionDonghoon Lim (Hanyang University); Youngchae Kim (Hanyang University); Dong-Hyun Kim (Hanyang University); Da-Hee Yang (Hanyang University); Joon-Hyuk Chang (Hanyang University)*
167Improving Multimodal Speech-To-Slide Alignment for Academic Lectures with Vision LLMsThomas Ranzenberger (Technische Hochschule Nürnberg Georg Simon Ohm)*; Dominik Wagner (Technische Hochschule Nürnberg Georg Simon Ohm); Steffen Freisinger (Technische Hochschule Nürnberg Georg Simon Ohm); Tobias Bocklet (Technische Hochschule Nürnberg Georg Simon Ohm); Korbinian Riedhammer (Technische Hochschule Nürnberg Georg Simon Ohm)
262Beyond Modality Limitations: A Unified MLLM Approach to Automated Speaking Assessment with Effective Curriculum LearningYu Hsuan Fang (National Taiwan Normal University)*; Tien Hong Lo (National Taiwan Normal University); Yao Ting Sung (National Taiwan Normal University); Berlin Chen (National Taiwan Normal University)
273Interpreting the Role of Visemes in Audio-Visual Speech RecognitionAristeidis Papadopoulos (Trinity College Dublin)*; Naomi Harte (Trinity College Dublin)
311MEAN-RIR: Multi-Modal Environment-Aware Network for Robust Room Impulse Response EstimationJiajian Chen (University of Science and Technology of China); Jiakang Chen (University of Science and Technology of China); Hang Chen (University of Science and Technology of China); Qing Wang (University of Science and Technology of China)*; Yu Gao (AI Research Center, Midea Group (Shanghai) Co.,Ltd., Shanghai 201702, China); Jun Du (University of Science and Technology of China)
398Transcribe, translate, or transliterate: An investigation of intermediate representations in spoken language modelsTolulope Ogunremi (Stanford)*; Christopher Manning (Stanford); Dan Jurafsky (Stanford); Karen Livescu (TTIC)
438Incorporating Contextual Paralinguistic Understanding in Large Speech-Language ModelsQiongqiong Wang (A*STAR )*; Hardik Sailor (A*STAR); Jeremy Wong (A*STAR); Tianchi Liu (A*STAR); Shuo Sun (A*STAR); Wenyu Zhang (A*STAR); Muhammad Huzaifah (A*STAR); Nancy Chen (A*STAR); Ai Ti Aw (A*STAR)
321mSTEB: Massively Multilingual Evaluation of LLMs on Speech and Text TasksLuel Hagos Beyene (African Institute for Mathematical Sciences Research and Innovation Center)*; Vivek Verma (Université de Montréal); Min Ma (Google); Jesujoba O. Alabi (Saarland University); Fabian Schmidt (University of Würzburg); Joyce Nakatumba Nabende (Makerere University); David Ifeoluwa Adelani (McGill University)
13Qieemo: Multimodal Emotion Recognition Based on the ASR Backbonejinming chen (Qifu Technology)*; jingyi fang (Qifu Technology); yuanzhong zheng (Qifu Technology); yaoxuan wang (Qifu Technology); haojun fei (Qifu Technology)
41Recognizing Dementia from Neuropsychological Tests with State Space ModelsLiming Wang (Massachusetts Institute of Technology)*; Saurabhchand Bhati (Massachusetts Institute of Technology); Cody Karjadi (Boston University); Rhoda Au (Boston University); James Glass (Massachusetts Institute of Technology)
91Intermediate-Selective Feature Enhancement for Speech Emotion Recognitionli yangbiao (South China University of Technology)*; Xing Xiaofen (South China University of Technology); Mai Jialong (South China University of Technology); Xing Jingyuan (South China University of Technology); Xu Xiangmin (South China University of Technology)
299RE-LLM: Refining Empathetic Speech-LLM Responses by Integrating Emotion NuanceJing-Han Chen (National Tsing Hua University); Bo-Hao Su (National Tsing Hua University); Ya-Tse Wu (National Tsing Hua University); Chi-Chun Lee (National Tsing Hua University)*
302SPEAKER STYLE-AWARE PHONEME ANCHORING FOR IMPROVED CROSS-LINGUAL SPEECH EMOTION RECOGNITIONShreya Upadhyay (National Tsing Hua University ); Carlos Busso (Carnegie Mellon University); Chi-Chun Lee (National Tsing Hua University)*
336Enhancing Dialogue Annotation with Speaker Characteristics Leveraging a Frozen LLMThomas Thebaud (Johns Hopkins University)*; Yen-Ju Lu (Johns Hopkins University); Matthew Wiesner (Johns Hopkins University); Peter Viechnicki (Johns Hopkins University); Najim Dehak (Johns Hopkins University)
341Joint ASR and Speech Attribute Prediction for Conversational Dysarthric Speech Analysis with Multimodal Language ModelsDominik Wagner (Technische Hochschule Nuernberg Georg Simon Ohm)*; Ilja Baumann (Technische Hochschule Nuernberg Georg Simon Ohm); Natalie Engert (Technische Hochschule Nuernberg Georg Simon Ohm); Elmar Nöth (Friedrich-Alexander-Universität Erlangen-Nürnberg); Korbinian Riedhammer (Technische Hochschule Nuernberg Georg Simon Ohm); Tobias Bocklet (Technische Hochschule Nuernberg Georg Simon Ohm)
367More Similar than Dissimilar: Modeling Annotators for Cross-Corpus Speech Emotion RecognitionJames Tavernor (University of Michigan)*; Emily Mower Provost (University of Michigan)
387Few-shot Personalization via In-Context Learning for Speech Emotion Recognition based on Speech-Language ModelMana Ihori (NTT Corporation)*; Taiga Yamane (NTT Corporation); Naotaka Kawata (NTT Corporation); Naoki Makishima (NTT Corporation); Tomohiro Tanaka (NTT Corporation); Satoshi Suzuki (NTT Corporation); Shota Orihashi (NTT Corporation); Ryo Masumura (NTT Corporation)
428Robust Speech Emotion Recognition via Classifier Retraining on Mixup-Augmented RepresentationsShi-wook Lee (National Institute of Avanced Industrial Science and Technology)*
Paper IDPaper TitleAuthors
124Full-Duplex-Bench: A Benchmark to Evaluate Full-Duplex Spoken Dialogue Models on Turn-taking CapabilitiesGuan-Ting Lin (National Taiwan University)*; Jiachen Lian (UC Berkeley); Tingle Li (UC Berkeley); Qirui Wang (University of Washington); Gopala Anumanchipalli (UC Berkeley); Alexander H. Liu (Massachusetts Institute of Technology); Hung-yi Lee (National Taiwan University)
147EmoTale: An Enacted Speech-emotion Dataset in DanishMaja Jønck Hjuler (University Grenoble Alpes)*; Harald Vilhelm Skat-Rørdam (Technical University of Denmark); Line Clemmensen (Technical University of Denmark); Sneha Das (Technical University of Denmark)
189FlexCTC: GPU-powered CTC Beam Decoding With Advanced Contextual AbilitiesLilit Grigoryan (NVIDIA)*; Vladimir Bataev (NVIDIA); Nikolay Karpov (NVIDIA); Andrei Andrusenko (NVIDIA); Vitaly Lavrukhin (NVIDIA); Boris Ginsburg (NVIDIA)
191SENSE models: an open source solution for multilingual and multimodal semantic-based tasksSalima Mdhaffar (LIA - University of Avignon)*; Haroun Elleuch (Elyadata/LIA); Chaimae Chellaf (LIA); Ha Nguyen (Oracle); Yannick Estève (LIA - University of Avignon)
196EMO-Reasoning: Benchmarking Emotional Reasoning Capabilities in Spoken Dialogue SystemsJingwen Liu (Zhejiang University); Kan Jen Cheng (UC Berkeley)*; Jiachen Lian (UC Berkeley); Tingle Li (UC Berkeley); Akshay Anand (UC Berkeley); Rishi Jain (UC Berkeley); Faith Qiao (UC Berkeley); Robbie Netzorg (UC Berkeley); Huang-Cheng Chou (National Tsing Hua University); Guan-Ting Lin (National Taiwan University); Gopala Anumanchipalli (UC Berkeley)
277Benchmarking Prosody Encoding in Discrete Speech TokensKentaro Onda (The University of Tokyo)*; Satoru Fukayama (National Institute of Advanced Industrial Science and Technology (AIST)); Daisuke Saito (The University of Tokyo); Nobuaki Minematsu (The University of Tokyo)
282MNSC: Advancing Singlish Speech Understanding with Carefully Curated CorporaBin Wang (National University of Singapore)*; Xunlong Zou (A*STAR); Shuo Sun (A*STAR ); Wenyu Zhang (A*STAR ); Yingxu He (A*STAR ); Zhuohan Liu (A*STAR ); Chengwei Wei (A*STAR ); Nancy F. Chen (A*STAR ); AiTi Aw (A*STAR )
316Meta Audiobox Aesthetics: Unified Automatic Assessment for Speech, Music and SoundAndros Tjandra (Meta AI)*; Yi-Chiao Wu (Meta AI); Baishan Guo (Meta AI); John Hoffman (Meta AI); Brian Ellis (Meta AI); Apoorv Vyas (Meta AI); Bowen Shi (Meta AI); Sanyuan Chen (Meta AI); Matt Le (Meta AI); Nick Zacharov (Meta AI); Carleigh Wood (Meta AI); Ann Lee (Meta AI); Wei-Ning Hsu (Meta AI)
356CASPER: A Large Scale Spontaneous Speech DatasetCihan Xiao (Johns Hopkins University)*; Ruixing Liang (Johns Hopkins University); Xiangyu Zhang (Johns Hopkins University); Mehmet Emre Tiryaki (Johns Hopkins University); Veronica Bae (Johns Hopkins University); Lavanya Shankar (Johns Hopkins University); Rong Yang (Johns Hopkins University); Ethan Poon (Edison Academy Magnet School); Emmanuel Dupoux (Meta); Sanjeev Khudanpur (Johns Hopkins University); Leibny Paola Garcia Perera (Johns Hopkins University)
444Towards General Discrete Speech Codec for Complex Acoustic Environments: A Study of Reconstruction and Downstream Task ConsistencyHaoran Wang (Shanghai Jiao Tong University)*; Guanyu Chen (Shanghai Jiao Tong University); Bohan Li (Shanghai Jiao Tong University); Hankun Wang (Shanghai Jiao Tong University); Yiwei Guo (Shanghai Jiao Tong University); Zhihan Li (Shanghai Jiao Tong University); Xie Chen (Shanghai Jiao Tong University); Kai Yu (Shanghai Jiao Tong University)
47Multi-Distillation from Speech and Music Representation ModelsJui-Chiang Wei (National Taiwan University); Yi-Cheng Lin (National Taiwan University)*; Fabian Ritter-Gutierrez (Nanyang Technological University); Hung-yi Lee (National Taiwan University)
60A correlation-permutation approach for speech-music encoders model mergingFabian Ritter-Gutierrez (Nanyang Technological University)*; Yi-Cheng Lin (National Taiwan University); Jeremy H.M Wong (Institute for Infocomm Research); Hung-yi Lee (National Taiwan University); Eng Siong Chng (Nanyang Technological University); Nancy F. Chen ( Institute for Infocomm Research)
108Emphasis Sensitivity in Speech RepresentationsShaun Cassini (University of Sheffield)*; Thomas Hain (University of Sheffield); Anton Ragni (University of Sheffield)
112Is Smaller Always Faster? Tradeoffs in Compressing Self-Supervised Speech TransformersTzu-Quan Lin (Graduate Institute of Communication Engineering, National Taiwan University)*; Tsung-Huan Yang (Academia Sinica); Chun-Yao Chang (University of California, Los Angeles); Kuang-Ming Chen (University of Washington); Tzu-hsun Feng (National Taiwan University); Hung-yi Lee (Graduate Institute of Communication Engineering, National Taiwan University); Hao Tang (The University of Edinburgh)
158Llasa+: Free Lunch for Accelerated and Streaming Llama-Based Speech SynthesisWenjie Tian (Northwestern Polytechnical University)*; Xinfa Zhu (Northwestern Polytechnical University); Hanke Xie (Northwestern Polytechnical University); Zhen Ye (Hong Kong University of Science and Technology); Wei Xue (Hong Kong University of Science and Technology); Lei Xie (Northwestern Polytechnical University)
160An Effective Strategy for Modeling Score Ordinality and Non-uniform Intervals in Automated Speaking AssessmentTien-Hong Lo (National Taiwan Normal University)*; Szu-Yu Chen (National Taiwan Normal University); Yao-Ting Sung (National Taiwan Normal University); Berlin Chen (National Taiwan Normal University)
193Codec2Vec: Self-Supervised Speech Representation Learning Using Neural Speech CodecsWei-Cheng Tseng (The University of Texas at Austin)*; David Harwath (The University of Texas at Austin)
258ProtoCLAP – Prototypical Contrastive Language-Audio PretrainingAdria Mallol-Ragolta (Technical University of Munich)*; Björn Schuller (Technical University of Munich)
281Personalized Federated Learning with Fuzzy Clustering for Dysarthric Speech RecognitionJie-Shiang Yang (National Tsing Hua University); Jing-Tong Tzeng (National Tsing Hua University); Chi-Chun Lee (National Tsing Hua University)*
318Iterative Feedback in the Online Active Learning ParadigmMark Lindsey (Probity, Inc.)*; Francis Kubala (Probity Inc.); Richard M. Stern (Carnegie Mellon University)
359PURE Codec: Progressive Unfolding of Residual Entropy for Speech Codec LearningJiatong Shi (Carnegie Mellon University)*; Haoran Wang (Shanghai Jiaotong University); William Chen (Carnegie Mellon University); Chenda Li (Shanghai Jiaotong University); Wangyou Zhang (Shanghai Jiaotong University); Jinchuan Tian (Carnegie Mellon University); Shinji Watanabe (Carnegie Mellon University)
361USAD: Universal Speech and Audio Representation via DistillationHeng-Jui Chang (Massachusetts Institute of Technology)*; Saurabhchand Bhati (Massachusetts Institute of Technology); James Glass (Massachusetts Institute of Technology); Alexander Liu (Massachusetts Institute of Technology)
426ULTRAS - Unified Learning of Transformer Representations for Audio and Speech SignalsAmeenudeen PE (IISc Bangalore)*; Charumathi Narayanan (IISc Bangalore); Sriram Ganapathy (IISc Bangalore)
20Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM?Andrew Rouditchenko (MIT CSAIL)*; Saurabhchand Bhati (MIT CSAIL); Edson Araujo (Goethe University of Frankfurt); Samuel Thomas (IBM Research AI); Rogerio Feris (MIT-IBM Watson AI Lab); Hilde Kuehne (Tuebingen AI Center); James Glass (MIT CSAIL)
Paper IDPaper TitleAuthors
465Audio Aesthetics Prediction System QAM16k Based on Pre-trained Audio EncoderLinping Xu (bytedance)*; Ziqian Wu (bytedance); Dejun Zhang (bytedance)
466QAMRO: Quality-aware Adaptive Margin Ranking Optimization for Human-aligned Assessment of Audio Generation Systems"Chien-Chun Wang (National Taiwan Normal University)*; Kuan-Tang Huang (National Taiwan Normal University ); Cheng-Yeh Yang (National Taiwan Normal University); Hung-Shin Lee (United Link Co., Ltd.); Hsin-Min Wang (Academia Sinica); Berlin Chen (National Taiwan Normal University)"
469Multi-Sampling-Frequency Naturalness MOS Prediction Using Self-Supervised Learning Model with Sampling-Frequency-Independent LayerGou Nishikawa (The University of Tokyo); Wataru Nakata (The University of Tokyo)*; Yuki Saito (The University of Tokyo); Kanami Imamura (The University of Tokyo); Hiroshi Saruwatari (The University of Tokyo); Tomohiko Nakamura (The University of Tokyo)
473The T12 System for AudioMOS Challenge 2025: Audio Aesthetics Score Prediction System Using KAN- and VERSA-based ModelsKatsuhiko Yamamoto (CyberAgent)*; Koichi Miyazaki (CyberAgent); Shogo Seki (CyberAgent)
475DyMEvalNet: Dynamic Text-Audio-Personalization Fusion for Multimodal Music Quality AssessmentXiaoxun Wu (Ningbo University); Kailai Shen (Juphoon System Software Co., Ltd.); Yuheng Huang (Ningbo University); Naiyuan Li (Ningbo University); Diqun Yan (Ningbo University of Finance and Economics)*
476ASTAR-NTU solution to AudioMOS Challenge 2025 Track1Fabian Ritter-Gutierrez (Nanyang Technological University)*; Yi-Cheng Lin (National Taiwan University); Jui-Chiang Wei (National Taiwan University); Jeremy H.M Wong (Institute for Infocomm Research); Nancy F. Chen (Institute for Infocomm Research); Hung-yi Lee (National Taiwan University)
477Improving Perceptual Audio Aesthetic Assessment via Triplet Loss and Self-Supervised EmbeddingsDyah Wisnu (Academia Sinica)*; Ryandhimas Zezario (Academia Sinica); Stefano Rini (National Yang Ming Chiao Tung University); Hsin-Min Wang (Academia Sinica); Yu Tsao (Academia Sinica)
479WhisQ: Cross-Modal Representation Learning for Text-to-Music MOS PredictionJakaria  Emon (Hokkaido Denshikiki Co., Ltd.)*; Md Abu  Salek (Hokkaido Denshikiki Co., Ltd.)
481KyotoMOS2: MOS Prediction for Speech Across Multiple Sampling RatesWangjin Zhou (Kyoto University)*; Yizhou Zhang (Kyoto University); Tatsuya Kawahara (Kyoto University); Keisuke Imoto (Kyoto University)
483The AudioMOS Challenge 2025Wen-Chin Huang (Nagoya University)*; Hui Wang (Nankai University); Cheng Liu (Nankai University); Yi-Chiao Wu (Meta); Andros  Tjandra (Meta); Wei-Ning Hsu (Meta); Erica Cooper (National Institute of Information and Communications Technology); Yong Qin (Nankai University); Tomoki Toda (Nagoya University)
485HighRateMOS: Sampling-Rate Aware Modeling for Speech Quality AssessmentWenze Ren (National Taiwan University)*; Yi-Cheng Lin (National Taiwan University); Wen-Chin Huang (Nagoya University); Ryandhimas Zezario (Academia Sinica);  Szu-Wei Fu (Nvidia); Sung-Feng Huang (Nvidia); Erica Cooper (NICT); Haibin Wu (Independent researcher); Hung-Yu Wei (National Taiwan University);  Hsin-Min Wang (Academia Sinica); Hung-yi Lee (National Taiwan University); Yu Tsao (Academia Sinica)
497MMMOS: Muli-domain Multi-axis Audio Quality AssessmentYi-Cheng Lin (National Taiwan University)*; Jia-Hung Chen (National Taiwan University); Hung-yi Lee (National Taiwan University)
471Towards Scalable and Robust Multilingual ASR for Indian Languages with MixLoRA-WhisperYeseul Park (Inha University)*; Bowon Lee (Inha University)
474MADASR 2.0: Multi-Lingual Multi-Dialect ASR Challenge in 8 Indian LanguagesSaurabh Kumar (IISc Bengaluru)*; Sumit Sharma (IISc Bengaluru); Deekshitha G (IISc Bengaluru); Abhayjeet Singh (IISc Bengaluru); Amartya veer (IISc Bengaluru); Sathvik Udupa (IISc Bengaluru); Sandhya Badiger (IISc Bengaluru); Sanjeev Khudanpur (John Hopkins University); Sunayana Sitaram (Microsoft Research); Srinivasan Umesh (Indian Institute of Technology, Madras); Bhuvana Ramabhadran (Google DeepMind); Brian Kingsbury (IBM Research); Hema A. Murthy (Indian Institute of Technology, Madras); Srikanth S Narayanan (University of Southern California); Howard Lakougna (Gates Foundation); Prasanta Kumar Ghosh (Indian Institute of Science, Bangalore)
470Voice Factor Control Using FIR-Based Fast Neural Vocoder for Speech Generation ApplicationsYamato Ohtani (NICT)*; Takuma Okamoto (NICT); Tomoki Toda (Nagoya University); Hisashi Kawai (NICT)
492Open Full-duplex Voice Agent with Speech-to-Speech Language ModelZhehuai Chen (NVIDIA)*; Edresson  Casanova (NVIDIA); Chen  Chen (NVIDIA); Kevin  Hu (NVIDIA); Ankita  Pasad (NVIDIA); Elena  Rastorgueva (NVIDIA); Seelan Lakshmi  Narasimhan (NVIDIA); Slyne  Deng (NVIDIA); Ehsan  Hosseini Asl (NVIDIA); Piotr Zelasko (NVIDIA); Valentin  Mendelev (NVIDIA); Subhankar  Ghosh (NVIDIA); Yifan  Peng (NVIDIA); Jason  Li (NVIDIA); Jagadeesh  Balam (NVIDIA); Vitaly  Lavrukhin (NVIDIA); Boris  Ginsburg (NVIDIA)
496Speech Masking System Based on Spatially Separated Multiple TTS Maskers With A Compact Circular Loudspeaker ArrayTakuma Okamoto (National Institute of Information and Communications Technology)*
490AsyncVoice Agent: Real-Time Explanation for LLM Planning and ReasoningYueqian Lin (Duke University)*; Zhengmian Hu ( Adobe Research); Jayakumar Subramanian (Adobe Research); Qinsi Wang (Duke University); Nikos Vlassis (Adobe Research); Hai Li (Duke University); Yiran Chen (Duke University)
493CAVIARES: Corpus for Audio-Visual Expressive Voice AgentJinsheng Chen (The University of Tokyo)*; Yuki Saito (The University of Tokyo); Dong Yang (The University of Tokyo); Naoko Tanji (The University of Tokyo); Hironori Doi (LY Corporation); Byeongseon Park (LY Corporation); Yuma Shirahata (LY Corporation); Kentaro Tachibana (LY Corporation); Hiroshi Saruwatari (The University of Tokyo)
489AURA: Agent for Understanding, Reasoning, and Automated Tool Use in Voice-Driven TasksLeander Maben (Carnegie Mellon University)*; Gayathri Lakshmy (Carnegie Mellon University); Srijith Radhakrishnan (Carnegie Mellon University); Siddhant Arora (Carnegie Mellon University); Shinji Watanabe (Carnegie Mellon University)
491ChipChat: Low-Latency Cascaded Conversational Agent in MLXTatiana Likhomanenko (Apple)*; Luke Carlson (Thinking Machines Lab); He Bai (Apple); Zijin Gu (Apple); Han Tran (Apple); Zakaria Aldeneh (Apple); Yizhe Zhang (Apple); Ruixiang Zhang (Apple); Huangjie Zheng (Apple); Navdeep Jaitly (Apple)
501Long-Form Fuzzy Speech-to-Text Alignment for 1000+ LanguagesRuizhe Huang (Meta)*; Xiaohui Zhang (Meta); Zhaoheng Ni (Meta); Moto Hira (Meta); Jeff Hwang (Meta); Vineel Pratap (Meta); Ju Lin (Meta); Ming Sun (Meta); Florian Metze (Meta)
467Efficient Deployment of Large Speech Recognition Models on GPUYuekai Zhang (Nvidia)*; Shuang Yu (Nvidia); Junjie Lai (Nvidia)
478VERSA-v2: A Modular and Scalable Toolkit for Speech and Audio Evaluation with Expanded Metrics, Visualization, and LLM IntegrationJiatong Shi (Carnegie Mellon University)*; Bo-Hao Su (Carnegie Mellon University); Shikhar Bharadwaj (Carnegie Mellon University); Yiwen Zhao (Carnegie Mellon University); Shih-Heng Wang (University of Southern California); Jionghao Han (Carnegie Mellon University); Haoran Wang (Shanghai JIaotong University); Wei Wang (Shanghai Jiaotong University); Wenhao Feng (Renmin University of China); Yuxun Tang (Renmin University of China); Nezih Topaloğlu (Yeditepe University); Siddhant Arora (Carnegie Mellon University); Jinchuan Tian (Carnegie Mellon University); William Chen (Carnegie Mellon University); Hye-jin Shim (Carnegie Mellon University); Wangyou Zhang (Shanghai Jiaotong University); Wen-Chin Huang (Nagoya University); Shinji Watanabe (Carnegie Mellon University)