• This tutorial will provide a detailed journey through multimodal speech modeling, organized into three parts: (1) Understanding, (2) Generation, and (3) Real-Time/Interactive Systems. We begin by reviewing how traditional speech processing has evolved with the advent of foundation models and audio-visual learning, and then delve into cutting-edge research that bridges audio, language, and visual modalities. The goal is to equip participants with both the conceptual frameworks and technical know-how to understand current audio-visual foundation models and to inspire new research directions at the nexus of speech and multimodality.

    • UC Berkeley

    • NVIDIA Research

    • Meta GenAI; incoming assistant Professor, Duke University, NC

  • Non-semantic speech metadata and paralinguistic signals carry information that are critical to modern voice AI. Such signals include the user's spoken language, accent, voice characteristics, emotion, and intent to activate or interact with the voice assistant. This tutorial offers a comprehensive overview on the most recent modelling approaches for non-semantic speech signals, from data acquisition to recent research advances empowered by foundation models. We share the advancements from both industry and academia, highlighting how diverse research tasks all contribute to the shared mission of creating scalable, personalized, and empathetic speech systems for everyone, everywhere.

    • Google DeepMind

    • Google DeepMind

    • Indian Institute of Science, Bangalore, India

    • Google DeepMind

  • This tutorial introduces optimal transport (OT) as a powerful mathematical framework for comparing and transforming probability distributions, with a focus on speech processing applications. OT offers a geometry-aware distance metric that is particularly useful in tasks like feature learning, domain adaptation, and multi-modal knowledge transfer. Participants will learn both foundational concepts and advanced techniques for integrating OT into deep learning-based speech processing pipelines. The session will also highlight recent research findings and practical insights to inspire further exploration in this emerging area.

    • Advanced Speech Technology Laboratory, National Institute of Information and Communications Technology (NICT), Japan

    • Research Center for Information Technology Innovation (CITI), Academia Sinica, Taiwan

  • Audio is a critical component of multimodal perception, requiring systems to demonstrate a wide range of capabilities, from transcription and classification to retrieval, reasoning, and segmentation. Fundamentally, these tasks rely on transforming raw audio into meaningful embeddings. We present the Massive Sound Embedding Benchmark (MSEB): an extensible, open-source framework designed to comprehensively evaluate the core auditory components of any multimodal system. MSEB's first release offers a comprehensive suite of eight tasks, supported by diverse datasets, including the new, large-scale Simple Voice Questions (SVQ) dataset. This tutorial will introduce the framework, review initial performance headrooms, and encourage community contributions to accelerate progress toward robust machine auditory intelligence. 

    • Staff Research Scientist, Google

    • Staff Software Engineer, Google

    • Cyril Allauzen

      Senior Staff Research Scientist, Google

    • Senior Staff Research Scientist, Google

    • Distinguished Scientist, Google