• This tutorial will provide a detailed journey through multimodal speech modeling, organized into three parts: (1) Understanding, (2) Generation, and (3) Real-Time/Interactive Systems. We begin by reviewing how traditional speech processing has evolved with the advent of foundation models and audio-visual learning, and then delve into cutting-edge research that bridges audio, language, and visual modalities. The goal is to equip participants with both the conceptual frameworks and technical know-how to understand current audio-visual foundation models and to inspire new research directions at the nexus of speech and multimodality.

    • UC Berkeley

    • NVIDIA Research

    • Meta GenAI; incoming assistant Professor, Duke University, NC

  • Non-semantic speech metadata and paralinguistic signals carry information that are critical to modern voice AI. Such signals include the user's spoken language, accent, voice characteristics, emotion, and intent to activate or interact with the voice assistant. This tutorial offers a comprehensive overview on the most recent modelling approaches for non-semantic speech signals, from data acquisition to recent research advances empowered by foundation models. We share the advancements from both industry and academia, highlighting how diverse research tasks all contribute to the shared mission of creating scalable, personalized, and empathetic speech systems for everyone, everywhere.

    • Google DeepMind

    • Google DeepMind

    • Indian Institute of Science, Bangalore, India

    • Google DeepMind

  • This tutorial introduces optimal transport (OT) as a powerful mathematical framework for comparing and transforming probability distributions, with a focus on speech processing applications. OT offers a geometry-aware distance metric that is particularly useful in tasks like feature learning, domain adaptation, and multi-modal knowledge transfer. Participants will learn both foundational concepts and advanced techniques for integrating OT into deep learning-based speech processing pipelines. The session will also highlight recent research findings and practical insights to inspire further exploration in this emerging area.

    • Advanced Speech Technology Laboratory, National Institute of Information and Communications Technology (NICT), Japan

    • Research Center for Information Technology Innovation (CITI), Academia Sinica, Taiwan

  • Although tasks like speech recognition, speaker identification, and sound event detection may appear distinct, they all fundamentally create an "embedding"—a dense vector representation—to capture salient information from the audio signal. This unified perspective allows us to research the optimality of current task-specific models and explore the potential for a single, robust sound embedding that generalizes across all applications. To accelerate this research, we have introduced the Massive Sound Embedding Benchmark (MSEB), a comprehensive evaluation suite with diverse tasks reflecting real-world challenges. Initial results reveal significant headroom for improvement over existing methods, and we invite the community to use MSEB to benchmark their models and contribute to the future of generalized audio intelligence. The MSEB library is publicly available at: https://github.com/google-research/mseb.

    • Staff Research Scientist, Google

    • Staff Software Engineer, Google

    • Senior Staff Research Scientist, Google

    • Distinguished Scientist, Google