• The success of mainstream machine learning for ASR (automatic speech recognition) and similar tasks is largely based on supervised training, which requires pairs of [audio,text] strings as training data.  In contrast, the assumption for unsupervised training is that no explicit pairs of [audio,text] strings are available.  On a task like Librispeech, this would mean that we are given the audio strings only (without any transcriptions) as training data, along with the domain-specific text data for training the language model.

  • Natural language processing research has evolved over the past few years from mainly task-specific models, then to task-independent representation models fine-tuned for various tasks, and finally to fully task-independent language models.  This progression addresses a desire for universality in the sense of handling arbitrary tasks in the same model.  Another dimension of universality is the ability to serve arbitrary types of language users, regardless of their choice of language, dialect, or other individual properties.  Progress toward universality has historically been addressed largely independently in separate research communities focusing on written, spoken, and signed language, although they share many similarities.  This talk will trace the recent progress toward universality in these three language modalities, while highlighting a few pieces of recent work.


     

  • In this talk, I will describe my journey as an old-style speech scientist from classic technologies (HMMs, concatenative TTS, GMMs) to modern speech recognition and then to image and document processing. In the talk I'll focus on how modern LLM technologies have removed barriers across what used to be disjoint disciplines (speech, vision, NLP) and how nowadays even profesional barriers are to a large degree irrelevant. Scientists with zero experience in speech can bring amazing contributions to the field. Undergraduates can build amazing TTS engines in a matter of months with no previous experience. I'll describe my journey building document understanding technologies and share some observations along this personal trip.

  • Large language models are famously prone to "hallucinations"—generating fictitious responses not grounded in reality, often unrelated to the input or query—necessitating mitigation strategies such as alignment and postfiltering. Modern automatic speech recognition (ASR) systems, their less famous siblings, are also prone to hallucinations, generating responses that are unrelated to the input, sometimes in the absence of any stimulus. Unlike LLMs, however, the definition of hallucination becomes more challenging, since they are indistinguishable from regular insertion and substitution errors. This talk will examine the problem of hallucination in ASR systems, how they may be defined, how they may be quantified, and what mitigation strategies may be employed.