• Speech language models are becoming essential in numerous applications, from voice assistants to transcription services, yet challenges remain in ensuring these systems are aligned with human values, such as fair, reliable, and versatile. In this talk, I will present three projects that push the human-centered alignment of speech models across fairness, conversational clarity, and processing versatility. First, I introduce the Group-Adapted Fusion Network, which addresses fairness in speaker verification through CNN-based models, offering improved accuracy across demographic groups. Next, I discuss MultiTurnCleanup, a benchmark designed for cleaning up multi-turn spoken conversational transcripts, utilizing BERT-based models to handle disfluencies and inconsistencies across conversational turns. 

  • The field of speech synthesis, including text-to-speech synthesis and voice conversion, has advanced rapidly in recent years, and evaluation methodologies have been evolving as well.  Subjective human ratings collected in listening tests remain the gold standard, and online crowdsourcing platforms allow large-scale listening tests to be quickly completed by participants around the world.  However, compared to objective evaluation metrics, listening tests are more costly and time-consuming, leading researchers to consider automatic evaluation methods.

  • Large language models (LLMs) have demonstrated remarkable ability in natural conversation, reasoning, contextual understanding and coherent generation when used with text. A natural next step is to interact with these systems through speech; the most common, natural and expressive form of human language. The traditional solution is to recognize speech with an ASR system, feed the transcript to an LLM, and generate spoken output with TTS, which is an approach that has powered spoken conversational AI for years.