Towards Human-Centered Alignment in Speech Language Models
Speech language models are becoming essential in numerous applications, from voice assistants to transcription services, yet challenges remain in ensuring these systems are aligned with human values, such as fair, reliable, and versatile. In this talk, I will present three projects that push the human-centered alignment of speech models across fairness, conversational clarity, and processing versatility. First, I introduce the Group-Adapted Fusion Network, which addresses fairness in speaker verification through CNN-based models, offering improved accuracy across demographic groups. Next, I discuss MultiTurnCleanup, a benchmark designed for cleaning up multi-turn spoken conversational transcripts, utilizing BERT-based models to handle disfluencies and inconsistencies across conversational turns.

