Towards Unsupervised Training for Seq-to-Seq Processing
The success of mainstream machine learning for ASR (automatic speech recognition) and similar tasks is largely based on supervised training, which requires pairs of [audio,text] strings as training data. In contrast, the assumption for unsupervised training is that no explicit pairs of [audio,text] strings are available. On a task like Librispeech, this would mean that we are given the audio strings only (without any transcriptions) as training data, along with the domain-specific text data for training the language model.

