Background
With the development of unsupervised pre-training, exemplified by Wav2Vec 2.0 released in 2020, these models could learn directly from the raw audio without the need for human labels. So the raw training data could be scaled to 1 million h...