Training Whisper with weak supervision on large-scale data
With Whisper’s multitasking transformer architecture covered, we’ll now explore the intricate training strategies that instilled its advanced speech recognition skills. Rather than just small, exquisitely annotated datasets, Whisper leverages terabytes of web speech data with semi-supervised techniques.
The following sections will dive into Whisper’s web-scale data accumulation, pseudo-labeling via machine teachers, and architectural supports, which facilitate learning from noisy labels. We’ll walk through data programming paradigms and innovations on self-training, stochastic depth, and pretraining, all of which were instrumental to Whispher’s success. By the end, you’ll grasp how weak supervision enabled unmatched speech comprehension – unlocking customization for accents and vocabulary where getting robust annotation at scale remains impractical.