- In VideoBERT, along with learning the representation of a language, we also learn the representation of the video. It is the first model to learn the representation of both video and language in a joint manner.
- The VideoBERT model is pre-trained using two important tasks called masked language modeling (cloze task) and the linguistic visual alignment task.
- Similar to the next-sentence prediction task we learned about for BERT, the linguistic visual alignment is also a classification task. But here, we will not predict whether a sentence is the next sentence. Instead, we predict whether the language and the visual tokens are temporally aligned with each other.
- In the text-only method, we mask the language tokens and train the model to predict the masked language tokens. This makes the model better at understanding language representation.
- In the video-only method, we mask the visual tokens and train the model to predict the masked visual...





















































