Survey of transformers for video


Survey of transformers for video

Video Transformers: A Survey
arXiv paper abstract
arXiv PDF paper

… Transformers a promising tool for solving video related tasks, but some adaptations are required.

… In this survey … analyse and summarize the main contributions and trends for adapting Transformers to model video data.

… delve into how videos are embedded and tokenized, finding a very widspread use of large CNN backbones to reduce dimensionality and a predominance of patches and frames as tokens.

… study how the Transformer layer has been tweaked to handle longer sequences, generally by reducing the number of tokens in single attention operation.

… explore how other modalities are integrated with video and

… conduct a performance comparison on the most common benchmark for Video Transformers (i.e., action classification), finding them to outperform 3D CNN counterparts with equivalent FLOPs and no significant parameter increase.

