Classifying visual and audio events of various durations in videos with MM-Pyramid

Classifying visual and audio events of various durations in videos with MM-Pyramid

MM-Pyramid: Multimodal Pyramid Attentional Network for Audio-Visual Event Localization and Video Parsing
arXiv paper abstract https://arxiv.org/abs/2111.12374v1
arXiv PDF paper https://arxiv.org/pdf/2111.12374v1.pdf

Recognizing and localizing events in videos is a fundamental task for video understanding … events may occur in auditory…