Classifying visual and audio events of various durations in videos with MM-Pyramid

Classifying visual and audio events of various durations in videos with MM-Pyramid

MM-Pyramid: Multimodal Pyramid Attentional Network for Audio-Visual Event Localization and Video Parsing
arXiv paper abstract https://arxiv.org/abs/2111.12374v1
arXiv PDF paper https://arxiv.org/pdf/2111.12374v1.pdf

Recognizing and localizing events in videos is a fundamental task for video understanding … events may occur in auditory and visual modalities

… Most previous works … do not consider semantic information at multiple scales … makes … difficult to localize events

… present a Multimodal Pyramid Attentional Network (MM-Pyramid) … integrates multi-level temporal features for audio-visual event localization and … parsing.

… propose a novel attentive feature pyramid module, which is composed of the fixed-size attention mechanism and dilated convolution block.

… experiments on the AVE and LLP datasets demonstrate the effectiveness of our proposed approach on localizing events in multiple lengths.

Stay up to date. Subscribe to my posts https://morrislee1234.wixsite.com/website/contact
Web site with my other posts by category https://morrislee1234.wixsite.com/website

LinkedIn https://www.linkedin.com/in/morris-lee-47877b7b

Photo by Jakob Owens on Unsplash

--

--

AI News Clips by Morris Lee: News to help your R&D

A computer vision consultant in artificial intelligence and related hitech technologies 37+ years. Am innovator with 66+ patents and ready to help a firm's R&D.