Find location in video matching a sentence with TAN
Find location in video matching a sentence with TAN
Temporal Alignment Networks for Long-term Video
arXiv paper abstract https://arxiv.org/abs/2204.02968
arXiv PDF paper https://arxiv.org/pdf/2204.02968.pdf
The objective … is a temporal alignment network that ingests long term video sequences, and associated text sentences, in order to:
(1) determine if a sentence is alignable with the video; and (2) if it is alignable, then determine its alignment.
The challenge is to train such networks from large-scale datasets … where the associated text sentences have significant noise, and are only weakly aligned when relevant.
… describe a novel co-training method that enables to denoise and train on raw instructional videos without using manual annotation, despite the considerable noise
… proposed model … outperforms strong baselines (CLIP, MIL-NCE) on this alignment dataset by a significant margin
… apply the trained model in the zero-shot settings to multiple downstream video understanding tasks and achieve state-of-the-art results …
Stay up to date. Subscribe to my posts https://morrislee1234.wixsite.com/website/contact
Web site with my other posts by category https://morrislee1234.wixsite.com/website
LinkedIn https://www.linkedin.com/in/morris-lee-47877b7b