Better image captioning and question answering using weakly supervised training

Better image captioning and question answering using weakly supervised training

SimVLM: Simple Visual Language Model Pretraining with Weak Supervision
arXiv paper abstract https://arxiv.org/abs/2108.10904
arXiv PDF paper https://arxiv.org/pdf/2108.10904.pdf

… Vision-Language Pretraining (VLP) has achieved impressive performance on many multimodal downstream tasks.

However, the requirement for expensive annotations … limits the scalability of existing approaches

… relax these constraints and present a minimalist pretraining framework, named Simple Visual Language Model (SimVLM).

… by exploiting large-scale weak supervision, and is trained end-to-end with a single prefix language modeling objective.

Without utilizing extra data or task-specific customization, the resulting model significantly outperforms previous pretraining methods and achieves new state-of-the-art results on a wide range of discriminative and generative vision-language benchmarks, including VQA (+3.74% vqa-score), NLVR2 (+1.17% accuracy), SNLI-VE (+1.37% accuracy) and image captioning tasks (+10.1% average CIDEr score).

… demonstrate that SimVLM acquires strong generalization and transfer ability, enabling zero-shot behavior including open-ended visual question answering and cross-modality transfer.

Stay up to date. Subscribe to my posts https://morrislee1234.wixsite.com/website/contact
Web site with my other posts by category https://morrislee1234.wixsite.com/website

LinkedIn https://www.linkedin.com/in/morris-lee-47877b7b

Photo by Jon Tyson on Unsplash

--

--

AI News Clips by Morris Lee: News to help your R&D

A computer vision consultant in artificial intelligence and related hitech technologies 37+ years. Am innovator with 66+ patents and ready to help a firm's R&D.