Better image captioning and question answering using weakly supervised training
Better image captioning and question answering using weakly supervised training
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision
arXiv paper abstract https://arxiv.org/abs/2108.10904
arXiv PDF paper https://arxiv.org/pdf/2108.10904.pdf
… Vision-Language Pretraining (VLP) has achieved impressive performance on many multimodal downstream tasks.
However, the requirement for expensive annotations … limits the scalability of existing approaches
… relax these constraints and present a minimalist pretraining framework, named Simple Visual Language Model (SimVLM).
… by exploiting large-scale weak supervision, and is trained end-to-end with a single prefix language modeling objective.
Without utilizing extra data or task-specific customization, the resulting model significantly outperforms previous pretraining methods and achieves new state-of-the-art results on a wide range of discriminative and generative vision-language benchmarks, including VQA (+3.74% vqa-score), NLVR2 (+1.17% accuracy), SNLI-VE (+1.37% accuracy) and image captioning tasks (+10.1% average CIDEr score).
… demonstrate that SimVLM acquires strong generalization and transfer ability, enabling zero-shot behavior including open-ended visual question answering and cross-modality transfer.
Stay up to date. Subscribe to my posts https://morrislee1234.wixsite.com/website/contact
Web site with my other posts by category https://morrislee1234.wixsite.com/website