Segment scene using information from vision-language models without neural training with PnP-OVSS
Segment scene using information from vision-language models without neural training with PnP-OVSS
Plug-and-Play, Dense-Label-Free Extraction of Open-Vocabulary Semantic Segmentation from Vision-Language Models
arXiv paper abstract https://arxiv.org/abs/2311.17095
arXiv PDF paper https://arxiv.org/pdf/2311.17095.pdf
From an enormous amount of image-text pairs, large-scale vision-language models (VLMs) learn to implicitly associate image regions with words
… propose … Plug-and-Play Open-Vocabulary Semantic Segmentation (PnP-OVSS) … leverages a VLM with direct text-to-image cross-attention and an image-text matching loss to produce semantic segmentation.
However, cross-attention alone tends to over-segment, whereas cross-attention plus GradCAM tend to under-segment.
To alleviate this issue, … introduce Salience Dropout; by iteratively dropping patches that the model is most attentive to, … are able to better resolve the entire extent of the segmentation mask.
… method does not require any neural network training and performs hyperparameter tuning without the need for any segmentation annotations, even for a validation set.
PnP-OVSS … substantial improvements over a comparable baseline … and even outperforms most baselines that conduct additional network training on top of pretrained VLMs.
Stay up to date. Subscribe to my posts https://morrislee1234.wixsite.com/website/contact
Web site with my other posts by category https://morrislee1234.wixsite.com/website