Segment scene using information from vision-language models without neural training with PnP-OVSS

Segment scene using information from vision-language models without neural training with PnP-OVSS

Plug-and-Play, Dense-Label-Free Extraction of Open-Vocabulary Semantic Segmentation from Vision-Language Models
arXiv paper abstract https://arxiv.org/abs/2311.17095
arXiv PDF paper https://arxiv.org/pdf/2311.17095.pdf

From an enormous amount of image-text pairs, large-scale vision-language models (VLMs) learn to implicitly associate image regions with words

… propose … Plug-and-Play Open-Vocabulary Semantic Segmentation (PnP-OVSS) … leverages a VLM with direct text-to-image cross-attention and an image-text matching loss to produce semantic segmentation.

However, cross-attention alone tends to over-segment, whereas cross-attention plus GradCAM tend to under-segment.

To alleviate this issue, … introduce Salience Dropout; by iteratively dropping patches that the model is most attentive to, … are able to better resolve the entire extent of the segmentation mask.

… method does not require any neural network training and performs hyperparameter tuning without the need for any segmentation annotations, even for a validation set.

PnP-OVSS … substantial improvements over a comparable baseline … and even outperforms most baselines that conduct additional network training on top of pretrained VLMs.

Stay up to date. Subscribe to my posts https://morrislee1234.wixsite.com/website/contact
Web site with my other posts by category https://morrislee1234.wixsite.com/website

LinkedIn https://www.linkedin.com/in/morris-lee-47877b7b

Photo by Mauro Shared Pictures on Unsplash

--

--

AI News Clips by Morris Lee: News to help your R&D

A computer vision consultant in artificial intelligence and related hitech technologies 37+ years. Am innovator with 66+ patents and ready to help a firm's R&D.