Segment scene using information from vision-language models without neural training with PnP-OVSS


Segment scene using information from vision-language models without neural training with PnP-OVSS

Plug-and-Play, Dense-Label-Free Extraction of Open-Vocabulary Semantic Segmentation from Vision-Language Models
arXiv paper abstract
arXiv PDF paper

From an enormous amount of image-text pairs, large-scale vision-language models (VLMs) learn to implicitly associate image regions with words

… propose … Plug-and-Play Open-Vocabulary Semantic Segmentation (PnP-OVSS) … leverages a VLM with direct text-to-image cross-attention and an image-text matching loss to produce semantic segmentation.

However, cross-attention alone tends to over-segment, whereas cross-attention plus GradCAM tend to under-segment.

To alleviate this issue, … introduce Salience Dropout; by iteratively dropping patches that the model is most attentive to, … are able to better resolve the entire extent of the segmentation mask.

… method does not require any neural network training and performs hyperparameter tuning without the need for any segmentation annotations, even for a validation set.

PnP-OVSS … substantial improvements over a comparable baseline … and even outperforms most baselines that conduct additional network training on top of pretrained VLMs.

Stay up to date. Subscribe to my posts
Web site with my other posts by category


Photo by Mauro Shared Pictures on Unsplash



AI News Clips by Morris Lee: News to help your R&D

A computer vision consultant in artificial intelligence and related hitech technologies 37+ years. Am innovator with 66+ patents and ready to help a firm's R&D.