Segment scene using information from vision-language models without neural training with PnP-OVSS

AI News Clips by Morris Lee: News to help your R&D

2 min readNov 30, 2023

Segment scene using information from vision-language models without neural training with PnP-OVSS

Plug-and-Play, Dense-Label-Free Extraction of Open-Vocabulary Semantic Segmentation from Vision-Language Models
arXiv paper abstract https://arxiv.org/abs/2311.17095
arXiv PDF paper https://arxiv.org/pdf/2311.17095.pdf

From an enormous amount of image-text pairs, large-scale vision-language models (VLMs) learn to implicitly associate image regions with words

… propose … Plug-and-Play Open-Vocabulary Semantic Segmentation (PnP-OVSS) … leverages a VLM with direct text-to-image cross-attention and an image-text matching loss to produce semantic segmentation.

However, cross-attention alone tends to over-segment, whereas cross-attention plus GradCAM tend to under-segment.

To alleviate this issue, … introduce Salience Dropout; by iteratively dropping patches that the model is most attentive to, … are able to better resolve the entire extent of the segmentation mask.

… method does not require any neural network training and performs hyperparameter tuning without the need for any segmentation annotations, even for a validation set.

PnP-OVSS … substantial improvements over a comparable baseline … and even outperforms most baselines that conduct additional network training on top of pretrained VLMs.

Stay up to date. Subscribe to my posts https://morrislee1234.wixsite.com/website/contact
Web site with my other posts by category https://morrislee1234.wixsite.com/website

LinkedIn https://www.linkedin.com/in/morris-lee-47877b7b

Segment scene using information from vision-language models without neural training with PnP-OVSS

Written by AI News Clips by Morris Lee: News to help your R&D