Segment unknown objects using VLM to filter texts and enhance masks with CLIP as RNN

--

Segment unknown objects using VLM to filter texts and enhance masks with CLIP as RNN

CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor
arXiv paper abstract https://arxiv.org/abs/2312.07661
arXiv PDF paper https://arxiv.org/pdf/2312.07661.pdf
GitHub https://torrvision.com/clip_as_rnn

Existing open-vocabulary image segmentation methods require a fine-tuning step on mask annotations and/or image-text datasets. Mask labels are labor-intensive

… , without fine-tuning, VLMs trained under weak image-text supervision tend to make suboptimal mask predictions when there are text queries referring to non-existing concepts in the image.

… introduce a novel recurrent framework that progressively filters out irrelevant texts and enhances mask quality without training efforts.

The recurrent unit is a two-stage segmenter built upon a VLM with frozen weights.

… model retains the VLM’s broad vocabulary space and strengthens its segmentation capability.

… method outperforms not only the training-free counterparts, but also those fine-tuned with millions of additional data samples, and sets new state-of-the-art records for both zero-shot semantic and referring image segmentation tasks …

Stay up to date. Subscribe to my posts https://morrislee1234.wixsite.com/website/contact
Web site with my other posts by category https://morrislee1234.wixsite.com/website

LinkedIn https://www.linkedin.com/in/morris-lee-47877b7b

Photo by Mae Mu on Unsplash

--

--

AI News Clips by Morris Lee: News to help your R&D
AI News Clips by Morris Lee: News to help your R&D

Written by AI News Clips by Morris Lee: News to help your R&D

A computer vision consultant in artificial intelligence and related hitech technologies 37+ years. Am innovator with 66+ patents and ready to help a firm's R&D.

No responses yet