Untrained object detection using over 100 times less data by flexible captioning with OTTER

--

Untrained object detection using over 100 times less data by flexible captioning with OTTER

Data Efficient Language-supervised Zero-shot Recognition with Optimal Transport Distillation
arXiv paper abstract https://arxiv.org/abs/2112.09445v2
arXiv PDF paper https://arxiv.org/pdf/2112.09445v2.pdf
GitHub https://github.com/facebookresearch/otter

… Previous works, such as CLIP, use InfoNCE loss to train a model to predict the pairing between images and text captions.

CLIP, however, is data hungry and requires more than 400M image-text pairs for training.

The inefficiency can be partially attributed to the fact that the image-text pairs are noisy.

… propose OTTER (Optimal TransporT distillation for Efficient zero-shot Recognition), which uses online entropic optimal transport to find a soft image-text match as labels for contrastive learning.

Based on pretrained image and text encoders, models trained with OTTER achieve strong performance with only 3M image text pairs.

… Over 42 evaluations on 7 different dataset/architecture settings x 6 metrics, OTTER outperforms (32) or ties (2) all baselines in 34 of them.

Stay up to date. Subscribe to my posts https://morrislee1234.wixsite.com/website/contact
Web site with my other posts by category https://morrislee1234.wixsite.com/website

LinkedIn https://www.linkedin.com/in/morris-lee-47877b7b

Photo by Daniel Houwing on Unsplash

--

--

AI News Clips by Morris Lee: News to help your R&D
AI News Clips by Morris Lee: News to help your R&D

Written by AI News Clips by Morris Lee: News to help your R&D

A computer vision consultant in artificial intelligence and related hitech technologies 37+ years. Am innovator with 66+ patents and ready to help a firm's R&D.

No responses yet