Many types of computer vision tasks possible with new customizable vision foundation model, Florence
Many types of computer vision tasks possible with new customizable vision foundation model, Florence
Florence: A New Foundation Model for Computer Vision
arXiv paper abstract https://arxiv.org/abs/2111.11432
arXiv PDF paper https://arxiv.org/pdf/2111.11432.pdf
… understanding … diverse … world demands computer vision models to generalize well with minimal customization for specific tasks
… vision foundation models … such as CLIP, ALIGN, and Wu Dao 2.0 focus mainly on mapping images and textual representations to a cross-modal shared representation,
… new computer vision foundation model, Florence, to expand the representations from coarse (scene) to fine (object), from static (images) to dynamic (videos), and from RGB to multiple modalities (caption, depth).
… incorporating … Web-scale image-text data … model … easily adapted for various computer vision tasks, such as classification, retrieval, object detection, VQA, image caption, video retrieval and action recognition.
… outstanding performance in many types of transfer learning: fully sampled fine-tuning, linear probing, few-shot transfer and zero-shot transfer for novel images and objects.
… achieves new state-of-the-art results in majority of 44 representative benchmarks, e.g., ImageNet-1K zero-shot classification with top-1 accuracy of 83.74 and the top-5 accuracy of 97.18, 62.4 mAP on COCO fine tuning, 80.36 on VQA, and 87.8 on Kinetics-600.
Stay up to date. Subscribe to my posts https://morrislee1234.wixsite.com/website/contact
Web site with my other posts by category https://morrislee1234.wixsite.com/website
LinkedIn https://www.linkedin.com/in/morris-lee-47877b7b