Image retrieval with a sketch by combining photo and sketch information with XModalViT
Image retrieval with a sketch by combining photo and sketch information with XModalViT
Cross-Modal Fusion Distillation for Fine-Grained Sketch-Based Image Retrieval
arXiv paper abstract https://arxiv.org/abs/2210.10486v1
arXiv PDF paper https://arxiv.org/pdf/2210.10486v1.pdf
Representation learning for sketch-based image retrieval has mostly been tackled by learning embeddings that discard modality-specific information.
… instances from different modalities can often provide complementary information … propose a cross-attention framework for Vision Transformers (XModalViT) that fuses modality-specific information instead of discarding them.
… framework first maps paired datapoints from the individual photo and sketch modalities to fused representations that unify information from both modalities.
… then decouple the input space of the aforementioned modality fusion network into independent encoders of the individual modalities via contrastive and relational cross-modal knowledge distillation.
Such encoders can then be applied to downstream tasks like cross-modal retrieval.
… demonstrate the expressive capacity of the learned representations by performing a wide range of experiments and achieving state-of-the-art results …
Stay up to date. Subscribe to my posts https://morrislee1234.wixsite.com/website/contact
Web site with my other posts by category https://morrislee1234.wixsite.com/website
LinkedIn https://www.linkedin.com/in/morris-lee-47877b7b