Segment object in image described by text more simply using SeqTR
Segment object in image described by text more simply using SeqTR
SeqTR: A Simple yet Universal Network for Visual Grounding
arXiv paper abstract https://arxiv.org/abs/2203.16265v1
arXiv PDF paper https://arxiv.org/pdf/2203.16265v1.pdf
GitHub https://github.com/sean-zhuh/seqtr
… propose … network termed SeqTR for visual grounding tasks, e.g., phrase localization, referring expression comprehension (REC) and segmentation (RES).
… visual grounding often require substantial expertise in designing network architectures and loss functions, making them hard to generalize across tasks.
To simplify … cast visual grounding as a point prediction problem conditioned on image and text inputs, where either the bounding box or binary mask is represented as a sequence of discrete coordinate tokens.
… visual grounding … unified in … SeqTR network without task-specific branches or heads, e.g., the convolutional mask decoder for RES, which greatly reduces the complexity of multi-task modeling.
… SeqTR outperforms (or is on par with) the existing state-of-the-arts, proving that a simple yet universal approach for visual grounding is indeed feasible.
Stay up to date. Subscribe to my posts https://morrislee1234.wixsite.com/website/contact
Web site with my other posts by category https://morrislee1234.wixsite.com/website
LinkedIn https://www.linkedin.com/in/morris-lee-47877b7b