Segment objects in a video that are mentioned in a text query

End-to-End Referring Video Object Segmentation with Multimodal Transformers
arXiv paper abstract https://arxiv.org/abs/2111.14821
arXiv PDF paper https://arxiv.org/pdf/2111.14821.pdf
GitHub https://github.com/mttr2021/MTTR

The referring video object segmentation task (RVOS) involves segmentation of a text-referred object instance in the frames of a given…

Better document understanding without OCR using Donut transformer

Donut: Document Understanding Transformer without OCR
arXiv paper abstract https://arxiv.org/abs/2111.15664
arXiv PDF paper https://arxiv.org/pdf/2111.15664.pdf

Understanding document images (e.g., invoices) has been an important research topic

… current Visual Document Understanding (VDU) systems have come to be designed based on OCR.

… suffer from critical problems induced by the OCR, e.g., (1) expensive computational costs and (2) performance degradation due to the OCR error propagation.

… propose a novel VDU model that is end-to-end trainable without underpinning OCR framework.

… pre-train the model to mitigate the dependencies on large-scale real document images.

… achieves state-of-the-art performance on various document understanding tasks in public benchmark datasets …

Stay up to date. Subscribe to my posts https://morrislee1234.wixsite.com/website/contact
Web site with my other posts by category https://morrislee1234.wixsite.com/website

LinkedIn https://www.linkedin.com/in/morris-lee-47877b7b

Photo by Annie Spratt on Unsplash

Get centimeter depth image from smartphone using LiDAR and unsteadiness of hand

The Implicit Values of A Good Hand Shake: Handheld Multi-Frame Neural Depth Refinement
arXiv paper abstract https://arxiv.org/abs/2111.13738
arXiv PDF paper https://arxiv.org/pdf/2111.13738.pdf

Modern smartphones can continuously stream multi-megapixel RGB images at 60~Hz, synchronized with high-quality 3D pose information and…

Classifying visual and audio events of various durations in videos with MM-Pyramid

MM-Pyramid: Multimodal Pyramid Attentional Network for Audio-Visual Event Localization and Video Parsing
arXiv paper abstract https://arxiv.org/abs/2111.12374v1
arXiv PDF paper https://arxiv.org/pdf/2111.12374v1.pdf

Recognizing and localizing events in videos is a fundamental task for video understanding … events may occur in…

Survey of panoptic image segmentation for objects and regions

Panoptic Segmentation: A Review
arXiv paper abstract https://arxiv.org/abs/2111.10250
arXiv PDF paper https://arxiv.org/ftp/arxiv/papers/2111/2111.10250.pdf
GitHub https://github.com/elharroussomar/Awesome-Panoptic-Segmentation

Image segmentation for video analysis plays an essential role in different research fields such as smart city, healthcare, computer vision and geoscience, and remote sensing applications.

……

Many types of computer vision tasks possible with new customizable vision foundation model, Florence

Florence: A New Foundation Model for Computer Vision
arXiv paper abstract https://arxiv.org/abs/2111.11432
arXiv PDF paper https://arxiv.org/pdf/2111.11432.pdf

… understanding … diverse … world demands computer vision models to generalize well with minimal customization for specific tasks

……

Correcting Face Distortion in Wide-Angle Videos

Correcting Face Distortion in Wide-Angle Videos
arXiv paper abstract https://arxiv.org/abs/2111.09950
arXiv PDF paper https://arxiv.org/pdf/2111.09950.pdf
Project page https://www.wslai.net/publications/video_face_correction

Video blogs and selfies are … are often captured by wide-angle cameras to show human subjects and expanded background.

… due to perspective projection, faces near corners…

Train new object detector without bounding box annotations using captioned images

Towards Open Vocabulary Object Detection without Human-provided Bounding Boxes
arXiv paper abstract https://arxiv.org/abs/2111.09452
arXiv PDF paper https://arxiv.org/pdf/2111.09452.pdf

… in object detection, most existing methods are limited to a small set of object categories, due to the tremendous human effort…

AI News Clips by Morris Lee: News to help your R&D

I apply innovative technologies like machine learning, computer vision, and physics to further an organization's goals. Am recognized innovator with 65 patents.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store