Real-time face distance and iris track on mobile phone without depth sensor

MediaPipe Iris: Real-time Iris Tracking & Depth Estimation
Google AI Blog https://ai.googleblog.com/2020/08/mediapipe-iris-real-time-iris-tracking.html
MediaPipe on the Web https://developers.googleblog.com/2020/01/mediapipe-on-web.html
arXiv paper abstract https://arxiv.org/abs/2006.11341
arXiv PDF paper https://arxiv.org/pdf/2006.11341.pdf
GitHub https://github.com/cedriclmenard/irislandmarks.pytorch

A wide range of real-world applications … rely on estimating eye position by tracking the iris.

… show that it is possible to determine the metric distance from the camera to the user — without the use of a dedicated depth sensor.

… announce the release of MediaPipe Iris … able to track landmarks involving the iris, pupil and the eye…


Better 3D pose estimates in video by dynamically learning joint relationships

Learning Dynamical Human-Joint Affinity for 3D Pose Estimation in Videos
arXiv paper abstract https://arxiv.org/abs/2109.07353v1
arXiv PDF paper https://arxiv.org/pdf/2109.07353v1.pdf

Graph Convolution Network (GCN) … for 3D human pose estimation in videos. … built on the fixed human-joint affinity …

may reduce adaptation capacity of GCN to tackle complex spatio-temporal pose variations

… propose a novel Dynamical Graph Network (DG-Net), which can dynamically identify human-joint affinity, and estimate 3D pose by adaptively learning spatial/temporal joint relations from videos.

… discover spatial/temporal human-joint affinity for each video exemplar, depending on spatial distance/temporal…


Unsupervised learning of image classes from dynamic video stream

Online Unsupervised Learning of Visual Representations and Categories
arXiv paper abstract https://arxiv.org/abs/2109.05675v1
arXiv PDF paper https://arxiv.org/pdf/2109.05675v1.pdf

Real world learning scenarios involve a nonstationary distribution of classes … demand learning on-the-fly from few or no class labels.

… propose an unsupervised model that simultaneously performs online visual representation learning and few-shot learning of new categories without relying on any class labels.

… model … determines when to form a new class prototype. … formulate … online Gaussian mixture model

… includes a contrastive loss that encourages different views of the same image…


Real-time 3D hand reconstruction from a single monocular image

Towards Accurate Alignment in Real-time 3D Hand-Mesh Reconstruction
arXiv paper abstract https://arxiv.org/abs/2109.01723v1
arXiv PDF paper https://arxiv.org/pdf/2109.01723v1.pdf

3D hand-mesh reconstruction from RGB images facilitates many applications, including augmented reality (AR).

However, this requires not only real-time speed and accurate hand pose and shape but also plausible mesh-image alignment.

… decoupling the hand-mesh reconstruction task into three stages:

a joint stage to predict hand joints and segmentation;
a mesh stage to predict a rough hand mesh; and
a refine stage to fine-tune it with an offset mesh for mesh-image alignment.

… can promote…


Get depth, regions, and layout from panoramic image quickly and accurately with horizontal features

HoHoNet: 360 Indoor Holistic Understanding with Latent Horizontal Features
arXiv paper abstract https://arxiv.org/abs/2011.11498
arXiv PDF paper https://arxiv.org/pdf/2011.11498.pdf
GitHub https://github.com/sunset1995/HoHoNet
YouTube (5 min) https://www.youtube.com/watch?v=xXtRaRKmMpA

We present HoHoNet, a versatile and efficient framework for holistic understanding of an indoor 360-degree panorama using a Latent Horizontal Feature (LHFeat).

The compact LHFeat flattens the features along the vertical direction and has shown success in modeling per-column modality for room layout reconstruction.

… allowing per-pixel dense prediction from LHFeat.

HoHoNet is fast: It runs at 52 FPS and 110 FPS with…


Use satellite images to get 3D structure of buildings and roofs

Automated LoD-2 Model Reconstruction from Very-HighResolution Satellite-derived Digital Surface Model and Orthophoto
arXiv paper abstract https://arxiv.org/abs/2109.03876
arXiv PDF paper https://arxiv.org/pdf/2109.03876.pdf

… reconstructs LoD-2 building models following a “decomposition-optimization-fitting” paradigm.

… starts … through a deep learning-based detector and vectorizes individual segments into polygons

… decomposes the complex and irregularly shaped building polygons to tightly combined elementary building rectangles

… introduced OpenStreetMap (OSM) and Graph-Cut (GC) labeling to further refine the orientation of 2D building rectangle.

… takes building-specific parameters such as hip lines … to optimize the flexibility for…


Image classification without normalization that is faster and better than with normalization

High-Performance Large-Scale Image Recognition Without Normalization
arXiv paper abstract https://arxiv.org/abs/2102.06171
arXiv PDF paper https://arxiv.org/pdf/2102.06171.pdf
GitHub https://github.com/deepmind/deepmind-research/tree/master/nfnets
Papers With Code https://paperswithcode.com/paper/high-performance-large-scale-image

Batch normalization is a key component of most image classification models, but it has many undesirable properties stemming from its dependence on the batch size and interactions between examples.

… a significantly improved class of Normalizer-Free ResNets.

… smaller models match the test accuracy of an EfficientNet-B7 on ImageNet while being up to 8.7x faster to train, and

… largest models attain a new state-of-the-art top-1 accuracy of…


Using an audio and vision transformer to count crowds

Audio-Visual Transformer Based Crowd Counting
arXiv paper abstract https://arxiv.org/abs/2109.01926
arXiv PDF paper https://arxiv.org/pdf/2109.01926.pdf

Crowd estimation is a very challenging problem.

… address the critical challenges in crowd counting by effectively utilizing both visual and audio inputs

… introduces the notion of auxiliary and explicit image patch-importance ranking (PIR) and patch-wise crowd estimate (PCE) information to produce a third (run-time) modality.

These modalities (audio, visual, run-time) undergo a transformer-inspired cross-modality co-attention mechanism to finally output the crowd estimate.

… proposed scheme outperforms the state-of-the-art networks under all evaluation settings with up to 33.8% improvement.

We also analyze and compare the vision-only variant of our network and empirically demonstrate its superiority over previous approaches.

Stay up to date. Subscribe to my posts https://morrislee1234.wixsite.com/website/contact
Web site with my other posts by category https://morrislee1234.wixsite.com/website

LinkedIn https://www.linkedin.com/in/morris-lee-47877b7b

Photo by Jake Weirick on Unsplash

Survey on improving efficiency of computer vision recogntion using deep learning

Efficient Visual Recognition with Deep Neural Networks: A Survey on Recent Advances and New Directions
arXiv paper abstract https://arxiv.org/abs/2108.13055v1
arXiv PDF paper https://arxiv.org/pdf/2108.13055v1.pdf

Visual recognition is currently one of the most important and active research areas in computer vision, pattern recognition, and even the general field of artificial intelligence.

… Deep neural networks (DNNs) have largely boosted their performances on many concrete tasks

… Though recognition accuracy is usually the first concern for new progresses, efficiency is actually rather important and sometimes critical

… present the review of the…

AI News Clips by Morris Lee: News to help your R&D

I apply innovative technologies like machine learning, computer vision, and physics to further an organization's goals. Am recognized innovator with 64 patents.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store