Complete 3D scene using only images having occlusions by masked autoencoder with VoxFormer
Complete 3D scene using only images having occlusions by masked autoencoder with VoxFormer
VoxFormer: Sparse Voxel Transformer for Camera-based 3D Semantic Scene Completion
arXiv paper abstract https://arxiv.org/abs/2302.12251
arXiv PDF paper https://arxiv.org/pdf/2302.12251.pdf
… complete 3D geometry of occluded objects and scenes … is vital for recognition and understanding.
… propose VoxFormer, a Transformer-based semantic scene completion framework that can output complete 3D volumetric semantics from only 2D images.
… framework adopts a two-stage design where … start from a sparse set of visible and occupied voxel queries from depth estimation, followed by a densification stage that generates dense 3D voxels from the sparse ones.
A key idea of this design is that the visual features on 2D images correspond only to the visible scene structures rather than the occluded or empty spaces.
… Once … obtain the set of sparse queries, … apply a masked autoencoder design to propagate the information to all the voxels by self-attention.
… VoxFormer outperforms the state of the art with a relative improvement of 20.0% in geometry and 18.1% in semantics …
Stay up to date. Subscribe to my posts https://morrislee1234.wixsite.com/website/contact
Web site with my other posts by category https://morrislee1234.wixsite.com/website
LinkedIn https://www.linkedin.com/in/morris-lee-47877b7b