Abstract
We propose SelfSplat, a novel 3D Gaussian Splatting model designed to perform pose-free and 3D prior-free generalizable 3D reconstruction from unposed multi-view images. These settings are inherently ill-posed due to the lack of ground-truth data, learned geometric information, and the need to achieve accurate 3D reconstruction without finetuning, making it difficult for conventional methods to achieve high-quality results. Our model addresses these challenges by effectively integrating explicit 3D representations with self-supervised depth and pose estimation techniques, resulting in reciprocal improvements in both pose accuracy and 3D reconstruction quality. Furthermore, we incorporate a matching-aware pose estimation network and a depth refinement module to enhance geometry consistency across views, ensuring more accurate and stable 3D reconstructions. To present the performance of our method, we evaluated it on large-scale real-world datasets, including RealEstate10K, ACID, and DL3DV. SelfSplat achieves superior results over previous state-of-the-art methods in both appearance and geometry quality, also demonstrates strong cross-dataset generalization capabilities. Extensive ablation studies and analysis also validate the effectiveness of our proposed methods.
Given unposed multi-view images as input, we predict depth and Gaussian attributes from the images, as well as the relative camera poses between them. We unify a self-supervised depth estimation framework with explicit 3D representation achieving accurate scene reconstruction.
Matching-aware pose network (a) and depth refinement module (b). We leverage cross-view features from input images to achieve accurate camera pose estimation, and use these estimated poses to further refine the depth maps with spatial awareness.
Qualitative comparison of novel view synthesis on RE10k (top two rows) and ACID (bottom row) datasets.
Qualitative comparison of novel view synthesis on DL3DV dataset.
Epipolar lines visualization. We draw the lines from reference to target frame using relative camera pose. It demonstrates the effectiveness of our approach in capturing accurate geometric alignments.
We evaluate the model's performance on multi-view (more than two) context images, considering its practical application, and visualize the camera trajectory on RE10k dataset. Construction of trajectory only consider the translation part of the estimated camera poses.
We provide the visualization of depth maps generated through rendering (rasterization), which is essential for producing interpretable 3D representations.
Detailed 3D Gaussian prediction architecture. This module takes only context images as input. We only employ the encoder part of CroCo as our monocular encoder, which is trained in a fully self-supervised manner. For the multi-view encoder, we adopt the backbone of UniMatch with randomly initialized weights. Then, we unify features from monocular and multi-view encoders using DPT block.
@article{kang2024selfsplat,
title={SelfSplat: Pose-Free and 3D Prior-Free Generalizable 3D Gaussian Splatting},
author={Kang, Gyeongjin and Yoo, Jisang and Park, Jihyeon and Nam, Seungtae and Im, Hyeonsoo and Kim, Sangpil and Park, Eunbyung and others},
journal={arXiv preprint arXiv:2411.17190},
year={2024}
}