SelfSplat

Abstract

TL;DR: We present SelfSplat, enabling 3D reconstruction from unposed images without any 3D priors.

We propose SelfSplat, a novel 3D Gaussian Splatting model designed to perform pose-free and 3D prior-free generalizable 3D reconstruction from unposed multi-view images. These settings are inherently ill-posed due to the lack of ground-truth data, learned geometric information, and the need to achieve accurate 3D reconstruction without finetuning, making it difficult for conventional methods to achieve high-quality results. Our model addresses these challenges by effectively integrating explicit 3D representations with self-supervised depth and pose estimation techniques, resulting in reciprocal improvements in both pose accuracy and 3D reconstruction quality. Furthermore, we incorporate a matching-aware pose estimation network and a depth refinement module to enhance geometry consistency across views, ensuring more accurate and stable 3D reconstructions. To present the performance of our method, we evaluated it on large-scale real-world datasets, including RealEstate10K, ACID, and DL3DV. SelfSplat achieves superior results over previous state-of-the-art methods in both appearance and geometry quality, also demonstrates strong cross-dataset generalization capabilities. Extensive ablation studies and analysis also validate the effectiveness of our proposed methods.

Pipeline

Given unposed multi-view images as input, we predict depth and Gaussian attributes from the images, as well as the relative camera poses between them. We unify a self-supervised depth estimation framework with explicit 3D representation achieving accurate scene reconstruction.

Matching-aware pose network (a) and depth refinement module (b). We leverage cross-view features from input images to achieve accurate camera pose estimation, and use these estimated poses to further refine the depth maps with spatial awareness.

Qualitative Comparisons

Qualitative comparison of novel view synthesis on RE10k (top two rows) and ACID (bottom row) datasets.

Qualitative comparison of novel view synthesis on DL3DV dataset.

Epipolar Lines Visualiation

Epipolar lines visualization. We draw the lines from reference to target frame using relative camera pose. It demonstrates the effectiveness of our approach in capturing accurate geometric alignments.

Multi-view Inference with Camera Trajectory

We evaluate the model's performance on multi-view (more than two) context images, considering its practical application, and visualize the camera trajectory on RE10k dataset. Construction of trajectory only consider the translation part of the estimated camera poses.

Depth Visualization

We provide the visualization of depth maps generated through rendering (rasterization), which is essential for producing interpretable 3D representations.

Fusion and 3D Gaussian prediction Module

Detailed 3D Gaussian prediction architecture. This module takes only context images as input. We only employ the encoder part of CroCo as our monocular encoder, which is trained in a fully self-supervised manner. For the multi-view encoder, we adopt the backbone of UniMatch with randomly initialized weights. Then, we unify features from monocular and multi-view encoders using DPT block.

BibTeX


@article{kang2024selfsplat,
  title={SelfSplat: Pose-Free and 3D Prior-Free Generalizable 3D Gaussian Splatting},
  author={Kang, Gyeongjin and Yoo, Jisang and Park, Jihyeon and Nam, Seungtae and Im, Hyeonsoo and Shin, Sangheon and Kim, Sangpil and Park, Eunbyung},
  journal={arXiv preprint arXiv:2411.17190},
  year={2024}
}

SelfSplat: Pose-Free and 3D Prior-Free Generalizable 3D Gaussian Splatting

CVPR 2025

Qualitative Results

RE10k (256 x 256)

ACID (256 x 256)

DL3DV (256 x 448)