We propose Multi-view Pyramid Transformer (MVP), a scalable multi-view transformer architecture that directly reconstructs large 3D scenes from tens to hundreds of images in a single forward pass. Drawing on the idea of ``looking broader to see the whole, looking finer to see the details," MVP is built on two core design principles: 1) a local-to-global inter-view hierarchy that gradually broadens the model's perspective from local views to groups and ultimately the full scene, and 2) a fine-to-coarse intra-view hierarchy that starts from detailed spatial representations and progressively aggregates them into compact, information-dense tokens. This dual hierarchy achieves both computational efficiency and representational richness, enabling fast reconstruction of large and complex scenes. We validate MVP on diverse datasets and show that, when coupled with 3D Gaussian Splatting as the underlying 3D representation, it achieves state-of-the-art generalizable reconstruction quality while maintaining high efficiency and scalability across a wide range of view configurations.
Inference results on the DL3DV dataset using 256 input images with a resolution of 960×540.
Inference results on the DL3DV dataset using 32 input images with a resolution of 960×540.
Given tokenized inputs, our model applies a three stage hierarchy of alternating attention blocks, varying in both self-attention coverage and token resolution. A Pyramidal Feature Aggregation module fuses the outputs from all stages, which are then passed to a final head for dense prediction.
We compare MVP against recent feed-forward 3D reconstruction methods, Long-LRM, and iLRM. We evaluate the methods on in-domain dataset, DL3DV, and out-of-domain datasets, Tanks&Temples and Mip-NeRF360. The resolution of input images is 960x540 for all methods.
We additionally compare MVP on RealEstate10K dataset against CLiFT, and iLRM. All methods were trained exclusively on RealEstate10K, and the resolution of input images is 256x256.
Use mouse to slide the handle left/right to compare the two methods. It might take a few seconds to load the videos.
Please use the buttons on below each comparison videos to toggle between different baseline models.
We compare the inference time as a function of the number of input views. We measure all timings at an input resolution of 960x540 on a H100 GPU. Note that the inference time only accounts for the generation of 3D Gaussians. For novel view rendering, Long-LRM encounters a memory error when using more than 192 input views.
@misc{kang2025multiviewpyramidtransformerlook,
title={Multi-view Pyramid Transformer: Look Coarser to See Broader},
author={Gyeongjin Kang and Seungkwon Yang and Seungtae Nam and Younggeun Lee and Jungwoo Kim and Eunbyung Park},
year={2025},
eprint={2512.07806},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.07806},
}