Multi-View Pyramid Transformer

Look Coarser to See Broader

1Sungkyunkwan University 2Yonsei University
Loading...

Abstract

We propose Multi-view Pyramid Transformer (MVP), a scalable multi-view transformer architecture that directly reconstructs large 3D scenes from tens to hundreds of images in a single forward pass. Drawing on the idea of ``looking broader to see the whole, looking finer to see the details," MVP is built on two core design principles: 1) a local-to-global inter-view hierarchy that gradually broadens the model's perspective from local views to groups and ultimately the full scene, and 2) a fine-to-coarse intra-view hierarchy that starts from detailed spatial representations and progressively aggregates them into compact, information-dense tokens. This dual hierarchy achieves both computational efficiency and representational richness, enabling fast reconstruction of large and complex scenes. We validate MVP on diverse datasets and show that, when coupled with 3D Gaussian Splatting as the underlying 3D representation, it achieves state-of-the-art generalizable reconstruction quality while maintaining high efficiency and scalability across a wide range of view configurations.

MVP handles a wide range of input views, reconstructing diverse large-scale scenes within 0.1–2.0 seconds.

Inference results on the DL3DV dataset using 256 input images with a resolution of 960×540.

Inference results on the DL3DV dataset using 32 input images with a resolution of 960×540.

Core architectural design

model

Given tokenized inputs, our model applies a three stage hierarchy of alternating attention blocks, varying in both self-attention coverage and token resolution. A Pyramidal Feature Aggregation module fuses the outputs from all stages, which are then passed to a final head for dense prediction.

Quantitative Comparison

We compare MVP against recent feed-forward 3D reconstruction methods, Long-LRM, and iLRM. We evaluate the methods on in-domain dataset, DL3DV, and out-of-domain datasets, Tanks&Temples and Mip-NeRF360. The resolution of input images is 960x540 for all methods.

dl3dv
tnt
mip

We additionally compare MVP on RealEstate10K dataset against CLiFT, and iLRM. All methods were trained exclusively on RealEstate10K, and the resolution of input images is 256x256.

re10k

Qualitative Comparison

Use mouse to slide the handle left/right to compare the two methods. It might take a few seconds to load the videos.
Please use the buttons on below each comparison videos to toggle between different baseline models.

32-view reconstruction

MVP (0.17s)
Long-LRM (0.84s)

128-view reconstruction

MVP (0.77s)
Long-LRM (6.39s)

4-view error map results on RealEstate10K

MVP Comparison
MVP (Ours)
CLiFT

Inference Time Comparison

We compare the inference time as a function of the number of input views. We measure all timings at an input resolution of 960x540 on a H100 GPU. Note that the inference time only accounts for the generation of 3D Gaussians. For novel view rendering, Long-LRM encounters a memory error when using more than 192 input views.

BibTeX


        @misc{kang2025multiviewpyramidtransformerlook,
              title={Multi-view Pyramid Transformer: Look Coarser to See Broader}, 
              author={Gyeongjin Kang and Seungkwon Yang and Seungtae Nam and Younggeun Lee and Jungwoo Kim and Eunbyung Park},
              year={2025},
              eprint={2512.07806},
              archivePrefix={arXiv},
              primaryClass={cs.CV},
              url={https://arxiv.org/abs/2512.07806}, 
        }