Efficient-LVSM
Faster, Cheaper, and Better Large View Synthesis Model Via Decoupled Co-Refinement Attention

Xiaosong Jia1*, Yihang Sun2*, Junqi You2*, Songbur Wong2, Zichen Zou1, Junchi Yan2†, Zuxuan Wu1, Yu-Gang Jiang1,
1Institute of Trustworthy Embodied AI (TEAI), Fudan University, 2Sch. of Computer Science & Sch. of Artificial Intelligence, Shanghai Jiao Tong University

Interactive View Synthesis Demo

Use the controls below to navigate between different viewpoints. Move forward and backward along the track, and rotate the camera view at each position.
Note: The 3D scene and camera are for display only and are not generated by our model.

Keyboard Controls: W/S - Move Forward/Backward ↑↓←→ - Rotate Camera +/- - Adjust Input Views

Scene and Camera Track

Input Views

Input 1
Input 2
Input 3
Input 4
4 Views

Synthesized View

Output View
Loading...

Track Position Control

Position: 5

Camera Rotate Control

Compare with Baseline

Efficient-LVSM synthesizes novel views with better quality and faster speed than LVSM. (Speed of videos represents the speed of inference)

Abstract

Feedforward models for novel view synthesis (NVS) have recently advanced by transformer-based methods like LVSM, using attention among all input and target views. In this work, we argue that its full self-attention design is suboptimal, suffering from quadratic complexity with respect to the number of input views and rigid parameter sharing among heterogeneous tokens. We propose \textbf{Efficient-LVSM}, a dual-stream architecture that avoids these issues with a decoupled co-refinement mechanism. It applies intra-view self-attention for input views and self-then-cross attention for target views, eliminating unnecessary computation. Efficient-LVSM achieves 30.6 dB PSNR on RealEstate10K with 2 input views, surpassing LVSM by 0.9 dB, with 2× faster training convergence and 4.4× faster inference speed. Efficient-LVSM achieves state-of-the-art performance on multiple benchmarks, exhibits strong zero-shot generalization to unseen view counts, and enables incremental inference with KV-cache, thanks to its decoupled designs.

Pipeline Architecture

Description

PCA Visualization of Input and Target Views Features at Different Layers.

Description

Latent Novel View Synthesis Paradigms Comparison. The proposed decoupled architecture disentangles the input and target streams. It maintains the integrity and specialization with high efficiency, obtaining better rendering quality and faster inference speed.

Description

Efficient-LVSM Model Structure. Efficient-LVSM patchifies posed input images and target Plücker into tokens. Tokens of each input view separately pass through an encoder to extract contextual information. Target tokens self-attend to input tokens and then cross-attend to render new views.

Inference Efficiency

Inference time and memory usage scale much more gently than conventional methods.

Training Efficiency

Efficient-LVSM achieves the same performance as LVSM with 2× faster training convergence.

BibTeX


      @article{jia2025efficientlvsm,
        title={Efficient-LVSM: Faster, Cheaper, and Better Large View Synthesis Model via Decoupled Co-Refinement Attention},
        author={Jia, Xiaosong and Sun, Yihang and You, Junqi and Wong, Songbur and Zou, Zichen and Yan, Junchi and Wu, Zuxuan and Jiang, Yu-Gang},
        journal={International Conference on Learning Representations (ICLR)},
        year={2026}
      }