NeRF-UAVeL: Unified Attention-driven Volumetric Learning for Enhanced NeRF-based 3D Object Detection

Project Page

Unified attention-driven volumetric learning framework for state-of-the-art NeRF-based 3D object detection

98.5%

R25

3D-FRONT dataset

77.2%

R50

3D-FRONT dataset

87.3%

AP25

3D-FRONT dataset

66.6%

AP50

3D-FRONT dataset

📖

Overview

Three-dimensional (3D) object detection based on Neural Radiance Fields (NeRF) has emerged as a promising direction for reconstructing complex environments from posed RGB images. However, existing NeRF-based detectors often suffer from coarse feature encoding and limited attention to multi-scale volumetric structure, leading to inaccurate localization and poor generalization in real-world scenarios. NeRF-UAVeL addresses these challenges with a unified attention-driven volumetric learning detection framework that integrates four novel modules into a NeRF-derived 3D volumetric backbone: Multi-dimensional Volumetric Attention Pooling (MVAP), Tri-Scale Asymmetric Convolutional Aggregation (TACA), Dual-Domain Attention Fusion (DDAF), and Volumetric Cross-Window Attention Fusion (V-CWAF). Extensive experiments on the 3D-FRONT and ScanNet datasets demonstrate that NeRF-UAVeL outperforms both point cloud-based and multi-view-based methods, improving AP50 by +6.7% and R50 by +7.3% over the baseline on 3D-FRONT, and achieving +6.9% in AP50 and +2.9% in R50 on ScanNet.

🎬

3D Object Detection Visualizations

💡

Novel Contributions

Multi-dimensional Volumetric Attention Pooling (MVAP)

Enhances spatial selectivity through adaptive attention-based pooling, enabling fine-grained volumetric feature representation in NeRF-derived volumes.

Tri-Scale Asymmetric Convolutional Aggregation (TACA)

Captures multi-scale volumetric features through asymmetric convolutional branches, enabling robust multi-scale structure understanding.

Dual-Domain Attention Fusion (DDAF)

Applies lightweight channel and spatial recalibration for refined feature emphasis, improving localization accuracy across spatial and frequency domains.