SCIDE: Structure-Content Integrated Diffusion for Faster Endoscopic Depth Estimation

Min Tan, Yushun Tao, Boyun Zheng, Gaosheng Xie, Zeyang Xia, Senior Member, IEEE and Jing Xiong, Member, IEEE

Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, 518055, Shenzhen, China. University of Chinese Academy of Sciences, 101400, Beijing, China. Department of Electronic Engineering, Chinese University of Hong Kong, 999077, Hong Kong SAR, China. School of Mechanical Engineering, Shanghai Jiao Tong University, 200240, Shanghai, China.

Resources 📚

Abstract 📝

Background: Monocular depth estimation in endoscopic environments is crucial for surgical video understanding, robotic navigation, and 3D reconstruction. However, existing discriminative approaches for depth estimation often struggle with challenging conditions such as complex illumination and narrow luminal spaces.

Methods: To address these challenges, we propose the Structure-Content Integrated Diffusion Estimation (SCIDE), which combines structure and content priors to guide depth estimation.

Results: Experimental results show that our SCIDE framework not only achieves state-of-the-art accuracy but also significantly reduces inference time, making real-time applications feasible in surgical settings.

Images 🖼️

Method Overview

Figure 1

Animal Experiment

Network Architecture

Figure 2

SCIDE Architecture

SCIDE Module

Figure 3

Quantitative Comparison

Training Pipeline

Figure 4

Comparison of Model Performance

Dataset Overview

Figure 5

Visualization of Depth Estimation Results

Experimental Setup

Figure 6

Comparison of Depth Map Fidelity

Qualitative Results

Figure 7

Ablation Results for SC-Extractor

Quantitative Results

Figure 8

Depth of Random Sampling

Error Analysis

Figure 9

Parameter-Performance and Inference Time Landscape of FODS

Results 📊

By incorporating SC-Extractor and FODS, our method enhances the robustness and accuracy of depth predictions under complex lighting conditions and within narrow luminal endoscopic environments. Experimental evaluations on EndoSLAM, Endomapper, and our custom-collected phantom dataset demonstrate the competitive performance of our method. Furthermore, our proposed FODS not only ensures high-fidelity depth estimation but also significantly accelerates the inference process, making it suitable for real-time surgical applications.

EndoSLAM Dataset Results (SCIDE)

Absolute Relative Error (Abs Rel) 0.215±0.077
Squared Relative Error (Sq Rel) 0.027±0.019
Root Mean Square Error (RMSE) 0.0875±0.013

EndoMapper Dataset Results (SCIDE)

Structural Similarity Index (SSIM) 0.612±0.070
Edge Consistency (EC) 0.237±0.015
Histogram Similarity (HS) 0.237±0.012

FODS Performance Metrics

Inference Speedup 74.2%
Accuracy (δ₃) 0.972±0.030
Inference Time 0.312s

Key Observations:

  • SCIDE achieves 74.2% inference speedup with FODS while maintaining high accuracy
  • Depth map fidelity comparable to 20-step sampling achieved in just 5-10 steps
  • Optimal parameters: ε = 0.02-0.06, steps = 5-15 balance performance and efficiency

SC-Extractor Ablation Study

Model Structure Extractor Content Extractor RMSE ↓ δ₃ ↑
#1 (Baseline) 0.0901±0.0028 0.715±0.047
#2 0.0893±0.0025 0.786±0.045
#3 0.0888±0.0024 0.862±0.049
#4 (SCIDE) 0.0875±0.0015 0.972±0.043

Key Findings:

  • Full SC-Extractor configuration achieves lowest RMSE (0.0875) and highest δ₃ accuracy (0.972)
  • Content Extractor contributes more significantly to accuracy improvement
  • Combined extractors provide synergistic benefits for depth estimation
Model Architecture Abs Rel↓ Sq Rel↓ log10↓ RMSE↓ δ₁↑ δ₃↑
Depthfm Diffusion 1.72 (0.805) 1.064 (1.341) 0.368 (0.087) 0.340 (0.152) 0.178 (0.056) 0.494 (0.126)
GeoWizard Diffusion 1.22 (0.466) 0.384 (0.245) 0.325 (0.072) 0.272 (0.064) 0.198 (0.087) 0.537 (0.139)
Marigold Diffusion 1.09 (0.393) 0.303 (0.144) 0.316 (0.058) 0.263 (0.045) 0.208 (0.076) 0.560 (0.112)
DMP Diffusion 0.935 (0.351) 0.206 (0.087) 0.258 (0.056) 0.217 (0.032) 0.269 (0.083) 0.658 (0.120)
DPT Transformer 0.799 (0.469) 0.175 (0.136) 0.239 (0.087) 0.207 (0.046) 0.315 (0.135) 0.698 (0.167)
Depth Anything Transformer 0.450 (0.307) 0.071 (0.082) 0.161 (0.066) 0.156 (0.041) 0.500 (0.174) 0.843 (0.112)
Endo-SfMLearner ResNet 0.350 (0.185) 0.221 (0.024) 0.287 (0.181) 0.475 (0.444) 0.506 (0.230) 0.889 (0.145)
SCIDE (Ours) Diffusion 0.215 (0.077) 0.027 (0.019) 0.102 (0.0313) 0.0875 (0.013) 0.824 (0.080) 0.972 (0.030)