Min Tan, Yushun Tao, Boyun Zheng, Gaosheng Xie, Zeyang Xia, Senior Member, IEEE and Jing Xiong, Member, IEEE
Background: Monocular depth estimation in endoscopic environments is crucial for surgical video understanding, robotic navigation, and 3D reconstruction. However, existing discriminative approaches for depth estimation often struggle with challenging conditions such as complex illumination and narrow luminal spaces.
Methods: To address these challenges, we propose the Structure-Content Integrated Diffusion Estimation (SCIDE), which combines structure and content priors to guide depth estimation.
Results: Experimental results show that our SCIDE framework not only achieves state-of-the-art accuracy but also significantly reduces inference time, making real-time applications feasible in surgical settings.
Animal Experiment
SCIDE Architecture
Quantitative Comparison
Comparison of Model Performance
Visualization of Depth Estimation Results
Comparison of Depth Map Fidelity
Ablation Results for SC-Extractor
Depth of Random Sampling
Parameter-Performance and Inference Time Landscape of FODS
By incorporating SC-Extractor and FODS, our method enhances the robustness and accuracy of depth predictions under complex lighting conditions and within narrow luminal endoscopic environments. Experimental evaluations on EndoSLAM, Endomapper, and our custom-collected phantom dataset demonstrate the competitive performance of our method. Furthermore, our proposed FODS not only ensures high-fidelity depth estimation but also significantly accelerates the inference process, making it suitable for real-time surgical applications.
| Model | Structure Extractor | Content Extractor | RMSE ↓ | δ₃ ↑ |
|---|---|---|---|---|
| #1 (Baseline) | ✗ | ✗ | 0.0901±0.0028 | 0.715±0.047 |
| #2 | ✓ | ✗ | 0.0893±0.0025 | 0.786±0.045 |
| #3 | ✗ | ✓ | 0.0888±0.0024 | 0.862±0.049 |
| #4 (SCIDE) | ✓ | ✓ | 0.0875±0.0015 | 0.972±0.043 |
Key Findings:
| Model | Architecture | Abs Rel↓ | Sq Rel↓ | log10↓ | RMSE↓ | δ₁↑ | δ₃↑ |
|---|---|---|---|---|---|---|---|
| Depthfm | Diffusion | 1.72 (0.805) | 1.064 (1.341) | 0.368 (0.087) | 0.340 (0.152) | 0.178 (0.056) | 0.494 (0.126) |
| GeoWizard | Diffusion | 1.22 (0.466) | 0.384 (0.245) | 0.325 (0.072) | 0.272 (0.064) | 0.198 (0.087) | 0.537 (0.139) |
| Marigold | Diffusion | 1.09 (0.393) | 0.303 (0.144) | 0.316 (0.058) | 0.263 (0.045) | 0.208 (0.076) | 0.560 (0.112) |
| DMP | Diffusion | 0.935 (0.351) | 0.206 (0.087) | 0.258 (0.056) | 0.217 (0.032) | 0.269 (0.083) | 0.658 (0.120) |
| DPT | Transformer | 0.799 (0.469) | 0.175 (0.136) | 0.239 (0.087) | 0.207 (0.046) | 0.315 (0.135) | 0.698 (0.167) |
| Depth Anything | Transformer | 0.450 (0.307) | 0.071 (0.082) | 0.161 (0.066) | 0.156 (0.041) | 0.500 (0.174) | 0.843 (0.112) |
| Endo-SfMLearner | ResNet | 0.350 (0.185) | 0.221 (0.024) | 0.287 (0.181) | 0.475 (0.444) | 0.506 (0.230) | 0.889 (0.145) |
| SCIDE (Ours) | Diffusion | 0.215 (0.077) | 0.027 (0.019) | 0.102 (0.0313) | 0.0875 (0.013) | 0.824 (0.080) | 0.972 (0.030) |