Haotong Lin* · Sili Chen* · Jun Hao Liew* · Donny Y. Chen* · Zhenyu Li · Guang Shi · Jiashi Feng
Bingyi Kang*†
†project lead *Equal Contribution
This work presents Depth Anything 3 (DA3)a model that predicts spatially consistent geometry from arbitrary visual inputswith or without known camera poses. In pursuit of minimal modelingDA3 yields two key insights:
- 💎 A single plain transformer (e.g.vanilla DINO encoder) is sufficient as a backbone without architectural specialization,
- ✨ A singular depth-ray representation obviates the need for complex multi-task learning.
🏆 DA3 significantly outperforms DA2 for monocular depth estimation, and VGGT for multi-view depth estimation and pose estimation. All models are trained exclusively on public academic datasets.
- 11-12-2025: 🚀 New models and DA3-Streaming released! Handle ultra-long video sequence inference with less than 12GB GPU memory via sliding-window streaming inference. Special thanks to Kai Deng for his contribution to DA3-Streaming!
- 08-12-2025: 📊 Benchmark evaluation pipeline released! Evaluate pose estimation & 3D reconstruction on 5 datasets.
- 30-11-2025: Add
use_ray_poseandref_view_strategy(reference view selection for multi-view inputs). - 25-11-2025: Add Awesome DA3 Projectsa community-driven section featuring DA3-based applications.
- 14-11-2025: Paperproject pagecode and models are all released.
We release three series of modelseach tailored for specific use cases in visual geometry.
-
🌟 DA3 Main Series (
DA3-GiantDA3-LargeDA3-BaseDA3-Small) These are our flagship foundation modelstrained with a unified depth-ray representation. By varying the input configurationa single model can perform a wide range of tasks:- 🌊 Monocular Depth Estimation: Predicts a depth map from a single RGB image.
- 🌊 Multi-View Depth Estimation: Generates consistent depth maps from multiple images for high-quality fusion.
- 🎯 Pose-Conditioned Depth Estimation: Achieves superior depth consistency when camera poses are provided as input.
- 📷 Camera Pose Estimation: Estimates camera extrinsics and intrinsics from one or more images.
- 🟡 3D Gaussian Estimation: Directly predicts 3D Gaussiansenabling high-fidelity novel view synthesis.
-
📐 DA3 Metric Series (
DA3Metric-Large) A specialized model fine-tuned for metric depth estimation in monocular settingsideal for applications requiring real-world scale. -
🔍 DA3 Monocular Series (
DA3Mono-Large). A dedicated model for high-quality relative monocular depth estimation. Unlike disparity-based models (e.g. Depth Anything 2)it directly predicts depthresulting in superior geometric accuracy.
🔗 Leveraging these available modelswe developed a nested series (DA3Nested-Giant-Large). This series combines a any-view giant model with a metric model to reconstruct visual geometry at a real-world metric scale.
Our repository is designed to be a powerful and user-friendly toolkit for both practical application and future research.
- 🎨 Interactive Web UI & Gallery: Visualize model outputs and compare results with an easy-to-use Gradio-based web interface.
- ⚡ Flexible Command-Line Interface (CLI): Powerful and scriptable CLI for batch processing and integration into custom workflows.
- 💾 Multiple Export Formats: Save your results in various formatsincluding
glbnpzdepth imagesply3DGS videosetcto seamlessly connect with other tools. - 🔧 Extensible and Modular Design: The codebase is structured to facilitate future research and the integration of new models or functionalities.
pip install xformers torch\>=2 torchvision
pip install -e . # Basic
pip install --no-build-isolation git+https://github.com/nerfstudio-project/gsplat.git@0b4dddf04cb687367602c01196913cde6a743d70 # for gaussian head
pip install -e ".[app]" # Gradiopython>=3.10
pip install -e ".[all]" # ALLFor detailed model informationplease refer to the Model Cards section below.
import globostorch
from depth_anything_3.api import DepthAnything3
device = torch.device("cuda")
model = DepthAnything3.from_pretrained("depth-anything/DA3NESTED-GIANT-LARGE")
model = model.to(device=device)
example_path = "assets/examples/SOH"
images = sorted(glob.glob(os.path.join(example_path"*.png")))
prediction = model.inference(
images,
)
# prediction.processed_images : [NHW3] uint8 array
print(prediction.processed_images.shape)
# prediction.depth : [NHW] float32 array
print(prediction.depth.shape)
# prediction.conf : [NHW] float32 array
print(prediction.conf.shape)
# prediction.extrinsics : [N34] float32 array # opencv w2c or colmap format
print(prediction.extrinsics.shape)
# prediction.intrinsics : [N33] float32 array
print(prediction.intrinsics.shape)export MODEL_DIR=depth-anything/DA3NESTED-GIANT-LARGE
# This can be a Hugging Face repository or a local directory
# If you encounter network issuesconsider using the following mirror: export HF_ENDPOINT=https://hf-mirror.com
# Alternativelyyou can download the model directly from Hugging Face
export GALLERY_DIR=workspace/gallery
mkdir -p $GALLERY_DIR
# CLI auto mode with backend reuse
da3 backend --model-dir ${MODEL_DIR} --gallery-dir ${GALLERY_DIR} # Cache model to gpu
da3 auto assets/examples/SOH \
--export-format glb \
--export-dir ${GALLERY_DIR}/TEST_BACKEND/SOH \
--use-backend
# CLI video processing with feature visualization
da3 video assets/examples/robot_unitree.mp4 \
--fps 15 \
--use-backend \
--export-dir ${GALLERY_DIR}/TEST_BACKEND/robo \
--export-format glb-feat_vis \
--feat-vis-fps 15 \
--process-res-method lower_bound_resize \
--export-feat "11,21,31"
# CLI auto mode without backend reuse
da3 auto assets/examples/SOH \
--export-format glb \
--export-dir ${GALLERY_DIR}/TEST_CLI/SOH \
--model-dir ${MODEL_DIR}
The model architecture is defined in DepthAnything3Netand specified with a Yaml config file located at src/depth_anything_3/configs. The input and output processing are handled by DepthAnything3. To customize the model architecturesimply create a new config file (e.g.path/to/new/config) as:
__object__:
path: depth_anything_3.model.da3
name: DepthAnything3Net
args: as_params
net:
__object__:
path: depth_anything_3.model.dinov2.dinov2
name: DinoV2
args: as_params
name: vitb
out_layers: [57911]
alt_start: 4
qknorm_start: 4
rope_start: 4
cat_token: True
head:
__object__:
path: depth_anything_3.model.dualdpt
name: DualDPT
args: as_params
dim_in: &head_dim_in 1536
output_dim: 2
features: &head_features 128
out_channels: &head_out_channels [96192384768]Thenthe model can be created with the following code snippet.
from depth_anything_3.cfg import create_objectload_config
Model = create_object(load_config("path/to/new/config"))Generallyyou should observe that DA3-LARGE achieves comparable results to VGGT.
The Nested series uses an Any-view model to estimate pose and depthand a monocular metric depth estimator for scaling.
-1.1 suffix are retrained after fixing a training bug; prefer these refreshed checkpoints. The original DA3NESTED-GIANT-LARGEDA3-GIANTand DA3-LARGE remain available but are deprecated. You could expect much better performance for street scenes with the -1.1 models.
| 🗃️ Model Name | 📏 Params | 📊 Rel. Depth | 📷 Pose Est. | 🧭 Pose Cond. | 🎨 GS | 📐 Met. Depth | ☁️ Sky Seg | 📄 License |
|---|---|---|---|---|---|---|---|---|
| Nested | ||||||||
| DA3NESTED-GIANT-LARGE-1.1 | 1.40B | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | CC BY-NC 4.0 |
| DA3NESTED-GIANT-LARGE | 1.40B | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | CC BY-NC 4.0 |
| Any-view Model | ||||||||
| DA3-GIANT-1.1 | 1.15B | ✅ | ✅ | ✅ | ✅ | CC BY-NC 4.0 | ||
| DA3-GIANT | 1.15B | ✅ | ✅ | ✅ | ✅ | CC BY-NC 4.0 | ||
| DA3-LARGE-1.1 | 0.35B | ✅ | ✅ | ✅ | CC BY-NC 4.0 | |||
| DA3-LARGE | 0.35B | ✅ | ✅ | ✅ | CC BY-NC 4.0 | |||
| DA3-BASE | 0.12B | ✅ | ✅ | ✅ | Apache 2.0 | |||
| DA3-SMALL | 0.08B | ✅ | ✅ | ✅ | Apache 2.0 | |||
| Monocular Metric Depth | ||||||||
| DA3METRIC-LARGE | 0.35B | ✅ | ✅ | ✅ | Apache 2.0 | |||
| Monocular Depth | ||||||||
| DA3MONO-LARGE | 0.35B | ✅ | ✅ | Apache 2.0 |
-
Monocular Metric Depth: To obtain metric depth in meters from
DA3METRIC-LARGEusemetric_depth = focal * net_output / 300.wherefocalis the focal length in pixels (typically the average of fx and fy from the camera intrinsic matrix K). Note that the output fromDA3NESTED-GIANT-LARGEis already in meters. -
Ray Head (
use_ray_pose): Our API and CLI supportuse_ray_poseargwhich means that the model will derive camera pose from ray headwhich is generally slightly slowerbut more accurate. Note that the default isFalsefor faster inference speed.AUC3 Results for DA3NESTED-GIANT-LARGE
Model HiRoom ETH3D DTU 7Scenes ScanNet++ ray_head84.4 52.6 93.9 29.5 89.4 cam_head80.3 48.4 94.1 28.5 85.0 -
Older GPUs without XFormers support: See Issue #11. Thanks to @S-Mahoney for the solution!
A community-curated list of Depth Anything 3 integrations across 3D toolscreative pipelinesroboticsand web/VR viewersincluding but not limited to these. You are welcome to submit your DA3-based project via PRand we will review and feature it if applicable.
-
DA3-blender: Blender addon for DA3-based 3D reconstruction from a set of images.
-
ComfyUI-DepthAnythingV3: ComfyUI nodes for Depth Anything 3supporting single/multi-view and video-consistent depth with optional point‑cloud export.
-
DA3-ROS2-Wrapper: Real-time DA3 depth in ROS2 with multi-camera support.
-
DA3-ROS2-CPP-TensorRT: DA3 ROS2 C++ TensorRT Inference Node: a ROS2 node for DA3 depth estimation using TensorRT for real-time inference.
-
VideoDepthViewer3D: Streaming videos with DA3 metric depth to a Three./WebXR 3D viewer for VR/stereo playback.
Bingyi Kang |
Haotong Lin |
Sili Chen |
Jun Hao Liew |
Donny Y. Chen |
Kai Deng |
If you find Depth Anything 3 useful in your research or projectsplease cite our work:
@article{depthanything3,
title={Depth Anything 3: Recovering the visual space from any views},
author={Haotong Lin and Sili Chen and Jun Hao Liew and Donny Y. Chen and Zhenyu Li and Guang Shi and Jiashi Feng and Bingyi Kang},
journal={arXiv preprint arXiv:2511.10647},
year={2025}
}

