Accepted by IEEE Robotics and Automation Letters (RA-L)
Jituo Li, Shunwang Sun, Jialu Zhang, Xinqi Liu, Jinyao Hu, Zhicheng Lu, Sajad Saeedi, Guodong Lu
In this work, we propose MVOFormer, a novel transformer framework for robust monocular visual odometry. Our architecture features a Flow-Semantic Dual Branch Encoder that synergizes dense geometric motion cues with object-centric semantic priors, explicitly distinguishing static structures from dynamic distractors. These representations are then fused by an Iterative Multimodal Decoder, enabling coarse-to-fine pose refinement while dynamically suppressing attention on unreliable regions.
We also reproduce TartanVO as a strong baseline and release the open-source implementation.
⬇ Download Video (MP4)
Demo data from DROID-W
# Create conda environment
conda create -n mvoformer python=3.11 -y
conda activate mvoformer
# Install PyTorch (CUDA 12.x)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
# Install dependencies
pip install -r requirements.txtThe Deformable Attention CUDA ops are pre-compiled for Python 3.11. If you encounter import errors, recompile:
cd Network/Deformable_ops
bash make.sh
cd ../..MVOFormer/
├── assets/ # Demo video and paper PDF
│ ├── MVOFormer.mp4
│ └── MVOFormer.pdf
├── Configs/
│ └── MVOFormer.yaml # Main configuration file
├── Model/ # Pretrained model checkpoints
│ ├── stage_1_model.pth # Flow-only pretrained (200 epochs)
│ └── MVOFormer.pth # Final model (50 epochs)
├── Network/
│ ├── Deformable_ops/ # Deformable attention CUDA ops
│ ├── Model/ # MVOFormer model (transformer, backbone, etc.)
│ ├── SeaRAFT/ # SeaRAFT optical flow model
│ └── dinov3/ # DINOv3 visual-semantic backbone
├── Tool/
│ ├── Datasets/ # Dataset loading & augmentation
│ ├── Evaluator/ # Trajectory evaluation (ATE, RPE, KITTI)
│ ├── Train_Test/ # Trainer, Tester, and Inference modules
│ └── Utils/ # Utilities (logging, seeding, transforms)
├── Outputs/ # Checkpoints and logs (gitignored)
├── train.py # Training & evaluation script
├── infer.py # Inference script (no GT poses needed)
The code supports TartanAir, TartanAir-Shibuya, KITTI, TUM-RGBD, Bonn, EuRoC, and ETH3D-SLAM datasets.
Each dataset should be organized as:
dataset/
{split}_img/ # RGB images
{split}_flow_sea/ # Pre-computed optical flow (.npy)
{split}_pose/ # Ground-truth poses (.txt, 7-DoF: xyz + quaternion)
Optical flow can be pre-computed using SEA-RAFT. Update Configs/MVOFormer.yaml with your dataset paths.
| Model | Source | Placement |
|---|---|---|
| DINOv3 backbone | facebookresearch/dinov3 | Network/dinov3/weights/ |
| SEA-RAFT optical flow | princeton-vl/SEA-RAFT | Network/SeaRAFT/models/ |
| Stage 1 (flow-only) | Google Drive | Model/stage_1_model.pth |
| MVOFormer (full) | Google Drive | Model/MVOFormer.pth |
The first stage trains MVOFormer using ground-truth optical flow only (without semantic features) for 200 epochs. This stage learns basic motion understanding.
CUDA_VISIBLE_DEVICES=0 python train.py --mode train \
--set model.is_Semantics=False \
--set trainer.max_epoch=200 \
--set trainer.pretrain_model=NoneAfter training, rename the output checkpoint to Model/stage_1_model.pth.
The second stage loads the flow-only checkpoint and adds DINOv3 semantic features, training for 50 epochs. The optical flow used in this stage is pre-computed by SEA-RAFT and saved locally (under {split}_flow_sea/), rather than ground-truth flow.
CUDA_VISIBLE_DEVICES=0 python train.py --mode train \
--set model.is_Semantics=True \
--set trainer.pretrain_model=./Model/stage_1_model.pth \
--set trainer.max_epoch=50| Component | Description |
|---|---|
| Model | MVOFormer with DINOv3 backbone (81.98M params, 52.53M trainable) |
| Optimizer | AdamW (lr=5e-5, weight_decay=1e-4) |
| LR Schedule | Cosine decay with 3-epoch linear warmup (init_lr=1e-5, min_lr=1e-7) |
| Batch Size | 64 |
| Mixed Precision | BF16 (automatic if GPU supports it) |
| Gradient Clipping | max_norm=1.0 |
| Loss | Weighted translation + rotation regression with uncertainty learning |
| Augmentation | Spatial random crop (scale up to 2.5×), color jitter (brightness/contrast/saturation) |
| Datasets | TartanAir (305K samples) + TartanAir-Shibuya (×10 repeat) |
During training, the model saves:
checkpoint_epoch_{N}.pth— everysave_frequencyepochs (default: 5)checkpoint_best.pth— epoch with lowest validation losscheckpoint_final.pth— latest epoch
Evaluate a specific checkpoint on test datasets:
CUDA_VISIBLE_DEVICES=0 python train.py --mode eval --config Configs/MVOFormer.yaml --checkpoint 50Uses checkpoint at Outputs/{model_name}/checkpoint_epoch_{N}.pth.
The evaluation:
- Loads the specified checkpoint (
checkpoint_epoch_50.pthinOutputs/{model_name}/, e.g.Outputs/MVOFormer/checkpoint_epoch_50.pth). - Iterates over all test sequences defined in
cfg['dataset']['test_datasets']. - For each sequence, runs the model frame-by-frame, computes relative poses.
- Evaluates trajectory using ATE (Absolute Trajectory Error), scale.
- Saves trajectory plots as
.pngand estimated poses as.txtinOutputs/results/. - Reports mean ATE per dataset and overall average.
# Set tester.mode: all in config to sweep all checkpointsFor inference on new data without ground-truth poses:
# Dataset inference (requires pre-computed flow in dataset folder)
python infer.py --config Configs/MVOFormer.yaml --checkpoint ./Model/MVOFormer.pth --mode single
# Raw image folder (on-the-fly flow computation via SEA-RAFT)
python infer.py --img_folder /path/to/images --checkpoint ./Model/MVOFormer.pth \
--fx 320 --fy 320 --cx 320 --cy 240
# Video file
python infer.py --video /path/to/video.mp4 --checkpoint ./Model/MVOFormer.pthOther options:
--checkpoint_epoch 50— useOutputs/{model_name}/checkpoint_epoch_50.pthinstead of--checkpoint--mode all— sweep allcheckpoint_epoch_*.pthinOutputs/{model_name}/--fx,--fy,--cx,--cy— camera intrinsics (defaults: image center, fx=fy=320)
When using --img_folder or --video, optical flow is computed on-the-fly using SEA-RAFT. The trajectory is saved to Outputs/{model_name}_{results}/trajectory.png.
Key parameters in Configs/MVOFormer.yaml:
| Parameter | Default | Description |
|---|---|---|
model.DINOv3_version |
smallplus |
DINOv3 backbone variant |
model.num_queries |
100 |
Number of transformer queries |
model.enc_layers |
3 |
Encoder layers |
model.dec_layers |
3 |
Decoder layers |
model.is_Semantics |
True |
Enable DINOv3 semantic features |
model.with_pose_refine |
False |
Enable pose refinement branch |
dataset.batch_size |
64 |
Training batch size |
trainer.max_epoch |
50 |
Total training epochs |
trainer.amp_dtype |
bf16 |
Mixed precision (bf16/fp16/fp32) |
trainer.pretrain_model |
./Model/stage_1_model.pth |
Flow-only pretrained weights |
trainer.save_frequency |
5 |
Save checkpoint every N epochs |
optimizer.lr |
0.00005 |
Learning rate |
inference.mode |
single |
Inference mode (single/all) |
inference.datasets |
— | Datasets for inference (same format as test_datasets) |
| Type | Intrinsics (fx, fy, cx, cy) |
|---|---|
tartanair |
320.0, 320.0, 320.0, 240.0 |
tartanair_shibuya |
772.55, 772.55, 320.0, 180.0 |
kitti |
707.09, 707.09, 601.89, 183.11 |
euroc |
458.65, 457.30, 367.22, 248.38 |
tum |
517.3, 516.5, 318.6, 255.3 |
bonn |
517.3, 516.5, 318.6, 255.3 |
ETH3D |
726.21, 726.21, 359.20, 202.47 |
Quantitative comparison (ATE ↓) on four benchmarks.
| Method | KITTI | TartanAir | TUM-RGBD | ETH3D-SLAM |
|---|---|---|---|---|
| ORB-SLAM3 [1] ‡ | — | 14.38 | — | — |
| DROID-VO [7] ‡ | 54.19 | 0.58 | 0.116 | 0.238 |
| DPVO [23] ‡ | 53.61 | 0.21 | 0.107 | 0.203 |
| TartanVO [6] † | 33.94 | 3.34 | 0.320 | 0.421 |
| DytanVO [15] † | 24.96 | 3.90 | 0.259 | 0.364 |
| MVOFormer (Ours) † | 19.61 | 1.36 | 0.187 | 0.276 |
‡ Multi-Frame methods (with global optimization / loop closure) † Frame-to-Frame learning-based methods
Qualitative results on KITTI Odometry sequences 00–10.
@article{li2026mvoformer,
author = {Jituo Li and Shunwang Sun and Jialu Zhang and Xinqi Liu and Jinyao Hu and Zhicheng Lu and Sajad Saeedi and Guodong Lu},
title = {MVOFormer: Flow-Semantic Transformer for Robust Monocular Visual Odometry},
journal = {arXiv preprint arXiv:2606.16474},
year = {2026},
url = {https://arxiv.org/abs/2606.16474}
}@article{wang2024sea,
title={SEA-RAFT: Simple, Efficient, Accurate RAFT for Optical Flow},
author={Wang, Yihan and Lipson, Lahav and Deng, Jia},
journal={arXiv preprint arXiv:2405.14793},
year={2024}
}
@inproceedings{tartanvo2020corl,
title = {TartanVO: A Generalizable Learning-based VO},
author = {Wang, Wenshan and Hu, Yaoyu and Scherer, Sebastian},
booktitle = {Conference on Robot Learning (CoRL)},
year = {2020}
}
@misc{opentartanvo2025,
title = {OpenTartanVO: An Open-Source Reproduction and Engineering Optimization of TartanVO},
author = {Zhang, Jialu and Sun, Shunwang and Xue, Tingxi},
year = {2025},
howpublished = {\url{https://github.com/Sun-Shun/OpenTartanVO}}
}This project is licensed under the BSD 3-Clause License — see the LICENSE file for details.
Copyright (c) 2026, Zhejiang University
