MVOFormer: Flow-Semantic Transformer for Robust Monocular Visual Odometry

Accepted by IEEE Robotics and Automation Letters (RA-L)

Jituo Li, Shunwang Sun, Jialu Zhang, Xinqi Liu, Jinyao Hu, Zhicheng Lu, Sajad Saeedi, Guodong Lu

Abstract

In this work, we propose MVOFormer, a novel transformer framework for robust monocular visual odometry. Our architecture features a Flow-Semantic Dual Branch Encoder that synergizes dense geometric motion cues with object-centric semantic priors, explicitly distinguishing static structures from dynamic distractors. These representations are then fused by an Iterative Multimodal Decoder, enabling coarse-to-fine pose refinement while dynamically suppressing attention on unreliable regions.

We also reproduce TartanVO as a strong baseline and release the open-source implementation.

⬇ Download Video (MP4)
Demo data from DROID-W

Installation

# Create conda environment
conda create -n mvoformer python=3.11 -y
conda activate mvoformer

# Install PyTorch (CUDA 12.x)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121

# Install dependencies
pip install -r requirements.txt

The Deformable Attention CUDA ops are pre-compiled for Python 3.11. If you encounter import errors, recompile:

cd Network/Deformable_ops
bash make.sh
cd ../..

Repository Structure

MVOFormer/
├── assets/                    # Demo video and paper PDF
│   ├── MVOFormer.mp4
│   └── MVOFormer.pdf
├── Configs/
│   └── MVOFormer.yaml         # Main configuration file
├── Model/                     # Pretrained model checkpoints
│   ├── stage_1_model.pth      # Flow-only pretrained (200 epochs)
│   └── MVOFormer.pth          # Final model (50 epochs)
├── Network/
│   ├── Deformable_ops/        # Deformable attention CUDA ops
│   ├── Model/                 # MVOFormer model (transformer, backbone, etc.)
│   ├── SeaRAFT/               # SeaRAFT optical flow model
│   └── dinov3/                # DINOv3 visual-semantic backbone
├── Tool/
│   ├── Datasets/              # Dataset loading & augmentation
│   ├── Evaluator/             # Trajectory evaluation (ATE, RPE, KITTI)
│   ├── Train_Test/            # Trainer, Tester, and Inference modules
│   └── Utils/                 # Utilities (logging, seeding, transforms)
├── Outputs/                   # Checkpoints and logs (gitignored)
├── train.py                   # Training & evaluation script
├── infer.py                   # Inference script (no GT poses needed)

Dataset Preparation

The code supports TartanAir, TartanAir-Shibuya, KITTI, TUM-RGBD, Bonn, EuRoC, and ETH3D-SLAM datasets.

Each dataset should be organized as:

dataset/
  {split}_img/        # RGB images
  {split}_flow_sea/   # Pre-computed optical flow (.npy)
  {split}_pose/       # Ground-truth poses (.txt, 7-DoF: xyz + quaternion)

Optical flow can be pre-computed using SEA-RAFT. Update Configs/MVOFormer.yaml with your dataset paths.

Pretrained Model Weights

Model	Source	Placement
DINOv3 backbone	facebookresearch/dinov3	`Network/dinov3/weights/`
SEA-RAFT optical flow	princeton-vl/SEA-RAFT	`Network/SeaRAFT/models/`
Stage 1 (flow-only)	Google Drive	`Model/stage_1_model.pth`
MVOFormer (full)	Google Drive	`Model/MVOFormer.pth`

Training Pipeline

Stage 1: Flow-Only Pretraining

The first stage trains MVOFormer using ground-truth optical flow only (without semantic features) for 200 epochs. This stage learns basic motion understanding.

CUDA_VISIBLE_DEVICES=0 python train.py --mode train \
  --set model.is_Semantics=False \
  --set trainer.max_epoch=200 \
  --set trainer.pretrain_model=None

After training, rename the output checkpoint to Model/stage_1_model.pth.

Stage 2: Full Training with Semantics

The second stage loads the flow-only checkpoint and adds DINOv3 semantic features, training for 50 epochs. The optical flow used in this stage is pre-computed by SEA-RAFT and saved locally (under {split}_flow_sea/), rather than ground-truth flow.

CUDA_VISIBLE_DEVICES=0 python train.py --mode train \
  --set model.is_Semantics=True \
  --set trainer.pretrain_model=./Model/stage_1_model.pth \
  --set trainer.max_epoch=50

Training Details

Component	Description
Model	MVOFormer with DINOv3 backbone (81.98M params, 52.53M trainable)
Optimizer	AdamW (lr=5e-5, weight_decay=1e-4)
LR Schedule	Cosine decay with 3-epoch linear warmup (init_lr=1e-5, min_lr=1e-7)
Batch Size	64
Mixed Precision	BF16 (automatic if GPU supports it)
Gradient Clipping	max_norm=1.0
Loss	Weighted translation + rotation regression with uncertainty learning
Augmentation	Spatial random crop (scale up to 2.5×), color jitter (brightness/contrast/saturation)
Datasets	TartanAir (305K samples) + TartanAir-Shibuya (×10 repeat)

Checkpoints

During training, the model saves:

checkpoint_epoch_{N}.pth — every save_frequency epochs (default: 5)
checkpoint_best.pth — epoch with lowest validation loss
checkpoint_final.pth — latest epoch

Evaluation Pipeline

Evaluation with Ground-Truth Poses

Evaluate a specific checkpoint on test datasets:

CUDA_VISIBLE_DEVICES=0 python train.py --mode eval --config Configs/MVOFormer.yaml --checkpoint 50

Uses checkpoint at Outputs/{model_name}/checkpoint_epoch_{N}.pth.

The evaluation:

Loads the specified checkpoint (checkpoint_epoch_50.pth in Outputs/{model_name}/, e.g. Outputs/MVOFormer/checkpoint_epoch_50.pth).
Iterates over all test sequences defined in cfg['dataset']['test_datasets'].
For each sequence, runs the model frame-by-frame, computes relative poses.
Evaluates trajectory using ATE (Absolute Trajectory Error), scale.
Saves trajectory plots as .png and estimated poses as .txt in Outputs/results/.
Reports mean ATE per dataset and overall average.

Multi-Checkpoint Sweep

# Set tester.mode: all in config to sweep all checkpoints

Inference Pipeline (No Ground-Truth Poses)

For inference on new data without ground-truth poses:

# Dataset inference (requires pre-computed flow in dataset folder)
python infer.py --config Configs/MVOFormer.yaml --checkpoint ./Model/MVOFormer.pth --mode single

# Raw image folder (on-the-fly flow computation via SEA-RAFT)
python infer.py --img_folder /path/to/images --checkpoint ./Model/MVOFormer.pth \
    --fx 320 --fy 320 --cx 320 --cy 240

# Video file
python infer.py --video /path/to/video.mp4 --checkpoint ./Model/MVOFormer.pth

Other options:

--checkpoint_epoch 50 — use Outputs/{model_name}/checkpoint_epoch_50.pth instead of --checkpoint
--mode all — sweep all checkpoint_epoch_*.pth in Outputs/{model_name}/
--fx, --fy, --cx, --cy — camera intrinsics (defaults: image center, fx=fy=320)

When using --img_folder or --video, optical flow is computed on-the-fly using SEA-RAFT. The trajectory is saved to Outputs/{model_name}_{results}/trajectory.png.

Configuration Reference

Key parameters in Configs/MVOFormer.yaml:

Parameter	Default	Description
`model.DINOv3_version`	`smallplus`	DINOv3 backbone variant
`model.num_queries`	`100`	Number of transformer queries
`model.enc_layers`	`3`	Encoder layers
`model.dec_layers`	`3`	Decoder layers
`model.is_Semantics`	`True`	Enable DINOv3 semantic features
`model.with_pose_refine`	`False`	Enable pose refinement branch
`dataset.batch_size`	`64`	Training batch size
`trainer.max_epoch`	`50`	Total training epochs
`trainer.amp_dtype`	`bf16`	Mixed precision (bf16/fp16/fp32)
`trainer.pretrain_model`	`./Model/stage_1_model.pth`	Flow-only pretrained weights
`trainer.save_frequency`	`5`	Save checkpoint every N epochs
`optimizer.lr`	`0.00005`	Learning rate
`inference.mode`	`single`	Inference mode (single/all)
`inference.datasets`	—	Datasets for inference (same format as test_datasets)

Supported Dataset Types

Type	Intrinsics (fx, fy, cx, cy)
`tartanair`	320.0, 320.0, 320.0, 240.0
`tartanair_shibuya`	772.55, 772.55, 320.0, 180.0
`kitti`	707.09, 707.09, 601.89, 183.11
`euroc`	458.65, 457.30, 367.22, 248.38
`tum`	517.3, 516.5, 318.6, 255.3
`bonn`	517.3, 516.5, 318.6, 255.3
`ETH3D`	726.21, 726.21, 359.20, 202.47

Results

Quantitative comparison (ATE ↓) on four benchmarks.

Method	KITTI	TartanAir	TUM-RGBD	ETH3D-SLAM
ORB-SLAM3 [1] ‡	—	14.38	—	—
DROID-VO [7] ‡	54.19	0.58	0.116	0.238
DPVO [23] ‡	53.61	0.21	0.107	0.203
TartanVO [6] †	33.94	3.34	0.320	0.421
DytanVO [15] †	24.96	3.90	0.259	0.364
MVOFormer (Ours) †	19.61	1.36	0.187	0.276

‡ Multi-Frame methods (with global optimization / loop closure) † Frame-to-Frame learning-based methods

Qualitative results on KITTI Odometry sequences 00–10.

Citation

@article{li2026mvoformer,
  author    = {Jituo Li and Shunwang Sun and Jialu Zhang and Xinqi Liu and Jinyao Hu and Zhicheng Lu and Sajad Saeedi and Guodong Lu},
  title     = {MVOFormer: Flow-Semantic Transformer for Robust Monocular Visual Odometry},
  journal   = {arXiv preprint arXiv:2606.16474},
  year      = {2026},
  url       = {https://arxiv.org/abs/2606.16474}
}

References

@article{wang2024sea,
  title={SEA-RAFT: Simple, Efficient, Accurate RAFT for Optical Flow},
  author={Wang, Yihan and Lipson, Lahav and Deng, Jia},
  journal={arXiv preprint arXiv:2405.14793},
  year={2024}
}

@inproceedings{tartanvo2020corl,
  title     = {TartanVO: A Generalizable Learning-based VO},
  author    = {Wang, Wenshan and Hu, Yaoyu and Scherer, Sebastian},
  booktitle = {Conference on Robot Learning (CoRL)},
  year      = {2020}
}

@misc{opentartanvo2025,
  title     = {OpenTartanVO: An Open-Source Reproduction and Engineering Optimization of TartanVO},
  author    = {Zhang, Jialu and Sun, Shunwang and Xue, Tingxi},
  year      = {2025},
  howpublished = {\url{https://github.com/Sun-Shun/OpenTartanVO}}
}

License

This project is licensed under the BSD 3-Clause License — see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MVOFormer: Flow-Semantic Transformer for Robust Monocular Visual Odometry

Abstract

Installation

Repository Structure

Dataset Preparation

Pretrained Model Weights

Training Pipeline

Stage 1: Flow-Only Pretraining

Stage 2: Full Training with Semantics

Training Details

Checkpoints

Evaluation Pipeline

Evaluation with Ground-Truth Poses

Multi-Checkpoint Sweep

Inference Pipeline (No Ground-Truth Poses)

Configuration Reference

Supported Dataset Types

Results

Citation

References

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
Configs		Configs
Network		Network
Tool		Tool
assets		assets
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
infer.py		infer.py
requirements.txt		requirements.txt
train.py		train.py

Folders and files

Latest commit

History

Repository files navigation

MVOFormer: Flow-Semantic Transformer for Robust Monocular Visual Odometry

Abstract

Installation

Repository Structure

Dataset Preparation

Pretrained Model Weights

Training Pipeline

Stage 1: Flow-Only Pretraining

Stage 2: Full Training with Semantics

Training Details

Checkpoints

Evaluation Pipeline

Evaluation with Ground-Truth Poses

Multi-Checkpoint Sweep

Inference Pipeline (No Ground-Truth Poses)

Configuration Reference

Supported Dataset Types

Results

Citation

References

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages