DUST: Dual-Stream Diffusion
Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model
Dual-Stream Diffusion for World-Model
Augmented Vision-Language-Action Model
ICML 2026 TL;DR
DUST augments VLAs with world modeling through a dual-stream diffusion transformer that jointly denoises actions and future visual states in separate-but-linked pathways.
Abstract
Augmenting Vision-Language-Action models (VLAs) with world models is promising for robotic policy learning but faces challenges in jointly predicting states and actions due to the modality gap. To address this, we propose DUal-STream diffusion (DUST), a world-model augmented VLA framework featuring a multimodal diffusion transformer that maintains separate modality streams while enabling cross-modal knowledge sharing. In addition, DUST utilizes independent noise perturbations and a decoupled flow matching loss to learn cross-modal causal relationships. We further introduce an asynchronous sampling method for action and vision tokens that enhances performance through inference-time scaling. Experimental results on simulated benchmarks like RoboCasa and GR-1 show that DUST achieves up to 6% gains over state-of-the-art VLA and world-modeling baselines, with inference-time scaling providing an additional 2–5% improvement. In real-world tasks using the Franka Research 3, DUST outperforms baselines by 10% in success rate. Finally, we demonstrate that DUST enables effective transfer learning through both pretraining on action-free videos and joint-training with heterogeneous robot and human datasets.
Key Results
simulation benchmarks
real-world Franka tasks
Motivation
Existing approaches for joint world-modeling and action prediction face a fundamental trade-off. Joint diffusion models force both modalities into a single latent space, causing mismatches between low-dimensional actions and high-dimensional visual predictions. Causal designs separate modalities but limit information flow to one direction. DUST resolves this by maintaining dual streams that interact through shared attention while preserving modality-specific structure.
Architecture
DUST is built upon a frozen vision-language model (VLM) backbone that provides semantic features from the current observation and task instruction. The core diffusion model uses a stack of multimodal diffusion transformer (MMDiT) blocks where action and vision token streams are propagated through separate pathways, concatenated only during shared cross-modal attention layers. Each stream receives its own timestep embedding via adaptive layer normalization, enabling decoupled noise scheduling during training. After the shared MMDiT layers, modality-specific DiT blocks handle specialized denoising for each stream.
Asynchronous Joint Sampling
During inference, DUST jointly samples actions and future visual observations. Since image embeddings operate in a higher-dimensional space and benefit from more denoising steps, we introduce asynchronous denoising: vision tokens are updated at every fine-grained step while action tokens are updated less frequently. This test-time scaling strategy provides a tunable trade-off between inference speed and predictive accuracy, yielding an additional 2–5% boost in success rate.
Qualitative Results
DUST produces physically consistent future predictions that guide accurate action generation across diverse manipulation tasks, including pick-and-place, insertion, and tool use.
BibTeX
@inproceedings{won2026dust,
title={Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model},
author={Won, John and Lee, Kyungmin and Jang, Huiwon and Kim, Dongyoung and Shin, Jinwoo},
booktitle={International Conference on Machine Learning (ICML)},
year={2026}
}