Dual-Stream Diffusion for World-Model
Augmented Vision-Language-Action Model

ICML 2026

John Won, Kyungmin Lee, Huiwon Jang, Dongyoung Kim, Jinwoo Shin

Kim Jaechul Graduate School of AI, KAIST • RLWRLD

arXiv Code (Coming Soon)

TL;DR

DUST augments VLAs with world modeling through a dual-stream diffusion transformer that jointly denoises actions and future visual states in separate-but-linked pathways.

Abstract

Augmenting Vision-Language-Action models (VLAs) with world models is promising for robotic policy learning but faces challenges in jointly predicting states and actions due to the modality gap. To address this, we propose DUal-STream diffusion (DUST), a world-model augmented VLA framework featuring a multimodal diffusion transformer that maintains separate modality streams while enabling cross-modal knowledge sharing. In addition, DUST utilizes independent noise perturbations and a decoupled flow matching loss to learn cross-modal causal relationships. We further introduce an asynchronous sampling method for action and vision tokens that enhances performance through inference-time scaling. Experimental results on simulated benchmarks like RoboCasa and GR-1 show that DUST achieves up to 6% gains over state-of-the-art VLA and world-modeling baselines, with inference-time scaling providing an additional 2–5% improvement. In real-world tasks using the Franka Research 3, DUST outperforms baselines by 10% in success rate. Finally, we demonstrate that DUST enables effective transfer learning through both pretraining on action-free videos and joint-training with heterogeneous robot and human datasets.

Key Results

+6%

Over baseline VLAs on
simulation benchmarks

+13%

Over baselines on
real-world Franka tasks

~40Hz

Inference speed

Motivation

Existing approaches for joint world-modeling and action prediction face a fundamental trade-off. Joint diffusion models force both modalities into a single latent space, causing mismatches between low-dimensional actions and high-dimensional visual predictions. Causal designs separate modalities but limit information flow to one direction. DUST resolves this by maintaining dual streams that interact through shared attention while preserving modality-specific structure.

(a) Joint Diffusion

(b) Causal

Architecture

DUST is built upon a frozen vision-language model (VLM) backbone that provides semantic features from the current observation and task instruction. The core diffusion model uses a stack of multimodal diffusion transformer (MMDiT) blocks where action and vision token streams are propagated through separate pathways, concatenated only during shared cross-modal attention layers. Each stream receives its own timestep embedding via adaptive layer normalization, enabling decoupled noise scheduling during training. After the shared MMDiT layers, modality-specific DiT blocks handle specialized denoising for each stream.

DUST architecture. A frozen VLM provides conditioning features to the dual-stream diffusion model, which jointly denoises action and future observation tokens through shared MMDiT blocks followed by modality-specific DiT blocks.

Asynchronous Joint Sampling

During inference, DUST jointly samples actions and future visual observations. Since image embeddings operate in a higher-dimensional space and benefit from more denoising steps, we introduce asynchronous denoising: vision tokens are updated at every fine-grained step while action tokens are updated less frequently. This test-time scaling strategy provides a tunable trade-off between inference speed and predictive accuracy, yielding an additional 2–5% boost in success rate.

Asynchronous joint sampling. Vision tokens receive more denoising steps than action tokens, enabling test-time scaling of visual prediction quality.

Qualitative Results

DUST produces physically consistent future predictions that guide accurate action generation across diverse manipulation tasks, including pick-and-place, insertion, and tool use.

Example rollouts showing DUST's predicted future observations alongside actual task execution in real-world and simulated environments.

BibTeX

@inproceedings{won2026dust,
  title={Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model},
  author={Won, John and Lee, Kyungmin and Jang, Huiwon and Kim, Dongyoung and Shin, Jinwoo},
  booktitle={International Conference on Machine Learning (ICML)},
  year={2026}
}

Dual-Stream Diffusion for World-ModelAugmented Vision-Language-Action Model