NVIDIA Unveils SANA-WM: An Open-Source World Model Capable

The development of world models—systems designed to synthesize realistic video sequences from initial images and action sets—is rapidly becoming critical for advancements in robotics, simulation, and embodied AI. A primary challenge facing this field is the scaling required to generate high-resolution, minute-long video without needing prohibitively large computational clusters for both training and inference. Many existing open-source baseline models either mandate multi-GPU setups or must reduce resolution to remain computationally feasible.

NVIDIA’s new model, SANA-WM (SANA-Video World Model), directly addresses these limitations. This open-source system is built upon the SANA-Video codebase and is available through the NVlabs/Sana GitHub repository. It operates as a 2.6B-parameter Diffusion Transformer (DiT) that generates 720p video over one minute, incorporating metric-scale 6-DoF camera control. Notably, it supports three single-GPU inference modes: a bidirectional generator for high-quality offline synthesis, a chunk-causal autoregressive generator for sequential rollout, and a distilled autoregressive variant optimized for fast deployment.

The efficiency of the model is highlighted by its distilled variant, which can denoise an entire 60-second 720p clip in just 34 seconds using a single RTX 5090 equipped with NVFP4 quantization.

Architectural Innovations for Scale and Stability

SANA-WM integrates four major architectural designs to achieve its performance goals, overcoming common bottlenecks found in video generation models:

Hybrid Linear Attention with Gated DeltaNet (GDN)

Standard softmax attention suffers from a memory and compute complexity that increases quadratically with the sequence length—a critical issue when modeling 961 latent frames for a full minute. While SANA-Video, its predecessor, utilized cumulative ReLU-based linear attention (maintaining constant recurrent state size), this method lacked decay mechanisms, causing accumulated drift over long sequences.

SANA-WM replaces most standard attention blocks with frame-wise Gated DeltaNet (GDN). Unlike token-wise GDNs used in language models, the frame-wise variant processes an entire latent frame during each recurrent step. This process incorporates two key mechanisms: a $gamma$ Decay gate (which diminishes the influence of older frames) and a $beta$ Delta-rule correction (which updates only the difference between the current prediction and the target value). These changes maintain the recurrent state size at a constant $D times D$, independent of video duration. Furthermore, gradient stability is maintained through an algebraic key-scaling approach, where keys are scaled by $1/sqrt{D cdot S}$ (D being head dimension, S being spatial tokens per frame).

Dual-Branch Camera Control

To ensure the model accurately follows a continuous 6-DoF camera trajectory—rather than merely matching motion descriptions—SANA-WM uses two complementary branches operating at different temporal resolutions:

Coarse Branch (UCPE attention): Operates at the latent-frame rate. It calculates a ray-local camera basis from the camera-to-world pose and intrinsics, applying Unified Camera Positional Encoding (UCPE) to capture global trajectory structure across the entire sequence.
Fine Branch (Plücker mixing): Addresses temporal compression mismatch. Since each latent token summarizes eight raw frames, the fine branch computes pixel-wise Plücker raymaps (a 6D representation: ray direction $d$ and moment $o times d$) from all eight raw frames within one VAE temporal stride. These are packed into a 48-channel tensor and injected after each self-attention output via a zero-initialized projection, restoring fine-grained camera motion details invisible to the coarse branch.

Two-Stage Generation Pipeline

While SANA-WM produces spatiotemporally consistent outputs in Stage-1, structural artifacts can still appear over extended durations. A secondary refiner corrects these issues. This refiner is initialized using the 17B LTX-2 model and fine-tuned with rank-384 LoRA adapters on paired synthetic and real video data. It employs truncated-$sigma$ flow matching: Stage-1 latents are perturbed by a large starting noise ($sigma_{start} = 0.9$), allowing the refiner to map this noisy input toward a high-fidelity target. Crucially, inference requires only three Euler denoising steps. This refinement process significantly reduces long-horizon visual drift ($Delta IQ$) from $3.79$ to $1.17$ on the Simple-Trajectory split and from $3.09$ to $0.31$ on the Hard-Trajectory split.

Robust Data Annotation Pipeline

The training process for camera-controlled video generation demands metric-scale 6-DoF pose annotations, data typically unavailable in standard datasets. The development team modified the VIPE annotation engine by replacing its depth backend with Pi3X (which provides long-sequence-consistent depth) fused with MoGe-2 (ensuring accurate per-frame metric scale). Additionally, they expanded the bundle adjustment stage to treat focal lengths and principal points as per-frame variables instead of shared global intrinsics, greatly improving annotation robustness for internet video footage.

The resulting pipeline processed a total corpus of 212,975 clips drawn from seven open-source sources: SpatialVID-HQ (real, 10s), DL3DV real clips (10s), DL3DV GS Refined synthetic clips (60s), OmniWorld (synthetic, 60s), Sekai Game (synthetic, 60s), Sekai Walking-HQ (real, 60s), and MiraData (real, 60s).

Training Methodology and Infrastructure

The training of SANA-WM occurred in two distinct phases utilizing 64 H100 GPUs. First, the LTX2 VAE was adapted to the SANA-Video SFT data over approximately 50K steps, taking about $3.5$ days. The main DiT training then followed a progressive four-stage schedule spanning roughly 15 days:

Stage 1 (GDN Adaptation): Adapt the pre-trained SANA-Video model to the frame-wise GDN architecture using short (5s) video clips, replacing cumulative linear attention with recurrent GDN blocks.
Stage 2 (Hybrid Attention): Introduce hybrid attention by substituting every fourth GDN block with a standard softmax attention block on the same 5s clip setting, optimizing the quality-to-efficiency balance.
Stage 3 (Minute-Scale + Dual-Branch Control): Extended training to 961-frame (60-second) sequences and incorporated Dual-Branch Camera Control. This stage used Context-Parallel (CP=2) sharding, a mathematically exact parallelization method that minimizes communication overhead.
Stage 4 (SFT + Distillation): Fine-tuning was performed on a chunk-causal autoregressive variant, followed by self-forcing distillation to reduce sampling steps to four denoising passes. Attention-sink tokens and local temporal windows were added to the softmax layers to maintain constant memory usage during long rollouts.

The development team also achieved significant performance gains by implementing custom fused Triton kernels for GDN scan and gate operations, contributing an estimated $1.5times$ to $2times$ efficiency boost throughout all training stages.

Benchmark Performance

Researchers evaluated SANA-WM on a specialized 60-second world-model benchmark comprising 80 initial scenes generated by Nano Banana Pro across four categories: game, indoor, outdoor-city, and outdoor-nature. The evaluation used the multi-step, undistilled autoregressive setting for comparison against leading models like LingBot-World (14B+14B parameters on 8 GPUs) and HY-WorldPlay (8B parameters on 8 GPUs).

When utilizing the second-stage refiner, SANA-WM achieved state-of-the-art results across key metrics:

Camera Accuracy: The model recorded rotation errors (RotErr) of $4.50^circ$ and $8.34^circ$, translation errors (TransErr) of $1.39 / 1.39$, and CamMC scores of $1.41 / 1.44$. These figures were superior to all compared methods, including LingBot-World and HY-WorldPlay.
Visual Quality: The VBench Overall score was $80.62$ (Simple split) and $81.89$ (Hard split), demonstrating visual quality comparable to LingBot-World ($81.82 / 81.89$). Critically, this was achieved while generating 720p output on a single GPU per clip.
Throughput: The full pipeline (including the refiner) demonstrated a throughput of 22.0 videos/hour on 8 H100s. This represents a $36times$ increase in speed compared to LingBot-World’s rate of $0.6$ videos/hour.
Memory Footprint: The full pipeline requires $74.7 text{ GB}$ of memory, fitting within the $80 text{ GB}$ H100 budget.

Written by

Hue

The girl with pink hair, usually arguing about GPU benchmarks or checking her crypto portfolio between gaming sessions. She writes about PC tech, games, and crypto.

NVIDIA Unveils SANA-WM: An Open-Source World Model Capable of Generating Minute-Scale 720p Video on a Single GPU