Week 07 · April 2026

Generative Modeling via Drifting - One Step Is All You Need

April 12, 2026 · by Satish K C 7 min read
Deep Learning Generative Models Computer Vision

The Paper

"Generative Modeling via Drifting" was published in February 2026 by Mingyang Deng, He Li, Tianhong Li, Yilun Du, and Kaiming He - researchers at MIT and Harvard University. The central claim is that generative modeling does not need iterative inference at all. The authors introduce Drifting Models, a new paradigm where the generator's output distribution evolves during training through a learned drifting field, and at inference time, a single forward pass produces the final sample. On ImageNet 256x256, the model achieves an FID of 1.54 in latent space and 1.61 in pixel space - both state-of-the-art for one-step generators.

Read the Paper on arXiv →

The Problem Before This Paper

Diffusion and flow-based models produce high-quality images but require hundreds of iterative denoising steps at inference time. DiT-XL/2 needs 500 function evaluations (NFE) to reach an FID of 2.27 on ImageNet 256x256. SiT-XL/2 with REPA needs the same 500 steps for 1.42 FID. Consistency models and distillation methods attempt to reduce step count but typically sacrifice quality - and many still require a pre-trained multi-step teacher model. GANs offer single-step generation but have historically struggled with training instability and mode collapse, with StyleGAN-XL reaching only 2.30 FID and BigGAN reaching 6.95 FID on the same benchmark. No existing paradigm cleanly delivered both one-step inference and state-of-the-art quality without auxiliary teacher models.

What They Built

Drifting Models define a drifting field V that governs how generated samples should move to better match the data distribution. The field is composed of two opposing forces: an attraction term that pulls generated samples toward nearby data points, and a repulsion term that pushes them away from other generated samples. The equilibrium condition - when the generated distribution matches the data distribution - is guaranteed by the field's anti-symmetric property: V_{p,q} = -V_{q,p}, meaning the field vanishes exactly when p equals q. During training, the network f_theta maps noise to samples, and the optimizer updates weights by regressing toward "drifted targets" - the current output shifted by the estimated field V.

V_{p,q}(x) = V+_p(x) - V-_q(x)
Loss = ||f_theta(epsilon) - stopgrad(f_theta(epsilon) + V(f_theta(epsilon)))||^2

The attraction and repulsion forces use a kernel function k(x,y) = exp(-||x-y||/tau) with softmax normalization over mini-batch samples. The architecture is a DiT-style transformer with patch size 2 operating in the latent space of a pre-trained SD-VAE encoder (32x32x4 latent resolution). A key design choice is the feature encoder - a ResNet-style MAE pre-trained on the latent space - that extracts multi-scale features for computing the drifting field. The entire system trains end-to-end without any teacher model, distillation, or adversarial loss.

k(x, y) = exp(-||x - y|| / tau)
Equilibrium: V_{p,q} = 0 when p = q (anti-symmetric property)

Key Findings

Results

On ImageNet 256x256, Drifting Model L/2 in latent space achieves FID 1.54 with Inception Score 258.9 using a single forward pass. In pixel space, the L/16 variant reaches FID 1.61 with IS 307.5 - matching PixelDiT/16 (1.61 FID at 400 steps) exactly, but in one step. For comparison, StyleGAN-XL achieves 2.30 FID and BigGAN reaches 6.95 FID, both also single-step. Among multi-step methods, DiT-XL/2 achieves 2.27 FID at 500 NFE, while SiT-XL/2 with REPA pushes to 1.42 FID at the same step count. Training scales predictably: the B/2 model improves from 3.36 FID at 100 epochs to 1.75 FID at 1280 epochs, and upgrading to L/2 at 1280 epochs reaches the final 1.54.

Why This Matters for AI and Automation

My Take

The elegance of this work is in the formulation. Rather than trying to compress a multi-step process into fewer steps (distillation) or stabilize adversarial training (GANs), Drifting Models reframe the problem entirely: let the optimizer itself be the iterative process, and let inference be a single deterministic mapping. The anti-symmetry property providing a natural equilibrium condition is a clean theoretical contribution - the ablation results showing how quickly quality degrades without it (8.46 to 177.14 FID) confirm this is not a cosmetic design choice but a structural requirement. The fact that the same framework transfers directly to robotics policy generation strengthens the claim that this is a genuine paradigm shift, not an image-specific trick.

The open question is scaling behavior. The current results use ImageNet 256x256 - a well-studied benchmark but far from the resolution and diversity demands of production text-to-image systems. Whether the drifting field formulation remains stable and effective at 1024x1024 resolution with text conditioning, and whether it can match the diversity and controllability of classifier-free guided diffusion at scale, will determine whether this paradigm moves from research milestone to production deployment. The kernel-based field computation also raises questions about mini-batch sensitivity and compute cost at very large batch sizes.

Discussion question: If one-step generators now match the quality of 500-step diffusion models on standard benchmarks, what remaining advantages - if any - do iterative methods retain that could keep them relevant in production systems?

Share this discussion

← Back to all papers