The Paper
"Generative Modeling via Drifting" was published in February 2026 by Mingyang Deng, He Li, Tianhong Li, Yilun Du, and Kaiming He - researchers at MIT and Harvard University. The central claim is that generative modeling does not need iterative inference at all. The authors introduce Drifting Models, a new paradigm where the generator's output distribution evolves during training through a learned drifting field, and at inference time, a single forward pass produces the final sample. On ImageNet 256x256, the model achieves an FID of 1.54 in latent space and 1.61 in pixel space - both state-of-the-art for one-step generators.
Read the Paper on arXiv →The Problem Before This Paper
Diffusion and flow-based models produce high-quality images but require hundreds of iterative denoising steps at inference time. DiT-XL/2 needs 500 function evaluations (NFE) to reach an FID of 2.27 on ImageNet 256x256. SiT-XL/2 with REPA needs the same 500 steps for 1.42 FID. Consistency models and distillation methods attempt to reduce step count but typically sacrifice quality - and many still require a pre-trained multi-step teacher model. GANs offer single-step generation but have historically struggled with training instability and mode collapse, with StyleGAN-XL reaching only 2.30 FID and BigGAN reaching 6.95 FID on the same benchmark. No existing paradigm cleanly delivered both one-step inference and state-of-the-art quality without auxiliary teacher models.
What They Built
Drifting Models define a drifting field V that governs how generated samples should move to better match the data distribution. The field is composed of two opposing forces: an attraction term that pulls generated samples toward nearby data points, and a repulsion term that pushes them away from other generated samples. The equilibrium condition - when the generated distribution matches the data distribution - is guaranteed by the field's anti-symmetric property: V_{p,q} = -V_{q,p}, meaning the field vanishes exactly when p equals q. During training, the network f_theta maps noise to samples, and the optimizer updates weights by regressing toward "drifted targets" - the current output shifted by the estimated field V.
V_{p,q}(x) = V+_p(x) - V-_q(x)
Loss = ||f_theta(epsilon) - stopgrad(f_theta(epsilon) + V(f_theta(epsilon)))||^2
The attraction and repulsion forces use a kernel function k(x,y) = exp(-||x-y||/tau) with softmax normalization over mini-batch samples. The architecture is a DiT-style transformer with patch size 2 operating in the latent space of a pre-trained SD-VAE encoder (32x32x4 latent resolution). A key design choice is the feature encoder - a ResNet-style MAE pre-trained on the latent space - that extracts multi-scale features for computing the drifting field. The entire system trains end-to-end without any teacher model, distillation, or adversarial loss.
k(x, y) = exp(-||x - y|| / tau)
Equilibrium: V_{p,q} = 0 when p = q (anti-symmetric property)
Key Findings
- State-of-the-art one-step generation. Drifting Model L/2 achieves FID 1.54 (latent) and 1.61 (pixel) on ImageNet 256x256 with a single forward pass - beating all prior one-step methods including iMeanFlow (1.72 FID) and AdvFlow (2.38 FID).
- Competitive with 500-step diffusion models. The one-step FID of 1.54 is within striking distance of SiT-XL/2+REPA (1.42 FID at 500 NFE) and LightningDiT (1.35 FID at 500 NFE), while requiring 500x fewer function evaluations.
- Anti-symmetry is critical. Ablation studies show that breaking the balance between attraction and repulsion causes FID to collapse: 1.5x attraction bias degrades FID from 8.46 to 41.05, attraction-only reaches 177.14 FID.
- Scales beyond images to robotics. A one-step Drifting Policy matched or exceeded 100-step Diffusion Policy performance across single-stage and multi-stage robot manipulation tasks.
Results
On ImageNet 256x256, Drifting Model L/2 in latent space achieves FID 1.54 with Inception Score 258.9 using a single forward pass. In pixel space, the L/16 variant reaches FID 1.61 with IS 307.5 - matching PixelDiT/16 (1.61 FID at 400 steps) exactly, but in one step. For comparison, StyleGAN-XL achieves 2.30 FID and BigGAN reaches 6.95 FID, both also single-step. Among multi-step methods, DiT-XL/2 achieves 2.27 FID at 500 NFE, while SiT-XL/2 with REPA pushes to 1.42 FID at the same step count. Training scales predictably: the B/2 model improves from 3.36 FID at 100 epochs to 1.75 FID at 1280 epochs, and upgrading to L/2 at 1280 epochs reaches the final 1.54.
Why This Matters for AI and Automation
- Latency reduction. One-step inference means real-time image generation becomes trivially achievable. Applications that currently batch diffusion steps - product image generation, design automation, synthetic data pipelines - can run 100-500x faster at the same quality level.
- No teacher dependency. Unlike distillation approaches (consistency distillation, progressive distillation), Drifting Models train from scratch. This eliminates the need to first train an expensive multi-step teacher, simplifying the training pipeline and reducing total compute.
- Robotics implications. The demonstrated transfer to robot manipulation policies suggests the paradigm generalizes beyond image synthesis. Any domain currently using diffusion-based planners or policy generators - warehouse automation, robotic assembly, autonomous navigation - could benefit from the same one-step speedup.
- Kaiming He's involvement. This is a signal paper. He's track record (ResNet, MAE, Mask R-CNN) means this paradigm will receive significant follow-up attention from the research community.
My Take
The elegance of this work is in the formulation. Rather than trying to compress a multi-step process into fewer steps (distillation) or stabilize adversarial training (GANs), Drifting Models reframe the problem entirely: let the optimizer itself be the iterative process, and let inference be a single deterministic mapping. The anti-symmetry property providing a natural equilibrium condition is a clean theoretical contribution - the ablation results showing how quickly quality degrades without it (8.46 to 177.14 FID) confirm this is not a cosmetic design choice but a structural requirement. The fact that the same framework transfers directly to robotics policy generation strengthens the claim that this is a genuine paradigm shift, not an image-specific trick.
The open question is scaling behavior. The current results use ImageNet 256x256 - a well-studied benchmark but far from the resolution and diversity demands of production text-to-image systems. Whether the drifting field formulation remains stable and effective at 1024x1024 resolution with text conditioning, and whether it can match the diversity and controllability of classifier-free guided diffusion at scale, will determine whether this paradigm moves from research milestone to production deployment. The kernel-based field computation also raises questions about mini-batch sensitivity and compute cost at very large batch sizes.
Discussion question: If one-step generators now match the quality of 500-step diffusion models on standard benchmarks, what remaining advantages - if any - do iterative methods retain that could keep them relevant in production systems?