Definitely massive! A revolution at the heart of image generation!
Representation Autoencoders (RAEs) are a simple yet powerful upgrade that replaces the traditional VAE with pretrained encoders—such as DINO, SigLIP, or MAE—paired with trained decoders.
Why it matters:
- Richer latent spaces – semantically meaningful, not just reconstructive
- Faster convergence – no extra alignment loss needed
- Higher fidelity – achieves FID scores of 1.51 (without guidance) and 1.13 at 256×256 and 512×512 resolutions
By rethinking the foundation, RAEs make diffusion transformers simpler, stronger, and smarter.
three years ago, DiT replaced the legacy unet with a transformer-based denoising backbone. we knew the bulky VAEs would be the next to go -- we just waited until we could do it right.
today, we introduce Representation Autoencoders (RAE).
>> Retire VAEs. Use RAEs. 👇(1/n)