Excited to release BFM-Zero, an unsupervised RL approach to learn humanoid Behavior Foundation Model.
Existing humanoid general whole-body controllers rely on explicit motion tracking rewards, on-policy PG methods like PPO, and distillation to one policy.
In contrast, BFM-Zero directly learns an effective shared latent representation that embeds motions, goals, and rewards into a common space, which enables a single policy zero-shot perform multiple tasks: (1) natural transition from any pose to any goal pose, (2) real-time motion following, (3) optimize any user-specified reward function at test time, etc.
How it works? We don't give the model any specific reward in training. It builds upon recent advances in Forward-Backward (FB) models, where a latent-conditioned policy, a deep "forward dynamics model" and a deep "inverse dynamics model"are jointly learned. In such a way, the learned representation space understands humanoid dynamics and unifies different tasks.
More details: lecar-lab.github.io/BFM-Zero…
Meet BFM-Zero: A Promptable Humanoid Behavioral Foundation Model w/ Unsupervised RL👉 lecar-lab.github.io/BFM-Zero…
🧩ONE latent space for ALL tasks
⚡Zero-shot goal reaching, tracking, and reward optimization (any reward at test time), from ONE policy
🤖Natural recovery & transition
Nov 7, 2025 · 5:07 PM UTC
Want to make our claim more accurate and rigorous: BFM-Zero is "unsupervised" in the task level: The training doesn't involve any *task-related* reward, but it still has "auxiliary" rewards such as joint angle limit (only for sim2real) and AMP-style rewards to "shape" the latent space to be more human-like.
W/o the AMP-style reward it still works but won't be that natural or human-like. Similarly, w/o the auxiliary reward it works in sim but cannot smoothly transfer to real.

