Guanya Shi · Nov 7, 2025 · 5:07 PM UTC

Guanya Shi · Nov 7, 2025 · 5:07 PM UTC

Guanya Shi

Guanya Shi

@GuanyaShi

Nov 7

Excited to release BFM-Zero, an unsupervised RL approach to learn humanoid Behavior Foundation Model. Existing humanoid general whole-body controllers rely on explicit motion tracking rewards, on-policy PG methods like PPO, and distillation to one policy. In contrast, BFM-Zero directly learns an effective shared latent representation that embeds motions, goals, and rewards into a common space, which enables a single policy zero-shot perform multiple tasks: (1) natural transition from any pose to any goal pose, (2) real-time motion following, (3) optimize any user-specified reward function at test time, etc. How it works? We don't give the model any specific reward in training. It builds upon recent advances in Forward-Backward (FB) models, where a latent-conditioned policy, a deep "forward dynamics model" and a deep "inverse dynamics model"are jointly learned. In such a way, the learned representation space understands humanoid dynamics and unifies different tasks. More details: lecar-lab.github.io/BFM-Zero…

BFM-Zero: A Promptable Behavioral Foundation Model for Humanoid Control Using Unsupervised Reinfo...

lecar-lab.github.io

Yitang Li @li_yitang

Nov 7

Meet BFM-Zero: A Promptable Humanoid Behavioral Foundation Model w/ Unsupervised RL👉 lecar-lab.github.io/BFM-Zero… 🧩ONE latent space for ALL tasks ⚡Zero-shot goal reaching, tracking, and reward optimization (any reward at test time), from ONE policy 🤖Natural recovery & transition

Nov 7, 2025 · 5:07 PM UTC

Guanya Shi · Nov 7, 2025 · 9:42 PM UTC

Guanya Shi

@GuanyaShi

Nov 7

Want to make our claim more accurate and rigorous: BFM-Zero is "unsupervised" in the task level: The training doesn't involve any *task-related* reward, but it still has "auxiliary" rewards such as joint angle limit (only for sim2real) and AMP-style rewards to "shape" the latent space to be more human-like. W/o the AMP-style reward it still works but won't be that natural or human-like. Similarly, w/o the auxiliary reward it works in sim but cannot smoothly transfer to real.

Harshit Sikchi (will be at NeurIPS 25) · Nov 7, 2025 · 11:43 PM UTC

Harshit Sikchi (will be at NeurIPS 25)

@harshit_sikchi

Nov 7

Replying to @GuanyaShi

I am super happy to see this work on real robots! We explored zero shot language/video to behavior with unsupervised RL here: arxiv.org/abs/2412.05718 and fast adaptation here arxiv.org/abs/2504.07896 which seem very related!

RLZero: Direct Policy Inference from Language Without In-Domain Supervision

The reward hypothesis states that all goals and purposes can be understood as the maximization of a received scalar reward signal. However, in practice, defining such a reward signal is...

arxiv.org