A lot of people presume we use reinforcement learning to train Ace. The founding team has extensive RL background, but RL is not how we'll get computer AGI. The single best way we know how to create artificial intelligence is still large-scale behaviour cloning.
PSA: agents acting in an environment is *not* reinforcement learning reinforcement learning is about having a reward signal for which you get reinforcement (hence the name) if your "reward" is 99.99% "predicting accurately the consequence of an action" (which is just regular unsupervised learning) and 0.01% some additional specific goal (which is actual RL), then calling the training procedure "reinforcement learning" is technically accurate but is very much a sin against the truth RL always was and will always remain "the cherry on the top" due to fundamental information reasons a sparse 1d reward signal just doesn't have enough information to train complex agents with trillions of parameters in complex enough environments, whereas just predicting the outcome of every action is a maximally dense feedback signal from the environment in terms of the information it provides I really find it somewhat offensive that RL people try to bucket everything that is about agents acting in environments into the RL bucket, because if you are slightly less in the weeds then you buy this and you eventually have an incorrect and imprecise understanding of the world
This also negates a lot of AGI x-risk concerns imo. Typical safety-ist argument: RL will make agents blink past human-level performance in the blink of an eye But: the current paradigm is divergence minimization wrt human intelligence. It converges to around human performance.

Apr 19, 2025 · 11:10 PM UTC

3
2
35
Replying to @sherjilozair
Aren't the strongest baselines for lm digital control offline RL policies? Which, tbf, are still unlikely to experience ASI foom.
2
Replying to @sherjilozair
you guys built a good task recording system right?
Replying to @sherjilozair
But obviously at some point we will go past human performance and go beyond behavioral cloning (already happening), right?
1