Let’s get agents to learn fast! 🤖🔥 Research Scientist @Oracle | PhD @UniOfOxford, MS & BS @BrownUniversity, Predoc @Microsoft

Joined April 2014
Pinned Tweet
Big news—our survey paper “A Tutorial on Meta-Reinforcement Learning” is officially published! Meta-RL = learning how to adapt through interaction. It embraces The Bitter Lesson: don’t hardcode agents—train them to adapt on their own arxiv.org/abs/2301.08028 🧵⬇️
2
80
4
341
Summer 2026 Internship — Oracle (Boston, MA) My fantastic research team is hiring! Projects include a data scientist agent with in-context learning, evolutionary search (a la AlphaEvolve), AI feedback, and RL/ES Apply here! eeho.fa.us2.oraclecloud.com/… 📧 jake.beck@oracle.com
1
2
7
Excited to host @ml_collective Office Hours! Available for advice, mentorship, and discussion on RL, LLMs, meta-learning, and beyond. mlcollective.org/services/
Welcome to the newest MLC Office Hours host, @jakeABeck, researcher at Oracle! Schedule a chat with Jacob at the link below to talk about RL, LLMs, hypernetworks, meta-learning, multi-agent RL, AI feedback, philosophy, and more! mlcollective.org/services/
1
6
4️⃣ Superintelligence does not beget super-power. Some systems are inherently unpredictable, and prediction doesn’t guarantee control. Knowing how a hurricane forms doesn’t mean you can steer one.
2
3️⃣ We already live alongside “misaligned superintelligences” in the form of adversarial nation states. North Korea would love to destroy the US, and yet here we are. The benefits of superintelligence are limited by real-world constraints.
1
2
2️⃣ AI is trained to follow prompts, making it highly amenable to alignment — with failures solely due to incompetence. We only train otherwise for safety, e.g. refusing to build a bioweapon. The real danger isn’t disobedient machines; it’s humans, misaligned with each other.
1
1
1️⃣ Exponential AI self-improvement is shaky. The real bottleneck isn’t code; it’s compute & data. In these areas, AIs training AIs are just as limited by the world as humans training AIs. For both, we’ve nearly exhausted the internet’s data.
1
1
1
AI optimists “don’t have counter-arguments — they just call names.” — @So8res on a podcast with @ESYudkowsky + Sam Harris Curious what you two think of these counter-arguments. And since @ylecun was called out by name, I’d love his take too…
1
2
Jacob Beck retweeted
E71: Jake Beck, Alex Goldie, & Cornelius Braun on Sutton's OaK, Metalearning, LLMs, Squirrels @ @RL_Conference 2025 A few thoughts with @jakeABeck, @AlexDGoldie and @corbraun after @RichardSSutton's fascinating lecture on his OaK architecture at @UAlberta
2
4
27
running out of data: arxiv.org/pdf/2211.04325 (@pvllss et al) repeated sampling: arxiv.org/pdf/2411.17501v1 (@benediktstroebl et al)
4
Results for RL+LLM are mixed, with positive results where reasoning chains are short, the LLM already has a sense of what to do, and we already know the answer — leaving us to exhaust our finite pool of supervised data Same for repeated sampling Time for exploration bonuses
New paper alert: Unifies insights from Limit-of-RLVR and ProRL — does current RLVR actually expand reasoning? Turns out: RLVR is mostly an efficient sampler with shrinking, very rarely an explorer with explanding. Explore is holy grail for LLM and may entail beyond 0/1 reward.
1
10
@codewithimanshu @MihneaDeVries @yuxili99 LLMs lack nuances of real-world interactive learning—but I’ve seen them explore, interact, and update beliefs in disparate domains. My point is that, with an LLM or not, we can apply the same recipe to RL. Just add some gridworlds!
2
Thoughts on this? The possibility of exponential AI self-improvement is shaky. The real bottleneck isn’t code; it’s compute & data. In these areas, AIs training AIs are just as limited by the world as humans training AIs. For both, we’ve nearly exhausted the internet’s data.
3
Thoughts on this? AI is trained to follow prompts, making it highly amenable to alignment. The only time we train it not to obey is for safety constraints, such as refusing to build a bioweapon. The real danger isn’t disobedient machines. It’s humans, misaligned with each other.
2
Citations: @juliancodaforno et al., 2023 and @janexwang et al., 2016
5
We’ve also been meta-learning model-based learning algos with seq models for years—long before the current LLM wave: arxiv.org/pdf/1611.05763
2
10
LLMs can handle non-stationary in-context learning, and even in-context meta-learning: arxiv.org/abs/2305.12907! More important than using LLMs, they give us a proven recipe: A flexible seq model with memory (Turing-complete, noted by Dale Schuurmans) + tons of data (or MDPs)!
2
16
Fantastic talk from @RichardSSutton at @RL_Conference with shoutouts to meta-RL. Honored to be called “more extreme” than Rich (by Rich) for taking the Bitter Lesson to heart and suggesting we meta-learn all the components he discussed. My Q: Aren’t LLMs already doing all this?
Jacob Beck retweeted
🥳 It’s an honour to have been awarded the Outstanding Paper for Scientific Understanding in RL at RLC for our work, ‘How Should We Meta-Learn RL Algorithms?’ Thank you to the organisers @RL_Conference for putting on a great conference, and congratulations to the other winners!
4
23
5
225
Heading to @rl_conference (RLC) today! Come say hi if you’re around—happy to chat about meta-RL, coding agents, and my work at @RLBrew_RLC tomorrow on VLM feedback for RL 🍁
LLMs do not natively understand how to act as agents in the world 🤔⁉️ But VLMs can evaluate success—and provide feedback for training generalist agents with RL ✅ Offline RL from AI Feedback charts the path—To be presented @RLBrew_RLC @rl_conference! github.com/Jacooba/OfflineRL…
1
2
16