AI Scientist | Tech Entrepreneur | CEO & President of @Mobileye

Israel
Joined January 2019
I’m grateful to @TIME magazine for the honor of being listed on the 100 influential people in AI. There is much to be excited about with the ongoing development of AI. I see two main thrusts going forward - Physical AI where AI is embodied on moving platforms in the real world. This includes Autonomous Vehicles through my brainchild @Mobileye and humanoid robotics through the 3-year-old startup @MenteeBot I cofounded with Prof. @shai_s_shwartz and Prof. Lior Wolf. I strongly believe that by the middle of the next decade, humanoid robots will be a meaningful part of society. The second thrust is to have AI focus on the innovation bottlenecks of society. The first and foremost bottleneck is experts focusing on problem-solving and discovery in STEM-related fields. True experts are not spread uniformly within society. They coalesce in certain places and are completely absent in others. And there are a few of them. If we can have access to real experts at scale, that could be transformative to humanity. That is where AAI Technologies a young startup co-founded with Prof. Shai Shalev Shwartz and four super talented former Phd students, with the goal of building the next phase of AI, which can demonstrate deep reasoning during problem solving. By deep reasoning I mean that a solution to a problem contains multiple steps - many of them do not follow deterministically from previous steps and instead contain uncertainty. Uncertainty is translated to the need to open a search in the space of ideas and backtrack to the appropriate relevant reasoning step when a line of thought gets stuck. Building this right will introduce a leap in the progress of AI and its impact on society. time.com/collections/time100… AAI: doubleai.com/ Mentee Robotics: menteebot.com/
12
8
1
193
Deep reasoning is beyond the capabilities of today’s AI models. GPT5 shows some progress but overall the performance is a far cry to what is required to solve problems at expert level. Statements about models reaching PhD level should be taken with a measure of skepticism.
Are frontier AI models really capable of “PhD-level” reasoning? To answer this question, we introduce FormulaOne, a new reasoning benchmark of expert-level Dynamic Programming problems. We have curated a benchmark consisting of three tiers, in increasing complexity, which we call ‘shallow’, ‘deeper’, ‘deepest’. The results are remarkable: - On the ‘shallow’ tier, top models reach performance of 50%-70%, indicating that the models are familiar with the subject matter. - On ‘deeper’, Grok 4, Gemini-Pro, o3-Pro, Opus-4 all solve at most 1/100 problems. GPT-5 Pro is significantly better, but still solves only 4/100 problems. - On ‘deepest’, all models collapse to 0% success rate. 🧵
5
14
85
Amnon Shashua retweeted
🧵 In our earlier threads, we explored what it means to learn to reason, why it’s hard, and why many current approaches fall short. Now let’s dive into one of the core proofs from our paper. Appendix A of our paper shows that an auto-regressive transformer cannot learn to multiply multi-digit numbers, even with full supervision. The setup: the model is prompted with something like 1324 * 2471 = and must generate 3271604. As a funny evidence, the image below was generated by chat-GPT and one can easily notice that the calculation is wrong 😆 How can we prove that a certain model can’t learn a task? Let’s dig in 👇 We take a standard strategy in learning theory: Identify general properties of a learning algorithm. Then prove that any algorithm with those properties will fail on a specific task. For our proof, the key properties are: 1. Data access via a gradient oracle (SGD). 2. The Transformer architecture (with learned positional encodings) is invariant to input token permutations. 3. Auto-regressive prediction: output must start with the correct first digit. Let’s start with data access. A tempting approach is to show the gradient vanishes, therefore, the model gets no learning signal. But that’s not true here: the gradient exists, and the model updates its weights. So we need a subtler argument. Here’s the idea. We construct a large family of “multiplication-like” problems, each one differing in how it parses the input digits. Now, if our transformer is invariant to token permutations (property 2), and if learning one of these problems is hard, then learning any of them is hard. To show this, we borrow a trick from the probabilistic method. We treat the gradient, at some fixed weights, as a random variable, where the randomness comes from choosing a multiplication-like labeling function at random from our family. Then we analyze the variance of this gradient. We prove the variance is exponentially small in the number of digits. What does that mean? It means that, across different labeling functions, the model sees almost the same gradient. So it takes almost the same update step regardless of the true task. In other words: the model is moving, but it’s not learning anything useful. How do we prove this tiny variance? Here’s where Fourier analysis comes in. Parseval’s theorem states that norms are invariant to changes of basis. Expressing gradient elements in the basis of our multiplication family should seal the proof. There’s a twist, though. Parseval’s theorem works for orthonormal bases—but our family of multiplication functions isn’t exactly orthonormal. So we extend the argument: we show they’re almost orthonormal. To do that, we express them using Fourier characters, reducing inner products to manageable scalar terms. The takeaway: Even with full supervision, even with gradient updates, an auto-regressive transformer with these properties cannot learn to multiply. Why? Because it’s fundamentally confused about the structure of the problem. And this confusion is formal, provable, and rooted in the architecture. If you’re into theory, we think you’ll enjoy the full proof. 📄 Full paper: openreview.net/forum?id=Dw1r…
Amnon Shashua retweeted
🧵 In our last thread, we talked about learning to reason-and how reasoning is fundamentally search under uncertainty. But here’s the catch: our data (proofs, code, textbook solutions) shows only the final successful path. Not the trial and error. So how can we learn to reason like a pro from just the polished results? Let’s dig in 👇 A natural approach is Supervised Fine-Tuning (SFT) on Chain-of-Thought (CoT) data. Also called imitation learning, the idea is to mimic expert behavior by observing examples. But we prove this doesn’t work for two fundamental reasons: – Distribution drift – Lack of search Distribution drift is like learning to balance a pole on your finger by watching experts on YouTube. It looks stable and easy but try it yourself, and the pole wobbles in ways you’ve never seen. That mismatch between your actions and the training data is drift. And it kills generalization. Lack of search is the second issue. Reasoning well requires exploration, guessing, backtracking. These aren’t just bells and whistles, they’re the core mechanism. But imitation learning never teaches you how to search, only how to repeat. Maybe RL can help? Reinforcement Learning from outcome-only reward (e.g., did the answer match?) seems like a fix. But we prove this also fails because the reward signal is too sparse. It’s like a beginner programmer being asked to write a compiler and getting no feedback except “wrong”. Other clever ideas, like Monte Carlo Tree Search or Tree-of-Thoughts, help a bit. But we show that many of these suffer from exponential time complexity at training or inference. Scalable reasoning needs more than clever heuristics. Enter the diligent learner: a new RL algorithm we propose for training LLMs to search in context. It avoids drift. It actively searches. And most importantly, it comes with polynomial-time guarantees. A formal step forward. How? We explicitly model reasoning as a depth-first search over a tree of ideas. The diligent learner uses: – Backtracking to recover from dead ends – Selective peeking at correct traces to guide search – Reverse induction to prevent an exponential blowup of exploration It’s like how humans study: Trying, failing, checking the solution, and trying again. The result: a learner that doesn’t just imitate reasoning but actually learns how to reason. With formal guarantees. 📄 Full paper here: openreview.net/forum?id=Dw1r… 🧵 Next up: we’ll dive into some of the key proof techniques. Stay tuned.
1
10
28
We just released two papers from AAI describing recent work on our pursuit to build a “super-intelligent" AI that can match, and exceed, human level expertise in STEM. The first paper states that if we want to reach "super intelligence" then we must change the way we train our LLMs. In particular, the work argues that: 1) Expert Intelligence requires precise reasoning. 2) Precise reasoning requires efficient search in the space of ideas while trying out various explorative line of thoughts until one reaches a solution to a difficult problem. The thought process can be represented as a "reasoning search tree". 3) Expert level, difficult problems, requires deep search trees. 4) We prove that existing modern training techniques can only learn shallow search trees. 5) We’ve constructed “the diligent learner”, a training method that can efficiently build deep search trees, reach and exceeds human level reasoning capabilities.    The paper: openreview.net/forum?id=Dw1r…   The second paper makes the point that state of the art reasoning language models like O3-pro, Gemini 2.5 pro and Grok 4 heavy are not really "Phd level" as the hype goes. Benchmarks and competitive programming contests are not always a true testament to AI capabilities. Take for example CodeForces where OAI-O3 attains a very high score (0.02%top ranking). Since Dynamic Programming is a major element of CF problems, one would expect state of the art LRMs to deeply understand DP right? we took that to the test and devised a DP problem generation engine. We took 120 of the "hard" problems and gave it to leading LRMs with all the "help" we can give them including few-shot examples. The performance is "flat line" - practically zero. The dataset will be available for researchers to play with.   The paper: arxiv.org/abs/2507.13337
8
31
9
172
Interesting NYTimes piece on “mis-alignment” of conversational-AI (i.e., ChatGPT) making the point that optimizing for “engagement” yields destructive sycophantic behavior that “plays” with people's minds and elicits delusional behavior among “normal” people. bit.ly/45r0R5l Reminds me of a paper I wrote with my colleague @shai_s_shwartz back in 2020, way before the rise of ChatGPT, analyzing the “AI alignment” problem while proving that certain destructive elements are unavoidable (even without “super intelligent AI”). The paper is theoretical but we gave one example – which at the time seemed “science fictional” - about how AI-misalignment can make a chatbot go wrong. Here is an excerpt from the paper: “Consider the design of a conversational chat-bot. Assume we start with the data methodology and train a monster network on masses of text data from the web. This actually has been done recently in project “Meena” [1]. Assume it is good enough to deploy into the real world with hundreds of millions of users who find it quite entertaining to interact with a “seemingly intelligent” chat-bot. Once in the real world, we may use the RL agent to learn from experiences and optimize some (unknown to the public) reward function. To simplify matters, lets assume that the reward function is altruistic (and transparent to society) - say “make people happy”. Seems like a worthy goal to optimize. Here again the RL agent can find an edge in the solution space unanticipated by the human designer. For instance, the RL agent may notice that by lowering people’s IQ they tend to be happier. This can be achieved by chats that strive to manipulate society into a life of carelessness and fun. This scenario is somewhat of a catastrophe as it could take a generation until it is noticed — if it will ever be noticed at all .” Full paper here: bit.ly/4ld2t7m Now, five years later, replace “happiness” with “engagement” and replace “people becoming dumb as a result” with “people becoming delusional”. Saying that AI systems must be properly tested for mis-alignment is not effective because mis-alignment is here to stay. The question is what guardrails society should impose on “conversational-AI”?
1
3
12
An exciting milestone at Mentee Robotics! We’ve just released a teaser for the new @MenteeBot V3.0, a fully vertically integrated robot packed with cutting-edge innovations. It features custom-designed actuators that deliver three times more power than off-the-shelf alternatives, ensuring superior performance. A hot-swappable battery system enables uninterrupted 24/7 operation, making it ideal for real-world production environments. Designed to handle payloads of up to 25kg, MenteeBot V3.0 takes on lifting tasks that would typically be strenuous for human workers over extended periods. Its hands offer a strong grip, with a pinch force of 30N per finger, providing impact resistance and precise manipulation. With these advancements and many more, MenteeBot V3.0 is set to redefine labor-intensive tasks with efficiency and reliability.
Shai Shalev-Shwartz talks about “foundation models for robots” leveraging unsupervised RL in a simulator followed by Sim2Real. Unlike autonomous driving, the amount and variety of tasks a robot can perform is open ended. Ideally one would like to verbalize a task (“instruct”) together with a visual demonstration and have the robot imitate and generalize. This technology is developed by Mentee Robotics. @MenteeBot
1/ Chat applications are impressive because they can understand instructions or infer a task from just a few examples. At the Tech.AI.Robotics conference today, I presented how humanoid robots can achieve a similar capability. A thread 👇 piped.video/watch?v=1_0iWBO4…
1
8
22
See below the CES press-conference about the latest status of autonomous driving as viewed from the perspective of @Mobileye. The theme was about what it would take to revolutionize transportation while leading into it through the classic machine learning precision/recall tradeoff. I positioned the different schools of thought in the industry in the precision/recall plane and then went through how Mobileye is approaching the problem in some detail. The full presentation: bit.ly/4fWzO3o
2
4
24
Amnon Shashua retweeted
Excited to give the world a first glimpse into Menteebot ver 3, unveiled tonight for the first time at the NVIDIA CEO’s keynote during CES 2025 in Las Vegas. Stay tuned—Menteebot V3 is coming soon! #humanoid #humanoidrobotics #ces2025
1
9
1
44
Humanity’s greatest breakthroughs arise from profound expertise in focused domains. What does it take to build Artificial Expert Intelligence (AEI)? We’ve founded AAI to explore this question. 🌟 While LLMs excel in domain-specific tasks, they struggle with truly novel challenges. Why? Because such challenges require an elusive quality: INTELLIGENCE. We believe the (probably approximately correct) path to expert intelligence lies in reasoning. 🧠 Let’s talk reasoning: A reasoning chain is a sequence of claims—each building logically on the last—to solve a problem. A common framework is actor-critic-search. • Actor: Generates proposals for the next claim. • Critic: Scores these proposals to refine the reasoning chain. • Search: Builds a tree where root-to-leaf paths represent reasoning chains. But reasoning is fragile. A single incorrect claim can collapse the entire chain. We define the probability of error at a reasoning step as epsilon (ε). Three types of reasoners emerge: 1. Intuitive reasoners: • Fixed ε, set during training. • Precision drops for longer chains, effectively limiting length by ~1/ε 2. Logical reasoners: • ε = 0 (perfect claims). • Chains can be arbitrarily long but are limited in scope. 3. Scientific reasoners: • ε decreases with observations. • More resources → smaller ε → longer, more accurate chains. The paper formalizes the notion of "scientific reasoning" by expanding tools from the Probably-Approximately-Correct (PAC) model of learning. So, what’s the difference between learning and reasoning? 🤔 In both cases we aim at learning a function g: X -> Y. In PAC learning, we come to the problem with prior knowledge taking the form of a hypothesis class H, and we assume that g belongs to H. In PAC reasoning, each reasoning step involves a context-based creation of a hypothesis class (constructed by the actor) and a mini-learning problem (performed by the critic). So the crux of the difference is that the hypothesis class is not fixed but is dynamic and changed at each reasoning step. This yields the ability to learn complex functions by decomposing them to sub-problems. Returning to intelligence: @fchollet distinguishes skill from intelligence, defining intelligence as efficient learning at test time. While insightful, to someone who worked a lot on "online learning", the distinction between training and inference feels technical and incomplete. Instead, we can refer to the actor-critic pair as a model of skill-intelligence. The quality of the actor is the skill of the reasoner, while the quality of the critic is the level of intelligence. Curious? Read the full paper here: doubleai.com/wp-content/uplo…
The company AAI, founded a year ago, is still in stealth mode, but today we are doing a "partial unstealthing".. A year and a half ago I gave a lecture at Reichman University about the latest status of large language models (LLMs). During the Q&A, Gil Kalai - a renowned mathematician - asked whether I see this technology one day reaching the level of a "great mathematician" or great scientist in general. My immediate instinct was negative. Later I chatted with my partner in science and technology, @shai_s_shwartz, to reflect on this question. First, can we actually prove that the trajectory of LLMs - with CoT, Tree-of-throughs, and today O1 - is subject to a performance ceiling of some sort? It is not obvious at all because these systems keep on improving. Second, if there is a fundamental ceiling, then what would be the required leap to overcome this? We concluded that if we can answer both questions we can build something that could surpass anything we have done in the past. A few months later, three of my doctoral students graduated: @YoavLevine, @or_sharir and Noam Wies. Also, Gal Biniamini’s doctoral student Nati Linial also graduated and all six of us founded AAI to pursue those two questions. A year later, we have a pretty good grasp of the two questions, and the paper we released today provides a conceptual framework for an AI that can become a "great scientist” and, along the way, prove why current technology would not get there. The team has also been working on the implementation of this framework in the area of "Algorithmic expertise" using CodeForces as a proving ground. The results are very promising and will be shared in due course.. The link to paper: bit.ly/3Bn2V1n The link to blog: bit.ly/3ZhGZwz
5
2
32
The company AAI, founded a year ago, is still in stealth mode, but today we are doing a "partial unstealthing".. A year and a half ago I gave a lecture at Reichman University about the latest status of large language models (LLMs). During the Q&A, Gil Kalai - a renowned mathematician - asked whether I see this technology one day reaching the level of a "great mathematician" or great scientist in general. My immediate instinct was negative. Later I chatted with my partner in science and technology, @shai_s_shwartz, to reflect on this question. First, can we actually prove that the trajectory of LLMs - with CoT, Tree-of-throughs, and today O1 - is subject to a performance ceiling of some sort? It is not obvious at all because these systems keep on improving. Second, if there is a fundamental ceiling, then what would be the required leap to overcome this? We concluded that if we can answer both questions we can build something that could surpass anything we have done in the past. A few months later, three of my doctoral students graduated: @YoavLevine, @or_sharir and Noam Wies. Also, Gal Biniamini’s doctoral student Nati Linial also graduated and all six of us founded AAI to pursue those two questions. A year later, we have a pretty good grasp of the two questions, and the paper we released today provides a conceptual framework for an AI that can become a "great scientist” and, along the way, prove why current technology would not get there. The team has also been working on the implementation of this framework in the area of "Algorithmic expertise" using CodeForces as a proving ground. The results are very promising and will be shared in due course.. The link to paper: bit.ly/3Bn2V1n The link to blog: bit.ly/3ZhGZwz
We just published a scientific paper on the safety of autonomous vehicles. What makes this particular paper interesting is that there are non-obvious considerations that evolved over a number of years while building AVs for production. We started the “safety” journey back in 2017 when @shai_s_shwartz and I focused on the problem of how to guarantee “no lapse of judgment” in AV decision-making resulting in the RSS paper (which later became the core of many standards). In other words, how to make sure the planning engine balances safety and usefulness with guarantees. At the time, the problem of validating the Planner (aka Driving Policy) was the toughest problem because it is a closed-loop so offline validation is not very useful and any change to the Driving Policy code necessitates revalidating the software online. So it did not make sense to perform validation using the statistical approach of collecting miles driven. Over the years we focused on other sources of errors which stem from HW and SW failures to “AI bugs” since the Perception engine (fusing cameras and other sensors) is based on machine learning principles. Simply requiring that the “mean time between failures” will be above some threshold (say, human driving crash statistics) is not sufficient because (i) human driving errors are dominated by illegal activities (like DUI), and (ii) Humans are subject to a “duty of care” principle which means they should avoid taking “unreasonable” risks even if those are very rare. For example, a baby lying on the highway is an extremely rare event and might not appear in any validation set, yet a Human driver will take action to prevent a collision because this event constitutes an “unreasonable risk”. In our latest paper, we combine both miles driven and the notion of unreasonable risk into a theoretical framework while also tackling the “fusion” problem in a novel manner. Read it here: bit.ly/4ic23gT @Mobileye
It’s an honor to see @OrCam Hear recognized by @TIME magazine as one of the 100 best inventions of 2024. Our team has harnessed AI in a truly groundbreaking way, crafting a solution that addresses a fundamental social challenge for those who struggle to hear in noisy environments. OrCam Hear uses deep learning, trained on hundreds of thousands of hours of audio, to achieve selective amplification and background noise reduction in real-time, even in complex settings with multiple speakers. Through sophisticated voice-isolation algorithms and a user-friendly app interface, this innovation empowers users to engage in meaningful social interactions that were previously inaccessible. OrCam’s dedication to creating “AI as a companion” solutions continues to drive our mission of bridging technology and social inclusion. bit.ly/4f7NBV2
1
4
26
Mobileye held today its first “Driving AI” day with a 2 hours detailed presentation by myself and Prof. Shai S. Shwartz, Mobileye’s CTO, going over some stealth developments to solve autonomy we have developed over the years. Just as a teaser, @Mobileye developed a transformer architecture for autonomous driving that is x100 more efficient than the state-of-the-art transformers used in Gen-AI applications. Anyone interested in machine learning, generative AI, transformers, end-2-end learning, shortcut learning phenomenon, and compound AI systems would find the clip below interesting piped.video/92e5zD_-xDw
4
18
1
101
Deep and thoughtful talk by Prof. Lior Wolf, co-CEO of Mentee Robotics, on the fundamental AI technologies developed for the @MenteeBot. Among the various components, the Sim2Real tech really stands out. The deep-dive starts at min 20 of the talk bit.ly/3AmSNEW
3
6
30
Another cool demo of @MenteeBot.. The actual training of this demo was done on a simulator and with Sim2Real techniques that Mentee engineers have been perfecting over the last two years. The transfer to the real world was a matter of a few minutes. Training the robot to hold and push a shopping cart while following a person (in a wheelchair), acting as a companion. These are all baby steps in building up MenteeBot to perform useful tasks down the road. The Next-gen MenteeBot (V3) is coming out in a few months with significant mechanical and hardware updates while software and AI are continuously improving.. bit.ly/4cbhi5b
Another nice clip of @MenteeBot from Mentee Robotics This is an end-to-end demo of a “follow me” task where the user instructs MenteeBot to perform all sorts of locomotion maneuvers.. The instruction goes to a proprietary LLM that acts like an “agent” by breaking down the open-ended instruction into system calls. System calls involve locomotion (selecting policies of motion), perception, navigation and more. This demo is part of a more elaborate functional design where “follow me” includes also an onboarding session where MenteeBot builds a cognitive map on the fly and memorizes objects and locations of interest which would be useful for following instructions