#Girldad of twins. Leading GenAI @ Meta (llama, imagine, meta ai and more)

Menlo Park, CA
Joined March 2024
Ahmad Al-Dahle retweeted
📢We show that continuous latent reasoning has a theoretical advantage over discrete token reasoning (arxiv.org/abs/2505.12514): For a graph with n vertices and graph diameter D, a two-layer transformer with D steps of continuous CoTs can solve the directed graph reachability problem, while the best known result of constant-depth transformers with discrete CoTs requires O(n^2) decoding steps. The underlying idea is like classic vs quantum: continuous thoughts can encode multiple candidate graph paths simultaneously and does an implicit "parallel search" with such "superposition", while discrete token sequence can only take one path at a time.
30
169
15
1,041
Ahmad Al-Dahle retweeted
Our vision is for AI that uses world models to adapt in new and dynamic environments and efficiently learn new skills. We’re sharing V-JEPA 2, a new world model with state-of-the-art performance in visual understanding and prediction. V-JEPA 2 is a 1.2 billion-parameter model, trained on video, that can enable zero-shot planning in robots—allowing them to plan and execute tasks in unfamiliar environments. Learn more about V-JEPA 2 ➡️ai.meta.com/blog/v-jepa-2-wo… As we continue working toward our goal of achieving advanced machine intelligence (AMI), we’re also releasing three new benchmarks for evaluating how well existing models can reason about the physical world from video. Learn more and download the new benchmarks ➡️ai.meta.com/blog/v-jepa-2-wo…
Ahmad Al-Dahle retweeted
Since the initial Attention is All you Need, 300 architectures have been contributed to Transformers. See the rise and fall of these architectures over time; crazy to see how BERT remains on top, but Llama is catching up fast!
Ahmad Al-Dahle retweeted
Our CRAG-MM Challenge (KDD Cup 2025) invites you to develop innovative multi-modal, multi-turn question-answering systems with a focus on RAG, using agentic tools to retrieve information. The goal is to improve visual reasoning: aicrowd.com/challenges/meta-…
2
1
10
Ahmad Al-Dahle retweeted
You can now run Llama 4 on your local device!🦙 We shrank Maverick (402B) from 400GB to 122GB (-70%). Scout: 115GB to 33.8GB (-75%) Our Dynamic 1.78bit GGUFs ensures optimal accuracy by selectively quantizing layers GGUFs: huggingface.co/collections/u… Guide: docs.unsloth.ai/basics/tutor…
Ahmad Al-Dahle retweeted
Llama 4 Intelligence Index Update: We have now replicated Meta’s claimed values for MMLU Pro and GPQA Diamond, pushing our Intelligence Index scores for both Scout and Maverick higher Key update details: ➤ We noted in our first post 48 hours ago that we noticed discrepancies between our measured results and Meta’s claimed scores for our multi-choice eval datasets (MMLU Pro and GPQA Diamond) ➤ After further experiments and and close review, we have decided that in accordance with our published principle against unfairly penalizing models where they get the content of questions correct but format answers differently, we will allow Llama 4’s answer style of ‘The best answer is A’ as legitimate answer for our multi-choice evals ➤ This leads to a jump in score for both Scout and Maverick (largest for Scout) in 2/7 of the evals that make up Artificial Analysis Intelligence Index, and therefore a jump in their Intelligence Index scores ➤ Scout’s Intelligence Index has moved from 36 to 43, and Maverick’s Intelligence Index has moved from 49 to 50. Overall, we continue to conclude that both Scout and Maverick are very impressive models and a significant contribution to the open weights AI ecosystem. While DeepSeek V3 0324 maintains a small lead over Maverick, we continue to note that Maverick has ~half the active parameters (17B vs 37B), and ~60% of the total parameters (402B vs 671B), while also supporting image inputs. All our tests have been performed on the Hugging Face release version of the Llama 4 weights for both Scout and Maverick, including testing via a range of third party cloud providers. None of our eval results are based on the experimental chat-tuned model provided to LMArena (Llama-4-Maverick-03-26-Experimental). We can also share that we have observed third party cloud APIs generally stabilizing over the last 48 hours. We will soon release endpoint-level comparison data to allow developers to understand whether any cloud providers are still serving versions of Llama 4 with accuracy issues.
Ahmad Al-Dahle retweeted
llama 4 scout on @groqinc paired with @elevenlabsio is incredible for multilingual voice agents. insanely smooth even switching between different languages thanks to low latency. and for those who have been asking about its turkish - i've been testing and it's pretty good. :)
Llama 4 supports 12 different languages out of the box, making it a powerful brain for your voice agents! Running on @GroqInc Cloud and integrated with @ElevenLabsDevs Conversational AI, it creates a fantastic multilingual agent setup. Give it a try below! 👇
1
4
1
81
Ahmad Al-Dahle retweeted
Llama-4-Maverick is CRAZY GOOD to power agents 🤯 It's now the top open model on smolagents LLM leaderboard, beating the much larger DeepSeek-R1! Congrats @ThomasScialom and team!
7
23
151
We're glad to start getting Llama 4 in all your hands. We're already hearing lots of great results people are getting with these models. That said, we're also hearing some reports of mixed quality across different services. Since we dropped the models as soon as they were ready, we expect it'll take several days for all the public implementations to get dialed in. We'll keep working through our bug fixes and onboarding partners. We've also heard claims that we trained on test sets -- that's simply not true and we would never do that. Our best understanding is that the variable quality people are seeing is due to needing to stabilize implementations. We believe the Llama 4 models are a significant advancement and we're looking forward to working with the community to unlock their value.
90
81
34
1,053
Ahmad Al-Dahle retweeted
my detailed personal benchmarks ran overnight. - Scout is best at summarization and function calling. exactly what you want from a cheap long ctx model. this is going to be a workhorse in coding flows and RAG applications. the single shot ICL recall is very very good. - Maverick was built for replacing developers and doing agenic / tool calling work. it is very consistent in instruction following, very long context ICL and parallel multi tool calls. this is EXACTLY the model and capabilities i want in my coder style flows. it is not creative, i have V3 and R1 for that tho. multimodal is very good at OCR and charts and graphs outperforming both 4o and qwen 2.5 VL 72 in my typical tests. the only thing i haven't tested is computer use but i doubt it will beat sonnet or qwen at that as both models were explicitly trained for it. The output is kind of bland (hence the constant 4o comparisons) with little personality, which is totally fine. this is a professional tool built for professional work (testing it on RP or the like will lead to terrible results). Im not sure what more you could ask for in a agent focused model. - V3-0324 is not consistent enough with tool calling output to be useful but when it gets it right, it is always the clear and best choice. however, it excels at creativity, problem solving and multi-turn interactions. this will continue to be my non-function calling workhorse. the 131k ctx feels remarkably restrictive now tho. i am going to do some more long ctx testing on V3 cause im almost positive i can get more out of it (200k - 300k ideally), but i think this is where MLA is going to show its tradeoffs. FIM and completion are also huge V3 specific wins here and places where it not only excels but is really in a league of its own. - R1 continues to be the smartest and most creative model available when used single shot, single turn and when prompted correctly. its the genius in the corner who cant make eye contact but if you properly specify a problem it will be solved with an incredibly high degree of confidence. Function calling (really all of the V3 features) work as expected but the <think> formatting is a bit 1/2 baked and doubly so when you use them with tool use. however, with proper parsing and sampling effort, its a truly remarkable model. - All of these models benefit tremendously from proper sampling and lovingly crafted matmuls and accumulations. they are all much better and smarter than what is generally available from lmsys or openrouter. I am incredibly bullish on Behemoth and R2 and cannot wait to fold them into my daily workflow. I have never been happier about the state of open source models and since the R1 launch and when used correctly they provide a viable alternative to frontier models for the first time. I am happy to answer and specific questions but this is probably my last general post on this. i gotta get back to work ...
Ahmad Al-Dahle retweeted
Llama-4 Series on BigCodeBench-Hard *Inference via NVIDIA NIM Llama-4 Maverick Ranked 41th/192 Similar to Gemini-2.0-Flash-Thinking & GPT-4o-2024-05-13 29.1% Complete 25% Instruct Llama-4-Scout Ranked 97th/192 16.9% Complete 16.9% Instruct Also, new visuals on the leaderboard! 1. Recommendation --- Plot on a mixture of top and recent models 2. Time View --- Plot on a release time scale 3. Score Meter --- Better indicator inside the table I'll present BigCodeBench at #ICLR2025 on Fri 25 Apr 3:30 p.m. CST — 5 p.m. CST in Oral Session 4B. See you in 🇸🇬! See more results: bigcode-bench.github.io/
4
8
4
68
Ahmad Al-Dahle retweeted
Llama 4 takes 43 seconds to analyse 900k tokens!
Ahmad Al-Dahle retweeted
Meta Llama, number four, Coming Saturday, explore! Zuck announces, proud and loud, Fans and devs, a buzzing crowd. Llama 4, it’s on the way, Fireworks AI scrambles—hey! Startups racing, GPUs hot, “Launch the model—wait we cannot!” Llama Llama, context long, Support is deep, compute is strong. Servers humming, tests to run, Saturday deadline—not so fun! Meta’s dropping shiny tech, Engineers are neck-and-neck. Zuck smiles wide, with tech delight, Llama 4 debuts tonight! Fireworks AI, fast and bold, Integrations to uphold. Launch is ready, code reviewed— Fireworks shouts: “Thanks, Zuck, dude!”
2
12
Ahmad Al-Dahle retweeted
Congratulations to @togethercompute, @FireworksAI_HQ , @databricks, @DeepInfra, @CentML_Inc and @GroqInc on having day-one Llama 4 inference endpoints live! Keep an eye out for endpoints coming this week from @Azure, @CerebrasSystems, @SambaNovaAI and more. Both @Meta's Llama 4 Scout and Maverick only have 17B active parameters - so although their total sizes are relatively large at 109B and 402B respectively, these models have the potential to enable extremely fast and efficient inference. All providers serving Llama 4 so far are offering it at cheaper prices than their Llama 3.3 70B endpoints, including for Llama 4 Maverick. This makes Maverick as an incredibly compelling model with a wide range of inference options.
5
11
4
174
Ahmad Al-Dahle retweeted
Kraken literally cooked with Llama 4 on Groq. Insane speed!
6
6
80
0
Ahmad Al-Dahle retweeted
L4 Maverick feels very much like a smarter 4o to me. i feel pretty confident in saying that was an explicit goal. i can also confirm it works very well at 1M+ ctx len. i dont have any 10M+ ctx evals but i'll try to throw something together just to satisfy my own curiosity
14
8
1
286
Ahmad Al-Dahle retweeted
Llama 4 (Maverick) easily one-shots my Brick Breaker vibe check Output speed for 700+ words felt on par with ChatGPT/Claude on a good day using vLLM, excited to see how much faster we can run it!
Ahmad Al-Dahle retweeted
llama-4-scout-17b-16e-instruct prompt: write a p5.js script that shows a ball bouncing inside a spinning hexagon. The ball should be affected by gravity and friction, and it must bounce off the rotating walls realistically
🚨 o3-mini crushed DeepSeek R1 🚨 "write a Python program that shows a ball bouncing inside a spinning hexagon. The ball should be affected by gravity and friction, and it must bounce off the rotating walls realistically"
Ahmad Al-Dahle retweeted
Llama 4 Maverick has big model smell, thank you so much @AIatMeta 🙏🏼 I have some prompts for an upcoming eval and based on those I tested, it is on the level of other frontier models. Really happy :)
3
2
2
95
👀👀 🦙🦙🦙🦙
BREAKING: Meta's Llama 4 Maverick just hit #2 overall - becoming the 4th org to break 1400+ on Arena!🔥 Highlights: - #1 open model, surpassing DeepSeek - Tied #1 in Hard Prompts, Coding, Math, Creative Writing - Huge leap over Llama 3 405B: 1268 → 1417 - #5 under style control Huge congrats to @AIatMeta — and another big win for open-source! 👏 More analysis below⬇️
5
9
251