Yuandong Tian · Jun 18, 2025 · 5:20 AM UTC

Yuandong Tian

Ahmad Al-Dahle retweeted

Yuandong Tian

@tydsh

Jun 18

📢We show that continuous latent reasoning has a theoretical advantage over discrete token reasoning (arxiv.org/abs/2505.12514): For a graph with n vertices and graph diameter D, a two-layer transformer with D steps of continuous CoTs can solve the directed graph reachability problem, while the best known result of constant-depth transformers with discrete CoTs requires O(n^2) decoding steps. The underlying idea is like classic vs quantum: continuous thoughts can encode multiple candidate graph paths simultaneously and does an implicit "parallel search" with such "superposition", while discrete token sequence can only take one path at a time.

Reasoning by Superposition: A Theoretical Perspective on Chain of...

Large Language Models (LLMs) have demonstrated remarkable performance in many applications, including challenging reasoning problems via chain-of-thoughts (CoTs) techniques that generate...

arxiv.org

169

1,041

AI at Meta · Jun 11, 2025 · 2:35 PM UTC

Ahmad Al-Dahle retweeted

AI at Meta

@AIatMeta

Jun 11

Our vision is for AI that uses world models to adapt in new and dynamic environments and efficiently learn new skills. We’re sharing V-JEPA 2, a new world model with state-of-the-art performance in visual understanding and prediction. V-JEPA 2 is a 1.2 billion-parameter model, trained on video, that can enable zero-shot planning in robots—allowing them to plan and execute tasks in unfamiliar environments. Learn more about V-JEPA 2 ➡️ai.meta.com/blog/v-jepa-2-wo… As we continue working toward our goal of achieving advanced machine intelligence (AMI), we’re also releasing three new benchmarks for evaluating how well existing models can reason about the physical world from video. Learn more and download the new benchmarks ➡️ai.meta.com/blog/v-jepa-2-wo…

349

105

1,915

Lysandre · May 19, 2025 · 5:42 PM UTC

Ahmad Al-Dahle retweeted

Lysandre

@LysandreJik

May 19

Since the initial Attention is All you Need, 300 architectures have been contributed to Transformers. See the rise and fall of these architectures over time; crazy to see how BERT remains on top, but Llama is catching up fast!

145

Rohit Patel · Apr 22, 2025 · 6:14 PM UTC

Ahmad Al-Dahle retweeted

Rohit Patel @_Rohit_Patel_

Apr 22

Our CRAG-MM Challenge (KDD Cup 2025) invites you to develop innovative multi-modal, multi-turn question-answering systems with a focus on RAG, using agentic tools to retrieve information. The goal is to improve visual reasoning: aicrowd.com/challenges/meta-…

AIcrowd | Meta CRAG - MM Challenge 2025 | Challenges

Improve RAG with Real-World Benchmarks | KDD Cup 2025

aicrowd.com

Unsloth AI · Apr 8, 2025 · 8:42 PM UTC

Ahmad Al-Dahle retweeted

Unsloth AI

@UnslothAI

Apr 8

You can now run Llama 4 on your local device!🦙 We shrank Maverick (402B) from 400GB to 122GB (-70%). Scout: 115GB to 33.8GB (-75%) Our Dynamic 1.78bit GGUFs ensures optimal accuracy by selectively quantizing layers GGUFs: huggingface.co/collections/u… Guide: docs.unsloth.ai/basics/tutor…

129

814

Artificial Analysis · Apr 8, 2025 · 3:07 PM UTC

Ahmad Al-Dahle retweeted

Artificial Analysis

@ArtificialAnlys

Apr 8

Llama 4 Intelligence Index Update: We have now replicated Meta’s claimed values for MMLU Pro and GPQA Diamond, pushing our Intelligence Index scores for both Scout and Maverick higher Key update details: ➤ We noted in our first post 48 hours ago that we noticed discrepancies between our measured results and Meta’s claimed scores for our multi-choice eval datasets (MMLU Pro and GPQA Diamond) ➤ After further experiments and and close review, we have decided that in accordance with our published principle against unfairly penalizing models where they get the content of questions correct but format answers differently, we will allow Llama 4’s answer style of ‘The best answer is A’ as legitimate answer for our multi-choice evals ➤ This leads to a jump in score for both Scout and Maverick (largest for Scout) in 2/7 of the evals that make up Artificial Analysis Intelligence Index, and therefore a jump in their Intelligence Index scores ➤ Scout’s Intelligence Index has moved from 36 to 43, and Maverick’s Intelligence Index has moved from 49 to 50. Overall, we continue to conclude that both Scout and Maverick are very impressive models and a significant contribution to the open weights AI ecosystem. While DeepSeek V3 0324 maintains a small lead over Maverick, we continue to note that Maverick has ~half the active parameters (17B vs 37B), and ~60% of the total parameters (402B vs 671B), while also supporting image inputs. All our tests have been performed on the Hugging Face release version of the Llama 4 weights for both Scout and Maverick, including testing via a range of third party cloud providers. None of our eval results are based on the experimental chat-tuned model provided to LMArena (Llama-4-Maverick-03-26-Experimental). We can also share that we have observed third party cloud APIs generally stabilizing over the last 48 hours. We will soon release endpoint-level comparison data to allow developers to understand whether any cloud providers are still serving versions of Llama 4 with accuracy issues.

161

727

Hatice Ozen · Apr 7, 2025 · 10:40 PM UTC

Ahmad Al-Dahle retweeted

Hatice Ozen

@ozenhati

Apr 7

llama 4 scout on @groqinc paired with @elevenlabsio is incredible for multilingual voice agents. insanely smooth even switching between different languages thanks to low latency. and for those who have been asking about its turkish - i've been testing and it's pretty good. :)

Thor 雷神 ⚡️

@thorwebdev

Apr 7

Llama 4 supports 12 different languages out of the box, making it a powerful brain for your voice agents! Running on @GroqInc Cloud and integrated with @ElevenLabsDevs Conversational AI, it creates a fantastic multilingual agent setup. Give it a try below! 👇

m_ric · Apr 8, 2025 · 9:03 AM UTC

Ahmad Al-Dahle retweeted

m_ric

@AymericRoucher

Apr 8

Llama-4-Maverick is CRAZY GOOD to power agents 🤯 It's now the top open model on smolagents LLM leaderboard, beating the much larger DeepSeek-R1! Congrats @ThomasScialom and team!

151

Ahmad Al-Dahle · Apr 7, 2025 · 5:49 PM UTC

Ahmad Al-Dahle

@Ahmad_Al_Dahle

Apr 7

We're glad to start getting Llama 4 in all your hands. We're already hearing lots of great results people are getting with these models. That said, we're also hearing some reports of mixed quality across different services. Since we dropped the models as soon as they were ready, we expect it'll take several days for all the public implementations to get dialed in. We'll keep working through our bug fixes and onboarding partners. We've also heard claims that we trained on test sets -- that's simply not true and we would never do that. Our best understanding is that the variable quality people are seeing is due to needing to stabilize implementations. We believe the Llama 4 models are a significant advancement and we're looking forward to working with the community to unlock their value.

1,053

xjdr · Apr 7, 2025 · 4:15 PM UTC

Ahmad Al-Dahle retweeted

xjdr

@_xjdr

Apr 7

my detailed personal benchmarks ran overnight. - Scout is best at summarization and function calling. exactly what you want from a cheap long ctx model. this is going to be a workhorse in coding flows and RAG applications. the single shot ICL recall is very very good. - Maverick was built for replacing developers and doing agenic / tool calling work. it is very consistent in instruction following, very long context ICL and parallel multi tool calls. this is EXACTLY the model and capabilities i want in my coder style flows. it is not creative, i have V3 and R1 for that tho. multimodal is very good at OCR and charts and graphs outperforming both 4o and qwen 2.5 VL 72 in my typical tests. the only thing i haven't tested is computer use but i doubt it will beat sonnet or qwen at that as both models were explicitly trained for it. The output is kind of bland (hence the constant 4o comparisons) with little personality, which is totally fine. this is a professional tool built for professional work (testing it on RP or the like will lead to terrible results). Im not sure what more you could ask for in a agent focused model. - V3-0324 is not consistent enough with tool calling output to be useful but when it gets it right, it is always the clear and best choice. however, it excels at creativity, problem solving and multi-turn interactions. this will continue to be my non-function calling workhorse. the 131k ctx feels remarkably restrictive now tho. i am going to do some more long ctx testing on V3 cause im almost positive i can get more out of it (200k - 300k ideally), but i think this is where MLA is going to show its tradeoffs. FIM and completion are also huge V3 specific wins here and places where it not only excels but is really in a league of its own. - R1 continues to be the smartest and most creative model available when used single shot, single turn and when prompted correctly. its the genius in the corner who cant make eye contact but if you properly specify a problem it will be solved with an incredibly high degree of confidence. Function calling (really all of the V3 features) work as expected but the <think> formatting is a bit 1/2 baked and doubly so when you use them with tool use. however, with proper parsing and sampling effort, its a truly remarkable model. - All of these models benefit tremendously from proper sampling and lovingly crafted matmuls and accumulations. they are all much better and smarter than what is generally available from lmsys or openrouter. I am incredibly bullish on Behemoth and R2 and cannot wait to fold them into my daily workflow. I have never been happier about the state of open source models and since the R1 launch and when used correctly they provide a viable alternative to frontier models for the first time. I am happy to answer and specific questions but this is probably my last general post on this. i gotta get back to work ...

460

Terry Yue Zhuo · Apr 7, 2025 · 2:11 PM UTC

Ahmad Al-Dahle retweeted

Terry Yue Zhuo

@terryyuezhuo

Apr 7

Llama-4 Series on BigCodeBench-Hard *Inference via NVIDIA NIM Llama-4 Maverick Ranked 41th/192 Similar to Gemini-2.0-Flash-Thinking & GPT-4o-2024-05-13 29.1% Complete 25% Instruct Llama-4-Scout Ranked 97th/192 16.9% Complete 16.9% Instruct Also, new visuals on the leaderboard! 1. Recommendation --- Plot on a mixture of top and recent models 2. Time View --- Plot on a release time scale 3. Score Meter --- Better indicator inside the table I'll present BigCodeBench at #ICLR2025 on Fri 25 Apr 3:30 p.m. CST — 5 p.m. CST in Oral Session 4B. See you in 🇸🇬! See more results: bigcode-bench.github.io/

Sanyam Bhutani · Apr 7, 2025 · 1:47 AM UTC

Ahmad Al-Dahle retweeted

Sanyam Bhutani

@bhutanisanyam1

Apr 7

Llama 4 takes 43 seconds to analyse 900k tokens!

Dmytro Dzhulgakov · Apr 6, 2025 · 8:47 PM UTC

Ahmad Al-Dahle retweeted

Dmytro Dzhulgakov

@dzhulgakov

Apr 6

Meta Llama, number four, Coming Saturday, explore! Zuck announces, proud and loud, Fans and devs, a buzzing crowd. Llama 4, it’s on the way, Fireworks AI scrambles—hey! Startups racing, GPUs hot, “Launch the model—wait we cannot!” Llama Llama, context long, Support is deep, compute is strong. Servers humming, tests to run, Saturday deadline—not so fun! Meta’s dropping shiny tech, Engineers are neck-and-neck. Zuck smiles wide, with tech delight, Llama 4 debuts tonight! Fireworks AI, fast and bold, Integrations to uphold. Launch is ready, code reviewed— Fireworks shouts: “Thanks, Zuck, dude!”

Artificial Analysis · Apr 6, 2025 · 2:35 PM UTC

Ahmad Al-Dahle retweeted

Artificial Analysis

@ArtificialAnlys

Apr 6

Congratulations to @togethercompute, @FireworksAI_HQ , @databricks, @DeepInfra, @CentML_Inc and @GroqInc on having day-one Llama 4 inference endpoints live! Keep an eye out for endpoints coming this week from @Azure, @CerebrasSystems, @SambaNovaAI and more. Both @Meta's Llama 4 Scout and Maverick only have 17B active parameters - so although their total sizes are relatively large at 109B and 402B respectively, these models have the potential to enable extremely fast and efficient inference. All providers serving Llama 4 so far are offering it at cheaper prices than their Llama 3.3 70B endpoints, including for Llama 4 Maverick. This makes Maverick as an incredibly compelling model with a wide range of inference options.

174

Ray Fernando · Apr 6, 2025 · 5:15 AM UTC

Ahmad Al-Dahle retweeted

Ray Fernando

@RayFernando1337

Apr 6

Kraken literally cooked with Llama 4 on Groq. Insane speed!

xjdr · Apr 6, 2025 · 4:37 PM UTC

Ahmad Al-Dahle retweeted

xjdr

@_xjdr

Apr 6

L4 Maverick feels very much like a smarter 4o to me. i feel pretty confident in saying that was an explicit goal. i can also confirm it works very well at 1M+ ctx len. i dont have any 10M+ ctx evals but i'll try to throw something together just to satisfy my own curiosity

286

Philip Kiely · Apr 6, 2025 · 12:27 AM UTC

Ahmad Al-Dahle retweeted

Philip Kiely

@philipkiely

Apr 6

Llama 4 (Maverick) easily one-shots my Brick Breaker vibe check Output speed for 700+ words felt on par with ChatGPT/Claude on a good day using vLLM, excited to see how much faster we can run it!

AK · Apr 6, 2025 · 3:42 AM UTC

Ahmad Al-Dahle retweeted

@_akhaliq

Apr 6

llama-4-scout-17b-16e-instruct prompt: write a p5.js script that shows a ball bouncing inside a spinning hexagon. The ball should be affected by gravity and friction, and it must bounce off the rotating walls realistically

Flavio Adamo

@flavioAd

Jan 31

🚨 o3-mini crushed DeepSeek R1 🚨 "write a Python program that shows a ball bouncing inside a spinning hexagon. The ball should be affected by gravity and friction, and it must bounce off the rotating walls realistically"

414

Xeophon · Apr 5, 2025 · 7:45 PM UTC

Ahmad Al-Dahle retweeted

Xeophon

@xeophon_

Apr 5

Llama 4 Maverick has big model smell, thank you so much @AIatMeta 🙏🏼 I have some prompts for an upcoming eval and based on those I tested, it is on the level of other frontier models. Really happy :)

Ahmad Al-Dahle · Apr 5, 2025 · 7:37 PM UTC

Ahmad Al-Dahle

@Ahmad_Al_Dahle

Apr 5

👀👀 🦙🦙🦙🦙

lmarena.ai

@arena

Apr 5

BREAKING: Meta's Llama 4 Maverick just hit #2 overall - becoming the 4th org to break 1400+ on Arena!🔥 Highlights: - #1 open model, surpassing DeepSeek - Tied #1 in Hard Prompts, Coding, Math, Creative Writing - Huge leap over Llama 3 405B: 1268 → 1417 - #5 under style control Huge congrats to @AIatMeta — and another big win for open-source! 👏 More analysis below⬇️

251