Loubna Ben Allal · Oct 30, 2025 · 4:10 PM UTC

Loubna Ben Allal

Atif Saleem retweeted

Loubna Ben Allal

@LoubnaBenAllal1

Oct 30

After ~4 years building SOTA models & datasets, we're sharing everything we learned in ⚡The Smol Training Playbook We cover the full LLM cycle: designing ablations, choosing an architecture, curating data, post-training, and building solid infrastructure. We'll help you navigate the messy training reality that LLM papers don't cover. Chapter highlights in the 🧵

158

1,003

Songlin Yang · Oct 30, 2025 · 10:16 PM UTC

Atif Saleem retweeted

Songlin Yang

@SonglinYang4

Oct 30

Many people are confused by Minimax’s recent return to full attention - especially since it was the first large-scale pivot toward hybrid linear attention - and by Kimi’s later adoption of hybrid linear variants (as well as earlier attempts by Qwen3-Next, or Qwen3.5). I actually appreciate Minimax’s openness here: they admitted the challenges and regrets of hybrid linear or sliding-window attention on multi-hop reasoning tasks, which not many labs would say out loud. That said, the “regrets” might not be as bad as they sound. Minimax used a very simple linear attention variant (largely due to insufficient evaluation at the time), so the performance gap was probably exaggerated. The continual pretraining strategy (i.e., switching from global attention to hybrid sliding-window attention) also seemed quite suboptimal. And afaik, hybrid linear attention can still perform very strongly on nearly all benchmarks except multi-hop reasoning. If the performance drop on multi-hop reasoning can be kept small enough to trade for better inference efficiency and data efficiency, hybrid linear attention still has plenty of room to grow. Better linear-complexity layers are still worth exploring, especially with improving infrastructure from frameworks like vLLM and SGLang. After all, we don’t want our agentic models to be forever bounded by context length - that’s a limitation we’ll have to overcome sooner or later

506

Visual Studio Code · Oct 28, 2025 · 4:51 PM UTC

Atif Saleem retweeted

Visual Studio Code

@code

Oct 28

OpenAI Codex is now integrated directly in @code through the new Agent Sessions view - and can be powered by your GitHub Copilot subscription. Try it out now with VS Code Insiders and a Copilot Pro+ subscription. Happy coding!

257

2,320

GitHub · Oct 28, 2025 · 4:12 PM UTC

Atif Saleem retweeted

GitHub

@github

Oct 28

Welcome to your Agent HQ 📍Orchestrate any agent, any time, anywhere. Coding agents from @claudeai, @OpenAI, @cognition, @julesagent, @xai and more will become available in GitHub as part of your paid Copilot subscription. github.blog/news-insights/co…

366

118

1,843

Google Labs · Oct 28, 2025 · 4:07 PM UTC

Atif Saleem retweeted

Google Labs

@GoogleLabs

Oct 28

🚨 NEW LABS EXPERIMENT 🚨 Introducing Pomelli, an experimental AI marketing tool designed to help you easily generate scalable, on-brand content to connect with your audience, faster. Just enter your website, and Pomelli will understand your unique business identity to build effective campaigns tailored to your brand. Now available in US, CAN, AUS, & NZ! Try It Now ⬇️ labs.google/pomelli

314

891

363

7,632

Numman Ali · Oct 26, 2025 · 9:38 PM UTC

Atif Saleem retweeted

Numman Ali

@nummanthinks

Oct 26

Use Claude Code Skills with ANY Coding Agent! Introducing OpenSkills 💫 A smart CLI tool, that syncs .claude/skills to your AGENTS .md file npm i -g openskills openskills install anthropics/skills --project openskills sync GitHub in the comments ↓

Jesse Hoogland · Oct 22, 2025 · 4:25 PM UTC

Atif Saleem retweeted

Jesse Hoogland

@jesse_hoogland

Oct 22

How does training data shape model behavior? Well, it’s complicated… 1/10

148

991

Richard Sutton · Oct 20, 2025 · 5:55 AM UTC

Atif Saleem retweeted

Richard Sutton

@RichardSSutton

Oct 20

To learn more about temporal difference learning, you could read the original paper (incompleteideas.net/papers/s…) or watch this video (videolectures.net/videos/dee…).

TD Learning

videolectures.net

Khurram Javed

@kjaved_

Oct 18

The Dwarkesh/Andrej interview is worth watching. Like many others in the field, my introduction to deep learning was Andrej’s CS231n. In this era when many are involved in wishful thinking driven by simple pattern matching (e.g., extrapolating scaling laws without nuance), it’s refreshing to hear an influential voice that is tethered to reality. One clarification for the podcast is that when Andrej says humans don’t use reinforcement learning, he is really saying humans don't use returns as learning targets. His example of LLMs struggling to learn to solve math problems from outcome-based rewards also elucidates the problem with learning directly from returns. Fortunately for RL, this exact problem is solved by temporal difference (TD) learning. All sample-efficient RL algorithms that show human-like learning (e.g., sample-efficient learning on Atari, and our work on learning from experience directly on a robot) rely on TD learning. Now Andrej is not primarily an RL person; he is looking at RL through the lens of LLMs these days, and all RL done in LLMs uses returns as targets, so it’s understandable that he is assuming that RL is all about learning from observed returns. But this assumption leads him to the incorrect conclusion that we need process-based dense rewards for RL to work. If you embrace TD learning, then you don't necessarily need a dense reward. Once you have learned a value function that encodes useful knowledge about the world, you can learn on the fly in the absence of rewards, just like humans and animals. This is possible because in TD learning there is no difference between learning from an unexpected reward and learning from an unexpected change in perceived value.

125

1,091

Brian Roemmele · Oct 18, 2025 · 6:13 PM UTC

Atif Saleem retweeted

Brian Roemmele

@BrianRoemmele

Oct 18

If we do not use the Nonconformist Bee Strategy we will never reach AGI. Here is why. The epsilon function in AI, specifically in the epsilon-greedy strategy used in reinforcement learning, balances exploration and exploitation. I will get a bit technical but please go in to it slowly. You can understand it and it is important for you to know. Epsilon (ε) sets the probability of random actions to explore new possibilities versus exploiting known rewards, starting high (e.g., 0.9) and decaying (e.g., to 0.01) as learning progresses. This method suits structured environments like games but struggles to uncover true novelty or fringe advancements. It fails to capture radical breakthroughs because exploration is shallow, limited to predefined action spaces, and biased toward existing data distributions. AI prioritizes efficiency, converging on safe, incremental solutions rather than high-risk, paradigm-shifting ideas often sparked by serendipity or interdisciplinary leaps in human contexts, like penicillin’s discovery. Studies note AI’s tendency to consolidate rather than disrupt, with 86% of R&D cases favoring augmentation over novelty due to cost and benchmark pressures. AI lacks human-like intuition or the unconstrained persistence of lone inventors, further limiting its reach into fringe innovation. This problem intensifies when AI trains on conformist sources like Wikipedia and Reddit, which enforce status quo biases that stifle fringe perspectives. Wikipedia’s editor consensus rules create a “debunker gaming system” bias, retaining existing content unless broad agreement favors change, leading to systemic underrepresentation of non-mainstream views and higher exit rates among pro-fringe editors. Agenda-driven “keepers” weaponize this for ideological control, replicating paid science publication biases in sourcing and marginalizing diverse or disruptive narratives. Reddit’s karma system, an intermittent reinforcement loop, rewards conformity through upvotes for popular opinions while punishing dissent via downvotes, fostering echo chambers where unpopular ideas tank karma and restrict posting. Moderators, often biased, amplify this by removing non-conformist content, turning subreddits into hiveminds that conflate popularity with truth. Training AI on these datasets—Wikipedia comprising up to 38% of GPT-3’s tokens and Reddit-linked web text 72%—embeds their flaws, creating a catastrophe of amplified biases and ideological distortions propagate into AI outputs, hallucinating stereotypes (a grifter, a quack, crazy) and suppressing novelty. This feedback loop risks a “doom spiral” for reliable knowledge, as AI-generated junk floods sources, eroding trust and innovation while fringe advancements drown in curated conformity. Historically, lone or fringe inventors drove 50–70% of major U.S. inventions pre-1900 (e.g., telephone), but now contribute 30–65% of granted patents annually (~10,000–45,000). They remain overrepresented in high-impact breakthroughs (~80% of disruptive patents), despite teams dominating 85–90% of output. Modern patents average 3.2 inventors (up from 1.7 in 1976), reflecting a shift to collaborative, less risky innovation. Fringe inventors face barriers like funding, with only 0.2% of people inventing but potential for 4x growth if barriers drop, especially for underrepresented groups (e.g., 17% of global inventors are women). If AI continues to use the very flawed epsilon function we will not see the very basis that has driven humanity forward. If AI continues to site sources as “facts” and then takes on the “debunker” role learned by Wikipedia and Reddit, there will be no innovations. This is not a guess, I have tested it in a small scale on my garage AI models. It is one reason I wrote about the Nonconformist Bees in the article attached below. It is important for you to know, and not just a few math majors that only see this as a calculation. AI must be the Nonconformist Bee.

Brian Roemmele

@BrianRoemmele

Oct 16

x.com/i/article/197862757491…

114

586

nataliestaud · Oct 9, 2025 · 8:21 PM UTC

Atif Saleem retweeted

nataliestaud

@nataliestaud

Oct 9

ChatGPT shouldn’t have political bias in any direction. Today, we’re sharing new research that defines what political bias means in LLMs, and we introduce a new evaluation framework to measure and reduce it. This has been the most meaningful work I’ve done at OpenAI, and I say that as someone who got to be part of the ChatGPT launch!!

175

1,229

Replit ⠕ · Oct 4, 2025 · 7:36 PM UTC

Atif Saleem retweeted

Replit ⠕

@Replit

Oct 4

Migrate now from Vercel to Replit—in just a couple clicks:

Start now

replit.com

151

Daniel Marcus · Oct 5, 2025 · 6:40 PM UTC

Atif Saleem retweeted

Daniel Marcus

@DMarc_2

Oct 5

Replying to @cramforce @sohei1L @d4m1n

If I’m not mistaken, I believe I read that his name is Shadab and that’s where the name comes from. The guy’s an absolute legend for creating it. It’s the perfect balance of customizability and excellent defaults.

Atif Saleem · Sep 28, 2025 · 9:31 PM UTC

Atif Saleem

@malikatifsaleem

Sep 28

Has anyone found the best AI Model for frontend development?

Jerry Liu · Sep 27, 2025 · 3:29 PM UTC

Atif Saleem retweeted

Jerry Liu

@jerryjliu0

Sep 27

Give Claude Code a semantic filesystem 🗃️🛠️ Giving Claude Code access to the right CLI tools over your filesystem turns it into a general agent capable of automating far more knowledge work beyond code - it can do dynamic financial/legal/medical/technical/backoffice analysis over any subset of documents. With our latest release of semtools 💫, you can now manually or *agentically* create a persistent workspace over any subset of files. This gives Claude Code the ability to get blazing-fast, local semantic search over any data, while still allowing it to chain with commands like grep/cat/etc. so that it can load in dynamic context instead of naive top-k vector search. The coding agent can dynamically index data and use those indexes, instead of having to rebuild it every time. So you get the benefits of fast search along with agentic reasoning over CLI tools mentioned above. Come check it out! github.com/run-llama/semtool…

816

Atif Saleem · Sep 21, 2025 · 1:37 AM UTC

Atif Saleem

@malikatifsaleem

Sep 21

Why can’t Gemini 2.5 Pro use Google Search to provide grounded information? Google got the best search engine in the world yet its SOTA model is not versed at it in ⁦@GeminiApp⁩ but is in AI Mode. ⁦@OfficialLoganK⁩ any plans to fix this

Ben Lang · Sep 19, 2025 · 4:20 PM UTC

Atif Saleem retweeted

Ben Lang

@benln

Sep 19

Cursor community is something special

197

MBZUAI · Sep 9, 2025 · 3:37 PM UTC

Atif Saleem retweeted

MBZUAI

@mbzuai

Sep 9

K2 Think is a 32 billion parameter, open source reasoning model that punches well above its weight. Available now on Hugging Face, it’s built for advanced logic, math, and science reasoning, delivering frontier-class performance while being remarkably efficient: huggingface.co/LLM360/K2-Thi… Hosted under an Apache 2.0 license, the model is fully open source, weights, code, and documentation included. The Hugging Face model card features a comprehensive quick-start, including sample Python code using transformers.pipeline for easy integration. #K2Think #AI #OpenSource #MBZUAI #G42 #Innovation @huggingface

LLM360/K2-Think · Hugging Face

huggingface.co

127

Ian Nuttall · Sep 8, 2025 · 8:57 AM UTC

Atif Saleem retweeted

Ian Nuttall

@iannuttall

Sep 8

A list of things I think Claude Code could do to win back people switching to Codex CLI: - open source Claude Code - reduce sycophancy/make it less verbose (or add option for that) - more transparency about how/why the model degrades - fix tui flashing bug! PLEASE - improve model hallucinations like GPT-5 has - better thinking for removing files/lines of code to prevent accidental deletions - less boilerplate or pseudo implementations (break it into working chunks if needed) - ability to change/remove/reduce the system reminder prompts - file based session auto-compact with much more detail on the conversation for future reference what would you want to see improve to make CC work better for you?

179

807

Crystal · Aug 31, 2025 · 5:51 AM UTC

Atif Saleem retweeted

Crystal

@crystalsssup

Aug 31

Youtube version is here - the full interview ⬇️ piped.video/watch?v=91fmhAnE…

Kimi Founder Yang Zhilin: K2, Agentic LLMs, Brains in Vats, and the...

This is an interview with Kimi founder Yang Zhilin, recorded shortly after the release of Kimi K2. We discussed the development of K2 and his latest technica...

youtube.com

Crystal

@crystalsssup

Aug 27

Kimi's founder, Zhilin Yang's interview is out. Again, you can let Kimi translate for you: ) lots of insights there. mp.weixin.qq.com/s/uqUGwJLO3… Several takes: 1/ Base Model Focus: K2 aims to be a solid base model. We've found that high-quality data growth is slow, and multi-modal data doesn't significantly boost textual "IQ." So, we focus on maximizing every data token's value — token efficiency. 2/ Data Rephrasing: With 30T tokens, only a small portion is high-quality data (billions of tokens). We rephrase these to make them more efficient for the model, improving generalization. 3/ Agentic Ability: We aim to enhance generalization. The biggest challenge is making the model generalize well beyond specific tasks. RL improves this over supervised fine-tuning (SFT). 4/ AI-Native Training: We're exploring more AI-native ways to train models. If AI can do good alignment research, it'll generalize better, beyond single-task optimization. 5/ RL vs SFT: RL's generalization is better, as it learns from on-policy samples, but it has its limits. RL helps improve specific tasks, but it's hard to generalize to all scenarios without tailored tasks. 6/ Long Contexts: Context length is crucial, we need millions. The challenge is balancing model size and context length for optimal performance, as some architectures improve with long context but worsen with short ones.

109

Stanford Online · Sep 5, 2025 · 1:38 PM UTC

Atif Saleem retweeted

Stanford Online @StanfordOnline

Sep 5

New Stanford CS231N Deep Learning for Computer Vision lectures taught by Professor Fei-Fei Li, Assistant Professors Ehsan Adeli and Justin Johnson, and Zane Durante are now available! Watch the complete playlist here: piped.video/playlist?list=PL…

Stanford CS231N Deep Learning for Computer Vision I 2025

Computer Vision has become ubiquitous in our society, with applications in search, image understanding, apps, mapping, medicine, drones, and self-driving car...

youtube.com

153

1,215