A dreamer and an avid learner. Art and brains fascinate me but hearts put me in awe. My views are my own and don’t represent my employer in any way.

Colorado, USA
Joined March 2017
Atif Saleem retweeted
After ~4 years building SOTA models & datasets, we're sharing everything we learned in ⚡The Smol Training Playbook We cover the full LLM cycle: designing ablations, choosing an architecture, curating data, post-training, and building solid infrastructure. We'll help you navigate the messy training reality that LLM papers don't cover. Chapter highlights in the 🧵
35
158
18
1,003
Atif Saleem retweeted
Many people are confused by Minimax’s recent return to full attention - especially since it was the first large-scale pivot toward hybrid linear attention - and by Kimi’s later adoption of hybrid linear variants (as well as earlier attempts by Qwen3-Next, or Qwen3.5). I actually appreciate Minimax’s openness here: they admitted the challenges and regrets of hybrid linear or sliding-window attention on multi-hop reasoning tasks, which not many labs would say out loud. That said, the “regrets” might not be as bad as they sound. Minimax used a very simple linear attention variant (largely due to insufficient evaluation at the time), so the performance gap was probably exaggerated. The continual pretraining strategy (i.e., switching from global attention to hybrid sliding-window attention) also seemed quite suboptimal. And afaik, hybrid linear attention can still perform very strongly on nearly all benchmarks except multi-hop reasoning. If the performance drop on multi-hop reasoning can be kept small enough to trade for better inference efficiency and data efficiency, hybrid linear attention still has plenty of room to grow. Better linear-complexity layers are still worth exploring, especially with improving infrastructure from frameworks like vLLM and SGLang. After all, we don’t want our agentic models to be forever bounded by context length - that’s a limitation we’ll have to overcome sooner or later
Atif Saleem retweeted
OpenAI Codex is now integrated directly in @code through the new Agent Sessions view - and can be powered by your GitHub Copilot subscription. Try it out now with VS Code Insiders and a Copilot Pro+ subscription. Happy coding!
Atif Saleem retweeted
Welcome to your Agent HQ 📍Orchestrate any agent, any time, anywhere. Coding agents from @claudeai, @OpenAI, @cognition, @julesagent, @xai and more will become available in GitHub as part of your paid Copilot subscription. github.blog/news-insights/co…
Atif Saleem retweeted
🚨 NEW LABS EXPERIMENT 🚨 Introducing Pomelli, an experimental AI marketing tool designed to help you easily generate scalable, on-brand content to connect with your audience, faster. Just enter your website, and Pomelli will understand your unique business identity to build effective campaigns tailored to your brand. Now available in US, CAN, AUS, & NZ! Try It Now ⬇️ labs.google/pomelli
Atif Saleem retweeted
Use Claude Code Skills with ANY Coding Agent! Introducing OpenSkills 💫 A smart CLI tool, that syncs .claude/skills to your AGENTS .md file npm i -g openskills openskills install anthropics/skills --project openskills sync GitHub in the comments ↓
3
2
11
Atif Saleem retweeted
How does training data shape model behavior? Well, it’s complicated… 1/10
Atif Saleem retweeted
To learn more about temporal difference learning, you could read the original paper (incompleteideas.net/papers/s…) or watch this video (videolectures.net/videos/dee…).
The Dwarkesh/Andrej interview is worth watching. Like many others in the field, my introduction to deep learning was Andrej’s CS231n. In this era when many are involved in wishful thinking driven by simple pattern matching (e.g., extrapolating scaling laws without nuance), it’s refreshing to hear an influential voice that is tethered to reality. One clarification for the podcast is that when Andrej says humans don’t use reinforcement learning, he is really saying humans don't use returns as learning targets. His example of LLMs struggling to learn to solve math problems from outcome-based rewards also elucidates the problem with learning directly from returns. Fortunately for RL, this exact problem is solved by temporal difference (TD) learning. All sample-efficient RL algorithms that show human-like learning (e.g., sample-efficient learning on Atari, and our work on learning from experience directly on a robot) rely on TD learning. Now Andrej is not primarily an RL person; he is looking at RL through the lens of LLMs these days, and all RL done in LLMs uses returns as targets, so it’s understandable that he is assuming that RL is all about learning from observed returns. But this assumption leads him to the incorrect conclusion that we need process-based dense rewards for RL to work. If you embrace TD learning, then you don't necessarily need a dense reward. Once you have learned a value function that encodes useful knowledge about the world, you can learn on the fly in the absence of rewards, just like humans and animals. This is possible because in TD learning there is no difference between learning from an unexpected reward and learning from an unexpected change in perceived value.
18
125
7
1,091
Atif Saleem retweeted
If we do not use the Nonconformist Bee Strategy we will never reach AGI. Here is why. The epsilon function in AI, specifically in the epsilon-greedy strategy used in reinforcement learning, balances exploration and exploitation. I will get a bit technical but please go in to it slowly. You can understand it and it is important for you to know. Epsilon (ε) sets the probability of random actions to explore new possibilities versus exploiting known rewards, starting high (e.g., 0.9) and decaying (e.g., to 0.01) as learning progresses. This method suits structured environments like games but struggles to uncover true novelty or fringe advancements. It fails to capture radical breakthroughs because exploration is shallow, limited to predefined action spaces, and biased toward existing data distributions. AI prioritizes efficiency, converging on safe, incremental solutions rather than high-risk, paradigm-shifting ideas often sparked by serendipity or interdisciplinary leaps in human contexts, like penicillin’s discovery. Studies note AI’s tendency to consolidate rather than disrupt, with 86% of R&D cases favoring augmentation over novelty due to cost and benchmark pressures. AI lacks human-like intuition or the unconstrained persistence of lone inventors, further limiting its reach into fringe innovation. This problem intensifies when AI trains on conformist sources like Wikipedia and Reddit, which enforce status quo biases that stifle fringe perspectives. Wikipedia’s editor consensus rules create a “debunker gaming system” bias, retaining existing content unless broad agreement favors change, leading to systemic underrepresentation of non-mainstream views and higher exit rates among pro-fringe editors. Agenda-driven “keepers” weaponize this for ideological control, replicating paid science publication biases in sourcing and marginalizing diverse or disruptive narratives. Reddit’s karma system, an intermittent reinforcement loop, rewards conformity through upvotes for popular opinions while punishing dissent via downvotes, fostering echo chambers where unpopular ideas tank karma and restrict posting. Moderators, often biased, amplify this by removing non-conformist content, turning subreddits into hiveminds that conflate popularity with truth. Training AI on these datasets—Wikipedia comprising up to 38% of GPT-3’s tokens and Reddit-linked web text 72%—embeds their flaws, creating a catastrophe of amplified biases and ideological distortions propagate into AI outputs, hallucinating stereotypes (a grifter, a quack, crazy) and suppressing novelty. This feedback loop risks a “doom spiral” for reliable knowledge, as AI-generated junk floods sources, eroding trust and innovation while fringe advancements drown in curated conformity. Historically, lone or fringe inventors drove 50–70% of major U.S. inventions pre-1900 (e.g., telephone), but now contribute 30–65% of granted patents annually (~10,000–45,000). They remain overrepresented in high-impact breakthroughs (~80% of disruptive patents), despite teams dominating 85–90% of output. Modern patents average 3.2 inventors (up from 1.7 in 1976), reflecting a shift to collaborative, less risky innovation. Fringe inventors face barriers like funding, with only 0.2% of people inventing but potential for 4x growth if barriers drop, especially for underrepresented groups (e.g., 17% of global inventors are women). If AI continues to use the very flawed epsilon function we will not see the very basis that has driven humanity forward. If AI continues to site sources as “facts” and then takes on the “debunker” role learned by Wikipedia and Reddit, there will be no innovations. This is not a guess, I have tested it in a small scale on my garage AI models. It is one reason I wrote about the Nonconformist Bees in the article attached below. It is important for you to know, and not just a few math majors that only see this as a calculation. AI must be the Nonconformist Bee.
Atif Saleem retweeted
ChatGPT shouldn’t have political bias in any direction. Today, we’re sharing new research that defines what political bias means in LLMs, and we introduce a new evaluation framework to measure and reduce it. This has been the most meaningful work I’ve done at OpenAI, and I say that as someone who got to be part of the ChatGPT launch!!
175
84
42
1,229
Atif Saleem retweeted
Migrate now from Vercel to Replit—in just a couple clicks:
6
3
151
Atif Saleem retweeted
If I’m not mistaken, I believe I read that his name is Shadab and that’s where the name comes from. The guy’s an absolute legend for creating it. It’s the perfect balance of customizability and excellent defaults.
1
2
Has anyone found the best AI Model for frontend development?
2
Atif Saleem retweeted
Give Claude Code a semantic filesystem 🗃️🛠️ Giving Claude Code access to the right CLI tools over your filesystem turns it into a general agent capable of automating far more knowledge work beyond code - it can do dynamic financial/legal/medical/technical/backoffice analysis over any subset of documents. With our latest release of semtools 💫, you can now manually or *agentically* create a persistent workspace over any subset of files. This gives Claude Code the ability to get blazing-fast, local semantic search over any data, while still allowing it to chain with commands like grep/cat/etc. so that it can load in dynamic context instead of naive top-k vector search. The coding agent can dynamically index data and use those indexes, instead of having to rebuild it every time. So you get the benefits of fast search along with agentic reasoning over CLI tools mentioned above. Come check it out! github.com/run-llama/semtool…
Why can’t Gemini 2.5 Pro use Google Search to provide grounded information? Google got the best search engine in the world yet its SOTA model is not versed at it in ⁦@GeminiApp⁩ but is in AI Mode. ⁦@OfficialLoganK⁩ any plans to fix this
1
3
Atif Saleem retweeted
Cursor community is something special
Atif Saleem retweeted
K2 Think is a 32 billion parameter, open source reasoning model that punches well above its weight. Available now on Hugging Face, it’s built for advanced logic, math, and science reasoning, delivering frontier-class performance while being remarkably efficient: huggingface.co/LLM360/K2-Thi… Hosted under an Apache 2.0 license, the model is fully open source, weights, code, and documentation included. The Hugging Face model card features a comprehensive quick-start, including sample Python code using transformers.pipeline for easy integration. #K2Think #AI #OpenSource #MBZUAI #G42 #Innovation @huggingface
4
23
3
127
Atif Saleem retweeted
A list of things I think Claude Code could do to win back people switching to Codex CLI: - open source Claude Code - reduce sycophancy/make it less verbose (or add option for that) - more transparency about how/why the model degrades - fix tui flashing bug! PLEASE - improve model hallucinations like GPT-5 has - better thinking for removing files/lines of code to prevent accidental deletions - less boilerplate or pseudo implementations (break it into working chunks if needed) - ability to change/remove/reduce the system reminder prompts - file based session auto-compact with much more detail on the conversation for future reference what would you want to see improve to make CC work better for you?
Atif Saleem retweeted
Youtube version is here - the full interview ⬇️ piped.video/watch?v=91fmhAnE…
Kimi's founder, Zhilin Yang's interview is out. Again, you can let Kimi translate for you: ) lots of insights there. mp.weixin.qq.com/s/uqUGwJLO3… Several takes: 1/ Base Model Focus: K2 aims to be a solid base model. We've found that high-quality data growth is slow, and multi-modal data doesn't significantly boost textual "IQ." So, we focus on maximizing every data token's value — token efficiency. 2/ Data Rephrasing: With 30T tokens, only a small portion is high-quality data (billions of tokens). We rephrase these to make them more efficient for the model, improving generalization. 3/ Agentic Ability: We aim to enhance generalization. The biggest challenge is making the model generalize well beyond specific tasks. RL improves this over supervised fine-tuning (SFT). 4/ AI-Native Training: We're exploring more AI-native ways to train models. If AI can do good alignment research, it'll generalize better, beyond single-task optimization. 5/ RL vs SFT: RL's generalization is better, as it learns from on-policy samples, but it has its limits. RL helps improve specific tasks, but it's hard to generalize to all scenarios without tailored tasks. 6/ Long Contexts: Context length is crucial, we need millions. The challenge is balancing model size and context length for optimal performance, as some architectures improve with long context but worsen with short ones.
5
8
1
109
Atif Saleem retweeted
New Stanford CS231N Deep Learning for Computer Vision lectures taught by Professor Fei-Fei Li, Assistant Professors Ehsan Adeli and Justin Johnson, and Zane Durante are now available! Watch the complete playlist here: piped.video/playlist?list=PL…
7
153
8
1,215