Thanh-Dung Le retweeted
15 courses for understanding neural networks, v/@tut_ml: shorturl.at/60Se8
Thanh-Dung Le retweeted
Haozhe's paper is worth a read, really nice use of fixed point theorems. The new one about 1 to 1 seems almost immediate though Just from reading the thread only I would guess the proof is as follows: Say your input space is discrete in j \in [n] and represented by x_j. The embedding is E: [n] \to R^d . "For almost all E", E x_j are distinct. Then a transformer f is composed of building blocks that are analytic, composition of analytic is analytic and also preserved under many algebraic operations (o-minimal stuff) . Analytic functions are either the zero function, or crosses zero on a measure zero set (they can't have positive measure set f(set) =0) . Thus so long as f \neq 0 uniformly, this should be injective (not bijective necessarily its not onto).
(1/7) Glad to see that people are following up on our work studying topological properties of modern neural network architectures. It was cool to see that widely used neural architectures can almost always generate any output given appropriate inputs, a.k.a. are surjective.
8
14
117
Thanh-Dung Le retweeted
New Nvidia + UC San Diego paper builds an automated way to read huge research fields and tell where to work. It automates high-quality field surveys and trend tracking. Research moves too fast for humans to track 10,000+ papers a year. The authors create Real Deep Research, a pipeline that gathers papers from top venues, filters scope with targeted prompts, and turns each paper into a compact structured summary. For foundation models it logs input, modeling, output, objective, and training recipe, which means data in, how the model works, what it produces, what it learns toward, and how it is trained. For robotics it logs sensor, body, joint outputs, action space, and environment, which together describe how a robot senses, moves, and acts in the world. It embeds these summaries as vectors so similar work clusters together. It then auto-builds surveys, maps topic trends over time, and links clusters across fields. It also supports semantic retrieval so newcomers get high-quality starting papers. In expert pairwise tests the system achieves average rank 1.30 and wins across many subdomains. The trend read shows teleoperation and dexterous manipulation rising while classic reinforcement learning slows. Researchers get a current map, fast orientation, and concrete entry points. ---- Paper – arxiv. org/abs/2510.20809 Paper Title: "Real Deep Research for AI, Robotics and Beyond"
Thanh-Dung Le retweeted
I built a biologically inspired spiking neural network from scratch and it learned with %5 accuracy to do addition :) There is no backpropagation, no artificial loss functions - just spikes, synapses, and dopamine-like reward signals. it uses STDP -> "Spike-Timing-Dependent Plasticity" with modulated rewards This is super fun and I will try to get it to learn with better accuracy. I also need to better understand how all the moving parts fit together Link to source code in comment which has a detailed readme and html with animations explaining how it all works
Thanh-Dung Le retweeted
this updated my prior
Replying to @AtliKosson
Why override µP? Because its core assumptions only hold very early in training! In practice wide models quickly stop being more sensitive to weight updates than smaller models! This is caused by changes in the geometric alignment of updates and layer inputs over training. 🧵6/8
2
12
90
Thanh-Dung Le retweeted
Discovering state-of-the-art reinforcement learning algorithms Reinforcement learning agents usually learn with rules we program by hand (TD, Q-learning, PPO…). But humans didn’t hand-design our learning rules—evolution did. What if we let machines discover their own RL update rules from experience? Junhyuk Oh and coauthors present exactly that. They train a population of agents across many environments and use meta-learning to optimize a meta-network that outputs the targets an agent should learn toward—effectively learning the agent’s loss and bootstrapping scheme end-to-end. The agent still emits a policy and predictions, but the semantics of those predictions are discovered rather than hard-coded. The outcome is striking: a discovered rule (“DiscoRL”) that sets a new bar on long-standing benchmarks. On Atari, a version trained on the 57 games (Disco57) exceeds the performance of hand-engineered algorithms while being more wall-clock efficient. Even more interesting, the same rule generalizes: without being tuned for them, it delivers state-of-the-art results on ProcGen and competitive performance on DMLab, NetHack, Crafter, and Sokoban. Scaling the discovery process to a more diverse set of environments (Disco103) makes the rule stronger still—performance improves simply by exposing it to more varied worlds. Under the hood, the learned predictions behave differently from classic value functions: they spike before salient events (big rewards, abrupt policy shifts) and are explicitly used to bootstrap and update the policy—showing the system has invented useful intermediate quantities rather than rediscovering old ones. The discovery process is also practical: a few hundred million steps per environment were enough to find a top rule, and the learned rule transfers to larger networks at evaluation time. This points to a compelling future: instead of manually crafting ever more intricate RL losses and targets, we can train agents whose learning algorithms are themselves learned—improving as we add compute, data diversity, and richer environments. Fewer knobs, more capability. Paper: nature.com/articles/s41586-0…
Thanh-Dung Le retweeted
I have a thing for empirical deep dive into learning dynamics like done in this paper. Sounds like muP mostly helps the early training, while wd affects the long term.
The Maximal Update Parameterization (µP) allows LR transfer from small to large models, saving costly tuning. But why is independent weight decay (IWD) essential for it to work? We find µP stabilizes early training (like an LR warmup), but IWD takes over in the long term! 🧵
7
15
278
Thanh-Dung Le retweeted
Back-propagation through discrete variables seemed crazy at the time. But that's why you work on it as a researcher. Yes, 2015-2017 were special at Google Brain.
Google brain around 2016 also was a very special place. People were pursuing a ton of diverse, exploratory and ambitious directions to push the field forward. Here's a section of @JeffDean's Google Brain "2017 Look-back", see if you can spot the transformer :) The full document is in the link below and is full of wisdom. It also features many of the ideas that are now finally becoming mainstream and some alternative approaches that have been forgotten by the community. Needless to say that many of the current "big shots" in AI were at brain during that period (or had just left, @ilyasut!), often as interns (like me) or AI residents.
17
20
447
Thanh-Dung Le retweeted
Cool math insight on Weisfeiler–Lehman color refinement and Attention. Really nicely done!
Following his first blog on "Attention Sinks from the Graph perspective", @tensorqt has now released a new blogpost, titled "Beyond Attention as a Graph". First and foremost, tensorqt introduces why standard neural networks require depth, despite the issues that this introduces in sequence modeling (most notably, gradients' instabilities). In the specific case of Transformers however, depth, while still problematic, is easy to justify: considering that "attention is an operation that message-passes between pairs of tokens in a graph", depth (intended as number of decoder layers) ends up approximating n-hops information transmissions between pair of tokens. However, what if these n-hops of information passage between pair of tokens could be approximated without resorting to depth? As such, detailing existing literature, 2-Simplicial Attention (and, more in general, High-order Attention) is introduced. The intuition here is the following: instead of considering just one key to attend the query to, one can project the key tokens in two subspaces, considering K = XW_k and K' = XW_k', which finally renders the attention calculation a multilinear product. The result is that while standard attention scores A_ij "represented the link weight going from node to node i to node j, now each entry can instead be seen as the collective weight assigned to the triangle determined by the (directed) walk from node i, passing through node j, and ending up in node s". This idea can also be extended to n > 2 key projections, with the equations describing the resulting n-order attention scores here attached (the case described before is with n = 2). It is immediate though that the (already) quadratic cost of ordinary attention ends up exploding to O(L**(n+1)), where L is the sequence length and n the attention order. One proposed way to solve this issue builds on DeepSeek Sparse Attention (DSA): first, compute the dot product of each query vector of token i at each head h with a (shared across heads) key for each token j. Pass the result in a ReLU and multiply via a per-head learned weight. Sum the resulting scores across heads, and only retain, for attention calculations, k keys with the largest scores obtained above to make attend to q: the final computational complexity, in the context of standard attention, goes O(L**2) to O(Lk). As such, while the original paper sparsifies 2-simplicial attention using local sliding window, tensorqt adapts DSA to n-order attention, testing his framework in the 2-simplicial case. First substituting ReLU with softmax, and then simplifying the scoring by directly using standard QK' attention from previous layers, considering the top-k based on those: from O(L**n) to O(L*k**(n)). All in all, given the potential of High Order attention, further research to rendering it computationally tractable is welcomed. Link to the blog below: in the picture, the aforementioned equations governing n-order attention.
1
9
97
Thanh-Dung Le retweeted
👁️ Geometry Meets Vision: Revisiting Pretrained Semantics in Distilled Fields We find that visual-only features (DINO) outperform visual-geometry features (VGGT) in spatial tasks! 👇
7
31
3
244
Thanh-Dung Le retweeted
🎓Stanford CME295 Transformers & LLMs Nice to see the new release of this new course on Transformers and LLMs. Great way to catch up on the world of LLMs and AI Agents. Includes topics like the basics of attention, mixture-of-experts, to agents. Excited to see more on evals. First lessons available now. cme295.stanford.edu/syllabus…
Thanh-Dung Le retweeted
How does training data shape model behavior? Well, it’s complicated… 1/10
Thanh-Dung Le retweeted
TL;DR: I made a Transformer that conditions its generation on latent variables. To do so an encoder Transformer only needs a source of randomness during generation, but then it needs an encoder for training, as a [conditional] VAE. 1/5
Thanh-Dung Le retweeted
The Free Transformer is such a "Transformer VAE". It mitigates the overhead by sharing half the layers between the encoder and the decoder, and having a single block specific to the later. arxiv.org/abs/2510.17558 2/5
Thanh-Dung Le retweeted
New paper: You can make ChatGPT 2x as creative with one sentence. Ever notice how LLMs all sound the same? They know 100+ jokes but only ever tell one. Every blog intro: "In today's digital landscape..." We figured out why – and how to unlock the rest 🔓 Copy-paste prompt: 🧵
Thanh-Dung Le retweeted
We cut the cost of training a diffusion model from months of rent to one night out. TREAD matches ImageNet performance of a DiT with 97% fewer A100 hours! No extra components. No extra losses. Training‑time only. Inference remains unchanged. Accepted at ICCV2025🌺
Thanh-Dung Le retweeted
🧠 Are brain-inspired algorithms inherently unscalable compared to backpropagation? 🎉 Pleased to share that our work on scaling predictive coding to 100+ layer networks has been accepted at NeurIPS 2025. 💻 Notebook: thebuckleylab.github.io/jpc/… 📄 Paper: arxiv.org/abs/2505.13124
Replying to @anilkseth
2/3 Francesco Innocenti @InnocFrancesco - a deeply suspicious (& brilliant) Ph.D. student @SussexUni @SussexAI - has now (w/ @drclbuckley) developed a new predictive coding method that scales to >100 layers w/ highly competitive benchmark performance
7
24
1
159
Thanh-Dung Le retweeted
Compression techniques I’d study if I wanted small but smart LLMs. Bookmark this. 1.Quantization 2.Distillation 3.Low-Rank Adaptation 4.Weight Sharing 5.Sparse Matrices 6.Layer Dropping 7.Knowledge Transfer 8.Embedding Compression 9.Mixed Sparsity 10. Progressive Shrinking 11.Structured Pruning 12.AutoML Compression Follow @asmah2107 to update your game on LLM optimisations.
'Learning from Similar Linear Representations: Adaptivity, Minimaxity, and Robustness', by Ye Tian, Yuqi Gu, Yang Feng. jmlr.org/papers/v26/23-0902.… #outlier #task #tasks
1
2
7
Thanh-Dung Le retweeted
since the day it was announced, i've been dying to get my hands on DGX Spark; a small but powerful machine i can put on my desk to run latest open models of almost any size. thanks to @nvidia, the dream came true a few weeks ago. look at this cutie sitting on my desk at NYU Global AI Frontier Lab. (1/6)