MIT CSAIL · Nov 9, 2025 · 5:00 PM UTC

MIT CSAIL

Thanh-Dung Le retweeted

MIT CSAIL

@MIT_CSAIL

Nov 9

15 courses for understanding neural networks, v/@tut_ml: shorturl.at/60Se8

158

943

Jason Lee · Oct 30, 2025 · 2:24 PM UTC

Thanh-Dung Le retweeted

Jason Lee

@jasondeanlee

Oct 30

Haozhe's paper is worth a read, really nice use of fixed point theorems. The new one about 1 to 1 seems almost immediate though Just from reading the thread only I would guess the proof is as follows: Say your input space is discrete in j \in [n] and represented by x_j. The embedding is E: [n] \to R^d . "For almost all E", E x_j are distinct. Then a transformer f is composed of building blocks that are analytic, composition of analytic is analytic and also preserved under many algebraic operations (o-minimal stuff) . Analytic functions are either the zero function, or crosses zero on a measure zero set (they can't have positive measure set f(set) =0) . Thus so long as f \neq 0 uniformly, this should be injective (not bijective necessarily its not onto).

Haozhe Jiang @erichzjiang

Oct 28

(1/7) Glad to see that people are following up on our work studying topological properties of modern neural network architectures. It was cool to see that widely used neural architectures can almost always generate any output given appropriate inputs, a.k.a. are surjective.

117

Rohan Paul · Oct 25, 2025 · 7:27 AM UTC

Thanh-Dung Le retweeted

Rohan Paul

@rohanpaul_ai

Oct 25

New Nvidia + UC San Diego paper builds an automated way to read huge research fields and tell where to work. It automates high-quality field surveys and trend tracking. Research moves too fast for humans to track 10,000+ papers a year. The authors create Real Deep Research, a pipeline that gathers papers from top venues, filters scope with targeted prompts, and turns each paper into a compact structured summary. For foundation models it logs input, modeling, output, objective, and training recipe, which means data in, how the model works, what it produces, what it learns toward, and how it is trained. For robotics it logs sensor, body, joint outputs, action space, and environment, which together describe how a robot senses, moves, and acts in the world. It embeds these summaries as vectors so similar work clusters together. It then auto-builds surveys, maps topic trends over time, and links clusters across fields. It also supports semantic retrieval so newcomers get high-quality starting papers. In expert pairwise tests the system achieves average rank 1.30 and wins across many subdomains. The trend read shows teleoperation and dexterous manipulation rising while classic reinforcement learning slows. Researchers get a current map, fast orientation, and concrete entry points. ---- Paper – arxiv. org/abs/2510.20809 Paper Title: "Real Deep Research for AI, Robotics and Beyond"

457

echo.hive · Oct 24, 2025 · 6:57 AM UTC

Thanh-Dung Le retweeted

echo.hive

@hive_echo

Oct 24

I built a biologically inspired spiking neural network from scratch and it learned with %5 accuracy to do addition :) There is no backpropagation, no artificial loss functions - just spikes, synapses, and dopamine-like reward signals. it uses STDP -> "Spike-Timing-Dependent Plasticity" with modulated rewards This is super fun and I will try to get it to learn with better accuracy. I also need to better understand how all the moving parts fit together Link to source code in comment which has a detailed readme and html with animations explaining how it all works

168

343

4,207

You Jiacheng · Oct 24, 2025 · 7:38 PM UTC

Thanh-Dung Le retweeted

You Jiacheng @YouJiacheng

Oct 24

this updated my prior

Atli Kosson

@AtliKosson

Oct 23

Replying to @AtliKosson

Why override µP? Because its core assumptions only hold very early in training! In practice wide models quickly stop being more sensitive to weight updates than smaller models! This is caused by changes in the geometric alignment of updates and layer inputs over training. 🧵6/8

Jorge Bravo Abad · Oct 24, 2025 · 9:27 AM UTC

Thanh-Dung Le retweeted

Jorge Bravo Abad

@bravo_abad

Oct 24

Discovering state-of-the-art reinforcement learning algorithms Reinforcement learning agents usually learn with rules we program by hand (TD, Q-learning, PPO…). But humans didn’t hand-design our learning rules—evolution did. What if we let machines discover their own RL update rules from experience? Junhyuk Oh and coauthors present exactly that. They train a population of agents across many environments and use meta-learning to optimize a meta-network that outputs the targets an agent should learn toward—effectively learning the agent’s loss and bootstrapping scheme end-to-end. The agent still emits a policy and predictions, but the semantics of those predictions are discovered rather than hard-coded. The outcome is striking: a discovered rule (“DiscoRL”) that sets a new bar on long-standing benchmarks. On Atari, a version trained on the 57 games (Disco57) exceeds the performance of hand-engineered algorithms while being more wall-clock efficient. Even more interesting, the same rule generalizes: without being tuned for them, it delivers state-of-the-art results on ProcGen and competitive performance on DMLab, NetHack, Crafter, and Sokoban. Scaling the discovery process to a more diverse set of environments (Disco103) makes the rule stronger still—performance improves simply by exposing it to more varied worlds. Under the hood, the learned predictions behave differently from classic value functions: they spike before salient events (big rewards, abrupt policy shifts) and are explicitly used to bootstrap and update the policy—showing the system has invented useful intermediate quantities rather than rediscovering old ones. The discovery process is also practical: a few hundred million steps per environment were enough to find a top rule, and the learned rule transfers to larger networks at evaluation time. This points to a compelling future: instead of manually crafting ever more intricate RL losses and targets, we can train agents whose learning algorithms are themselves learned—improving as we add compute, data diversity, and richer environments. Fewer knobs, more capability. Paper: nature.com/articles/s41586-0…

106

653

Lucas Beyer (bl16) · Oct 23, 2025 · 10:10 PM UTC

Thanh-Dung Le retweeted

Lucas Beyer (bl16)

@giffmana

Oct 23

I have a thing for empirical deep dive into learning dynamics like done in this paper. Sounds like muP mostly helps the early training, while wd affects the long term.

Atli Kosson

@AtliKosson

Oct 23

The Maximal Update Parameterization (µP) allows LR transfer from small to large models, saving costly tuning. But why is independent weight decay (IWD) essential for it to work? We find µP stabilizes early training (like an LR warmup), but IWD takes over in the long term! 🧵

278

Shane Gu · Oct 24, 2025 · 3:33 AM UTC

Thanh-Dung Le retweeted

Shane Gu

@shaneguML

Oct 24

Back-propagation through discrete variables seemed crazy at the time. But that's why you work on it as a researcher. Yes, 2015-2017 were special at Google Brain.

Jakob Foerster

@j_foerst

Oct 23

Google brain around 2016 also was a very special place. People were pursuing a ton of diverse, exploratory and ambitious directions to push the field forward. Here's a section of @JeffDean's Google Brain "2017 Look-back", see if you can spot the transformer :) The full document is in the link below and is full of wisdom. It also features many of the ideas that are now finally becoming mainstream and some alternative approaches that have been forgotten by the community. Needless to say that many of the current "big shots" in AI were at brain during that period (or had just left, @ilyasut!), often as interns (like me) or AI residents.

447

rohan anil · Oct 24, 2025 · 2:22 AM UTC

Thanh-Dung Le retweeted

rohan anil

@_arohan_

Oct 24

Cool math insight on Weisfeiler–Lehman color refinement and Attention. Really nicely done!

Niccolo' Gentile

@Niccolg92

Oct 22

Following his first blog on "Attention Sinks from the Graph perspective", @tensorqt has now released a new blogpost, titled "Beyond Attention as a Graph". First and foremost, tensorqt introduces why standard neural networks require depth, despite the issues that this introduces in sequence modeling (most notably, gradients' instabilities). In the specific case of Transformers however, depth, while still problematic, is easy to justify: considering that "attention is an operation that message-passes between pairs of tokens in a graph", depth (intended as number of decoder layers) ends up approximating n-hops information transmissions between pair of tokens. However, what if these n-hops of information passage between pair of tokens could be approximated without resorting to depth? As such, detailing existing literature, 2-Simplicial Attention (and, more in general, High-order Attention) is introduced. The intuition here is the following: instead of considering just one key to attend the query to, one can project the key tokens in two subspaces, considering K = XW_k and K' = XW_k', which finally renders the attention calculation a multilinear product. The result is that while standard attention scores A_ij "represented the link weight going from node to node i to node j, now each entry can instead be seen as the collective weight assigned to the triangle determined by the (directed) walk from node i, passing through node j, and ending up in node s". This idea can also be extended to n > 2 key projections, with the equations describing the resulting n-order attention scores here attached (the case described before is with n = 2). It is immediate though that the (already) quadratic cost of ordinary attention ends up exploding to O(L**(n+1)), where L is the sequence length and n the attention order. One proposed way to solve this issue builds on DeepSeek Sparse Attention (DSA): first, compute the dot product of each query vector of token i at each head h with a (shared across heads) key for each token j. Pass the result in a ReLU and multiply via a per-head learned weight. Sum the resulting scores across heads, and only retain, for attention calculations, k keys with the largest scores obtained above to make attend to q: the final computational complexity, in the context of standard attention, goes O(L**2) to O(Lk). As such, while the original paper sparsifies 2-simplicial attention using local sliding window, tensorqt adapts DSA to n-order attention, testing his framework in the 2-simplicial case. First substituting ReLU with softmax, and then simplifying the scoring by directly using standard QK' attention from previous layers, considering the top-k based on those: from O(L**n) to O(L*k**(n)). All in all, given the potential of High Order attention, further research to rendering it computationally tractable is welcomed. Link to the blog below: in the picture, the aforementioned equations governing n-order attention.

Anirudha Majumdar · Oct 22, 2025 · 3:24 PM UTC

Thanh-Dung Le retweeted

Anirudha Majumdar

@Majumdar_Ani

Oct 22

👁️ Geometry Meets Vision: Revisiting Pretrained Semantics in Distilled Fields We find that visual-only features (DINO) outperform visual-geometry features (VGGT) in spatial tasks! 👇

244

elvis · Oct 22, 2025 · 4:10 PM UTC

Thanh-Dung Le retweeted

elvis

@omarsar0

Oct 22

🎓Stanford CME295 Transformers & LLMs Nice to see the new release of this new course on Transformers and LLMs. Great way to catch up on the world of LLMs and AI Agents. Includes topics like the basics of attention, mixture-of-experts, to agents. Excited to see more on evals. First lessons available now. cme295.stanford.edu/syllabus…

159

877

Jesse Hoogland · Oct 22, 2025 · 4:25 PM UTC

Thanh-Dung Le retweeted

Jesse Hoogland

@jesse_hoogland

Oct 22

How does training data shape model behavior? Well, it’s complicated… 1/10

147

989

François Fleuret · Oct 21, 2025 · 3:34 AM UTC

Thanh-Dung Le retweeted

François Fleuret

@francoisfleuret

Oct 21

TL;DR: I made a Transformer that conditions its generation on latent variables. To do so an encoder Transformer only needs a source of randomness during generation, but then it needs an encoder for training, as a [conditional] VAE. 1/5

591

François Fleuret · Oct 21, 2025 · 3:34 AM UTC

Thanh-Dung Le retweeted

François Fleuret

@francoisfleuret

Oct 21

The Free Transformer is such a "Transformer VAE". It mitigates the overhead by sharing half the layers between the encoder and the decoder, and having a single block specific to the later. arxiv.org/abs/2510.17558 2/5

Weiyan Shi · Oct 15, 2025 · 1:30 PM UTC

Thanh-Dung Le retweeted

Weiyan Shi

@shi_weiyan

Oct 15

New paper: You can make ChatGPT 2x as creative with one sentence. Ever notice how LLMs all sound the same? They know 100+ jokes but only ever tell one. Every blog intro: "In today's digital landscape..." We figured out why – and how to unlock the rest 🔓 Copy-paste prompt: 🧵

153

1,241

Felix Krause · Oct 14, 2025 · 9:22 AM UTC

Thanh-Dung Le retweeted

Felix Krause @felix_m_krause

Oct 14

We cut the cost of training a diffusion model from months of rent to one night out. TREAD matches ImageNet performance of a DiT with 97% fewer A100 hours! No extra components. No extra losses. Training‑time only. Inference remains unchanged. Accepted at ICCV2025🌺

833

Francesco Innocenti · Oct 14, 2025 · 3:30 PM UTC

Thanh-Dung Le retweeted

Francesco Innocenti @InnocFrancesco

Oct 14

🧠 Are brain-inspired algorithms inherently unscalable compared to backpropagation? 🎉 Pleased to share that our work on scaling predictive coding to 100+ layer networks has been accepted at NeurIPS 2025. 💻 Notebook: thebuckleylab.github.io/jpc/… 📄 Paper: arxiv.org/abs/2505.13124

$μ$PC: Scaling Predictive Coding to 100+ Layer Networks

The biological implausibility of backpropagation (BP) has motivated many alternative, brain-inspired algorithms that attempt to rely only on local information, such as predictive coding (PC) and...

arxiv.org

Anil Seth @anilkseth

May 20

Replying to @anilkseth

2/3 Francesco Innocenti @InnocFrancesco - a deeply suspicious (& brilliant) Ph.D. student @SussexUni @SussexAI - has now (w/ @drclbuckley) developed a new predictive coding method that scales to >100 layers w/ highly competitive benchmark performance

159

Ashutosh Maheshwari · Oct 13, 2025 · 5:46 PM UTC

Thanh-Dung Le retweeted

Ashutosh Maheshwari

@asmah2107

Oct 13

Compression techniques I’d study if I wanted small but smart LLMs. Bookmark this. 1.Quantization 2.Distillation 3.Low-Rank Adaptation 4.Weight Sharing 5.Sparse Matrices 6.Layer Dropping 7.Knowledge Transfer 8.Embedding Compression 9.Mixed Sparsity 10. Progressive Shrinking 11.Structured Pruning 12.AutoML Compression Follow @asmah2107 to update your game on LLM optimisations.

135

916

Journal of Machine Learning Research · Oct 13, 2025 · 9:00 AM UTC

Thanh-Dung Le retweeted

Journal of Machine Learning Research @JmlrOrg

Oct 13

'Learning from Similar Linear Representations: Adaptivity, Minimaxity, and Robustness', by Ye Tian, Yuqi Gu, Yang Feng. jmlr.org/papers/v26/23-0902.… #outlier #task #tasks

Kyunghyun Cho · Oct 14, 2025 · 5:50 PM UTC

Thanh-Dung Le retweeted

Kyunghyun Cho

@kchonyc

Oct 14

since the day it was announced, i've been dying to get my hands on DGX Spark; a small but powerful machine i can put on my desk to run latest open models of almost any size. thanks to @nvidia, the dream came true a few weeks ago. look at this cutie sitting on my desk at NYU Global AI Frontier Lab. (1/6)

309