Math @ AdMU • NanoGPT speedrunner • Muon fan 🤍 • prev ML @ XPD • 2x IOI & 2x ICPC • admonymous.co/leloy

Joined November 2018
Pinned Tweet
I've finally solved steepest descent on Finsler-structured (matrix) manifolds more generally. This generalizes work by me, @jxbz, and @Jianlin_S on Muon, Orthogonal Muon, & Stiefel Muon. --- The general solution turned out to be much simpler than I thought. And it should generalize to any combination of (underlying manifold, Finsler norm) and any number of extra constraints on the updates so long as the feasible set for each constraint is convex. --- I now consider this class of problems as sufficiently solved (by my definition of 'solved') and thus I'm moving on to other things I'm interested about.
I managed to train a 1-Lipschitz, 2-layer MLP to grok on the Addition-Modulo-113 task (40-60% train-test split) in just 44 full-batch steps. This is another evidence that "we just need to scale up" is a brainworm and being smart on the choice of geometry to 'place' our weights in not only leads to really stable models but also accelerates generalization. --- Tbh, I haven't fully optimized my code yet, so I'm open-sourcing it & challenging you to beat my 'record'
This too!
We also showed this in February on modded nanoGPT. The difference is even more stark when you drop Adam completely (Scion).
8
This is not a new result btw! See:
Replying to @micahgoldblum
On top of hyperparameter robustness, small batch training makes training robust to the choice of optimizer too. We observe great performance with memory-efficient optimizers like Adafactor, and even vanilla SGD without momentum performs nearly as well as Adam. 7/n
1
3
1
17
I recommend reading this work done by this awesome team. They also helped me test the claim that larger batch sizes widens the gap between optimizers like Muon and AdamW. Here's an early result (see pic). And I think this would be a decent interview question: can you explain why this happens and how this speeds up larger scale pretraining?
(1/n) Check out our new paper: "Fantastic Pretraining Optimizers and Where to Find Them"! >4000 models to find the fastest optimizer! 2× speedups over AdamW? Unlikely. Beware under-tuned baseline or limited scale! E.g. Muon: ~40% speedups <0.5B & only 10% at 1.2B (8× Chinchilla)!
9
16
2
232
leloy! retweeted
Amazing work by the Pytorch team!
FlashAttention in 3D? Our latest blog explores the #kernel design of 2-Simplicial #Attention, modeling the algorithm with a hardware aligned design and rewriting the entire kernel in TLX (Triton Low Level Extensions). 🔗 hubs.la/Q03H6S9D0 #PyTorch #OpenSourceAI
1
3
52
This is a much-needed boost to my self-confidence atm. Thank you so much @_arohan_ !! Can vouch that these folks are looking at things in a different way. And I hope I could live up to expectations 🙏
3
81
Replying to @leloykun @kalomaze
Triton implementation by @WhyPhy_Labs !
We are excited to announce WhyPhy’s first open-source release: S3GD S3GD is a PyTorch compatible, fused kernel implementation of the Smoothed SignSGD optimizer Check out the repo on GitHub, and read more on our website or Hugging Face (links below)
7
Also check out @Jianlin_S's solution:
1
24
This turned out to have a rather simple solution taking advantage of the fact that the feasible sets of the constraints are convex: the primal-dual approach!
1
1
24
Here's the problem we're trying to solve (see pic) More intuitively, given a "raw gradient" G, we want to find the optimal update A that is (1) maximal according to some norm and (2) in the tangent space at the current point W in the manifold.
1
22
Finsler manifolds are a generalization of Riemannian manifolds: we just need to remove the requirement of the 'measure of distance' having a bilinear form and viola Think of it this way: Imagine some lovecraftian horror pulls you into its realm. And with you is a ruler. It 'looks' the same in any direction, but its ticks recalibrate as you rotate the ruler. That's what it's like to be in a Finsler geometry. en.wikipedia.org/wiki/Finsle…
1
24
But then... 1. What if we want to do steepest descent under other norms? 2. What if we want to constraint the weights to "live in" a manifold to prevent them from blowing up to infinity or from degenerating to 0? Now we have to do optimization on Finsler manifolds.
1
25
First, here's a minimal construction of the Muon optimizer: 1. Take R^{m x n} 2. Equip the tangent spaces with the spectral norm 3. Do first-order optimization on the resulting manifold 4. Add momentum and viola, you get Muon. --- What I like about this approach is that we have solid reasons for doing each step: 1. Linear weights are matrices 2. arxiv.org/abs/2310.17813 3. Why do second-order optimization when first-order would suffice? 4. Momentum accelerates optimization + handles gradient noise
1
32
I have so many interests I find it hard to focus on any of them I wanna study algebraic topology, category theory, optimization on finsler manifolds but also, I wanna build. I can build the entire AI infra of an AI SaaS, even the UI. I've done it before yet here I am, working on something I'm not an expert at yet
Finally, here's the code: colab.research.google.com/dr… I challenge you to try to beat my record! Rules: - You must reach a median 95% evaluation accuracy in (at least) 64 random seeds - Your model has to be ~1-Lipschitz for fair comparison - Active parameters per token must be at most how many is in my current record I'll setup another track for transformer models. Have fun!
7
3
33
Also check out this article by @Jianlin_S on solving the steepest descent problem on stiefel manifold. It's been fun going back-and-forth with him on the topic! It does seem that the performance gap between our methods is negligible at scale, although one can easily find adversarial examples at much smaller scales that widens the gap via e.g. backpropagation. x.com/Jianlin_S/status/19536…
1
20
If you wanna read more about our paper on training transformers with enforced lipschitz bounds, please check out this awesome thread by @LakerNewhouse These fellas were awesome collaborators 💙
[1/9] We created a performant Lipschitz transformer by spectrally regulating the weights—without using activation stability tricks: no layer norm, QK norm, or logit softcapping. We think this may address a “root cause” of unstable training.
1
1
22