leloy! · Aug 22, 2025 · 3:32 PM UTC

leloy!

Pinned Tweet

leloy!

@leloykun

Aug 22

I've finally solved steepest descent on Finsler-structured (matrix) manifolds more generally. This generalizes work by me, @jxbz, and @Jianlin_S on Muon, Orthogonal Muon, & Stiefel Muon. --- The general solution turned out to be much simpler than I thought. And it should generalize to any combination of (underlying manifold, Finsler norm) and any number of extra constraints on the updates so long as the feasible set for each constraint is convex. --- I now consider this class of problems as sufficiently solved (by my definition of 'solved') and thus I'm moving on to other things I'm interested about.

leloy!

@leloykun

Aug 8

I managed to train a 1-Lipschitz, 2-layer MLP to grok on the Addition-Modulo-113 task (40-60% train-test split) in just 44 full-batch steps. This is another evidence that "we just need to scale up" is a brainworm and being smart on the choice of geometry to 'place' our weights in not only leads to really stable models but also accelerates generalization. --- Tbh, I haven't fully optimized my code yet, so I'm open-sourcing it & challenging you to beat my 'record'

485

leloy! · Nov 7, 2025 · 2:42 PM UTC

leloy!

@leloykun

Nov 7

This too!

Tony S.F. @tonysilveti

Nov 7

We also showed this in February on modded nanoGPT. The difference is even more stark when you drop Adam completely (Scion).

leloy! · Nov 7, 2025 · 2:31 PM UTC

leloy!

@leloykun

Nov 7

This is not a new result btw! See:

Micah Goldblum @micahgoldblum

Jul 10

Replying to @micahgoldblum

On top of hyperparameter robustness, small batch training makes training robust to the choice of optimizer too. We observe great performance with memory-efficient optimizers like Adafactor, and even vanilla SGD without momentum performs nearly as well as Adam. 7/n

leloy! · Sep 10, 2025 · 5:11 AM UTC

leloy!

@leloykun

Sep 10

Here's the tracker issue: github.com/marin-community/m…

Verify scaling batch size widens the gap between Muon & AdamW · Issue #1565 · marin-community/marin

Description This experiment is proposed by @leloykun. The following is quoted from his PR #1558. Recent work has shown that Muon has a larger critical batch size than AdamW. Intuitively, this means...

github.com

leloy! · Sep 10, 2025 · 5:11 AM UTC

leloy!

@leloykun

Sep 10

I recommend reading this work done by this awesome team. They also helped me test the claim that larger batch sizes widens the gap between optimizers like Muon and AdamW. Here's an early result (see pic). And I think this would be a decent interview question: can you explain why this happens and how this speeds up larger scale pretraining?

Kaiyue Wen

@wen_kaiyue

Sep 4

(1/n) Check out our new paper: "Fantastic Pretraining Optimizers and Where to Find Them"! >4000 models to find the fastest optimizer! 2× speedups over AdamW? Unlikely. Beware under-tuned baseline or limited scale! E.g. Muon: ~40% speedups <0.5B & only 10% at 1.2B (8× Chinchilla)!

232

Aurko Roy · Sep 5, 2025 · 6:30 PM UTC

leloy! retweeted

Aurko Roy

@aurko79

Sep 5

Amazing work by the Pytorch team!

PyTorch

@PyTorch

Sep 5

FlashAttention in 3D? Our latest blog explores the #kernel design of 2-Simplicial #Attention, modeling the algorithm with a hardware aligned design and rewriting the entire kernel in TLX (Triton Low Level Extensions). 🔗 hubs.la/Q03H6S9D0 #PyTorch #OpenSourceAI

leloy! · Sep 5, 2025 · 4:02 PM UTC

leloy!

@leloykun

Sep 5

This is a much-needed boost to my self-confidence atm. Thank you so much @_arohan_ !! Can vouch that these folks are looking at things in a different way. And I hope I could live up to expectations 🙏

rohan anil

@_arohan_

Aug 28

Ok its either @leloykun @aurko79 @tensorqt or @agarwl_

leloy! · Aug 22, 2025 · 9:44 PM UTC

leloy!

@leloykun

Aug 22

Replying to @leloykun @kalomaze

Triton implementation by @WhyPhy_Labs !

WhyPhy Labs

@WhyPhy_Labs

Aug 22

We are excited to announce WhyPhy’s first open-source release: S3GD S3GD is a PyTorch compatible, fused kernel implementation of the Smoothed SignSGD optimizer Check out the repo on GitHub, and read more on our website or Hugging Face (links below)

leloy! · Aug 22, 2025 · 3:32 PM UTC

leloy!

@leloykun

Aug 22

Also check out @Jianlin_S's solution:

jianlin.su

@Jianlin_S

Aug 8

Muon + Stiefiel kexue.fm/archives/11221 solved the open problem at docs.modula.systems/algorith… from @jxbz . cc @leloykun

leloy! · Aug 22, 2025 · 3:32 PM UTC

leloy!

@leloykun

Aug 22

If you're interested in reading more about this problem, check out @jxbz's writeup on Stiefel Muon here: docs.modula.systems/algorith…

Stiefel manifold

📚 This page contains original research. To cite the Modula docs, here’s some BibTeX: On this page we shall consider a problem that I affectionately refer to as manifold Muon —or, more formally,...

docs.modula.systems

leloy! · Aug 22, 2025 · 3:32 PM UTC

leloy!

@leloykun

Aug 22

This turned out to have a rather simple solution taking advantage of the fact that the feasible sets of the constraints are convex: the primal-dual approach!

leloy! · Aug 22, 2025 · 3:32 PM UTC

leloy!

@leloykun

Aug 22

Here's the problem we're trying to solve (see pic) More intuitively, given a "raw gradient" G, we want to find the optimal update A that is (1) maximal according to some norm and (2) in the tangent space at the current point W in the manifold.

leloy! · Aug 22, 2025 · 3:32 PM UTC

leloy!

@leloykun

Aug 22

Finsler manifolds are a generalization of Riemannian manifolds: we just need to remove the requirement of the 'measure of distance' having a bilinear form and viola Think of it this way: Imagine some lovecraftian horror pulls you into its realm. And with you is a ruler. It 'looks' the same in any direction, but its ticks recalibrate as you rotate the ruler. That's what it's like to be in a Finsler geometry. en.wikipedia.org/wiki/Finsle…

Finsler manifold - Wikipedia

en.wikipedia.org

leloy! · Aug 22, 2025 · 3:32 PM UTC

leloy!

@leloykun

Aug 22

But then... 1. What if we want to do steepest descent under other norms? 2. What if we want to constraint the weights to "live in" a manifold to prevent them from blowing up to infinity or from degenerating to 0? Now we have to do optimization on Finsler manifolds.

leloy! · Aug 22, 2025 · 3:32 PM UTC

leloy!

@leloykun

Aug 22

First, here's a minimal construction of the Muon optimizer: 1. Take R^{m x n} 2. Equip the tangent spaces with the spectral norm 3. Do first-order optimization on the resulting manifold 4. Add momentum and viola, you get Muon. --- What I like about this approach is that we have solid reasons for doing each step: 1. Linear weights are matrices 2. arxiv.org/abs/2310.17813 3. Why do second-order optimization when first-order would suffice? 4. Momentum accelerates optimization + handles gradient noise

A Spectral Condition for Feature Learning

The push to train ever larger neural networks has motivated the study of initialization and training at large network width. A key challenge is to scale training so that a network's internal...

arxiv.org

leloy! · Aug 22, 2025 · 3:32 PM UTC

leloy!

@leloykun

Aug 22

leloykun.github.io/ponder/st…

Steepest Descent on Finsler-Structured (Matrix) Manifolds

Fast and robust model training.

leloykun.github.io

leloy! · Aug 9, 2025 · 10:20 PM UTC

leloy!

@leloykun

Aug 9

I have so many interests I find it hard to focus on any of them I wanna study algebraic topology, category theory, optimization on finsler manifolds but also, I wanna build. I can build the entire AI infra of an AI SaaS, even the UI. I've done it before yet here I am, working on something I'm not an expert at yet

956

leloy! · Aug 8, 2025 · 2:13 PM UTC

leloy!

@leloykun

Aug 8

Finally, here's the code: colab.research.google.com/dr… I challenge you to try to beat my record! Rules: - You must reach a median 95% evaluation accuracy in (at least) 64 random seeds - Your model has to be ~1-Lipschitz for fair comparison - Active parameters per token must be at most how many is in my current record I'll setup another track for transformer models. Have fun!

Grokking on the Addition-Modulo-113 Task in 43 Full-Batch Training Steps.ipynb

Colab notebook

colab.research.google.com

leloy! · Aug 8, 2025 · 2:13 PM UTC

leloy!

@leloykun

Aug 8

Also check out this article by @Jianlin_S on solving the steepest descent problem on stiefel manifold. It's been fun going back-and-forth with him on the topic! It does seem that the performance gap between our methods is negligible at scale, although one can easily find adversarial examples at much smaller scales that widens the gap via e.g. backpropagation. x.com/Jianlin_S/status/19536…

leloy! · Aug 8, 2025 · 2:13 PM UTC

leloy!

@leloykun

Aug 8

If you wanna read more about our paper on training transformers with enforced lipschitz bounds, please check out this awesome thread by @LakerNewhouse These fellas were awesome collaborators 💙

Laker Newhouse @LakerNewhouse

Jul 19

[1/9] We created a performant Lipschitz transformer by spectrally regulating the weights—without using activation stability tricks: no layer norm, QK norm, or logit softcapping. We think this may address a “root cause” of unstable training.