Torch C++ & CUDA optimization dev all day today + tomorrow, streaming here/yt/twitch. The goal is to 1) Make PufferLib do RL at 10M steps/second 2) Eliminate hard to profile sources of potential bottlenecks 3) See how simple we can make it Some questions for GPU devs below
11
8
184
Q: I have small nets and need to reduce kernel launches. My options are 1) suffer through cudagraph hell 2) write some big fused kernels or 3) both. Fused kernels seem cool, but NVIDIA's cublas matmul isn't open source. What do?
2
10
Q: So far, fp32 kerns are pretty easy. Pretty much just writing C. What's the easiest way to do TF32, FP16, BF16 support without making a bloody mess?

Nov 7, 2025 · 1:09 PM UTC

4
11
Q: My instinct is to avoid extra libraries unless absolutely necessary. Really, really don't like Triton from what I see, for instance (though I'd be less annoyed if it would generate the kernels once which I could then include statically in my project). I do need some level of tile size tuning. What do?
3
12
Replying to @jsuarez5341
All FP16 please, watch the FP32 RL for LLMs paper or “how to just fix RL for fine tuning”.
Not remotely relevant here. We can deploy these models in fp32 on a toaster
1
Replying to @jsuarez5341
stupid question but why not just stick with fp32?
2
Replying to @jsuarez5341
use c++ templates?
1