Torch C++ & CUDA optimization dev all day today + tomorrow, streaming here/yt/twitch. The goal is to
1) Make PufferLib do RL at 10M steps/second
2) Eliminate hard to profile sources of potential bottlenecks
3) See how simple we can make it
Some questions for GPU devs below
Q: So far, fp32 kerns are pretty easy. Pretty much just writing C. What's the easiest way to do TF32, FP16, BF16 support without making a bloody mess?
Nov 7, 2025 · 1:09 PM UTC
Q: My instinct is to avoid extra libraries unless absolutely necessary. Really, really don't like Triton from what I see, for instance (though I'd be less annoyed if it would generate the kernels once which I could then include statically in my project). I do need some level of tile size tuning. What do?



