Joseph Suarez 🐡 · Nov 7, 2025 · 1:09 PM UTC

Joseph Suarez 🐡

@jsuarez5341

Nov 7

Torch C++ & CUDA optimization dev all day today + tomorrow, streaming here/yt/twitch. The goal is to 1) Make PufferLib do RL at 10M steps/second 2) Eliminate hard to profile sources of potential bottlenecks 3) See how simple we can make it Some questions for GPU devs below

184

Joseph Suarez 🐡 · Nov 7, 2025 · 1:09 PM UTC

Joseph Suarez 🐡

@jsuarez5341

Nov 7

Q: I have small nets and need to reduce kernel launches. My options are 1) suffer through cudagraph hell 2) write some big fused kernels or 3) both. Fused kernels seem cool, but NVIDIA's cublas matmul isn't open source. What do?

Joseph Suarez 🐡 · Nov 7, 2025 · 1:09 PM UTC

Joseph Suarez 🐡 · Nov 7, 2025 · 1:09 PM UTC

Joseph Suarez 🐡

@jsuarez5341

Nov 7

Q: So far, fp32 kerns are pretty easy. Pretty much just writing C. What's the easiest way to do TF32, FP16, BF16 support without making a bloody mess?

Nov 7, 2025 · 1:09 PM UTC

Joseph Suarez 🐡 · Nov 7, 2025 · 1:09 PM UTC

Joseph Suarez 🐡

@jsuarez5341

Nov 7

Q: My instinct is to avoid extra libraries unless absolutely necessary. Really, really don't like Triton from what I see, for instance (though I'd be less annoyed if it would generate the kernels once which I could then include statically in my project). I do need some level of tile size tuning. What do?