day 67/100 of GPU Programming - practiced writing my very fast dot product kernel and a fp16 gemm kernel
day 66/100 of GPU Programming - started reading the nvidia cutlass documention - learnt how to write a CuTe DSL vector add kernel, currently fastest as well on all available GPU's on leetgpu

Oct 6, 2025 · 6:29 PM UTC

1
1
35
Replying to @sadernoheart
Try warp level primitives __shfl_down_sync Much faster.
1
2
on the dot product or fp16 gemm or both?