Been thinking of writing a longer blog post about it...specialized compute and general compute both have to deal with same memory hierarchy (registers, scratch pad, L1,L2,near DRAM, far DRAM), and both pack a ton of flops, you get better utilization of these flops if your workload can stay on the closest memory for as long as possible, and this is what primarily driving ton of improvisations....while the primary Math engine looks more or less the same in GPUs vs special AI hardware, CUDA/SIMT control plane enables ton of creativity in exploiting the memory hierarchy...and relatively doesn't cost much to keep in hardware. Rather than think specialized vs general, we should think in terms of a math, memory hierarchy and a control plane.
On the surface you’d think that the convergence of model architecture to the Transformer would open the door for specialized hardware.
But somehow it feels like general purpose hardware (GP in GPGPU) is more useful now than ever.
Like back in the RNN and conv days it was relatively uncommon to need a new kernel. On the other hand specialized kernels for models are way more common now.
I think it’s in part thanks to languages like Triton which make it easier. In part the hardware has gotten so fast that the overhead of implementing your SSM or attention in high level ops is too high.
But also there’s just a lot of interesting research and algorithmic changes that need custom kernels. Like MoEs, low precision matmuls, variations on attention and linear state spaces models, …