Create, Clean, Consume is my aspirational routine. My interests math, computer graphics, silicon, software and music.

San Francisco
Joined December 2009
This is a tool we built for ourselves originally and can't live without it for our daily development. If you need access to heterogenous AI hardware (AMD, Apple, Intel, Nvidia, Tenstorrent...) at your fingertips for your development and creation workflows, and switch between them frequently, OxCapsule is your tool. We are giving access to our hardware cluster for select few users now. Eventually we want to release this as a tool you can install on your own edge infra. Please watch the video to see how one of our creators uses this infrastructure...and yes OxCapsule enables both compute and pixel streaming..
Raja Koduri retweeted
Firstly, thank you for all the love and appreciation for the teaser of #BaahubaliTheEternalWar. We have successfully passed the first step in a long journey ahead of bringing this film to theatres globally in 2027. We loved the concept that @_Ishan_Shukla proposed to @ssrajamouli and me, and we put it into development immediately. While @sharma_sowmya18 and Ishan developed the story, we onboarded #ScottMosier (Dr Suess - The Grinch) to write a captivating screenplay! We roped in the knowledgeable @vinvaranasi as a mythology consultant to ensure authenticity and @devakatta and @madhankarky to write the dialogues! @FireflyVFXIndia has helped with defining the rules for the new worlds. @mmkeeravaani garu is composing music to match the scale of the story! To bring Ishan's vision to life, the team @mihiravisualabs, which we co-founded, is building workflows, tech stack and infrastructure while working with talented Indian and international artists and animation studios, just like we did for @BaahubaliMovie. Some Indian artists working on concepts include Rupali Gatti, Gibby Joseph, Priyanka Chavan, Ajay Lele, and Sanjiv Waeerkar. The international studios we are working with to achieve world-class style and animation include @Aniventure_, #Zaratan, #alcyde, and #LesAndroidsAssociés. And of course, our excellent post-production partner @AnnapurnaStdios Again, we @arkamediaworks are trying to push the boundaries of scale and ambition, and with these incredible partners and a few more partners joining us soon, we feel we are in the right hands. Your appreciation has only increased our responsibility. Exciting times ahead! piped.video/GJKsRRX3v1Q?si=epDZ…
62
810
32
4,108
A break from politics - to speak about #AI and things related in a chat wth @RajaXg - as always enjoyed listening to young startups & talking tech
16
56
Raja Koduri retweeted
Scaling Elections with GPUs and Mojo 🔥 I love occasionally optimizing obscure algorithms no one cares about. One of those, also having O(n³) complexity, like matrix multiplication, is the Schulze method for ranked-choice voting. So last year, at one of the AGI hackathons I tried accelerating it across CPUs and GPUs, with CUDA C++, Numba, and Mojo 🔥 It was an epic event and I really wanted to share the results sooner, but without AMD GPUs they didn't feel complete! We are talking about plurality of choice, after all! Now, after gathering AMD MI355X results - the story is out. The Mojo variant didn't require a single line change, ran out of the box, and at scale delivered 4x the throughput of H100. Don't take the article too seriously. It's just a fun exploration of modern GPU tech beyond LLMs — tropical semirings, blocked Floyd-Warshall, and the participation paradox in voting! Blogpost: ashvardanian.com/posts/scali… Code: github.com/ashvardanian/Scal… Overall it was one of the best cozy gatherings that year - @clattner_llvm talked about Mojo's GPU support, @RajaXg explained the reasons for the success of the CUDA ecosystem in the last decade, @tri_dao gave the first public talk on FlashAttention-3... now superseded by his FlashAttention-4, and @dylan522p shared some insights on the data-center market! Thanks to @bztree, @verdagon, and Pradeep Ramani for collaborating with me on this one! Thanks to @JvNixon, @khoomeik, and @kylejohnmorris for bringing the AGI house together and hosting all of us and to @nebiusai for providing the compute! Can't wait to gather again!
Reusable Chiplets across vendors (for AI)
One of the best statements I've seen in a while
2
3
29
Been thinking of writing a longer blog post about it...specialized compute and general compute both have to deal with same memory hierarchy (registers, scratch pad, L1,L2,near DRAM, far DRAM), and both pack a ton of flops, you get better utilization of these flops if your workload can stay on the closest memory for as long as possible, and this is what primarily driving ton of improvisations....while the primary Math engine looks more or less the same in GPUs vs special AI hardware, CUDA/SIMT control plane enables ton of creativity in exploiting the memory hierarchy...and relatively doesn't cost much to keep in hardware. Rather than think specialized vs general, we should think in terms of a math, memory hierarchy and a control plane.
On the surface you’d think that the convergence of model architecture to the Transformer would open the door for specialized hardware. But somehow it feels like general purpose hardware (GP in GPGPU) is more useful now than ever. Like back in the RNN and conv days it was relatively uncommon to need a new kernel. On the other hand specialized kernels for models are way more common now. I think it’s in part thanks to languages like Triton which make it easier. In part the hardware has gotten so fast that the overhead of implementing your SSM or attention in high level ops is too high. But also there’s just a lot of interesting research and algorithmic changes that need custom kernels. Like MoEs, low precision matmuls, variations on attention and linear state spaces models, …
7
11
3
161
Reminded me of this quote..
the other bitter lesson: hard work only gets you so far. own your narrative, or else someone else will own it for you.
5
3
2
24
jax-ml.github.io/scaling-boo… Aditya wagh shared this link on LinkedIn...looks interesting
5
51
Genuine question: All the breakthrough optimizations I see - KV cache, flash attention, quantization, seem to originate from CUDA/GPU land. Are TPUs innovating differently, or is my feed just GPU-biased? Would love examples of TPU-first optimization techniques that later crossed over. Drop links if you’ve got them!
Raja Koduri retweeted
I wrote this piece on HBF and NAND. I hope you find it helpful. Why did I turn bullish on NAND? feat. HBF open.substack.com/pub/semico…
6
24
4
187
"tiny" box with 6 Radeons arrived today 💃
9
3
189
I might have picked up the last red box 😀
How price sensitive is this market? We're in this for the long game, we want to offer compute so cheap that no one can compete. For a limited time, we're dropping our prices. $10k for the red, $25k for the green. That's below cost for the red. Act fast!
4
1
1
38
AI videos can’t compete 😀
Laughter in every second of this video !!!!
1
9
Raja Koduri retweeted
i just made the best ai ad on this whole website. mostly by ripping off great movies. ocean's 11, but it's dogs robbing a casino, to promote @barkbox (dog toys and treats) Breakdown 👇
4
2
17
0
"Ha! Well, I'll make sure to clean up any Codex-generated mess I find. 😄" Claude getting cheeky
16