为机器人设计推理芯片其实非常困难。
在数据中心里,每一块芯片都像泡在温泉里一样被精心照料,还有“保姆”随时监控。如果芯片坏了,很快就会被它的“克隆体”热插拔替换掉。
实际上,GPU 在数据中心的故障率相当高。以 H100 为例,行业平均年故障率大约为 9%。在理想条件下也许能降到 2%,但仍然无法低于个位数。
GPU 节点的故障恢复也需要时间,通常从几分钟到几小时不等,并不是瞬间完成的。
而在机器人中,芯片要在寒冷或恶劣的环境下运行,并且必须具备快速自我恢复的能力。容错要求完全是另一个量级。很多机器人公司往往难以让芯片连续运行几个小时而不需要重启。
对于芯片公司来说,这当然是好事——他们可以让机器人公司多买几块芯片用于热备。
但对机器人公司来说,这无疑是坏消息。这样的方案显然无法规模化,他们只能陷入与芯片厂商之间无休止的 JIRA 工单往返之中。
Designing an inference chip for robots is actually very difficult.
In data centers each chip is bathed in jacuzzi and babysat by nannies. If they died it would be hot swapped by one of their clones.
The fault rate of GPUs in datacenter is actually quite high. Industrial average annual fault rate of H100 is 9%. Ideal conditions could reduce it down to 2% but never below a single digit.
The fault recovery of GPU nodes actually could take a while, from minutes to hours. It is not instantaneous.
In robots, the chips are out in the cold and they need rapid self recovery. The fault tolerance is in a different league. It is not uncommon many robotic companies struggle to get the chip running more than a few hours without rebooting.
For chip companies, this is great since they would tell robotic companies to buy more chips for hot swapping.
For robotic companies, this is bad since it is obviously not a scaleable solution but they are stuck with endless back-and-forth JIRA tickets with vendors.