Nvidia's Audio2Face-3D blows me away, but needs dedicated inference servers + Nvidia AI Enterprise license (ouch for solo builders)
Meta's Oculus Lipsync is lightweight enough to run locally on anything, just basic visemes without expressions though
Is there a middle ground?🤔