DLLMs seem promising... but parallel generation is not always possible
Diffusion-based LLMs can generate many tokens at different positions at once, while most autoregressive LLMs generate tokens one by one.
This makes diffusion-based LLMs highly attractive when we need fast generation with less compute.
A big question is … is parallel generation possible without losing modeling accuracy?
The answer is no. There are fundamental limits on how much parallelism we can achieve.
Consider this example:
“Pick one city uniformly at random from the following four cities:
New York, New Orleans, Mexico City, or Panama City.”
Then,
P(Y₁ = New, Y₂ = York) = 1/4,
P(Y₁ = New, Y₂ = Orleans) = 1/4, and so on.
Thus, P(Y₁ = New) = 1/2, P(Y₂ = City) = 1/2.
If you choose to generate Y₁ and Y₂ in parallel, no matter which decoding algorithm you use …
You’re doomed to sample “New City.”
None of today’s DLLMs can generate these two words correctly without giving up parallelism.
-----
Why is this the case?
In fact, we never train LLMs to learn the joint distribution over multiple tokens in one forward iteration.
We always teach a single-token marginal distribution conditioned on context.
(The same holds for autoregressive models too.)
Therefore, sampling multiple tokens at once is only possible when those tokens are mutually independent given the current context.
And this limitation of parallel sampling can be precisely formalized.
One can derive an information-theoretic limit that’s decoding-strategy agnostic, and also derive strategy-specific limits.
-----
So are DLLMs doomed? No!
They have huge potential to save compute and time.
But:
(1) we need to be aware of their fundamental limitations, and
(2) we need to design better training and decoding strategies.
In particular, there’s huge room for improvement in decoding.
Why?
Ideally, we want the model to control the degree of parallelism during generation.
At the same time, it should choose a subset of future tokens that are almost mutually independent given the current context.
Are current decoding strategies good at this?
Hard to tell.
Most DLLMs were never stress-tested for it.
-----
That’s why we introduced a synthetic benchmark to stress-test DLLMs.
We call it ParallelBench.
The idea is simple: these are natural language tasks, but carefully designed so that parallel generation is inherently difficult.
(Think “New City”, but more natural, real tasks.)
What did we find?
We tested popular DLLMs with various decoding algorithms, and none came close to “oracle” performance, the ideal performance you’d get if the model could optimally adjust its parallelism during decoding.
-----
Takeaway:
(1) Parallel generation is not always possible and check out our paper for more details :)
(2) If you can design a DLLM that matches oracle performance on our benchmark, well, who knows, you might just get a call from someone in Menlo Park. 😉