Excited to share these two papers on parallelizing MCMC and theory for parallelizing nonlinear sequence models! This is a fun line of work and there is lots more to explore in these areas.
Collabs with @SkylerWu9@xavierjgonzalez@Leokoz8 Ken Clarkson & @scott_linderman
📣Announcing 2 @NeurIPSConf papers!
"Parallelizing MCMC Across the Sequence Length": uses Newton iterations to parallelize MCMC! 🤯
But can we parallelize any nonlinear state space model?
"Predictability Enables Parallelizability": proves what systems we can parallelize. 🧵
Great collaborations with awesome people! I am really excited about this line of work. IMHO this is a fundamentally new way to think about evaluating nonlinear dynamical systems (nonlinear SSMs), and one that can lead to massive practical speedups on GPU.
📣Announcing 2 @NeurIPSConf papers!
"Parallelizing MCMC Across the Sequence Length": uses Newton iterations to parallelize MCMC! 🤯
But can we parallelize any nonlinear state space model?
"Predictability Enables Parallelizability": proves what systems we can parallelize. 🧵
Parallelizing "inherently sequential" processes has become all the rage in the era of GPUs.
But can we really parallelize anything?
In work led by me and @Leokoz8, we show that the "predictability" of a dynamical system determines whether it can be parallelized efficiently.
@dmzoltowski and @SkylerWu9 put the theory into practice by parallelizing Markov chain Monte Carlo! Turns out MCMC chains are often predictable in practice!
Really exciting work, with clear implications. And yet still so much to discover and explore!
📣Announcing 2 @NeurIPSConf papers!
"Parallelizing MCMC Across the Sequence Length": uses Newton iterations to parallelize MCMC! 🤯
But can we parallelize any nonlinear state space model?
"Predictability Enables Parallelizability": proves what systems we can parallelize. 🧵
We can use the ungulates (quasi-DEER and more!) to parallelize MCMC (arxiv.org/pdf/2508.18413).
And we have theoretical insight as to which SSMs we can parallelize, and how we should design them (arxiv.org/pdf/2508.16817)
Bright future for parallel nonlinear dynamics! [17/17]
For example, here's a predictable (blue) and unpredictable (red) RNN, evaluated sequentially (top) and in parallel (bottom).
Each parallel step is slower than a sequential one! So we need to converge in a small # of them [15/17]
Consequences:
- predictable dynamics, LLE < 0: as T grows, PL constant converges to nonzero limit -> can parallelize quickly ✅
-unpredictable dynamics, LLE>0: as T grows, PL constant goes to zero exponentially quickly -> sequential faster than parallel ❌
[14/17]
We proved that the merit function's PL constant is bounded above and below by a simple function of the LLE!
We were surprised to find such a tight link between dynamics and optimization! [13/17]
We connect "predictability" and merit geometry rigorously.
For "predictability", the Largest Lyapunov Exponent (LLE) measures how quickly trajectories diverge.
For "flatness", we use the PL constant, which controls how small the gradient can get. [12/17]
What is predictability?
Say we give our state a nudge. Does that nudge blow up or go away?
If nudges blow up, it's hard to predict the future! Think chaos, butterfly effect, the weather.
If nudges go away, we can use nearby states to predict future dynamics! [11/17]
Main finding: predictable dynamics lead to well-conditioned merit landscapes. Easy to optimize ✅
Unpredictable dynamics lead to poorly-conditioned merit landscapes with flat, narrow valleys 😨. These are hard to optimize ❌
Look at this diagram that comes from 2 RNNs! [10/17]
We provide a sharp characterization of when you can and cannot parallelize a nonlinear SSM.
Plus a recipe for designing new fully nonlinear architectures that are parallelizable by design! [9/17]
We are only faster than sequential if DEER converges in a small # of iters.
Why does DEER converge quickly in MCMC?
Turns out empirically that MCMC dynamics are often contractive.
Time for our theory contributions that "predictability enables parallelization!" [8/17]
The novelty of our contribution also opens doors to a variety of new approaches to MCMC, such as:
- stopping the DEER iterations early (approximate MCMC)
- mixing parallelization over sequence length and batch
- and more! [7/17]
Using these techniques, we get order-of-magnitude speed-ups in a variety of standard MCMC examples like
- multimodal distributions
- 8 schools
- Bayesian logistic regression (German Credit Dataset)
- Sentiment classification from LLM embeddings (D=768)
But in examples with large state size, DEER scales badly. So, we developed new quasi-Newton iterations that:
- work with ~nondifferentiable~ accept/reject steps
- use Hutchinson’s estimator and JVPs to be extra memory and compute efficient
- and more! [5/17]
We can put many of the major MCMC workhorses into state space form. These include
- Gibbs
- MALA
- HMC
We can then apply DEER to parallelize! DEER is the Gauss-Newton optimization method on an appropriately defined merit function. [4/17]
People used to think MCMC was inherently sequential—so how were we able to parallelize it?
The 🪄magic is the DEER/DeepPCR algorithm.
DEER works by making a guess for *all* the states over the entire trajectory. And then iteratively refining them until convergence. [3/17]
📣Announcing 2 @NeurIPSConf papers!
"Parallelizing MCMC Across the Sequence Length": uses Newton iterations to parallelize MCMC! 🤯
But can we parallelize any nonlinear state space model?
"Predictability Enables Parallelizability": proves what systems we can parallelize. 🧵