Small language models (SLMs) are definitely the way to go. But what many are missing is that SLMs can't handle *nearly* the diversity and complexity that larger models can.
Our 2 full-time NLP researchers are currently working day and night on getting SLMs aligned with larger models. SFT, RLVR, hybrid approaches - what have you... (P.S. stay tuned for upcoming announcements).
Here's what we've found.
⚠️ Small models fail miserably at broad, fuzzy tasks - especially those with diverse inputs. They do quite well, however, at narrow, specific ones.
➡️ A 14b model can be great at one specific thing: "Tell me whether the following observation is true with respect to the current state of this conversation..."
➡️ A 7b model will need further breakdown - "Given ***this particular category*** of observations that you're trained on, does this observation hold in this conversation?"
➡️ A 3b model will need even more categories.
💡 The lower you go in terms of model size, the more SLMs you need to manage and route between, to replace your LLM setup.
Yet it's so worth it in terms of cost and latencies. If you play your cards right, you can get a 10-20x cost reduction and 2-10x latency improvements.
But here's the key point. What I'm saying here can *only be leveraged* if your agentic architecture lends itself to this type of task decomposition.
In Parlant, for example, we don't have one LLM request doing "aligned conversation" - unlike ~99% of CAI solutions.
We have six specialized categories just for guideline matching, different categories of tool calls, different ways to compose an aligned response message, and a separate stage of selecting a canned response (when they're used).
Each is a narrow task. Each often uses a different model. Most importantly, each can be fine-tuned independently.
Now, suppose your agentic architecture treats the LLM as a magic black box that handles everything (e.g., general "planning" and "responding").
In that case, you won't be able to train SLMs, and you won't enjoy the cost and latency slashes that they can deliver.
So build pedantically, obsessing about the intricacies of each sub-task in your system. If the task is sufficiently well-defined, you'll be able to train an SLM for it. Or stay locked into expensive models for years to come. ☹️