Solving LLM Agent Alignment @ Parlant.io

Joined April 2009
Small language models (SLMs) are definitely the way to go. But what many are missing is that SLMs can't handle *nearly* the diversity and complexity that larger models can. Our 2 full-time NLP researchers are currently working day and night on getting SLMs aligned with larger models. SFT, RLVR, hybrid approaches - what have you... (P.S. stay tuned for upcoming announcements). Here's what we've found. ⚠️ Small models fail miserably at broad, fuzzy tasks - especially those with diverse inputs. They do quite well, however, at narrow, specific ones. ➡️ A 14b model can be great at one specific thing: "Tell me whether the following observation is true with respect to the current state of this conversation..." ➡️ A 7b model will need further breakdown - "Given ***this particular category*** of observations that you're trained on, does this observation hold in this conversation?" ➡️ A 3b model will need even more categories. 💡 The lower you go in terms of model size, the more SLMs you need to manage and route between, to replace your LLM setup. Yet it's so worth it in terms of cost and latencies. If you play your cards right, you can get a 10-20x cost reduction and 2-10x latency improvements. But here's the key point. What I'm saying here can *only be leveraged* if your agentic architecture lends itself to this type of task decomposition. In Parlant, for example, we don't have one LLM request doing "aligned conversation" - unlike ~99% of CAI solutions. We have six specialized categories just for guideline matching, different categories of tool calls, different ways to compose an aligned response message, and a separate stage of selecting a canned response (when they're used). Each is a narrow task. Each often uses a different model. Most importantly, each can be fine-tuned independently. Now, suppose your agentic architecture treats the LLM as a magic black box that handles everything (e.g., general "planning" and "responding"). In that case, you won't be able to train SLMs, and you won't enjoy the cost and latency slashes that they can deliver. So build pedantically, obsessing about the intricacies of each sub-task in your system. If the task is sufficiently well-defined, you'll be able to train an SLM for it. Or stay locked into expensive models for years to come. ☹️
"A $1,000 fine for each violation of a customer-facing AI." California just gave users a private right of action, meaning your customers can now sue you if your AI deceived them. Let's see... 80% accuracy on 1 million monthly conversations... I'm never the arithmetic expert, but I believe this would expose an organization to up to ...wait for it... *200 MILLION USD PER MONTH* in fines. Yikes. Yet most agentic solutions (especially those built in-house) still treat compliance as an afterthought. "Human reps aren't perfect either, so it's okay if the AI doesn't work," you say. Well, practice saying it, because soon enough, you might need to say it in court. 🤦 Look, it's just like with security: you can't bolt compliance on at the end. You need to: 1. Seriously understand the implications 2. Carefully define the accepted range of mistakes (not just frequency, but their potential severity) 3. Logically demonstrate that your solution cannot go outside this range 4. Architect around this approach from day one The good news is that Parlant will be there for you as soon as you're ready to come to terms with this reality - even if you're building in-house (long live open-source) :) cooley.com/news/insight/2025…
Yam Marcovic retweeted
I recently compared Parlant and LangGraph. (the original post is quoted below). One of the most frequent questions readers asked was: “Isn’t it possible to create a fanout graph in LangGraph that performs parallel guideline matching, like Parlant does?” Yes, but it misses the point. While you can create any type of execution model with a generic graph, it doesn’t actually help you implement the complexities of what a good guideline-matching graph actually does. Guideline matching goes far beyond a simple fanout graph or parallel LLM execution. Parlant actually has a detailed post explaining what production-grade guideline matching truly is. You’ll see why it requires more than just a fanout code snippet. This is actually one of the deepest context engineering case studies that I’ve seen. Worth reading! I’ve shared the link in replies!
Every LangGraph user I know is making the same mistake! They all use the popular supervisor pattern to build conversational agents. The pattern defines a supervisor agent that analyzes incoming queries and routes them to specialized sub-agents. Each sub-agent handles a specific domain (returns, billing, technical support) with its own system prompt. This works beautifully when there's a clear separation of concerns. The problem is that it always selects just one route. For instance, if a customer asks: "I need to return this laptop. Also, what's your warranty on replacements?" The supervisor routes this to the Returns Agent, which knows returns perfectly but has no idea about warranties. So it either ignores the warranty question, admits it can't help, or even worse, hallucinates an answer. None of these options are desired. This gets worse as conversations progress because real users don't think categorically. They mix topics, jump between contexts, and still expect the agent to keep up. This isn't a bug you can fix since this is fundamentally how router patterns work. Now, let's see how we can solve this problem. Instead of routing between Agents, first, define some Guidelines. Think of Guidelines as modular pieces of instructions like this: ``` agent.create_guideline( condition="Customer asks about refunds", action="Check order status first to see if eligible", tools=[check_order_status], ) ``` Each guideline has two parts: - Condition: When it gets activated? - Action: What should the agent do? Based on the user's query, relevant guidelines are dynamically loaded into the Agent's context. For instance, when a customer asks about returns AND warranties, both guidelines get loaded into context simultaneously, enabling coherent responses across multiple topics without artificial separation. This approach is actually implemented in Parlant - a recently trending open-source framework (15k+ stars). Instead of routing between specialized agents, Parlant uses dynamic guideline matching. At each turn, it evaluates ALL your guidelines and loads only the relevant ones, maintaining coherent flow across different topics. You can see the full implementation and try it yourself. That said, LangGraph and Parlant are not competitors. LangGraph is excellent for workflow automation where you need precise control over execution flow. Parlant is designed for free-form conversation where users don't follow scripts. The best part? They work together beautifully. LangGraph can handle complex retrieval workflows inside Parlant tools, giving you conversational coherence from Parlant and powerful orchestration from LangGraph. I have shared the repo in the replies!
4
15
1
172
"Context Engineering" 🤖⚙️ is all the rage right now in agentic development. Is it just a fancy new name for prompting, or is there more to it? In this post, I decided to go fully transparent and talk about what it took for us to create a high-quality implementation that enables developers and business stakeholders to work together on customer-facing agents that can achieve real business impact—and do so safely. Dive into what production-grade context engineering really looks like in the real world, inside Parlant's guideline matching engine. - The real challenges that had to be solved - The naive solutions we tried and failed at - The real solutions that started working, but didn't scale - How we optimized them until things fell into place - The consequent, emerging architecture of a production-grade agentic context-engineering engine We look at all this as a community effort in getting customer-facing agents under control - we're just part of this community. Would love to hear your feedback - feel free to DM me for deeper discussions. parlant.io/blog/inside-parla…
2
4
8
When I was a software architect at Microsoft Azure, SLAs were critical. Our team had to guarantee the number of nines our infrastructure would deliver in multiple aspects of service quality (e.g., 99.999% is "5 nines"). To put this in perspective, in terms of time, five nines means *up to* 315 seconds of experiencing service degradation per year. But somehow we've now normalized this debate on whether an 80% SLA is suddenly "good enough" for AI serving customers in high-involvement communications. What gives? If there's one main thing I've learned from working with large enterprises on customer-facing AI, it's this: 🔔 Bad service is objectively worse than no service. ➡️ Would you rather have your banking app go down for 30 seconds, or have it execute one unauthorized transaction? ➡️ Would you rather have your support chat be unavailable, or have it give one customer medically dangerous advice? When your service is down, customers know something is wrong. They wait. They retry. They find alternatives. The failure mode is transparent. When your service is bad, customers act on incorrect information. They follow advice that hurts them. They unwittingly trigger actions that get them or your business in trouble. And they're right to blame you for it—who else would they? So how come 20% bad-service is acceptable while 0.01% no-service isn't? 🤔 It's not... and saying it is just sets everyone up for failure. Spread the word! thedailyjuice.net/general-mo…
Here's a "LangChain vs LlamaIndex" comparison. Whose RAG is more accurate? Guess what? 🔔 Ding-ding! RAG accuracy is not the main issue. Ding-ding again. We're working with people who deploy agents handling hundreds of thousands to millions of customer interactions per month. Retrieval accuracy alone doesn't create customer trust or engagement. It's a component. And here's the other component nobody talks about. If the conversation isn't managed and steered confidently and authoritatively, in a way that creates trust for your customer: 1. They won't trust your answers even if they're accurate 2. They often won't even get to the point where they ask their important questions—not before they escalate the chat to a human rep If you want great production metrics, sure, go ahead and spend half of your time getting your RAG accurate. But to make a real difference, spend the other half optimizing domain-specific conversational steering and governance, focused on understanding your customers' needs and interaction patterns. latenode.com/blog/platform-c…
On a simple level, Parlant seeks to solve one problem: getting conversational LLM agents under control. And that can make it sound like a feature - like the missing piece for many existing frameworks (either blackbox or open-source). Why it actually isn’t a feature becomes clearer once we start understanding the problem on a deeper level. 2 years ago, disruptive AI vendors promised the omniscient LLM that can do everything and replace anything. Now, some parts of the market, particularly certain vendors, still haven’t given up on selling this illusion, even though most buyers and users have already awakened to it. I’ve come to the conclusion that every time we (humans) face a new technological breakthrough, we’re quick to assume that humanity has been decoded. For example, when mass construction got popular, we understood ourselves as such: “He’s got a loose screw.” When steam engines came out, “He ran out of steam”. With electronics, “He burnt a fuse,” and so on. Just like we eventually understood that those things are poor “imitiations” and are far from capturing the complexity of human faculties, more people are starting to understand that LLMs - in some vague, odd way - *only awkwardly capture some parts* of what it means to be and think like a human. But most LLM frameworks are still stuck on the “omnisicient LLM” illusion; that it will fill in the gaps and figure out how to connect the dots. Thus, the premise of today's many generic “LLM app frameworks” follows this reasoning: 1. You got this “thing" that can do everything. 2. There are certain best practices on how to configure it - “we’ll do that for you." 3. It needs to connect to your resources - “we’ll provide quick integrations for you." 4. Done. The "thing" does the rest. This is essentially why they are called “wrappers”. In reality, the “thing" _so_ doesn’t do “the rest.” Circling back, Parlant provides control and power over one thing in particular: Conversation Dynamics. It doesn’t get your LLM to know the answer to complex questions. It doesn’t make it easier for you to parse your knowledge-bases and keep them up to date. Instead, it solves a (different) clear business problem: It allows you to make your agent a controlled, consistent, and effective communicator for your organization - according to your rules, however many you have. lnkd.in/dD3GAKYe
1
1
2
"Let X." When you take this first step in a mathematical proposition, you've created something with infinite potential. X could be anything. At the same time, until you start adding constraints ("such that X is a natural number," "X such that some condition holds"), while it holds infinite potential, there's actually nothing you can discover about it or do with it. It's infinitely shallow. I've spent my career building frameworks, from low-level network protocols, hard-realtime pub/sub, to cloud platforms at Microsoft, and this same principle applies to framework design. The more assumptions your framework can make about its use case, the more powerful and optimized it becomes for that use case compared to generic solutions. 👉 Adding constraints and specializations isn't a curse. It's actually a blessing. The only question that matters once you've understood that is: do your chosen constraints leave you with a real and significant use case? With Parlant, for example, we've focused on the conversational, customer-facing use case. And we've discovered (and continuing to discover) enormous complexity in the world of chaotic semantics, which is the world of conversations. IMO, it's by far the solution most able to tame this chaos in today's market (though this is just the start). I'm saying this because many people recently, who love Parlant's performance in conversational control, have asked us to apply it to the world of automation agents. To that I say: it's a great idea, and I'd be happy to support anyone who goes down this path and share them what we've learned on getting LLMs under control. But I also know this: the world of conversational semantics is extremely complex. Getting it under the necessary level of control is the mission of a dedicated company, not a side hustle. If you're in this market - mark these words :) So we'll continue to delve deeper into the significance of controlling AI conversations at the largest scales. Because the more we dive deep, the more complexity and power we discover within it. Someone should do the same for workflow automation agents, since there still isn't a single framework I know of that's really taking control and reliability seriously enough in that area.
20 years ago I wrote my own, first web MVC app framework, from scratch. But there were already dozens around, and they each did the exact same thing. The main difference was that they each thought a particular set of functions (which they all implemented the same way) sounded cooler with this or that name or naming convention. And I can't help but notice that what's happening with AI agent frameworks today is just like what happened then. A new framework comes up with the value prop of "a slightly more aesthetic API" without tackling any real technical problem. Is that what makes a framework valuable..? Or is it that it comes packed with 100 integrations for external libraries (vector DBs, LLM APIs, etc.) — while, incidentally, each of these "integrations" is a limited abstraction over the external libraries' powerful APIs: ones that their designers put much thought into so they may solve many problems and edge cases. There's a difference between "simplifying the problem" and being overly simplistic due to lack of experience and understanding. The latter is performed by the inexperienced, but the former can only be done with deep expertise and understanding. "To simplify" something complex — properly — is really, really hard. In software design, if we start from the (superficial level of) aesthetics — rather than a real problem, we end up with an API that breaks the minute you deviate 1 step from the "getting started" tutorial. Seen it too many times. Here's what actually matters when evaluating a framework: // Purpose - what important challenges does this address that I can't easily solve myself? // Design - does it make the hard things possible, or does it just make the easy things slightly easier? ideas2it.com/blogs/ai-agent-…
3
5
After recent calls with large-scale users of Parlant (and some new leads who are exploring it after experiencing issues in their customer-facing agents), so much of that comes down to the Supervisor Pattern... This is why I'm writing about it so much. But recently, an idea for a new agent architecture came up that shows a lot of promise. The initial problem is this: when you genuinely have complex, distinct departments in your AI support system, trying to cram everything into one omni-agent becomes unmanageable. It pretends to be one agent while fragmenting the conversation and mixing up contexts. Don't get me wrong... An omni-agent is a really cool moonshot concept. But it (unfortunately) fails due to hard technical limitations. But there's another option: Just build it like real customer service. You call support. A receptionist briefly clarifies what you need, then routes you *once* to the right department. Then an expert in that department handles your *entire* conversation. I call it the "Receptionist Pattern" and it's how agentic architectures *should* handle complex support use cases. The great thing about it is that it's so based on reality that: 1. Customers already understand it, so there's a natural alignment of expectation vs reality in their usage patterns. The friction is minimal. 2. Operators/developers already understand it, which makes it so much easier to model agentic flows (and even development teams) around it. - Simple routing at the entry point (not mid-conversation) - Each department is a separate expert agent with full context (using department-specific dynamic context assembly) - One coherent conversation per department - If you need a different department, you need an explicit transfer (just like real service) The alignment between expectation (which is based on existing habits) and reality (your product's UX) is what makes your AI feel natural and usable. Not the sci-fi features. Not sub-second latencies. Just providing better, quicker, and more manageable customer service.
1
4
7
Graphs and flowcharts are a broken model for real conversations. It's a mistake to build your platform on this shaky foundation. parlant.io/blog/parlant-vs-l…
But real conversations don't respect our logical graph. They're messy: they're influenced by emotions, spontaneity, and attention deficit. The whole point about supporting "natural conversations" is about dynamic context assembly, not rigid routing.
Here's the core design flaw. Spread the word around: 👉 When you route to a specialized node for topic X, that node is inherently ungrounded for topic Y.
1. Ignore the warranty question (bad UX) 2. Acknowledge you can't answer it (confusing and incoherent: the system can answer it, just routed wrong) 3. Hallucinate the warranty part of the answer (dangerous and all too common) 4. Force topic serialization: "Let me complete your return first, then we'll discuss whether your warranty will cover it." (kills trust; begs for human escalation) And it gets even worse when topics intertwine throughout the conversation, which they do...
The Returns Agent handles returns perfectly. But warranties are the Warranty Agent's domain. So your architecture leaves your agent stuck with nothing but bad options...
Here's what happens when a customer says: "I need to return this laptop. What's your warranty on replacements?" Your router picks one node. Let's say the returns agent. Now what?
1
I've been talking to several teams that spent months building supervisor-pattern agents, and they all hit the same wall once they deploy: that the commonly used "supervisor" architecture is fundamentally broken for real conversation.
1
99% of conversational agents today are built the wrong way. 🤷 Everyone reaches for the supervisor pattern: with a router at the top and specialized agents below (returns agent, billing agent, warranty agent). It looks clean and modular, but it breaks in production.
7
3
4
Once you've *bounded severity*, you can actually deploy your agent. Then you improve frequency through iteration from that safe & solid foundation. Compliance first. Optimization second. parlant.io/blog/how-parlant-…
1
1