Does anyone else feel evaluation/grounding is as hard as building the agent?
I helped a friend build an onboarding Q&A agent (“ask how we do X on Team Y, get the right steps + links”). The demo was shiny. The day after, I realized evaluation/grounding isn’t a checkbox—it’s the job. Nothing exploded. Instead, a slow drip of “almost-right”: it quoted last quarter’s PTO pilot because the doc changed mid-week, plus other slight misses. None felt dramatic. It felt slippery. Without tight eval/grounding, the agent isn’t stable enough.
What I learned (small, boring, effective):
1. Prompts are model-specific. A prompt that lifts Model A can tank Model B. If you swap models, re-optimize the prompt in evaluation to maximize performance before trusting results.
2. Mirror-prod staging. Spin up a staging environment that mirrors production
3. Extensive tests (not vibes), also could use LLMs as “user simulators” to fuzz phrasing and surface brittle prompts.
Curious what it’s like for others—feels similar? What evaluation/grounding habit actually made your agents stick in the real world? Are there ways to scale evaluation?