Doug Safreno · Aug 21, 2025 · 6:12 PM UTC

Doug Safreno

Doug Safreno

@dougsafreno

Aug 21

Agents are significantly more powerful than standalone LLM calls. But, debugging them is a nightmare. You can trace their reasoning and tool use, but traces get huge and are impossible to parse. To solve this, we spent the lasts several months building Gentrace for Agents, which puts our own agents to work on yours. In Gentrace for Agents, you can: • Chat with AI to debug agent traces • Create smart monitoring columns • Build out tailored evaluations It’s like a giant AI powered spreadsheet over your trace data, with a Cursor-style chat sidecar. If it sounds a little meta, it is, but it is very powerful in practice. We recorded this video to show you how it works. Take a look, and let me know what you think:

Doug Safreno · Jul 29, 2025 · 6:18 PM UTC

Doug Safreno

@dougsafreno

Jul 29

.@linear changed their agents feature from "assign to agent" to "delegate to agent." So there is always still a human assignee. This is smart - prevents the issue where agent-assigned issues end up in no-man's land if not merged. (Shown here with @codegen)

Doug Safreno · Jul 22, 2025 · 11:57 PM UTC

Doug Safreno

@dougsafreno

Jul 22

Many engineers are becoming "AI junkies" - they use AI blindly, and don't understand what's happening in the code that gets generated. This is even happening to good engineers and eventually leads to loss of critical thinking and engineering ability. Use AI responsibly.

Doug Safreno · Jun 10, 2025 · 10:06 PM UTC

Doug Safreno

@dougsafreno

Jun 10

2x this week got the question: "how do I eval my MCP server?" By which they mean: how do I create tools that an agent will know how to use and make changes to those tools (eg descriptions, organization) with confidence. My take: build a super basic agent on top of your MCP and eval it on common user queries across a handful of models. E.g. a task management company would eval the agent's ability to take a blob of notes and create the right tasks. A failure would be missing or "bad" (irrelevant, wrong data attributes) tasks (measured using LLM-as-a-judge). How do other people think we should eval MCP servers?

Doug Safreno · Jun 9, 2025 · 8:03 PM UTC

Doug Safreno

@dougsafreno

Jun 9

TIL how to use LLMs for nasty merge conflicts: 1. Get branch point sha - `git log` on the feature branch, it's the latest sha from main 2. Accept feature branch 3. Run `git diff <branch point sha> main -- <conflict>.ts` 4. Copy diff to cursor, "apply this change to this file"

Doug Safreno · Jun 8, 2025 · 6:46 PM UTC

Doug Safreno

@dougsafreno

Jun 8

My AI Code Review experience so far: - Cursor BugBot: has only found false flags, and not a single real issue yet (sample: 9 PRs) - Graphite Diamond: 1/3 hit rate - worth keeping but not amazing (sample: ~100 PRs) - Both: make ~0.5 comments per PR Any others we should try?

Doug Safreno · May 7, 2025 · 7:10 PM UTC

Doug Safreno

@dougsafreno

May 7

Spotify: open.spotify.com/episode/4y3… Apple: podcasts.apple.com/us/podcas…

Gentrace’s Doug Safreno on Escaping POC Purgatory with Collaborative AI Evaluation

Podcast Episode · The Chief AI Officer Show · 05/06/2025 · 43m

podcasts.apple.com

Doug Safreno · May 7, 2025 · 7:10 PM UTC

Doug Safreno

@dougsafreno

May 7

Had a fun chat about evals with Ben on The Chief AI Officer podcast. We discussed: • Why many companies don't have a trustworthy eval stack today • How to create great LLM-as-a-judge evals with an "unfair advantage" • Why "100% accuracy" is often a red flag Links below:

Doug Safreno · Mar 26, 2025 · 4:18 PM UTC

Doug Safreno

@dougsafreno

Mar 26

The ghibli stuff is cool and all, but tbh there have been versions of that forever. The "put my <whatever> <wherever>" is what's really blowing my mind.

Doug Safreno · Mar 18, 2025 · 6:01 PM UTC

Doug Safreno

@dougsafreno

Mar 18

Everyone is hyped on claude-3.7, but is o3-mini-high the current best? Such a common experience for me: - Ask cursor (3.7 thinking), "why is this code broken" - Wrong answer - Paste code into ChatGPT (o3-mini-high), ask "why is this broken" - Gets it right on the first try

Doug Safreno · Feb 18, 2025 · 5:09 PM UTC

Doug Safreno

@dougsafreno

Feb 18

Most teams start building LLM evals by searching for the perfect "golden" dataset, only to find it goes stale. A better approach: start small, iterate continuously, and connect datasets to real-world data. Here's how to do it right: go.gentrace.ai/fyPOgVM

Building datasets for LLM product evaluations

In this post, I provide a system for building, maintaining, and scaling datasets for modern LLM product workflows.

gentrace.ai

Doug Safreno · Feb 7, 2025 · 7:58 PM UTC

Doug Safreno

@dougsafreno

Feb 7

Feeling deja vu with agent frameworks: they're like the early days of JS frameworks (2009-2015). Everyone's building their own solution, but we haven't found our "React moment" yet. Start simple, build from scratch until patterns emerge.

Doug Safreno · Feb 6, 2025 · 4:58 PM UTC

Doug Safreno

@dougsafreno

Feb 6

The biggest trap when building agents that even OpenAI is struggling with: I was testing ChatGPT's search features yesterday and noticed something that keeps coming up: teams obsess over "did it pick the right tool?" while missing the bigger picture. Example: I asked it for the @gentraceai address. It correctly chose the search tool (great!) but then searched for a completely unrelated company (not great!). Tool selection was perfect, execution was useless. This pattern is everywhere: - Agents choose to edit your document but don't update the right section - Search results overflowing context windows, losing critical info - Tools failing silently with no recovery strategy The reality is tool selection is just step one. The real engineering challenge is managing these interactions reliably at scale. You need trace validation, context management, and error recovery strategies. What's the worst agent failure mode you've seen?

Doug Safreno · Feb 5, 2025 · 7:48 PM UTC

Doug Safreno

@dougsafreno

Feb 5

First impressions using Deep Research: - It's really slow - It generates a very verbose answer full of links / citations - I'm too lazy to click the links and see if it's right, but I kinda don't trust it - Ok I'll stop being lazy and click on the links - The linking is actually really impressive; it took me to a snippet in a recorded podcast and showed me exactly what I was looking for - Ok this is pretty cool! Now I'm wondering, what's the point of Operator? If Deep Research could take actions, wouldn't it be better?

Doug Safreno · Jan 31, 2025 · 9:34 PM UTC

Doug Safreno

@dougsafreno

Jan 31

Keep it simple and add new tools incrementally to maintain full test coverage for each new workflow. We're advising this approach to our customers at @gentraceai - lmk what you think!

Doug Safreno · Jan 31, 2025 · 9:34 PM UTC

Doug Safreno

@dougsafreno

Jan 31

Here's how I build agents: 1. Start with a basic agent loop + function calling 2. Add comprehensive tracing for each tool call 3. Monitor which tool combos actually get used 4. Add guided workflows to address failing evals 5. Use unguided fallback for edge cases

Doug Safreno · Jan 31, 2025 · 9:34 PM UTC

Doug Safreno

@dougsafreno

Jan 31

Unguided agents are: - Simpler to implement (single LLM loop + tools) - More powerful by default (no artificial constraints) - Easier to test initially (fewer branching paths)

Doug Safreno · Jan 31, 2025 · 9:34 PM UTC

Doug Safreno

@dougsafreno

Jan 31

Most people get agent design backwards. They start by adding guided workflows when a simple, unguided agent (just an LLM loop + tools) would work better. Why?

Doug Safreno · Jan 29, 2025 · 7:48 PM UTC

Doug Safreno

@dougsafreno

Jan 29

Replying to @dougsafreno @OpenAI

We're hosting an event on the future of agents w/ @ayanb and our speakers @bryantchou @rodrigodavies @prabhavjain @ezelby @patrickt010 - come join us! lu.ma/gentracexampersand

Gentrace x Ampersand present: The SaaS → AI Agent Transition · Luma

There's been much debate about the fate of 'low-code/no-code' / purpose-built SaaS tools — will AI agents spell their demise, or may they be a force for…

luma.com

Doug Safreno · Jan 29, 2025 · 7:48 PM UTC

Doug Safreno

@dougsafreno

Jan 29

The term "agent" is used for everything in AI right now, which creates confusion. Let me break it down: Classical ML definition: System that takes actions on your behalf Modern definition: System where the LLMs control application flow @openai Operator is an example of both. However, I prefer the modern definition - it's more relevant for how software is built and surfaces implementation challenges around testing and reliability.