Co-founder, CEO @GentraceAI. Proud ice cream tester for @hieesuh.

San Francisco, CA
Joined February 2015
Agents are significantly more powerful than standalone LLM calls. But, debugging them is a nightmare. You can trace their reasoning and tool use, but traces get huge and are impossible to parse. To solve this, we spent the lasts several months building Gentrace for Agents, which puts our own agents to work on yours. In Gentrace for Agents, you can: • Chat with AI to debug agent traces • Create smart monitoring columns • Build out tailored evaluations It’s like a giant AI powered spreadsheet over your trace data, with a Cursor-style chat sidecar. If it sounds a little meta, it is, but it is very powerful in practice. We recorded this video to show you how it works. Take a look, and let me know what you think:
.@linear changed their agents feature from "assign to agent" to "delegate to agent." So there is always still a human assignee. This is smart - prevents the issue where agent-assigned issues end up in no-man's land if not merged. (Shown here with @codegen)
1
12
Many engineers are becoming "AI junkies" - they use AI blindly, and don't understand what's happening in the code that gets generated. This is even happening to good engineers and eventually leads to loss of critical thinking and engineering ability. Use AI responsibly.
7
2x this week got the question: "how do I eval my MCP server?" By which they mean: how do I create tools that an agent will know how to use and make changes to those tools (eg descriptions, organization) with confidence. My take: build a super basic agent on top of your MCP and eval it on common user queries across a handful of models. E.g. a task management company would eval the agent's ability to take a blob of notes and create the right tasks. A failure would be missing or "bad" (irrelevant, wrong data attributes) tasks (measured using LLM-as-a-judge). How do other people think we should eval MCP servers?
1
1
6
TIL how to use LLMs for nasty merge conflicts: 1. Get branch point sha - `git log` on the feature branch, it's the latest sha from main 2. Accept feature branch 3. Run `git diff <branch point sha> main -- <conflict>.ts` 4. Copy diff to cursor, "apply this change to this file"
2
4
My AI Code Review experience so far: - Cursor BugBot: has only found false flags, and not a single real issue yet (sample: 9 PRs) - Graphite Diamond: 1/3 hit rate - worth keeping but not amazing (sample: ~100 PRs) - Both: make ~0.5 comments per PR Any others we should try?
4
4
Had a fun chat about evals with Ben on The Chief AI Officer podcast. We discussed: • Why many companies don't have a trustworthy eval stack today • How to create great LLM-as-a-judge evals with an "unfair advantage" • Why "100% accuracy" is often a red flag Links below:
1
1
7
The ghibli stuff is cool and all, but tbh there have been versions of that forever. The "put my <whatever> <wherever>" is what's really blowing my mind.
1
5
Everyone is hyped on claude-3.7, but is o3-mini-high the current best? Such a common experience for me: - Ask cursor (3.7 thinking), "why is this code broken" - Wrong answer - Paste code into ChatGPT (o3-mini-high), ask "why is this broken" - Gets it right on the first try
1
Most teams start building LLM evals by searching for the perfect "golden" dataset, only to find it goes stale. A better approach: start small, iterate continuously, and connect datasets to real-world data. Here's how to do it right: go.gentrace.ai/fyPOgVM
1
2
6
Feeling deja vu with agent frameworks: they're like the early days of JS frameworks (2009-2015). Everyone's building their own solution, but we haven't found our "React moment" yet. Start simple, build from scratch until patterns emerge.
4
The biggest trap when building agents that even OpenAI is struggling with: I was testing ChatGPT's search features yesterday and noticed something that keeps coming up: teams obsess over "did it pick the right tool?" while missing the bigger picture. Example: I asked it for the @gentraceai address. It correctly chose the search tool (great!) but then searched for a completely unrelated company (not great!). Tool selection was perfect, execution was useless. This pattern is everywhere: - Agents choose to edit your document but don't update the right section - Search results overflowing context windows, losing critical info - Tools failing silently with no recovery strategy The reality is tool selection is just step one. The real engineering challenge is managing these interactions reliably at scale. You need trace validation, context management, and error recovery strategies. What's the worst agent failure mode you've seen?
2
First impressions using Deep Research: - It's really slow - It generates a very verbose answer full of links / citations - I'm too lazy to click the links and see if it's right, but I kinda don't trust it - Ok I'll stop being lazy and click on the links - The linking is actually really impressive; it took me to a snippet in a recorded podcast and showed me exactly what I was looking for - Ok this is pretty cool! Now I'm wondering, what's the point of Operator? If Deep Research could take actions, wouldn't it be better?
4
Keep it simple and add new tools incrementally to maintain full test coverage for each new workflow. We're advising this approach to our customers at @gentraceai - lmk what you think!
Here's how I build agents: 1. Start with a basic agent loop + function calling 2. Add comprehensive tracing for each tool call 3. Monitor which tool combos actually get used 4. Add guided workflows to address failing evals 5. Use unguided fallback for edge cases
Unguided agents are: - Simpler to implement (single LLM loop + tools) - More powerful by default (no artificial constraints) - Easier to test initially (fewer branching paths)
Most people get agent design backwards. They start by adding guided workflows when a simple, unguided agent (just an LLM loop + tools) would work better. Why?
1
1
The term "agent" is used for everything in AI right now, which creates confusion. Let me break it down: Classical ML definition: System that takes actions on your behalf Modern definition: System where the LLMs control application flow @openai Operator is an example of both. However, I prefer the modern definition - it's more relevant for how software is built and surfaces implementation challenges around testing and reliability.
1
1