The fastest way to ship reliable AI apps - Evaluation, Experimentation, and Observability Platform

SF & NYC
Joined June 2021
Agent benchmarks are great, but you need to actually understand your end userโ€™s experience. We've released four new agentic metrics, enabling you to track how your agents behave in real-world scenarios and how users experience flow, efficiency, intent change, and conversation quality, all available out of the box. Agent Flow: Measure the correctness and coherence of an agentic trajectory by validating it against user-specified natural language tests. When building multi-step agents, this metric helps you catch deviations from expected behavior patterns. Agent Efficiency: Assess the efficiency of your agentic workflows. An agentic session is considered efficient or optimal when the agent provides a precise answer or resolution to every user ask, with an efficient path. This metric helps you identify unnecessary tool calls, redundant questions, and bloated workflows. Conversation Quality: Evaluate the overall user experience across multi-turn conversations. Beyond accuracy, it measures whether interactions leave users satisfied or frustrated, critical for customer-facing applications. Intent Change: Detect when users shift their goals mid-conversation. These shifts often indicate gaps in your agent's ability to handle the initial request, providing clear signals for improvement. Each metric is designed for production agents, providing granular visibility into behavior that traditional evals miss, and can be customized to your specific application, further extending our agent evals capabilities (which you can also access via our new MCP!). These are available to use for free today in Galileo, learn more about the new metrics below ๐Ÿ‘‡
1
2
0
Crazy seeing @rungalileo buses in the wild with @SFMTA_Muni
1
1
Galileo retweeted
Just wrapped up the @daytonaio HackSprint in San Francisco - an in-person hackathon focused on building AI agents with sharp reasoning and independent decision-making. Had the privilege to hack with @AnelyaGrant, our Co-Founder & CPO at @getjustpaid, where we vibe coded a design layout for finance professionals. What made this stand out: โ†’ Location: 660 Market St, San Francisco โ†’ Partners: @AnthropicAI, @rungalileo, @browser_use, and @WorkOS โ†’ Focus: Designing agents that demonstrate originality, technical strength, and real-world impact โ†’ Challenge: Safe integration with industry-relevant tools The shift toward Agentic AI continues to reshape how we think about automation. We're moving beyond simple task execution to systems that can reason, decide, and operate independently. This is the kind of event that reminds you why SF remains the epicenter for AI innovation. Big thanks to the Daytona team for putting this together. #AI #AgenticAI #Hackathon #SanFrancisco
1
1
5
Happy to share what weโ€™ve been up to for the past few weeks! Develop agents at rapid pace with Galileoโ€™s MCP right in your favorite IDE.
What if you could get insights into why your agent failed, and apply the fix without ever leaving your IDE? We're launching our Agent Evals MCP to make this a reality. ๐Ÿ’ช With one config file, you can now access our evaluation and observability capabilities directly in Cursor or VS Code. No context switching. No manual copy-paste. Just eval-powered insights where you actually build. Our new MCP server enables: ๐Ÿ” Instant root cause analysis: Get logstream insights that pinpoint precisely where and why agents deviate ๐Ÿ“Š Synthetic dataset generation: Create test data directly in your IDE with natural language requests โœ๏ธย  Prompt template management: Set up and validate templates without leaving your development environment โšก Real-time integration guidance: Your AI assistant can now suggest and apply Galileo instrumentation directly to your codebase Agent reliability should start where you code. Get started with the docs below ๐Ÿ‘‡
1
1
What if you could get insights into why your agent failed, and apply the fix without ever leaving your IDE? We're launching our Agent Evals MCP to make this a reality. ๐Ÿ’ช With one config file, you can now access our evaluation and observability capabilities directly in Cursor or VS Code. No context switching. No manual copy-paste. Just eval-powered insights where you actually build. Our new MCP server enables: ๐Ÿ” Instant root cause analysis: Get logstream insights that pinpoint precisely where and why agents deviate ๐Ÿ“Š Synthetic dataset generation: Create test data directly in your IDE with natural language requests โœ๏ธย  Prompt template management: Set up and validate templates without leaving your development environment โšก Real-time integration guidance: Your AI assistant can now suggest and apply Galileo instrumentation directly to your codebase Agent reliability should start where you code. Get started with the docs below ๐Ÿ‘‡
๐Ÿ‘€ ๐ŸšŒ Spotted in SF! We just launched our new bus ads across San Francisco because if you want your AI to work reliably in production, you need thorough evaluations, monitoring, and guardrails ๐Ÿ”’ If youโ€™re in the Bay Area and want a chance at winning some Galileo swag, nowโ€™s your chance. All you have to do is: ๐Ÿ‘€ Spot one of our buses in the city ๐Ÿ“ธ Take a photo ๐Ÿ’ป Post it on LinkedIn or X โœ… Tag @Galileo Our buses will be circulating for the next few weeks, and weโ€™ll announce the winner on Nov. 12th ๐Ÿ“…
On Saturday, the inaugural @daytonaio HackSprint brought San Francisco's AI builder community together, and what the community built in just six hours was incredible ๐Ÿ’ช Our Solutions Architect, Vyoma Gajjar, was onsite where she awarded the following builders for the best use of Galileo: 1๏ธโƒฃ BioScout by Mahtabin Rodela โ€“ Connects biotech investors with emerging research and patents, making discovery faster and more accessible. 2๏ธโƒฃ Smart Treasury Agent by @nicocapetillo โ€“ An AI copilot for CFOs that simulates treasury strategies in real time, bringing sophisticated financial modeling to more companies. 3๏ธโƒฃ Peazy Trainer by Kushal Murthy & Komala Chenna โ€“ A voice-guided AI trainer that teaches software through live, hands-on sandboxes. Each team excelled using evaluations and observability to understand agent behavior, catch errors early, and ship reliable agents. Huge congrats to all three winners, and thank you to everyone who participated. The quality of work, the collaboration, and the energy in the room reminded us why we love this community. Thanks to @JukicVedran and the Daytona team for creating a space where builders could do their best work, @WorkOS for hosting us in their amazing office, and the @AnthropicAI team for sponsoring as well ๐Ÿค We can't wait to see what you build next, and weโ€™re excited to participate in more HackSprints soon ๐Ÿ‘€
Galileo retweeted
๐Ÿš€Yesterday we hosted off the first-ever Daytona HackSprint in SF, powered by @AnthropicAI, @rungalileo, @browser_use & @WorkOS at the amazing WorkOS Office! ๐Ÿ†So many great projects โ€” here are the winners โฌ‡๏ธ ๐Ÿฅ‡ A/B GPT โ€“ AI that finds & fixes UX issues autonomously using Daytona + Browser Use ๐Ÿฅˆ PolySandbox โ€“ Unified sandbox orchestrator for AI code evals across backends ๐Ÿฅ‰ QoalA โ€“ Automated website QA with Browser-Use + Daytona ๐ŸŽ–๏ธ Best Use of Galileo 1๏ธโƒฃ BioScout โ€“ AI scouting biotech patents 2๏ธโƒฃ Peazy Trainer โ€“ Voice-led AI training in sandboxes 3๏ธโƒฃ Smart Treasury Agent โ€“ AI copilot for corporate finance Huge congrats to all the winners, and a big thank you to everyone who participated, including the judges, volunteers, and partner representatives! ๐Ÿš€
1
4
1
19
Galileo retweeted
We just kicked off our first @daytonaio HackSprint at the beautiful @WorkOS office in SF! After demos from our amazing partners @rungalileo, @browser_use , @WorkOS & @AnthropicAI , the hacking has begun. We canโ€™t wait to see what awesome projects our builders create today! ๐Ÿš€
1
3
1
13
๐Ÿ“… HackSprint SF with @daytonaio, @AnthropicAI, @browser_use, and @WorkOS is tomorrow! If you're in SF, join us for a day of hacking agents and get a chance to win prizes from the prize pool of over $40,000 ๐Ÿ’ฐ Register here: luma.com/bh7auv0t
On October 18th, we're excited to partner with @daytonaio for their HackSprint in SF, alongside @AnthropicAI, @browser_use, @WorkOS, and more. We're bringing together San Francisco's brightest AI builders for an intense one-day sprint to create projects at the frontier of AI. The Challenge? Build AI agents with sharp reasoning, independent decision-making, and safe integration with industry-relevant tools. Participants will have six hours to build, and three minutes to present. Every participant gets $50 in Claude API credits from Anthropic and โ€‹$100 Daytona credits, and the prize pool exceeds $40,000+ย ๐Ÿ‘€ Our Solutions Architect, Vyoma Gajjar, will be judging alongside @TobinSouth (Head of MCP and Agents at WorkOS), @JukicVedran (Co-founder & CTO at Daytona), and other leaders in AI. Donโ€™t miss this, register here: luma.com/bh7auv0t
1
6
Galileo retweeted
What does it ๐˜ณ๐˜ฆ๐˜ข๐˜ญ๐˜ญ๐˜บ take to build an AI startup that lasts? Founders from @rungalileo, @JinaAI_, & @llama_index join @benchmarkโ€™s @chetanp live for an honest convo on scaling, funding & building AI products that actually deliver. Save your seat: go.es.io/42JX9l9
Galileo retweeted
We recently welcomed new members to our Enterprise AI Ecosystem: @ServiceNow, @rungalileo, @Pulse__AI, and Fundamental. In doing so, we're able to deliver complete, production-ready AI solutions to customers. Our Abhas Ricky says it best: "Weโ€™re only as good as our ecosystem and we take pride in that." Read more about this commitment in @CRN:
2
3
7
Partnering with JPMorganChase has been a blastย ๐Ÿ’ฅ The impact of agentic applications within large financial services organizations will be massive, but for financial service agents to be successful, every interaction needs to be monitored, evaluated, and protected in real time to foster trust among users. Our platform is purpose-built for this with agent observability and runtime protection, which: โœ…ย Scales to billions of agent paths โœ…ย Stops errors before they reach users with millisecond latencies โœ…ย Can ingest millions of user queries a day Excited to share more soonย ๐Ÿ‘€
2
Galileo retweeted
Daytonaโ€™s October lineup ๐Ÿ”ฅ ๐Ÿ“San Francisco ๐ŸŒ‰ AI Builders โ€” Oct 14 HackSprint โ€” Oct 18 ๐Ÿ“New York City ๐Ÿ—ฝ AI Builders โ€” Oct 20 Thanks to @AnthropicAI, @datadoghq, @brexHQ , @rungalileo, @browser_use & @WorkOS for powering the builder community! ๐Ÿš€ RSVP links for the events in comments โฌ‡๏ธ
3
4
13
Production-ready agents require complex coordination across different model types, memory systems, and tool calls. This is the shift from prompt engineering to context engineering. When you're building production AI systems, you're architecting systems that coordinate: โ†’ Multiple model sizes (small, mid-size, large) โ†’ Different training approaches (fine-tuned, distilled, proprietary, open-source) โ†’ Multi-modal capabilities across different models โ†’ Memory systems that persist and retrieve context โ†’ Tool calls that extend model capabilities Each component of your system needs the right context at the right time; otherwise, agents can hallucinate, miss critical cues, or fail unpredictably. @Aishwarya_Sri0 joined us on the Chain of Thought Podcast last week to break down the shift to architecting AI systems. Watch the full conversation with host @ConorBronsdon below ๐Ÿ‘‡
On October 18th, we're excited to partner with @daytonaio for their HackSprint in SF, alongside @AnthropicAI, @browser_use, @WorkOS, and more. We're bringing together San Francisco's brightest AI builders for an intense one-day sprint to create projects at the frontier of AI. The Challenge? Build AI agents with sharp reasoning, independent decision-making, and safe integration with industry-relevant tools. Participants will have six hours to build, and three minutes to present. Every participant gets $50 in Claude API credits from Anthropic and โ€‹$100 Daytona credits, and the prize pool exceeds $40,000+ย ๐Ÿ‘€ Our Solutions Architect, Vyoma Gajjar, will be judging alongside @TobinSouth (Head of MCP and Agents at WorkOS), @JukicVedran (Co-founder & CTO at Daytona), and other leaders in AI. Donโ€™t miss this, register here: luma.com/bh7auv0t
3
1
7
Your multi-agent system hits production and encounters edge cases you never tested. 47 LLM calls, 12 tool invocations, 8 agent handoffs, but how do you know which one failed? Production-ready multi-agent systems need observability from day one. Our latest blog shows you how to build a LangGraph multi-agent system that routes queries to specialized agents. The architecture worked in testing, but in production, there was context loss between agent handoffs, redundant tool calls, and incorrect routing decisions. The solution? Systematic observability at every layer: ๐“๐ซ๐š๐œ๐ค ๐ž๐ฏ๐ž๐ซ๐ฒ ๐๐ž๐œ๐ข๐ฌ๐ข๐จ๐ง ๐ข๐ง ๐ซ๐ž๐š๐ฅ-๐ญ๐ข๐ฆ๐ž: Observability tools show which agent handled each query, what tools they invoked, and where latency bottlenecks appear. ๐ƒ๐ž๐›๐ฎ๐  ๐ฐ๐ข๐ญ๐ก ๐š๐œ๐ญ๐ฎ๐š๐ฅ ๐œ๐จ๐ง๐ญ๐ž๐ฑ๐ญ: Our Insights Engine automatically surfaces failure patterns so you donโ€™t have to dig through traces manually. ๐ˆ๐ฆ๐ฉ๐ซ๐จ๐ฏ๐ž ๐›๐š๐ฌ๐ž๐ ๐จ๐ง ๐๐š๐ญ๐š: See which metrics offer clear performance targets for maintaining user satisfaction and operational excellence. Read our blog to learn the complete implementation and techniques you can apply to any multi-agent architecture ๐Ÿ‘‡