Sync Agents in Production: Failure Modes and Fixes

Abstract

As models improve, we are starting to build long-running, asynchronous agents such as deep research agents and browser agents that can execute multi-step workflows autonomously. These systems unlock new use cases, but they fail in ways that short-lived agents do not.

The longer an agent runs, the more early mistakes compound, and the more token usage grows through extended reasoning, retries, and tool calls. Patterns that work for request-response agents often break down, leading to unreliable behaviour and unpredictable costs.

This talk is aimed at use case developers, with secondary relevance to platform engineers. It covers the most common failure modes in async agents and practical design patterns for reducing error compounding and keeping token costs bounded in production.


Speaker

Meryem Arik

Co-founder and CEO @Doubleword (previously TitanML)

Meryem is the Co-founder and CEO of Doubleword (previously TitanML), a self-hosted AI inference platform empowering enterprise teams to deploy domain-specific or custom models in their private environment. An alumna of Oxford University, Meryem studied Theoretical Physics and Philosophy. She frequently speaks at leading conferences, including TEDx and QCon, sharing insights on inference technology and enterprise AI. Meryem has been recognized as a Forbes 30 Under 30 honoree for her contributions to the AI field.

Read more

From the same track

Session

Reliable Retrieval for Production AI Systems

Search is central to many AI systems. Everyone is building RAG and agents right now, but few are building reliable retrieval systems.

Speaker image - Lan Chu

Lan Chu

AI Tech Lead and Senior Data Scientist

Session

Beyond Context Windows: Building Cognitive Memory for AI Agents

AI agents are rapidly changing how users interact with software, yet most agentic systems today operate with little to no intelligent memory, relying instead on brittle context-window heuristics or short-term state.

Speaker image - Karthik Ramgopal

Karthik Ramgopal

Distinguished Engineer & Tech Lead of the Product Engineering Team @LinkedIn, 15+ Years of Experience in Full-Stack Software Development

Session

Refreshing Stale Code Intelligence

Coding models are helping software developers move even faster than ever before, but weirdly, they’re not keeping up with our fast progress. The models that power code generation are often based on months to years old snapshots of open source code.

Speaker image - Jeff Smith

Jeff Smith

CEO & Co-Founder @ 2nd Set AI, AI Engineer, Researcher, Author, Ex-Meta/FAIR

Session

Rewriting All of Spotify's Code Base, All the Time

We don't need LLMs to write new code. We need them to clean up the mess we already made.In mature organizations, we have to maintain and migrate the existing codebase. Engineers are constantly balancing new feature development with endless software upkeep.

Speaker image - Jo  Kelly-Fenton

Jo Kelly-Fenton

Engineer @Spotify

Speaker image - Aleksandar Mitic

Aleksandar Mitic

Software Engineer @Spotify

Session

Building an AI Gateway Without Frameworks: One Platform, Many Agents

Early AI integrations often start small: wrap an inference API, add a prompt, ship a feature. At Zoox, that approach grew into Cortex, a production AI gateway supporting multiple model providers, multiple modalities, and agentic workflows with dozens of tools, serving over 100 internal clients.

Speaker image - Amit Navindgi

Amit Navindgi

Staff Software Engineer @Zoox