Are We All on the Same Page? Let’s Fix That - With AI Assistance

Summary

Disclaimer: This summary has been generated by AI. It is experimental, and feedback is welcomed. Please reach out to info@qconlondon.com with any comments or concerns.

The presentation discusses addressing the issue of incident response failures in distributed systems. The failure often occurs due to lack of proper team mobilization rather than missing signals. The talk explores how AI can aid in debugging distributed systems by acting as a teammate, thereby reducing cognitive load and fostering a shared understanding among teams. AI helps correlate signals, converge on probable causes, and surface remediation options in real-time, without replacing engineers.

Key Topics Covered:

  • Ownership and Collaboration: Effective incident response depends on how teams are structured and collaborate during failures.
  • Adaptive Paging: Introduced as a method to route incidents to the team closest to the problem, based on trace causality and signals alignment.
  • AI Assistance: AI is utilized to identify probable root causes and assist in incident response, providing actionable clarity on what needs to be done to solve issues.
  • Real-World Application: Drawing from experience, the speaker elaborates on intelligent routing and AI-assisted investigation in transforming incident response into a coordinated effort.

Challenges and Solutions: The talk also highlights the challenges in implementing distributed tracing and gaining organizational buy-in for instrumentation in large systems.

Conclusion:

The approach presented is vendor-neutral and architecture-agnostic, suitable for modern microservice environments. AI aids in reducing the cognitive load during incident response, resulting in faster recovery and more confident teams that can manage their systems effectively in production.

This is the end of the AI-generated content.


Abstract

In distributed systems, incidents rarely fail because of missing signals - they fail because the right people aren’t mobilised quickly enough, and teams struggle to build a shared understanding under pressure. Customer-facing teams absorb pages for downstream failures, ownership blurs, and valuable time is lost coordinating humans rather than solving problems.

This talk takes a sociotechnical view of debugging distributed systems. Beyond traces and metrics, effective incident response depends on how teams are structured, how ownership is defined, and how people collaborate during failure. Routing incidents to the team closest to the problem is necessary - but it’s only the starting point.

We explore how trace causality can be used to align alerts with ownership, and how AI can act as a teammate during incident response - helping teams converge on probable causes, correlate signals, and surface remediation options directly in the moment of paging. Rather than replacing engineers, AI reduces cognitive load and accelerates shared understanding when it matters most.

Drawing on real-world experience, we’ll show how combining intelligent routing, collaborative debugging practices, and AI-assisted investigation transforms incident response from a noisy escalation chain into a coordinated team effort. The result isn’t just faster recovery - it’s teams that can confidently own their systems in production.

The approach is vendor-neutral and architecture-agnostic, applicable across modern microservice environments and observability stacks.

Interview:

What is your session about, and why is it important for senior software developers?

The session explores incident response as a sociotechnical problem, not just a technical one. Most incidents fail because teams can’t mobilize quickly enough or align on the root cause under pressure - not because monitoring is missing signals. Senior developers will learn practical patterns to combine trace causality with intelligent alert routing so the right people own the right problems, and how AI can act as a thinking partner to accelerate that shared understanding.

Why is it critical for software leaders to focus on this topic right now, as we head into 2026?

As we head into 2026, the complexity of distributed platforms is growing faster than the operational capacity of the teams building them. Alert fatigue and unclear ownership are becoming the primary brake on incident response. Organizations are already investing heavily in observability infrastructure, but without the sociotechnical glue - structured ownership models and intelligent routing - those signals drown teams rather than help them. 2026 is the year to move past “better metrics” to “better human coordination.”

What are the common challenges developers and architects face in this area?

The classic challenges: symptom-based alerting works great until you’re paging the same team for ten different root causes deeper in the distributed system. Trace data exists, but teams lack the ownership structures or automation to route alerts based on causality rather than topology. And when an incident happens, people spend a lot of time gathering context and coordinating rather than solving. After figuring out where the actual problem may be, effective mitigation requires skills that are not abundant - that context assembly is where AI can add genuine value, not as a replacement but as a thinking partner.

What’s one thing you hope attendees will implement immediately after your talk?

Audit your alert rules against your actual team ownership model. Most organizations alert based on technical symptoms (service error rates) without aligning that to which team actually owns fixing it. Start with symptom-based alerting on your critical paths, then layer in one adaptive paging rule that follows trace causality to the root service owner. You’ll immediately see fewer pages and faster resolution.

What makes QCon stand out as a conference for senior software professionals?

QCon brings together practitioners who’ve actually built and operated these systems at scale. The talks move beyond theory - they’re grounded in real constraints: cost, team capacity, organizational structures, and the messy reality of production systems. That’s exactly the context you need when thinking through incident response, because no solution works in isolation from your team’s structure and culture.


Speaker

Luis Mineiro

Director of Digital Foundation @ASOS.com, SRE Charmer, Previously @Delivery Hero and @Zalando

Director of Digital Foundation at ASOS and self-confessed “SRE Charmer”. Previously Senior Director of Developer Platform at Delivery Hero and Head of SRE at Zalando, where he created Adaptive Paging to route incidents to the team closest to the problem

Read more
Find Luis Mineiro at:

Date

Tuesday Mar 17 / 05:05PM GMT ( 50 minutes )

Location

Mountbatten (6th Fl.)

Topics

Observability Alerting monitoring Paging Troubleshooting debugging RCA AI-Assisted

Share

From the same track

Session Sociotechnical Leadership

Orienting, Understanding, Playing, Thriving: Debugging your Organisation

Tuesday Mar 17 / 10:35AM GMT

Debugging is both an art and a science. But more than that, it's an activity undertaken with deep intention: to understand and improve your systems. In the purely technical realm, we have an extraordinary range of tooling and techniques that can help us tackle this problem.

Speaker image - Hazel Weakly

Hazel Weakly

Fellow @Nivenly Foundation; Director, Haskell Foundation; Experienced Leader Focusing on Organizational Change, Developer Experience, and Resilience Engineering

Session Distributed Tracing

How Eve Online Leverages Head Based Sampling to Observe "Fun"

Tuesday Mar 17 / 11:45AM GMT

A unique pattern in video game software is real-time interactions to express the personality of users.Here we will talk about how we instrument the universe of New Eden to identify the traffic that matters, even the "fun" parts!

Speaker image - Nicholas Herring

Nicholas Herring

Technical Director, Eve Online @CCP Games, Refiner of Internet Spaceships and Explorer of Feral Gordian Knots of Python

Session

Can Claude Fix Itself? Using LLMs for Incident Response

Tuesday Mar 17 / 02:45PM GMT

Can you throw an LLM at a production incident and expect useful results? A candid look from someone who runs a distributed AI system and reaches for Claude before reaching for a dashboard. Surprises, failures, and why the answer matters for every engineer carrying a pager.

Speaker image - Alex Palcuie

Alex Palcuie

Member of Technical Staff in AI Reliability Engineering @Anthropic, Previously Staff Site Reliability Engineer on Google Cloud Platform

Session Observability

Wrangling Telemetry at Scale: A Guide to Self-Hosted Observability

Tuesday Mar 17 / 03:55PM GMT

Observability is supposed to help you tame complexity, but your Observability stack can quickly become just as complex as the systems it's meant to watch. For most teams, the answer is to pay someone else to deal with it.

Speaker image - Colin Douch

Colin Douch

Site Reliability Engineer @DuckDuckGo

Session

Unconference: Debugging Distributed Systems

Tuesday Mar 17 / 01:35PM GMT