Summary
Disclaimer: This summary has been generated by AI. It is experimental, and feedback is welcomed. Please reach out to info@qconlondon.com with any comments or concerns.
The presentation discusses addressing the issue of incident response failures in distributed systems. The failure often occurs due to lack of proper team mobilization rather than missing signals. The talk explores how AI can aid in debugging distributed systems by acting as a teammate, thereby reducing cognitive load and fostering a shared understanding among teams. AI helps correlate signals, converge on probable causes, and surface remediation options in real-time, without replacing engineers.
Key Topics Covered:
- Ownership and Collaboration: Effective incident response depends on how teams are structured and collaborate during failures.
- Adaptive Paging: Introduced as a method to route incidents to the team closest to the problem, based on trace causality and signals alignment.
- AI Assistance: AI is utilized to identify probable root causes and assist in incident response, providing actionable clarity on what needs to be done to solve issues.
- Real-World Application: Drawing from experience, the speaker elaborates on intelligent routing and AI-assisted investigation in transforming incident response into a coordinated effort.
Challenges and Solutions: The talk also highlights the challenges in implementing distributed tracing and gaining organizational buy-in for instrumentation in large systems.
Conclusion:
The approach presented is vendor-neutral and architecture-agnostic, suitable for modern microservice environments. AI aids in reducing the cognitive load during incident response, resulting in faster recovery and more confident teams that can manage their systems effectively in production.
This is the end of the AI-generated content.
Abstract
In distributed systems, incidents rarely fail because of missing signals - they fail because the right people aren’t mobilised quickly enough, and teams struggle to build a shared understanding under pressure. Customer-facing teams absorb pages for downstream failures, ownership blurs, and valuable time is lost coordinating humans rather than solving problems.
This talk takes a sociotechnical view of debugging distributed systems. Beyond traces and metrics, effective incident response depends on how teams are structured, how ownership is defined, and how people collaborate during failure. Routing incidents to the team closest to the problem is necessary - but it’s only the starting point.
We explore how trace causality can be used to align alerts with ownership, and how AI can act as a teammate during incident response - helping teams converge on probable causes, correlate signals, and surface remediation options directly in the moment of paging. Rather than replacing engineers, AI reduces cognitive load and accelerates shared understanding when it matters most.
Drawing on real-world experience, we’ll show how combining intelligent routing, collaborative debugging practices, and AI-assisted investigation transforms incident response from a noisy escalation chain into a coordinated team effort. The result isn’t just faster recovery - it’s teams that can confidently own their systems in production.
The approach is vendor-neutral and architecture-agnostic, applicable across modern microservice environments and observability stacks.
Interview:
What is your session about, and why is it important for senior software developers?
The session explores incident response as a sociotechnical problem, not just a technical one. Most incidents fail because teams can’t mobilize quickly enough or align on the root cause under pressure - not because monitoring is missing signals. Senior developers will learn practical patterns to combine trace causality with intelligent alert routing so the right people own the right problems, and how AI can act as a thinking partner to accelerate that shared understanding.
Why is it critical for software leaders to focus on this topic right now, as we head into 2026?
As we head into 2026, the complexity of distributed platforms is growing faster than the operational capacity of the teams building them. Alert fatigue and unclear ownership are becoming the primary brake on incident response. Organizations are already investing heavily in observability infrastructure, but without the sociotechnical glue - structured ownership models and intelligent routing - those signals drown teams rather than help them. 2026 is the year to move past “better metrics” to “better human coordination.”
What are the common challenges developers and architects face in this area?
The classic challenges: symptom-based alerting works great until you’re paging the same team for ten different root causes deeper in the distributed system. Trace data exists, but teams lack the ownership structures or automation to route alerts based on causality rather than topology. And when an incident happens, people spend a lot of time gathering context and coordinating rather than solving. After figuring out where the actual problem may be, effective mitigation requires skills that are not abundant - that context assembly is where AI can add genuine value, not as a replacement but as a thinking partner.
What’s one thing you hope attendees will implement immediately after your talk?
Audit your alert rules against your actual team ownership model. Most organizations alert based on technical symptoms (service error rates) without aligning that to which team actually owns fixing it. Start with symptom-based alerting on your critical paths, then layer in one adaptive paging rule that follows trace causality to the root service owner. You’ll immediately see fewer pages and faster resolution.
What makes QCon stand out as a conference for senior software professionals?
QCon brings together practitioners who’ve actually built and operated these systems at scale. The talks move beyond theory - they’re grounded in real constraints: cost, team capacity, organizational structures, and the messy reality of production systems. That’s exactly the context you need when thinking through incident response, because no solution works in isolation from your team’s structure and culture.
Speaker
Luis Mineiro
Director of Digital Foundation @ASOS.com, SRE Charmer, Previously @Delivery Hero and @Zalando
Director of Digital Foundation at ASOS and self-confessed “SRE Charmer”. Previously Senior Director of Developer Platform at Delivery Hero and Head of SRE at Zalando, where he created Adaptive Paging to route incidents to the team closest to the problem