Abstract
In distributed systems, incidents rarely fail because of missing signals - they fail because the right people aren’t mobilised quickly enough, and teams struggle to build a shared understanding under pressure. Customer-facing teams absorb pages for downstream failures, ownership blurs, and valuable time is lost coordinating humans rather than solving problems.
This talk takes a sociotechnical view of debugging distributed systems. Beyond traces and metrics, effective incident response depends on how teams are structured, how ownership is defined, and how people collaborate during failure. Routing incidents to the team closest to the problem is necessary - but it’s only the starting point.
We explore how trace causality can be used to align alerts with ownership, and how AI can act as a teammate during incident response - helping teams converge on probable causes, correlate signals, and surface remediation options directly in the moment of paging. Rather than replacing engineers, AI reduces cognitive load and accelerates shared understanding when it matters most.
Drawing on real-world experience, we’ll show how combining intelligent routing, collaborative debugging practices, and AI-assisted investigation transforms incident response from a noisy escalation chain into a coordinated team effort. The result isn’t just faster recovery - it’s teams that can confidently own their systems in production.
The approach is vendor-neutral and architecture-agnostic, applicable across modern microservice environments and observability stacks.
Speaker
Luis Mineiro
Director of Digital Foundation @ASOS.com, SRE Charmer, Previously @Delivery Hero and @Zalando
Director of Digital Foundation at ASOS and self-confessed “SRE Charmer”. Previously Senior Director of Developer Platform at Delivery Hero and Head of SRE at Zalando, where he created Adaptive Paging to route incidents to the team closest to the problem