Are We All on the Same Page? Let’s Fix That - With AI Assistance

Abstract

In distributed systems, incidents rarely fail because of missing signals - they fail because the right people aren’t mobilised quickly enough, and teams struggle to build a shared understanding under pressure. Customer-facing teams absorb pages for downstream failures, ownership blurs, and valuable time is lost coordinating humans rather than solving problems.

This talk takes a sociotechnical view of debugging distributed systems. Beyond traces and metrics, effective incident response depends on how teams are structured, how ownership is defined, and how people collaborate during failure. Routing incidents to the team closest to the problem is necessary - but it’s only the starting point.

We explore how trace causality can be used to align alerts with ownership, and how AI can act as a teammate during incident response - helping teams converge on probable causes, correlate signals, and surface remediation options directly in the moment of paging. Rather than replacing engineers, AI reduces cognitive load and accelerates shared understanding when it matters most.

Drawing on real-world experience, we’ll show how combining intelligent routing, collaborative debugging practices, and AI-assisted investigation transforms incident response from a noisy escalation chain into a coordinated team effort. The result isn’t just faster recovery - it’s teams that can confidently own their systems in production.

The approach is vendor-neutral and architecture-agnostic, applicable across modern microservice environments and observability stacks.


Speaker

Luis Mineiro

Director of Digital Foundation @ASOS.com, SRE Charmer, Previously @Delivery Hero and @Zalando

Director of Digital Foundation at ASOS and self-confessed “SRE Charmer”. Previously Senior Director of Developer Platform at Delivery Hero and Head of SRE at Zalando, where he created Adaptive Paging to route incidents to the team closest to the problem

Read more
Find Luis Mineiro at:

Date

Tuesday Mar 17 / 05:05PM GMT ( 50 minutes )

Location

Windsor (5th Fl.)

Topics

Observability Alerting monitoring Paging Troubleshooting debugging RCA AI-Assisted

Share

From the same track

Session Sociotechnical Leadership

Orienting, Understanding, Playing, Thriving: Debugging your Organisation

Tuesday Mar 17 / 10:35AM GMT

Debugging is both an art and a science. But more than that, it's an activity undertaken with deep intention: to understand and improve your systems. In the purely technical realm, we have an extraordinary range of tooling and techniques that can help us tackle this problem.

Speaker image - Hazel Weakly

Hazel Weakly

Fellow @Nivenly Foundation; Director, Haskell Foundation; Experienced Leader Focusing on Organizational Change, Developer Experience, and Resilience Engineering

Session Distributed Tracing

How Eve Online Leverages Head Based Sampling to Observe "Fun"

Tuesday Mar 17 / 11:45AM GMT

A unique pattern in video game software is real-time interactions to express the personality of users.Here we will talk about how we instrument the universe of New Eden to identify the traffic that matters, even the "fun" parts!

Speaker image - Nicholas Herring

Nicholas Herring

Technical Director, Eve Online @CCP Games, Refiner of Internet Spaceships and Explorer of Feral Gordian Knots of Python

Session

Can Claude Fix Itself? Using LLMs for Incident Response

Tuesday Mar 17 / 02:45PM GMT

Can you throw an LLM at a production incident and expect useful results? A candid look from someone who runs a distributed AI system and reaches for Claude before reaching for a dashboard. Surprises, failures, and why the answer matters for every engineer carrying a pager.

Speaker image - Alex Palcuie

Alex Palcuie

Member of Technical Staff in AI Reliability Engineering @Anthropic, Previously Staff Site Reliability Engineer on Google Cloud Platform

Session Observability

Wrangling Telemetry at Scale: A Guide to Self-Hosted Observability

Tuesday Mar 17 / 03:55PM GMT

Observability is supposed to help you tame complexity, but your Observability stack can quickly become just as complex as the systems it's meant to watch. For most teams, the answer is to pay someone else to deal with it.

Speaker image - Colin Douch

Colin Douch

Site Reliability Engineer @DuckDuckGo

Session

Unconference: Debugging Distributed Systems

Tuesday Mar 17 / 01:35PM GMT