Abstract

In distributed systems, incidents rarely fail because of missing signals - they fail because the right people aren’t mobilised quickly enough, and teams struggle to build a shared understanding under pressure. Customer-facing teams absorb pages for downstream failures, ownership blurs, and valuable time is lost coordinating humans rather than solving problems.

This talk takes a sociotechnical view of debugging distributed systems. Beyond traces and metrics, effective incident response depends on how teams are structured, how ownership is defined, and how people collaborate during failure. Routing incidents to the team closest to the problem is necessary - but it’s only the starting point.

We explore how trace causality can be used to align alerts with ownership, and how AI can act as a teammate during incident response - helping teams converge on probable causes, correlate signals, and surface remediation options directly in the moment of paging. Rather than replacing engineers, AI reduces cognitive load and accelerates shared understanding when it matters most.

Drawing on real-world experience, we’ll show how combining intelligent routing, collaborative debugging practices, and AI-assisted investigation transforms incident response from a noisy escalation chain into a coordinated team effort. The result isn’t just faster recovery - it’s teams that can confidently own their systems in production.

The approach is vendor-neutral and architecture-agnostic, applicable across modern microservice environments and observability stacks.

Interview:

What is your session about, and why is it important for senior software developers?

The session explores incident response as a sociotechnical problem, not just a technical one. Most incidents fail because teams can’t mobilize quickly enough or align on the root cause under pressure - not because monitoring is missing signals. Senior developers will learn practical patterns to combine trace causality with intelligent alert routing so the right people own the right problems, and how AI can act as a thinking partner to accelerate that shared understanding.

Why is it critical for software leaders to focus on this topic right now, as we head into 2026?

As we head into 2026, the complexity of distributed platforms is growing faster than the operational capacity of the teams building them. Alert fatigue and unclear ownership are becoming the primary brake on incident response. Organizations are already investing heavily in observability infrastructure, but without the sociotechnical glue - structured ownership models and intelligent routing - those signals drown teams rather than help them. 2026 is the year to move past “better metrics” to “better human coordination.”

What are the common challenges developers and architects face in this area?

The classic challenges: symptom-based alerting works great until you’re paging the same team for ten different root causes deeper in the distributed system. Trace data exists, but teams lack the ownership structures or automation to route alerts based on causality rather than topology. And when an incident happens, people spend a lot of time gathering context and coordinating rather than solving. After figuring out where the actual problem may be, effective mitigation requires skills that are not abundant - that context assembly is where AI can add genuine value, not as a replacement but as a thinking partner.

What’s one thing you hope attendees will implement immediately after your talk?

Audit your alert rules against your actual team ownership model. Most organizations alert based on technical symptoms (service error rates) without aligning that to which team actually owns fixing it. Start with symptom-based alerting on your critical paths, then layer in one adaptive paging rule that follows trace causality to the root service owner. You’ll immediately see fewer pages and faster resolution.

What makes QCon stand out as a conference for senior software professionals?

QCon brings together practitioners who’ve actually built and operated these systems at scale. The talks move beyond theory - they’re grounded in real constraints: cost, team capacity, organizational structures, and the messy reality of production systems. That’s exactly the context you need when thinking through incident response, because no solution works in isolation from your team’s structure and culture.

Speaker

Luis Mineiro

Director of Digital Foundation @ASOS.com, SRE Charmer, Previously @Delivery Hero and @Zalando

Director of Digital Foundation at ASOS and self-confessed “SRE Charmer”. Previously Senior Director of Developer Platform at Delivery Hero and Head of SRE at Zalando, where he created Adaptive Paging to route incidents to the team closest to the problem

Are We All on the Same Page? Let’s Fix That - With AI Assistance

Abstract

Interview:

What is your session about, and why is it important for senior software developers?

Why is it critical for software leaders to focus on this topic right now, as we head into 2026?

What are the common challenges developers and architects face in this area?

What’s one thing you hope attendees will implement immediately after your talk?

What makes QCon stand out as a conference for senior software professionals?

Speaker

Luis Mineiro

Find Luis Mineiro at:

Speaker

Luis Mineiro

Date

Location

Track

Topics

Share

From the same track

Orienting, Understanding, Playing, Thriving: Debugging your Organisation

How Eve Online Leverages Head Based Sampling to Observe "Fun"

Can Claude Fix Itself? Using LLMs for Incident Response

Wrangling Telemetry at Scale: A Guide to Self-Hosted Observability

Unconference: Debugging Distributed Systems

Follow QCon

Contact

Menu

Conferences around the World