Can Claude Fix Itself? Using LLMs for Incident Response

Summary

Disclaimer: This summary has been generated by AI. It is experimental, and feedback is welcomed. Please reach out to info@qconlondon.com with any comments or concerns.

This session explored the role of Large Language Models (LLMs), specifically Claude, in handling incident response.

Key Points:

  • Role of LLMs: Alex uses LLMs as part of incident response, highlighting both the successes and limitations of deploying AI in such scenarios.
  • Capabilities of LLMs: LLMs can efficiently gather signals from systems, reading logs and identifying issues faster than humans due to unlimited parallel processing capabilities.
  • Limitations: Although Claude can handle mundane tasks rapidly, it cannot solve all incident scenarios. It often jumps to conclusions about root causes without understanding the complexities and historical contexts which human experts can discern.

Candid Observations:

  • The AI can propose "rollbacks" or standard mitigations, which may not always be appropriate, risking incorrect assumptions based on faulty interpretations of data correlations.
  • While AI is superb at aggregating data (e.g., tracing system requests), humans are crucial for validating and making nuanced decisions, especially in uniquely complex incidents.

Opportunities and Challenges:

  • AI can aid humans in redefining incident management by reducing time spent on documentation and manual investigation, thus allowing more focus on strategic prevention.
  • The presentation acknowledges the evolving nature of AI capabilities, cautioning against over-relying on current technology for critical incident resolution.

Overall, the talk emphasizes the balance between using AI for efficiency and relying on human judgment for complex problem-solving.

This is the end of the AI-generated content.


Abstract

Can you throw an LLM at a production incident and expect useful results? A candid look from someone who runs a distributed AI system and reaches for Claude before reaching for a dashboard. Surprises, failures, and why the answer matters for every engineer carrying a pager.

Interview:

What is your session about, and why is it important for senior software developers?

It's a field report on using LLMs for incident response. I run a production AI system and these days I reach for Claude before I reach for a dashboard. It's still taboo to say this, and sometimes it would have been better to just open the dashboard, but I want to discuss when it is and when it isn't.

Why is it critical for software leaders to focus on this topic right now, as we head into 2026?

If you decide to skip this because it's yet another talk about AI, I totally understand. I sometimes think there's too much talk about AI and not enough doing. This one's a field report, I'm definitely not going to try and convince you that Claude will solve all your problems.

What are the common challenges developers and architects face in this area?

Loss of control feels very scary. What once felt like a comfortable on-call rotation where you knew all the nooks and crannies now includes a large language model that sometimes finds the issue faster than you can and sometimes feels like an overconfident junior.

What's one thing you hope attendees will implement immediately after your talk?

Curiosity for experimenting. We're all learning together.

What makes QCon stand out as a conference for senior software professionals?

The audience has been paged at 3am and has worked with mission-critical systems. I can skip the intro and go straight to the interesting part.


Speaker

Alex Palcuie

Member of Technical Staff in AI Reliability Engineering @Anthropic, Previously Staff Site Reliability Engineer on Google Cloud Platform

Alex Palcuie is a Member of Technical Staff in AI Reliability Engineering at Anthropic, where he works on keeping Claude reliable at scale. He has the unenviable task of having to fix Claude without Claude when it goes down. Previously, he was a Staff Site Reliability Engineer on Google Cloud Platform (GCP) and a member of Google's Tech IRT (Incident Response Team), handling large-scale infrastructure incidents including the kind where datacentres flood.

Read more
Find Alex Palcuie at:

From the same track

Session Sociotechnical Leadership

Orienting, Understanding, Playing, Thriving: Debugging your Organisation

Tuesday Mar 17 / 10:35AM GMT

Debugging is both an art and a science. But more than that, it's an activity undertaken with deep intention: to understand and improve your systems. In the purely technical realm, we have an extraordinary range of tooling and techniques that can help us tackle this problem.

Speaker image - Hazel Weakly

Hazel Weakly

Fellow @Nivenly Foundation; Director, Haskell Foundation; Experienced Leader Focusing on Organizational Change, Developer Experience, and Resilience Engineering

Session Distributed Tracing

How Eve Online Leverages Head Based Sampling to Observe "Fun"

Tuesday Mar 17 / 11:45AM GMT

A unique pattern in video game software is real-time interactions to express the personality of users.Here we will talk about how we instrument the universe of New Eden to identify the traffic that matters, even the "fun" parts!

Speaker image - Nicholas Herring

Nicholas Herring

Technical Director, Eve Online @CCP Games, Refiner of Internet Spaceships and Explorer of Feral Gordian Knots of Python

Session Observability

Wrangling Telemetry at Scale: A Guide to Self-Hosted Observability

Tuesday Mar 17 / 03:55PM GMT

Observability is supposed to help you tame complexity, but your Observability stack can quickly become just as complex as the systems it's meant to watch. For most teams, the answer is to pay someone else to deal with it.

Speaker image - Colin Douch

Colin Douch

Site Reliability Engineer @DuckDuckGo

Session Observability

Are We All on the Same Page? Let’s Fix That - With AI Assistance

Tuesday Mar 17 / 05:05PM GMT

In distributed systems, incidents rarely fail because of missing signals - they fail because the right people aren’t mobilised quickly enough, and teams struggle to build a shared understanding under pressure.

Speaker image - Luis Mineiro

Luis Mineiro

Director of Digital Foundation @ASOS.com, SRE Charmer, Previously @Delivery Hero and @Zalando

Session

Unconference: Debugging Distributed Systems

Tuesday Mar 17 / 01:35PM GMT