Summary
Disclaimer: This summary has been generated by AI. It is experimental, and feedback is welcomed. Please reach out to info@qconlondon.com with any comments or concerns.
This session explored the role of Large Language Models (LLMs), specifically Claude, in handling incident response.
Key Points:
- Role of LLMs: Alex uses LLMs as part of incident response, highlighting both the successes and limitations of deploying AI in such scenarios.
- Capabilities of LLMs: LLMs can efficiently gather signals from systems, reading logs and identifying issues faster than humans due to unlimited parallel processing capabilities.
- Limitations: Although Claude can handle mundane tasks rapidly, it cannot solve all incident scenarios. It often jumps to conclusions about root causes without understanding the complexities and historical contexts which human experts can discern.
Candid Observations:
- The AI can propose "rollbacks" or standard mitigations, which may not always be appropriate, risking incorrect assumptions based on faulty interpretations of data correlations.
- While AI is superb at aggregating data (e.g., tracing system requests), humans are crucial for validating and making nuanced decisions, especially in uniquely complex incidents.
Opportunities and Challenges:
- AI can aid humans in redefining incident management by reducing time spent on documentation and manual investigation, thus allowing more focus on strategic prevention.
- The presentation acknowledges the evolving nature of AI capabilities, cautioning against over-relying on current technology for critical incident resolution.
Overall, the talk emphasizes the balance between using AI for efficiency and relying on human judgment for complex problem-solving.
This is the end of the AI-generated content.
Abstract
Can you throw an LLM at a production incident and expect useful results? A candid look from someone who runs a distributed AI system and reaches for Claude before reaching for a dashboard. Surprises, failures, and why the answer matters for every engineer carrying a pager.
Interview:
What is your session about, and why is it important for senior software developers?
It's a field report on using LLMs for incident response. I run a production AI system and these days I reach for Claude before I reach for a dashboard. It's still taboo to say this, and sometimes it would have been better to just open the dashboard, but I want to discuss when it is and when it isn't.
Why is it critical for software leaders to focus on this topic right now, as we head into 2026?
If you decide to skip this because it's yet another talk about AI, I totally understand. I sometimes think there's too much talk about AI and not enough doing. This one's a field report, I'm definitely not going to try and convince you that Claude will solve all your problems.
What are the common challenges developers and architects face in this area?
Loss of control feels very scary. What once felt like a comfortable on-call rotation where you knew all the nooks and crannies now includes a large language model that sometimes finds the issue faster than you can and sometimes feels like an overconfident junior.
What's one thing you hope attendees will implement immediately after your talk?
Curiosity for experimenting. We're all learning together.
What makes QCon stand out as a conference for senior software professionals?
The audience has been paged at 3am and has worked with mission-critical systems. I can skip the intro and go straight to the interesting part.
Speaker
Alex Palcuie
Member of Technical Staff in AI Reliability Engineering @Anthropic, Previously Staff Site Reliability Engineer on Google Cloud Platform
Alex Palcuie is a Member of Technical Staff in AI Reliability Engineering at Anthropic, where he works on keeping Claude reliable at scale. He has the unenviable task of having to fix Claude without Claude when it goes down. Previously, he was a Staff Site Reliability Engineer on Google Cloud Platform (GCP) and a member of Google's Tech IRT (Incident Response Team), handling large-scale infrastructure incidents including the kind where datacentres flood.