Abstract
Can you throw an LLM at a production incident and expect useful results? A candid look from someone who runs a distributed AI system and reaches for Claude before reaching for a dashboard. Surprises, failures, and why the answer matters for every engineer carrying a pager.
Interview:
What is your session about, and why is it important for senior software developers?
It's a field report on using LLMs for incident response. I run a production AI system and these days I reach for Claude before I reach for a dashboard. It's still taboo to say this, and sometimes it would have been better to just open the dashboard, but I want to discuss when it is and when it isn't.
Why is it critical for software leaders to focus on this topic right now, as we head into 2026?
If you decide to skip this because it's yet another talk about AI, I totally understand. I sometimes think there's too much talk about AI and not enough doing. This one's a field report, I'm definitely not going to try and convince you that Claude will solve all your problems.
What are the common challenges developers and architects face in this area?
Loss of control feels very scary. What once felt like a comfortable on-call rotation where you knew all the nooks and crannies now includes a large language model that sometimes finds the issue faster than you can and sometimes feels like an overconfident junior.
What's one thing you hope attendees will implement immediately after your talk?
Curiosity for experimenting. We're all learning together.
What makes QCon stand out as a conference for senior software professionals?
The audience has been paged at 3am and has worked with mission-critical systems. I can skip the intro and go straight to the interesting part.
Speaker
Alex Palcuie
Member of Technical Staff in AI Reliability Engineering @Anthropic, Previously Staff Site Reliability Engineer on Google Cloud Platform
Alex Palcuie is a Member of Technical Staff in AI Reliability Engineering at Anthropic, where he works on keeping Claude reliable at scale. He has the unenviable task of having to fix Claude without Claude when it goes down. Previously, he was a Staff Site Reliability Engineer on Google Cloud Platform (GCP) and a member of Google's Tech IRT (Incident Response Team), handling large-scale infrastructure incidents including the kind where datacentres flood.