Unconference: Debugging Distributed Systems
From the same track
Orienting, Understanding, Playing, Thriving: Debugging your Organisation
Tuesday Mar 17 / 10:35AM GMT
Debugging is both an art and a science. But more than that, it's an activity undertaken with deep intention: to understand and improve your systems. In the purely technical realm, we have an extraordinary range of tooling and techniques that can help us tackle this problem.
Hazel Weakly
Fellow @Nivenly Foundation; Director, Haskell Foundation; Experienced Leader Focusing on Organizational Change, Developer Experience, and Resilience Engineering
How Eve Online Leverages Head Based Sampling to Observe "Fun"
Tuesday Mar 17 / 11:45AM GMT
A unique pattern in video game software is real-time interactions to express the personality of users.Here we will talk about how we instrument the universe of New Eden to identify the traffic that matters, even the "fun" parts!
Nicholas Herring
Technical Director, Eve Online @CCP Games, Refiner of Internet Spaceships and Explorer of Feral Gordian Knots of Python
Can Claude Fix Itself? Using LLMs for Incident Response
Tuesday Mar 17 / 02:45PM GMT
Can you throw an LLM at a production incident and expect useful results? A candid look from someone who runs a distributed AI system and reaches for Claude before reaching for a dashboard. Surprises, failures, and why the answer matters for every engineer carrying a pager.
Alex Palcuie
Member of Technical Staff in AI Reliability Engineering @Anthropic, Previously Staff Site Reliability Engineer on Google Cloud Platform
Wrangling Telemetry at Scale: A Guide to Self-Hosted Observability
Tuesday Mar 17 / 03:55PM GMT
Observability is supposed to help you tame complexity, but your Observability stack can quickly become just as complex as the systems it's meant to watch. For most teams, the answer is to pay someone else to deal with it.
Colin Douch
Site Reliability Engineer @DuckDuckGo
Are We All on the Same Page? Let’s Fix That - With AI Assistance
Tuesday Mar 17 / 05:05PM GMT
In distributed systems, incidents rarely fail because of missing signals - they fail because the right people aren’t mobilised quickly enough, and teams struggle to build a shared understanding under pressure.
Luis Mineiro
Director of Digital Foundation @ASOS.com, SRE Charmer, Previously @Delivery Hero and @Zalando