Understanding Progressive Collapse: How To Avoid A Cascading Failure

Abstract

Small things going wrong can quickly snowball. The cascading failure is often a nightmare scenario for any system. An initial problem, which in isolation seems like such a minor problem, can kick off a chain reaction of ever-increasing failures, potentially leading to catastrophic results.

When a failure of a single component results in the failure of other connected elements, this is known as a progressive collapse. In this talk, Sam Newman looks at this phenomenon in more detail, and he'll examine how it has manifested in major disasters. Based on lessons learned from other industries, Sam will share three key techniques that can be used to mitigate against the progressive collapse occurring in your own system.

This talk will help you understand how to architect your systems in such a way that small failures stay small.

Interview:

What is your session about, and why is it important for senior software developers?

My session explores what happens when a small initial problem causes a giant catastrophe. In the context of buildings, this is called Progress Collapse. In my talk, I look at what happens when a building suffers a progressive collapse, how these can be mitigated, and what parallels we can draw deal with the cascading failures we see in distributed systems.

Why is it critical for software leaders to focus on this topic right now, as we head into 2026?

My session is about how disparate parts of a system interact, especially in the context of increasingly distributed systems. How we write code may have changed a lot over the last couple of years, but the fundamentals of system design, and the challenges of distributed systems still remain.

What are the common challenges developers and architects face in this area?

  1. When something goes wrong, they tend to look for one obvious cause, blame that and move on, without looking at wider systemic issues
  2. Too much focus on stopping things breaking, and not enough time spent on understanding how the system can continue to work when something does break

What's one thing you hope attendees will implement immediately after your talk?

Stop looking for single causes of failure!

What makes QCon stand out as a conference for senior software professionals?

The curated tracks are what helps QCon stand apart. It means you get a lot less clash between tracks, but also it means that each individual track ends up having something for everyone.


Speaker

Sam Newman

Microservice, Cloud, CI/CD Expert, Author of "Building Microservices" and "Monolith to Microservices", 20+ Years Experience as a Developer

Sam Newman is an independent consultant who loves solving problems with technology. Focusing primarily in the areas of cloud, microservice architecture and continuous delivery, Sam works with companies big and small all over the world. He is also an experienced conference speaker, and author of the O’Reilly books Monolith To Microservices, Building Microservices, and the forthcoming Building Resilient Distributed Systems.

Read more
Find Sam Newman at:

From the same track

Session resilience

How to Find Resilience Bugs in Systems that Don't Exist

Wednesday Mar 18 / 10:35AM GMT

Building correct distributed systems takes thinking outside the box, and the fastest way to do that is to think inside a different box. One different box is "formal methods", the discipline of mathematically verifying software and systems.

Speaker image - Hillel Wayne

Hillel Wayne

Author of "Logic for Programmers" and "Learn TLA+", Thought Leader in the Space of Empirical Software Engineering

Session decentralized

Spritely: Infrastructure for the Future of the Internet

Wednesday Mar 18 / 11:45AM GMT

Let's take back the internet! Learn about Spritely's work to re-decentralize the net with new foundational technologies that put users in control.

Speaker image - Christine  Lemmer-Webber

Christine Lemmer-Webber

Executive Director @Spritely Institute, Co-Author of ActivityPub

Speaker image - David Thompson

David Thompson

CTO @Spritely Institute,

Session

Maintaining Data Integrity During Regional Outages

Wednesday Mar 18 / 02:45PM GMT

Details coming soon.

Session

Migrating Legacy Monoliths to Resilient Microservices Without Downtime

Wednesday Mar 18 / 03:55PM GMT

Details coming soon.