Presentation: How Many Is Too Much? Exploring Costs of Coordination During Outages
Share this on:
This presentation is now available to view on InfoQ.com
Watch video with transcriptAbstract
Service outages can attract a lot of attention from a wide range of participants - particularly when the service is for a business critical function. These ‘stakeholders’ represent multiple roles with different experience, responsibilities, expertise and knowledge about how the system functions - be they users, management, engineers from other dependent services or the incident responders paged in to help with the response. Each stakeholder brings important contributions that are necessary for maintaining reliable operations but smoothly and effectively integrating their contributions or sufficiently meeting their needs for updates, for task delegation or for decisions requires elaborate coordination often under extreme time pressure. Prior research has shown these coordinative efforts represent a significant cognitive cost (Klein et al, 2005; Klinger & Klein, 1999; Klein, 2006) and require a distinct set of skills (Woods, 2017) to manage in concert with the demands of diagnosing and resolving the incident itself.
Presenting findings from her doctoral research and her experience working with site reliability engineers responsible for critical digital infrastructure (CDI), Laura will uncover the hidden costs of coordination, highlight how the challenges of modern IT infrastructure will continue to impede hitting four 9’s service reliability and show how resilient performance is directly tied to coordination. Along the way, she will examine problematic elements of an Incident Command System, use case study examples to describe helpful and harmful patterns of coordination and offer some promising directions for how to control the costs of coordination in your incident response practices. You will never look at incident response the same way!
What is the work you’re doing today?
I make invisible work visible.
I spent the last three years studying the incident response practices of reliability engineers across a Consortium of tech companies. My research shows that much of the cognitive work involved in detecting, diagnosing and resolving incidents across distributed teams is unacknowledged. As you might expect, it's unacknowledged because it is largely invisible. It's hard to trace the thinking and mental effort that goes into debugging code or investigating the sources of a cascading failure. Resilience Engineering gives us the methodologies to reveal this kind of effort and the capabilities to design better for it.
What are your goals for the talk?
I want developers to see what I see: that supporting the coordination of the multiple, diverse perspectives needed to cope with challenging problems is central to reliability and that the skills needed to do this are quite sophisticated. My goal is to give the audience a lens to start looking at problems of poor coordination so they can innovate their incident management practices.
What do you want people to leave the talk with?
My sense is that most people will leave the talk with a new appreciation for their work (or that of the teams they manage) and be inspired to rethink the tooling and practices for on-call engineers. My hope is at next year's QCon we see presentations about how they are managing incidents differently and finding new ways to learn from their incidents!
What do you think is the next big disruption in software?
I'm biased but I think companies that recognize in order to move faster and scale bigger you need to design collaborative automation that coordinates well with its human co-workers. Currently, we view automation and tooling as replacements for human activity. If we re-imagine it instead as hiring on a new team member we start to understand the dynamic differently. It's difficult to partner with someone that has hard limits for understanding the context of problems and there is an implicit dependence on human colleagues to be able to work effectively. Thinking about those interactions and how to coordinate them has the potential to have everyone moving faster and more accurately which ultimately drives performance.
Last Year's Tracks
Monday, 2 March
-
Next Generation Microservices: Building Distributed Systems the Right Way
Microservice-based applications are everywhere, but well-built distributed systems are not so common. Early adopters of microservices share their insights on how to design systems the right way.
-
Streaming Data Architectures
Today's systems process huge volumes of continuously changing data. Hear how the innovators in this space are designing systems and leveraging modern data stream processing platforms.
-
Driving Full Cycle Engineering Teams at Every Level
"Full cycle developers" is not just another catch phrase; it's about engineers taking ownership and delivering value, and doing so with the support of their entire organisation. Learn more from the pioneers.
-
When Things Go Wrong: GDPR, Ethics, & Politics
Privacy, confidentiality, safety and security: learning from the frontlines, from both good and bad experiences
-
JavaScript: Pushing the Client Beyond the Browser
JavaScript is not just the language of the web. Join this track to learn how the innovators are pushing the boundaries of this classic language and ecosystem
-
Modern CS in the Real World
Head back to academia to solve today's problems in software engineering.
Tuesday, 3 March
-
Architectures You've Always Wondered About
Hard-earned lessons from the names you know on scalability, reliability, security, and performance.
-
The Future of the API: REST, gRPC, GraphQL and More
The humble web-based API is evolving. This track provides the what, how, and why of future APIs.
-
Building High Performing Teams
There are many discussions outlining the secret sauce of high-performing teams. Learn how to balance the essential ingredients of high performing teams such as trust and delegation, as well as recognising the pitfalls and problems that will ruin any recipe.
-
Machine Learning: The Latest Innovations
AI and machine learning is more approachable than ever. Discover how ML, deep learning, and other modern approaches are being used in practice.
-
Bare Knuckle Performance
Crushing latency and getting the most out of your hardware.
-
Modern Compilation Targets
Learn about the innovation happening in the compilation target space. WebAssembly is only the tip of the iceberg.
Wednesday, 4 March
-
Growing Unicorns in the EU: Building, Leading and Scaling Financial Tech Start Ups
Learn how EU FinTech innovators have designed, built, and led both their technologies and organisations.
-
Kubernetes and Cloud Architectures
Learn about cloud native architectural approaches from the leading industry experts who have operated Kubernetes and FaaS at scale, and explore the associated modern DevOps practices.
-
Chaos and Resilience: Architecting for Success
Making systems resilient involves people and tech. Learn about strategies being used, from cognitive systems engineering to chaos engineering.
-
Leading Distributed Teams
Remote and distributed working are increasing in popularity, but many organisations underestimate the leadership challenges. Learn from those who are doing this effectively.
-
Scaling Security, from Device to Cloud
Implementing effective security is vitally important, regardless of where you are deploying software applications
-
Evolving Java
JVM futures, JIT directions and improvements to the runtimes stack is the theme of this year’s JVM track.