You are viewing content from a past/completed QCon

Track: Chaos and Resilience: Architecting for Success

Location: Fleming, 3rd flr.

Day of week: Wednesday

Understanding the complex, socio-technical systems found in software today are paramount to future successes of organizations and of the industry in general. In this track, you will learn from stories of practitioners who are bringing changes to their organizations in new, and previously unheard of, ways through Chaos Engineering, Resilience Engineering, and critically reflecting on Cognitive Systems Engineering and Human Factors techniques. The practitioners in this track will show you how bringing these disciplines to software helps organizations learn and grow in beneficial ways, such as: leading to better architecture decisions, growing and distilling technical expertise, and having more confidence when disaster strikes. You will leave this track having explored more effective approaches and techniques for building the adaptive capacity, in both people and technology, to manage the consequences of failure successfully.

Track Host: Nora Jones

Senior Developer/ Engineer

Nora is a dedicated and driven technology leader and software engineer with a passion for people and reliable software, as well as the intersection between those two worlds. She truly believes that safety is pivotal with software development nowadays. She co-wrote two O’Reilly books on Chaos Engineering, and how a product’s availability can be improved through intentional failure experimentation.

She also shared her experiences helping organizations large and small reach crucial availability and in November of 2017 keynoted at AWS re:Invent to share these experiences with an audience of ~40,000 people, helping kick off the Chaos Engineering movement we see today. Since then she has keynoted at several other conferences throughout the world highlighting her work on topics such as: Resilience Engineering, Chaos Engineering, Human Factors, Site Reliability, and more from her work at Netflix, Slack, and Jet.com.

10:35am - 11:25am

Better Resilience Adoption through UX

Too often, attempts to bring resilience engineering to an organization fall flat. Perhaps there’s some initial interest, but that wavers under the crushing weight of JIRA queues and sprint reviews. The tools are there but no one’s using them.

This session will go over three case studies where teams achieved success (and a few that didn't!) by focusing on the human element of engineering tooling. In each one, we’ll look at a specific UX technique that team employed to put their company on a path to resilience.

Randall Koutnik, UI Engineer

11:50am - 12:40pm

Preparing for the Unexpected

Convincing engineers to be on-call isn’t always straightforward. In 2019 the Customer Products group at the Financial Times set out to make their out of hours support process more sustainable after losing a number of people from their on-call team.

In this talk you’ll discover how to continuously learn from past incidents by applying your team’s most recent operational experience, increase the confidence of your team in handling live incidents away from the pressures of production, and convince them that, actually, joining the on-call team is a great idea!

Hear how the Financial Times is using incident workshops to prepare for the unexpected and make incident management a more consistent process by sharing the group’s wide range of operational knowledge and architectural insights.

Samuel Parkinson, Principal Engineer @FinancialTimes

1:40pm - 2:30pm

How Many Is Too Much? Exploring Costs of Coordination During Outages

Service outages can attract a lot of attention from a wide range of participants - particularly when the service is for a business critical function. These ‘stakeholders’ represent multiple roles with different experience, responsibilities, expertise and knowledge about how the system functions - be they users, management, engineers from other dependent services or the incident responders paged in to help with the response. Each stakeholder brings important contributions that are necessary for maintaining reliable operations but smoothly and effectively integrating their contributions or sufficiently meeting their needs for updates, for task delegation or for decisions requires elaborate coordination often under extreme time pressure.  Prior research has shown these coordinative efforts represent a significant cognitive cost (Klein et al, 2005; Klinger & Klein, 1999; Klein, 2006) and require a distinct set of skills (Woods, 2017) to manage in concert with the demands of diagnosing and resolving the incident itself.

Presenting findings from her doctoral research and her experience working with site reliability engineers responsible for critical digital infrastructure (CDI), Laura will uncover the hidden costs of coordination, highlight how the challenges of modern IT infrastructure will continue to impede hitting four 9’s service reliability and show how resilient performance is directly tied to coordination. Along the way, she will examine problematic elements of an Incident Command System, use case study examples to describe helpful and harmful patterns of coordination and offer some promising directions for how to control the costs of coordination in your incident response practices. You will never look at incident response the same way!

Laura Maguire, Cognitive Systems Engineer & Researcher

2:55pm - 3:45pm

Growing Resilience: Serving Half a Billion Users Monthly at Condé Nast

Serving over half a billion monthly customers while keeping service availability high is a monumental task. Condé Nast operates in nearly 40 countries and is better known for it’s portfolio of household brands such as Vogue, Wired, Vanity Fair, The New Yorker. Our globally distributed platforms run more than 15 Kubernetes clusters in more than 5 geographic regions, runs a multi-CDN / edge architectures, employs a micro services approach, multi-tenanted web applications, high throughput data streaming architectures just to outline some of the technical challenges that we implement and operate on a daily basis.
 
Many of us are facing these challenges - so how do we cultivate and grow our organisation's capacity to adapt to the unknown in ever fluid, dynamic socio-technical systems? In this talk I will outline how Condé Nast practices Chaos engineering, where this fits within the already established testing and verification ecosystem, and what emergent practices and tools are on the horizon. Observability is at the core of understanding and drawing inferences about our systems. We’ll dispel some myths about touted "Best Practices” for Observability, looking beyond to emergent trends in signals, metrics, and tracing. Last but not least, I’ll cover how to build up your organisation’s true superpower: Human Resilience.

Crystal Hirschorn, VP Engineering, Global Strategy & Operations @CondeNast

4:10pm - 5:00pm

Rethinking How the Industry Approaches Chaos Engineering:

In order to determine and envision how to achieve reliability and resilience that drive our businesses forward, organizations must be able to look back at past blunders unobscured by hindsight bias. Resilient organizations don’t take past successes as a reason for confidence. Instead, they use them as an opportunity to dig deeper, find underlying risks, and refine mental models of how our systems succeed and fail.  

There are key components of Chaos Engineering beyond building tools for experimenting in production and running game days. Understanding the concerns, ideas, and mental models of how the system is structured for each individual and learning where your organization excels in technical and human resilience are things that can’t be automated away by code. This talk will address the three different phases of Chaos Engineering and the hidden goals within each phase that might be the greatest benefit of all: using Chaos Engineering as a way to distill expertise.    

The chronically under-invested phases of Chaos Engineering in our industry are the Before and After phases -- and these tend to fall on a single individual to complete, usually a facilitator. This is someone who can act as a third party during the experiment, but prior to that will educate themselves on what the team is going through, their systems, and how it all works. If we only optimize for finding issues before they become incidents, we miss out on getting the most out of the point of Chaos Engineering, which is refining the mental models of our systems and distilling expertise.  

In this talk we focus on the Before and After phases of developing Chaos Engineering experiments (whether they be gamedays or driven by software) and develop important questions to ask with each of these phases. We will also dig into some of the Ironies of Automation present with Chaos Engineering today.

Nora Jones, Senior Developer/ Engineer

Last Year's Tracks