You are viewing content from a past/completed QCon

Track: Architecting for Failure: Chaos, Complexity, and Resilience

Location: Whittle, 3rd flr.

Day of week: Wednesday

Making systems resilient involves people and tech. Learn about strategies being used from chaos testing to distributed system clustering.

Track Host: Nicki Watt

Chief Technology Officer @OpenCredo

Nicki Watt currently serves as OpenCredo’s Chief Technology Officer, a pragmatic hands on software consultancy with specialisms in data engineering, ML & cloud native solutions. Her technical career has seen her wear many hats from Engineer, Systems & Technical Architects to Consultant and now CTO. She is a techie at heart, with involvement in the development, delivery and leading of large scale platform and application development projects. Nicki is also co-author of the graph database book Neo4J in Action.

10:35am - 11:25am

Building Resilient Serverless Systems

In this brave new world of serverless, we entrust our vendors with keeping the infrastructure up and running. However, when even cloud behemoths like Amazon Web Services and Google Cloud have outages and failures, how can we build resilient systems?

 

John Chapin explains how to use serverless technologies and an infrastructure-as-code approach to architect, build, and operate large-scale systems that are resilient to vendor failures, even while taking advantage of fully managed vendor services and platforms. He then leads an end-to-end demo of the resilience of a well architected serverless system in the face of massive simulated failure. He further demonstrates how the system not only provides resilience to failure, but also has a side affect of improving the end-user experience.

 

Finally, John discusses some of the drawbacks and idiosyncrasies of the approach. All source code, infrastructures templates, and slides will be available for the audience to download and explore. While the examples largely focus on AWS—including API Gateway, CloudFormation, DynamoDB, Lambda, and Route 53—the techniques discussed are broadly applicable across cloud vendors.

Johnathan Chapin, Cloud Technology Consultant with an expertise in Serverless Computing

11:50am - 12:40pm

Learning From Chaos: Architecting for Resilience

In this talk Russ Miles, CEO of ChaosIQ, will share how leading organisations are successfully adopting chaos engineering to encourage a mindset of "architecting for resilience". Through chaos engineering, architects are able to establish a true "learning system" where everyone is involved in exploring how their systems can improve through embracing failure.

 

Drawing from a collection of real-world examples and experience reports, Russ will show how you can set up your systems to learn from controlled failure and make resilience an important competitive edge for your organisation.

Russell Miles, CEO of @chaosiqio

1:40pm - 2:30pm

An Engineer's Guide to a Good Night's Sleep

As organisations look to empower engineers more, and embrace devops practices, we have seen the support role change quite a bit too. Developers are moving from being purely third line support, to working more collaboratively with engineers and operational staff. Also as we move to cloud native microservice solutions, the increased complexity and diversity of our production landscape means operational staff may well rely more heavily on the engineers, in particular out of hours.

 

I have spent the last 18 years working across a plethora of industries utilising a myriad of technology and approaches. From working on everything from trading applications to content enrichment APIs, I have seen a lot of approaches and processes try to help minimise operational support for developers.

 

In this talk, I will be exploring and discussing some of my top approaches and techniques to help reduce the risk of that dreaded 3am call! You will gain some practical insight into how to handle failure in today's more complex distributed microservice systems. This will include looking at approaches to resiliency, understanding your system, understanding the requirements for fault tolerance, and the developers' mindset necessary for this. I will be peppering this talk with real world examples, and an occasional war story along the way too.

Nicky Wrightson, Principal Engineer @riverisland

2:55pm - 3:45pm

How Condé Nast Succeeds by a Culture That Embraces Failure

Systems architectures are increasingly diverse to serve the growing demands for scalability, fault tolerance, isolation, and extensibility. But the compromise is ever complex software to operate and maintain often with no single shared view of entire design. This is especially true with the prevalence of microservices architectures, and a growing reliance of vendor capabilities which are largely out of our control. While errors and incidents themselves cannot be completely eradicated from our systems we can at least build for resilience and adaptability. Experimentation rigour as a cultural practice and habit can identify constraints in the current design with predictions about the emergence of newer patterns to handle failures gracefully such as preventing failure cascades. Another important benefit is aligning people’s mental models of how the software is designed and operated. Crystal will walk through learnings found by building a culture that embraced failure through Chaos Engineering practices as daily routine, what her teams have learned and adapted for their platforms at Condé Nast International which currently serve in excess of 220 million unique users every month across the globe.

Crystal Hirschorn, Crystal Hirschorn is Director of Engineering and Cloud Platforms @CondeNast

4:10pm - 5:00pm

Amplifying Sources of Resilience: What Research Says

Building robust software systems means anticipating how failures may occur with components and subsystems and developing answers to the question:   

“What is needed for the design of systems that prevents or limits catastrophic failure?”   Investing in, developing, and sustaining the adaptive capacity to cope with unexpected situations is at the core of Resilience Engineering. In the software community, this means developing (continually!) ever-better answers to the question:   

“When our preventative designs fail us, what are ways that teams of engineers successfully anticipate, resolve, and learn from those catastrophes?”

  

The Resilience Engineering community has been studying how people in high-consequence/high-tempo domains answer this latter question. Applying Resilience Engineering thinking and paradigms to the world of software engineering and operations is still in its infancy, but we have some promising routes for making progress. This talk will outline productive avenues to locate, amplify, support, and build this capacity that exists (sometimes invisibly) in the expertise of your organization. Spoiler: looking closely at the origins, handling, and perception of incidents is part of this story.

John Allspaw, DevOps/Resilience Engineering Thought Leader, Previously CTO @Etsy & Co-founder of @AdaptiveCLabs

Last Year's Tracks

Monday, 4 March

Tuesday, 5 March

Wednesday, 6 March