You are viewing content from a past/completed QCon

Presentation: Learning From Chaos: Architecting for Resilience

Track: Architecting for Failure: Chaos, Complexity, and Resilience

Location: Fleming, 3rd flr.

Duration: 11:50am - 12:40pm

Day of week: Wednesday

Share this on:

This presentation is now available to view on InfoQ.com

Watch video with transcript

What You’ll Learn

  1. Learn what Chaos Engineering is and, especially, what it is not.

  2. Hear how others in the real world are doing Chaos Engineering.

  3. Find out what some of the free, simple ways Chaos Engineering can be done and are how it can help an organization.

Abstract

In this talk Russ Miles, CEO of ChaosIQ, will share how leading organisations are successfully adopting chaos engineering to encourage a mindset of "architecting for resilience". Through chaos engineering, architects are able to establish a true "learning system" where everyone is involved in exploring how their systems can improve through embracing failure.

 

Drawing from a collection of real-world examples and experience reports, Russ will show how you can set up your systems to learn from controlled failure and make resilience an important competitive edge for your organisation.

Question: 

What are you doing today?

Answer: 

I am CEO at ChaosIQ. Our mission is to help everyone in the world to do Chaos Engineering regardless of the size of their pockets because we feel that it's probably the most important technique that anyone can apply to improve the resilience and reliability of the system.

Question: 

Do you have a product that you sell or is it an open source tool?

Answer: 

It's an open source project but we have a product now as well. By the time of QCon, there will be two products. At the moment we have the Chaos Toolkit, one that is very popular. That helps you do Chaos Engineering. And you can go a long way with that. We've got customers that are using it to extreme lengths. It's amazing. Then we have the Chaos Platform open source project, which has all of the neat things collaboration Chaos Engineering benefits from. The platform is all about that side of things. They're both open source and we've got a commercial product which is Chaos Platform which is the one that people buy. It's used by a whole bunch of big financial institutions now. The biggest usage they've got a thousand teams in one company.

It's on our mission to say everyone should be doing Chaos Engineering, it's not just for special Chaos engineers. It's for everybody. We realized that early on and obviously we do training. Right now, Chaos Engineering is on education mode. People are becoming familiar with what it is and it isn't. It's tremendously important to position it correctly, and not let it seem more simple than it is. One of my big beefs with it is that people think it's just about destroying virtual machines using Chaos Monkey or something like that.

I had a good discussion recently with Simon Wardley on that because he put a tweet saying "Hell Engineering is all about randomness." No. Most experiments I encounter, and I encounter a lot, I would say probably 1% incorporates some sort of randomness. The rest of it is very controlled, very careful exploring weaknesses, trying to find out who did that. That's what ChaosIQ is for. And I do not say "You have to have our product to do this." The first thing I say is "You don't have to have any product to do this, you do this yourself. You could do one little technique and you'd be better tomorrow." Then I'd say, when you want to automate it, because doing it often is a good idea and maybe you've got other things to do with your life, then you can use the free stuff and go an awful long way with that. It's a standard approach to my entire career. You can use it, it's free. You can go an awful long way with it, but if you want support, there's a product. Same story here.

Question: 

What's the goals for your talk?

Answer: 

Sharing some real world Chaos Engineering. We are still in the education area. I think people are maturing quicker in this field than anywhere else I've seen. About two years ago I would have been saying to people, this is Chaos Engineering, and I'd be fighting the misunderstandings of it. This year I've watched the audience change, the audience now knows what it should be. So now, defining it and framing it takes five minutes. It matters how people are actually using it.

The whole point is to show two things: real Chaos experiments and the sorts of people doing them, because it's not a specialized Chaos Engineering force, we're not enabling another security group model. Everybody can practice this. The subtext message is that you can do this and this is what it looks like when you do it. And then, on top of that, how does it relate to resilience engineering is a big takeaway. Resilience engineering is misunderstood as well unfortunately. The basic big message for anyone who's there is, if your organization doesn't invest in resilience then it is missing a trick because your competitors will. And if you're going fast, and I think everyone tries to get it fast these days, then going fast and not breaking things is about resilience. It goes hand in hand.

Question: 

What do you want someone to leave the talk with?

Answer: 

I want them to leave the talk caring about Chaos Engineering more. I want them to go back to base, and go, I know what it is, I know I can do it in a really small way. And the best one I've had is from someone who came to the talk. This is the idea. This is the person who came to the talk a few months ago. He said, I went back into work this morning, I tried a few things and I found three waitresses in production that we didn't even know were there. And he said just thank you. Someone said on behalf of the company, thank you. So that's my deal. That would be ideal. People knowing they can do something tomorrow that it's going to find weaknesses. I mean it's not the first time either. And there's nothing magical in this, it is just knowing that you should perhaps look.

Speaker: Russell Miles

CEO of @chaosiqio

Russ Miles is CEO of ChaosIQ.io where he and his team build commercial and open source (ChaosToolkit.org) products and provide services to companies applying Chaos Engineering to build confidence in the resilience of their production systems. 

Russ is an international speaker, trainer and author. Most recently he has been writing the handbook for Chaos Engineering for O'Reilly and having published "Antifragile Software: Building Adaptable Software with Microservices" where he explores how to apply Chaos Engineering to construct and manage complex, distributed systems in production with confidence. He also delivers public and private courses on Chaos Engineering and Resilience Engineering around the world and online for O'Reilly Media.

Find Russell Miles at

Tracks

  • Architectures You've Always Wondered About

    Hard-earned lessons from the names you know on scalability, reliability, security, and performance.

  • Machine Learning: The Latest Innovations

    AI and machine learning is more approachable than ever. Discover how ML, deep learning, and other modern approaches are being used in practice.

  • Kubernetes and Cloud Architectures

    Learn about cloud native architectural approaches from the leading industry experts who have operated Kubernetes and FaaS at scale, and explore the associated modern DevOps practices.

  • Evolving Java

    JVM futures, JIT directions and improvements to the runtimes stack is the theme of this year’s JVM track.

  • Next Generation Microservices: Building Distributed Systems the Right Way

    Microservice-based applications are everywhere, but well-built distributed systems are not so common. Early adopters of microservices share their insights on how to design systems the right way.

  • Chaos and Resilience: Architecting for Success

    Making systems resilient involves people and tech. Learn about strategies being used, from cognitive systems engineering to chaos engineering.

  • The Future of the API: REST, gRPC, GraphQL and More

    The humble web-based API is evolving. This track provides the what, how, and why of future APIs.

  • Streaming Data Architectures

    Today's systems move huge volumes of data. Hear how the innovators in this space are designing systems and leveraging modern data stream processing platforms.

  • Modern Compilation Targets

    Learn about the innovation happening in the compilation target space. WebAssembly is only the tip of the iceberg.

  • Leaving the Ivory Tower: Modern CS Research in the Real World

    Thoughts pushing software forward, including consensus, CRDT's, formal methods & probabilistic programming.

  • Bare Knuckle Performance

    Crushing latency and getting the most out of your hardware.

  • Leading Distributed Teams

    Remote and distributed working are increasing in popularity, but many organisations underestimate the leadership challenges. Learn from those who are doing this effectively.

  • Full Cycle Developers: Lead the People, Manage the Process & Systems

    "Full cycle developers" is not just another catch phrase; it's about engineers taking ownership and delivering value, and doing so with the support of their entire organisation. Learn more from the pioneers.

  • JavaScript: Pushing the Client Beyond the Browser

    JavaScript is not just the language of the web. Join this track to learn how the innovators are pushing the boundaries of this classic language and ecosystem.

  • When Things Go Wrong: GDPR, Ethics, & Politics

    Privacy, confidentiality, safety and security: learning from the frontlines, from both good and bad experiences

  • Growing Unicorns in the EU: Building, Leading and Scaling Financial Tech Start Ups

    Learn how EU FinTech innovators have designed, built, and led both their technologies and organisations.

  • Building High Performing Teams

    To have a high-performing team, everybody on it has to feel and act like an owner. Learn about cultivating culture, creating psychological safety, sharing the vision effectively, and more

  • Scaling Security, from Device to Cloud

    Implementing effective security is vitally important, regardless of where you are deploying software applications.