Presentation: Learning From Chaos: Architecting for Resilience

Track: Architecting for Failure: Chaos, Complexity, and Resilience

Location: Whittle, 3rd flr.

Duration: 11:50am - 12:40pm

Day of week: Wednesday

Share this on:

What You’ll Learn

  1. Learn what Chaos Engineering is and, especially, what it is not.

  2. Hear how others in the real world are doing Chaos Engineering.

  3. Find out what some of the free, simple ways Chaos Engineering can be done and are how it can help an organization.

Abstract

In this talk Russ Miles, CEO of ChaosIQ, will share how leading organisations are successfully adopting chaos engineering to encourage a mindset of "architecting for resilience". Through chaos engineering, architects are able to establish a true "learning system" where everyone is involved in exploring how their systems can improve through embracing failure.
 
Drawing from a collection of real-world examples and experience reports, Russ will show how you can set up your systems to learn from controlled failure and make resilience an important competitive edge for your organisation.

Question: 

What are you doing today?

Answer: 

I am CEO at ChaosIQ. Our mission is to help everyone in the world to do Chaos Engineering regardless of the size of their pockets because we feel that it's probably the most important technique that anyone can apply to improve the resilience and reliability of the system.

Question: 

Do you have a product that you sell or is it an open source tool?

Answer: 

It's an open source project but we have a product now as well. By the time of QCon, there will be two products. At the moment we have the Chaos Toolkit, one that is very popular. That helps you do Chaos Engineering. And you can go a long way with that. We've got customers that are using it to extreme lengths. It's amazing. Then we have the Chaos Platform open source project, which has all of the neat things collaboration Chaos Engineering benefits from. The platform is all about that side of things. They're both open source and we've got a commercial product which is Chaos Platform which is the one that people buy. It's used by a whole bunch of big financial institutions now. The biggest usage they've got a thousand teams in one company.

It's on our mission to say everyone should be doing Chaos Engineering, it's not just for special Chaos engineers. It's for everybody. We realized that early on and obviously we do training. Right now, Chaos Engineering is on education mode. People are becoming familiar with what it is and it isn't. It's tremendously important to position it correctly, and not let it seem more simple than it is. One of my big beefs with it is that people think it's just about destroying virtual machines using Chaos Monkey or something like that.

I had a good discussion recently with Simon Wardley on that because he put a tweet saying "Hell Engineering is all about randomness." No. Most experiments I encounter, and I encounter a lot, I would say probably 1% incorporates some sort of randomness. The rest of it is very controlled, very careful exploring weaknesses, trying to find out who did that. That's what ChaosIQ is for. And I do not say "You have to have our product to do this." The first thing I say is "You don't have to have any product to do this, you do this yourself. You could do one little technique and you'd be better tomorrow." Then I'd say, when you want to automate it, because doing it often is a good idea and maybe you've got other things to do with your life, then you can use the free stuff and go an awful long way with that. It's a standard approach to my entire career. You can use it, it's free. You can go an awful long way with it, but if you want support, there's a product. Same story here.

Question: 

What's the goals for your talk?

Answer: 

Sharing some real world Chaos Engineering. We are still in the education area. I think people are maturing quicker in this field than anywhere else I've seen. About two years ago I would have been saying to people, this is Chaos Engineering, and I'd be fighting the misunderstandings of it. This year I've watched the audience change, the audience now knows what it should be. So now, defining it and framing it takes five minutes. It matters how people are actually using it.

The whole point is to show two things: real Chaos experiments and the sorts of people doing them, because it's not a specialized Chaos Engineering force, we're not enabling another security group model. Everybody can practice this. The subtext message is that you can do this and this is what it looks like when you do it. And then, on top of that, how does it relate to resilience engineering is a big takeaway. Resilience engineering is misunderstood as well unfortunately. The basic big message for anyone who's there is, if your organization doesn't invest in resilience then it is missing a trick because your competitors will. And if you're going fast, and I think everyone tries to get it fast these days, then going fast and not breaking things is about resilience. It goes hand in hand.

Question: 

What do you want someone to leave the talk with?

Answer: 

I want them to leave the talk caring about Chaos Engineering more. I want them to go back to base, and go, I know what it is, I know I can do it in a really small way. And the best one I've had is from someone who came to the talk. This is the idea. This is the person who came to the talk a few months ago. He said, I went back into work this morning, I tried a few things and I found three waitresses in production that we didn't even know were there. And he said just thank you. Someone said on behalf of the company, thank you. So that's my deal. That would be ideal. People knowing they can do something tomorrow that it's going to find weaknesses. I mean it's not the first time either. And there's nothing magical in this, it is just knowing that you should perhaps look.

Speaker: Russ Miles

CEO of @chaosiqio

Russ Miles is CEO of ChaosIQ.io where he and his team build commercial and open source (ChaosToolkit.org) products and provide services to companies applying Chaos Engineering to build confidence in the resilience of their production systems. 
Russ is an international speaker, trainer and author. Most recently he has been writing the handbook for Chaos Engineering for O'Reilly and having published "Antifragile Software: Building Adaptable Software with Microservices" where he explores how to apply Chaos Engineering to construct and manage complex, distributed systems in production with confidence. He also delivers public and private courses on Chaos Engineering and Resilience Engineering around the world and online for O'Reilly Media.

Find Russ Miles at

Tracks