You are viewing content from a past/completed QCon

Presentation: How Condé Nast Succeeds by a Culture That Embraces Failure

Track: Architecting for Failure: Chaos, Complexity, and Resilience

Location: Churchill, G flr.

Duration: 2:55pm - 3:45pm

Day of week: Wednesday

Share this on:

This presentation is now available to view on InfoQ.com

Watch video with transcript

What You’ll Learn

  1. Find out how Condé Nast developed a culture that embraces possible failure.
  2. Learn some of the arguments to use to advocate for Chaos Engineering.
  3. Hear about some of the tooling to leverage in Chaos Engineering.

Abstract

Systems architectures are increasingly diverse to serve the growing demands for scalability, fault tolerance, isolation, and extensibility. But the compromise is ever complex software to operate and maintain often with no single shared view of entire design. This is especially true with the prevalence of microservices architectures, and a growing reliance of vendor capabilities which are largely out of our control. While errors and incidents themselves cannot be completely eradicated from our systems we can at least build for resilience and adaptability. Experimentation rigour as a cultural practice and habit can identify constraints in the current design with predictions about the emergence of newer patterns to handle failures gracefully such as preventing failure cascades. Another important benefit is aligning people’s mental models of how the software is designed and operated. Crystal will walk through learnings found by building a culture that embraced failure through Chaos Engineering practices as daily routine, what her teams have learned and adapted for their platforms at Condé Nast International which currently serve in excess of 220 million unique users every month across the globe.

Question: 

Tell me a bit about the work that you do.

Answer: 

I'm the Director of Engineering and Cloud Platforms. I oversee the whole software engineering function at Condé Nast International which is better known for its portfolio of magazines such as Vogue, GQ, Wired, Vanity Fair, Glamour. It's an international company - we have operations in 11 different countries around the world in Asia, Europe and South America. Additionally, we have many further licensee countries running the same publications around the world so 28 countries in total. We have distributed engineering teams in all 11 countries, ranging between 7 - 30 engineers in each location, as well as a recently established headquarters in London. The London HQ began just over two years ago and I have grown the Engineering function from 4 engineers to 65 now in London, which is quite sizeable in terms of growth over two years. I’ve nearly 20 years in the software industry. Before my current role as VP, I oversaw multiple teams as a Technical Lead and Principal Engineer at the BBC for many years. I've worked in almost all types of engineering including back-end, front-end, operations, and platform engineering. I believe this holistic experience has given me a deep understanding about the practice and how to manage large interacting systems, teams and disciplines.

Question: 

What's the TLDR for your talk?

Answer: 

Practical experience of first understanding Resilience Engineering, and then in terms of how to get practices and techniques, such as Chaos Engineering, adopted in your own workplace. Adoption is hard, but what’s harder is to establish this culture for the long-term so I’ll also discuss strategies for doing this. I will explain what we have done at Condé Nast using some real-world examples of a media / publishing company that's managed to adopt these practices.Along the way we'll talk about how we've used certain techniques around Chaos Engineering and what we've done in terms of setting up our observability practices to match.

Question: 

Was Chaos Engineering a top-down or bottom-up push?

Answer: 

I would say that I advocated heavily internally especially for sponsorship from the executive level. I had used some of these practices and techniques at a previous company and the huge benefits we reaped. So I was keen to establish this culture early on at Condé Nast. I like to use a technique called “nudging” such as sending out links and interesting talks that people can go watch, I started speaking to the software engineers and people in Technology Service Operations and other parts of Technology too to ensure a good foundational understanding was being applied across the business. There were a couple of other people that were really enthusiastic and passionate about it as well, particularly within engineering, who also did the advocacy once the ball got rolling. This is ideal. I would prefer it to be more grassroots but often those efforts will only have limited influence without sponsorship from above.

Question: 

Can you give me an example how you got consensus on moving towards Chaos Engineering?

Answer: 

When we got to a point where we were beginning to launch new websites and services, it became apparent that we needed to strengthen our resilience both within our systems and our teams. We also wanted to ensure we were effectively communicating, planning, architecting and operating with other parts of Technology and the wider business.

Having some experience in resilience engineering made it easier for me to influence my peers and those above to build in time, and invest in, ensuring we could recover from failure. Failure is inevitable after all. How you mitigate and respond to it organisationally is what can set companies apart.

We also did data analysis for about a three months period just to see how our applications and services are performing to have some hard evidence. We put a heavy emphasis on making sure all of our runbooks followed a standard template. We began doing role playing Game Days to simulate failures hypothetically and then trying to follow processes, artifacts such as runbooks and diagrams, metrics, and the communications to reach a speedy resolution. We did this many times first before actually breaking anything - and I would suggest this to anyone starting out as these steps alone have huge benefits.

Finally we moved on to do Chaos Experiments not just on staging but production as well.

Question: 

Who are you targeting in the talk?

Answer: 

This talk is aimed at people who want to do the advocacy, get the buy in, but perhaps don't know how. There will be practical advice about how do you prove that this is a good idea for the company. I will also reveal some of the instrumentation, metrics and chaos tooling we use.

Question: 

What are some other key takeaways that you think the talk will offer?

Answer: 

It will show the tooling that we have implemented, what works and what didn't. The way that we've extended some of the tooling to work in our particular environment, our context. I'll talk about the observability tools and practices that we've implemented because things like tracing are not well implemented in a lot of companies.

Speaker: Crystal Hirschorn

Crystal Hirschorn is Director of Engineering and Cloud Platforms @CondeNast

Crystal Hirschorn is Director of Engineering and Cloud Platforms at Condé Nast who are better known for their portfolio of global brands such as Vogue, GQ, Vanity Fair and Wired. She is building an awesome engineering organisation and technically leading a digital transformation to build unified technology platforms deployed across the globe. The majority of her career has been as a hands-on software engineer with more than 15 years experience working mainly within the media and government sectors tackling the challenges of building complex system architectures, scaling software and infrastructure. Previously, she led the online technical strategy for many BBC News elections events, including the last general election, which served more than 65 million requests in a 24-hour period, with traffic peak at 3.2 million concurrent requests.

Find Crystal Hirschorn at

Last Year's Tracks

Monday, 4 March

Tuesday, 5 March

Wednesday, 6 March