You are viewing content from a past/completed QCon -


How Condé Nast Succeeds by a Culture That Embraces Failure

Systems architectures are increasingly diverse to serve the growing demands for scalability, fault tolerance, isolation, and extensibility. But the compromise is ever complex software to operate and maintain often with no single shared view of entire design. This is especially true with the prevalence of microservices architectures, and a growing reliance of vendor capabilities which are largely out of our control. While errors and incidents themselves cannot be completely eradicated from our systems we can at least build for resilience and adaptability. Experimentation rigour as a cultural practice and habit can identify constraints in the current design with predictions about the emergence of newer patterns to handle failures gracefully such as preventing failure cascades. Another important benefit is aligning people’s mental models of how the software is designed and operated. Crystal will walk through learnings found by building a culture that embraced failure through Chaos Engineering practices as daily routine, what her teams have learned and adapted for their platforms at Condé Nast International which currently serve in excess of 220 million unique users every month across the globe.

Tell me a bit about the work that you do.

I'm the Director of Engineering and Cloud Platforms. I oversee the whole software engineering function at Condé Nast International which is better known for its portfolio of magazines such as Vogue, GQ, Wired, Vanity Fair, Glamour. It's an international company - we have operations in 11 different countries around the world in Asia, Europe and South America. Additionally, we have many further licensee countries running the same publications around the world so 28 countries in total. We have distributed engineering teams in all 11 countries, ranging between 7 - 30 engineers in each location, as well as a recently established headquarters in London. The London HQ began just over two years ago and I have grown the Engineering function from 4 engineers to 65 now in London, which is quite sizeable in terms of growth over two years. I’ve nearly 20 years in the software industry. Before my current role as VP, I oversaw multiple teams as a Technical Lead and Principal Engineer at the BBC for many years. I've worked in almost all types of engineering including back-end, front-end, operations, and platform engineering. I believe this holistic experience has given me a deep understanding about the practice and how to manage large interacting systems, teams and disciplines.

What's the TLDR for your talk?

Practical experience of first understanding Resilience Engineering, and then in terms of how to get practices and techniques, such as Chaos Engineering, adopted in your own workplace. Adoption is hard, but what’s harder is to establish this culture for the long-term so I’ll also discuss strategies for doing this. I will explain what we have done at Condé Nast using some real-world examples of a media / publishing company that's managed to adopt these practices.Along the way we'll talk about how we've used certain techniques around Chaos Engineering and what we've done in terms of setting up our observability practices to match.

Was Chaos Engineering a top-down or bottom-up push?

I would say that I advocated heavily internally especially for sponsorship from the executive level. I had used some of these practices and techniques at a previous company and the huge benefits we reaped. So I was keen to establish this culture early on at Condé Nast. I like to use a technique called “nudging” such as sending out links and interesting talks that people can go watch, I started speaking to the software engineers and people in Technology Service Operations and other parts of Technology too to ensure a good foundational understanding was being applied across the business. There were a couple of other people that were really enthusiastic and passionate about it as well, particularly within engineering, who also did the advocacy once the ball got rolling. This is ideal. I would prefer it to be more grassroots but often those efforts will only have limited influence without sponsorship from above.

Can you give me an example how you got consensus on moving towards Chaos Engineering?

When we got to a point where we were beginning to launch new websites and services, it became apparent that we needed to strengthen our resilience both within our systems and our teams. We also wanted to ensure we were effectively communicating, planning, architecting and operating with other parts of Technology and the wider business.

Who are you targeting in the talk?

This talk is aimed at people who want to do the advocacy, get the buy in, but perhaps don't know how. There will be practical advice about how do you prove that this is a good idea for the company. I will also reveal some of the instrumentation, metrics and chaos tooling we use.

What are some other key takeaways that you think the talk will offer?

It will show the tooling that we have implemented, what works and what didn't. The way that we've extended some of the tooling to work in our particular environment, our context. I'll talk about the observability tools and practices that we've implemented because things like tracing are not well implemented in a lot of companies.


Crystal Hirschorn

VP Engineering, Global Strategy & Operations @CondeNast

Crystal Hirschorn is currently VP Engineering, Global Strategy & Operations at Condé Nast which is best known for its portfolio of global brands Vogue, Wired, Vanity Fair, The New Yorker and many more. She oversees a globally distributed engineering organisation and leading the technical...

Read more
Find Crystal Hirschorn at:

From the same track

SESSION + Live Q&A Serverless

Building Resilient Serverless Systems

In this brave new world of serverless, we entrust our vendors with keeping the infrastructure up and running. However, when even cloud behemoths like Amazon Web Services and Google Cloud have outages and failures, how can we build resilient systems?   John Chapin explains how to use...

Johnathan Chapin

Cloud Technology Consultant with an expertise in Serverless Computing

SESSION + Live Q&A Site Reliability Engineering

An Engineer's Guide to a Good Night's Sleep

As organisations look to empower engineers more, and embrace devops practices, we have seen the support role change quite a bit too. Developers are moving from being purely third line support, to working more collaboratively with engineers and operational staff. Also as we move to cloud native...

Nicky Wrightson

Ventures CTO @blenheimchalcot

SESSION + Live Q&A Interview Available

Learning From Chaos: Architecting for Resilience

In this talk Russ Miles, CEO of ChaosIQ, will share how leading organisations are successfully adopting chaos engineering to encourage a mindset of "architecting for resilience". Through chaos engineering, architects are able to establish a true "learning system" where everyone is involved in...

Russell Miles

CEO of @chaosiqio

SESSION + Live Q&A Site Reliability Engineering

Amplifying Sources of Resilience: What Research Says

Building robust software systems means anticipating how failures may occur with components and subsystems and developing answers to the question:    “What is needed for the design of systems that prevents or limits catastrophic failure?”   Investing in, developing, and...

John Allspaw

DevOps/Resilience Engineering Thought Leader, Previously CTO @Etsy & Co-founder of @AdaptiveCLabs

View full Schedule